arxiv:2603.27844

Model Capability Dominates: Inference-Time Optimization Lessons from AIMO 3

Published on Apr 16

· Submitted by

Natapong Nitarach (Schwyter) on Apr 17

Upvote

Authors:

Natapong Nitarach

Abstract

Majority voting improves mathematical reasoning but is limited by correlated errors; diverse reasoning strategies and model capability are more impactful than prompt engineering.

AI-generated summary

Majority voting over multiple LLM attempts improves mathematical reasoning, but correlated errors limit the effective sample size. A natural fix is to assign different reasoning strategies to different voters. The approach, Diverse Prompt Mixer, is tested on the AIMO 3 competition: 3 models, 23+ experiments, 50 IMO-level problems, one H100 80 GB, 5-hour limit. Every prompt-level intervention fails. High-temperature sampling already decorrelates errors; weaker strategies reduce accuracy more than they reduce correlation. Across an 8-point capability gap at equal N=8 and every optimization tested, model capability dominates. The gap between the best majority-vote score (42/50) and pass@20 (~45.5) is selection loss, not prompt loss. A verifier-based selector could close it. Prompt engineering cannot.

View arXiv page View PDF Project page GitHub 1 Add to collection

Community

natnitaract

Paper author Paper submitter 1 day ago

Model Capability Dominates: Inference-Time Optimization Lessons from AIMO 3

Diverse Prompt Mixer assigns different reasoning strategies to majority-voting members to decorrelate errors. Tested on 50 IMO-level problems (1×H100, 5-hour limit, 3 models, 23+ experiments). It does not work.

Why it fails:
High-temperature sampling already pushes pairwise error correlation to zero or below (mean ρ̂ = −0.348 across 19 computable points). There is no correlation headroom left. Diverse prompts reduce per-attempt accuracy more than they reduce correlation.

What dominates:
At equal N=8, the 8-point model capability gap (gpt-oss-120b at 39.3 vs. gpt-oss-20b at 31.0) is 4× larger than any prompt optimization (±2 points). Scaling N past the compute budget backfires.

Where the real gap is:
The model's pass@20 ≈ 45.5, but majority voting peaks at 42. Six points of selection loss. A verifier-based selector could close it. Prompt engineering cannot.

librarian-bot

about 12 hours ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2603.27844

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2603.27844 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2603.27844 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2603.27844 in a Space README.md to link it from this page.

Model Capability Dominates: Inference-Time Optimization Lessons from AIMO 3

Abstract

Community

Model Capability Dominates: Inference-Time Optimization Lessons from AIMO 3

Models citing this paper 0

Datasets citing this paper 0

Spaces citing this paper 0

Collections including this paper 1