Will MathNet Set a New Bar for Olympiad Math and AI?

Will MathNet Set a New Bar for Olympiad Math and AI?

Laurent Giraid is a technologist steeped in the craft and consequences of AI. His work in machine learning and natural language processing intersects with ethics, which shows in how he thinks about data provenance, representation, and the human stakes of benchmarking. In this conversation, he walks us through the creation and impact of a global archive of Olympiad-level math problems—more than 30,000 problems drawn from 47 countries, 17 languages, and 143 competitions—discussing the design decisions, multilingual challenges, and benchmarking insights that followed. We explore the practical machinery behind assembling 1,595 PDF volumes and 25,000+ pages, how peer-reviewed multi-page solutions reshape both student learning and model training, and what the latest results—like ~69.3% average accuracy on a 6,400-problem benchmark—really say about AI’s current limits. He shares lessons from collaborations, evaluator calibration across cultures, retrieval strategies that add up to +12 points when they’re relevant (and how 22% irrelevant cases backfire), as well as why near-duplicate detection still trips up top embeddings that hit only about 5% top-1. We end with a grounded forecast for AI models tackling Olympiad-level math and what it will take to move from scattered progress to robust reasoning, especially on visual and low-resource language settings where some open models hit 0% in Mongolian.

What problem were you trying to solve when you set out to compile a global archive of Olympiad-level math problems, and how did that initial goal shape your design choices and success metrics?

We were solving two gaps at once: access and rigor. Access, because the national booklets get traded at events and then effectively vanish; rigor, because most datasets pull from informal forums instead of verified, multi-page solutions. That drove three design choices. First, provenance: we restricted ourselves to official national booklets so every entry had peer-reviewed pedigree. Second, breadth: the target was not just size but diversity—47 countries, 17 languages, and 143 competitions—to reflect distinct problem traditions. Third, evaluation: success meant clean, searchable problems and solutions with normalized metadata and demonstrable value as a benchmark, which we measured through controlled experiments on a 6,400-problem slice and cross-validated human grading on thousands of items.

With more than 30,000 problems spanning 47 countries and 17 languages, what trade-offs did you make between breadth and consistency, and how did you decide which competitions and years to prioritize?

Breadth without standards becomes noise, so we pushed for a spine of consistency while accepting regional texture. We prioritized competitions with sustained historical depth—four decades where possible—so that the same country’s style could be traced year to year, and the cross-country variance wasn’t just a snapshot. When faced with scanning limitations or missing years, we favored booklets with complete, multi-page official solutions over sparse or fragmentary releases; that decision preserves reasoning signals that matter for models and learners. We also weighted earlier decades if they filled geographic gaps—say, underrepresented regions—so our 47-country span was substantive, not just token coverage. Consistency came from standardizing metadata fields and notation mappings, even when the surface language differed.

Walk us through the nuts and bolts of tracking down 1,595 PDF volumes and 25,000+ pages—what workflows, tools, and human processes turned decades-old scans into clean, searchable data?

The workflow had three lanes: acquisition, reconstruction, and validation. Acquisition meant combing institutional repositories and personal archives, including long-kept scans from community stalwarts; that’s how we aggregated 1,595 volumes totaling 25,000+ pages. Reconstruction involved page-level OCR for text, math-aware parsing for formulas, and a layout pass that stitched multi-page solutions into coherent threads and reflowed footnotes when needed. On top of that, we layered human passes for ambiguous symbols, cross-referenced problem IDs against national indices, and flagged any instances where figures degraded under OCR. Each batch shipped with a manifest that logged source provenance and a checksum so future corrections could be merged without losing track of lineage.

When cleaning multi-page, expert-written solutions, how did you preserve multiple solution paths while standardizing notation and formatting, and what edge cases forced you to revise your pipeline?

We treated each solution as a tree, not a line: the canonical path plus named branches. Wherever an official booklet presented two or three approaches, we kept them as numbered sub-solutions inside a shared problem node, linking shared lemmas to avoid duplication. Standardization happened at the notation layer—aligning symbols for common objects, like re-mapping local variants of “floor,” “choose,” or angle marks—while preserving author voice in the prose so the pedagogical texture survived. Edge cases were often geometry-heavy: mixed text-in-figure labels, page breaks that split a diagram from its argument, and proof sketches that referenced prior-year results; those forced us to add a cross-document reference resolver and an image-anchored caption index to keep everything coherent.

What were the biggest multilingual hurdles across 17 languages—OCR accuracy, math notation, or terminology—and how did you validate that translations kept problem intent and difficulty intact?

All three bit us, but in different ways. OCR struggled with diacritics and older typefaces in languages that rarely appear in commercial OCR training, while notation drift—think decimal separators or localized combinatorial shorthand—created subtle errors that would tank a proof. Terminology was the trickiest: a one-word shift can soften a combinatorics constraint or change the quantifier scope. To validate, we used bilingual cross-checkers for each language pair where possible, back-and-forth translation checks to test for semantic loss, and difficulty calibration by asking native-language evaluators whether the English rendering still “felt” like the same problem. If difficulty judgments diverged, we reverted to the source phrasing and annotated the ambiguity rather than forcing a brittle standard.

You include both text and image-based problems. How did you handle figures, diagrams, and geometry notations at scale, and what quality checks caught errors that text-only pipelines would miss?

We separated image handling from text to avoid silent failures. Figures were extracted at high resolution, de-noised, and then re-linked to their exact reference points in the text so that “See Figure 2” always points to the right object. For geometry, we built a lightweight symbol table from labels in the figure—A, B, C, O, etc.—and reconciled them with mentions in the proof, catching cases where OCR misread a label or the scan cropped a key construction. Quality checks included bounding-box sanity tests (is the labeled segment present?), angle sum consistency when angles were given numerically, and human spot reviews when the pipeline flagged conflicts. These checks often surfaced issues that a text-only pass would miss, like a missing auxiliary line that made an otherwise valid proof seem to jump.

Why rely exclusively on official national competition booklets rather than community forums, and how did peer-reviewed, multi-page solutions change the dataset’s usefulness for both students and models?

Provenance and depth. Official booklets come with peer-reviewed solutions that often run across multiple pages, walking through alternatives and justifying each leap. That gives students true worked examples rather than crowd-sourced shortcuts, and it gives models the longer reasoning chains they need to learn structure instead of pattern-matching. It also standardizes metadata—topics, year, country—so analytics across 143 competitions are meaningful. Compared with forum posts, the error rate on final solutions is far lower, and the presence of multiple solution paths lets us study strategy choice directly, not just correctness.

Can you share stories or metrics from the collaboration with Navid Safaei, whose personal scans formed the backbone of the archive, and how his contributions influenced data coverage and reliability?

Navid’s archive was a quiet miracle. He had been scanning booklets by hand since 2006, and those decades of care became the backbone of the corpus when we needed continuity across years that official sites no longer hosted. In practical terms, his scans filled holes that would have shattered time series for certain countries, and the image quality—consistent margins, careful de-skewing—meant our OCR error rates dropped materially in those segments. Reliability isn’t just fewer typos; it’s knowing a 1990s booklet links cleanly to a 2000s follow-up, so problem metadata chains don’t break. That continuity is what lets you run longitudinal analyses without constantly second-guessing the source.

You worked with 30+ human evaluators from countries like Armenia, Ukraine, Vietnam, and Poland. How did you train, calibrate, and reconcile graders to ensure consistent judgments across traditions?

We started with shared rubrics that spelled out what “correct,” “incomplete,” and “invalid” proofs mean, plus examples showing how different traditions phrase the same argument. Calibration rounds used a held-out set representing multiple countries and eras; graders scored independently, then we reconciled disagreements in group calls, capturing the resolution in guideline updates. We also ran periodic drift checks: the same grader re-scored a small historical batch to see if their bar shifted as they acclimated to the dataset. Importantly, we flagged culturally specific phrasing—say, terse Olympiad styles—so graders didn’t penalize concise but valid steps that matched the originating tradition.

Students often train alone without formal coaching. How should a learner use this dataset day-to-day—problem selection, spaced repetition, and self-grading—and what pitfalls should they avoid?

Start narrow, then widen. Pick a topic focus for a week—e.g., geometry or number theory—pull problems stratified by year and country to sample different flavors, and aim for a daily set that mixes one warm-up, two mid-range, and one stretch problem. Log your attempts, then revisit unsolved problems via spaced repetition—day 1, day 3, day 7—making sure you write a minimal proof outline before re-reading the official multi-page solution paths. For self-grading, align with the rubric: separate “idea found” from “execution clean,” and mark gaps you’d need to fill under time pressure. The main pitfall is overreading solutions too early; give yourself a fixed think-time window—say, 30–45 minutes—so you’re building strategy muscles, not just memorizing tricks.

Models still stumble on visual reasoning. What specific failure modes did you see on figure-based problems, and what concrete steps—data augmentation, symbolic tools, or chain-of-thought pruning—actually help?

The big three failures are mislabeling, missed constructions, and brittle angle/length reasoning. Models conflate points with similar labels, ignore an auxiliary line introduced in step two, or accept numerically inconsistent angle sums because the textual chain “sounds” plausible. What helps is multi-view input (cropped sub-figures aligned to referenced steps), symbolic checks that enforce constraints like parallelism or cyclicity, and chain-of-thought pruning that strips redundant paraphrase while preserving derived invariants. Augmentations that synthetically vary label placements or reflect the figure without changing the geometry can reduce overfitting to layout. In short, tie the narrative proof to verifiable geometric predicates rather than letting it drift.

Even top systems averaged around 69.3% on a 6,400-problem benchmark. Where do models most frequently break—problem understanding, strategy selection, or proof verification—and how would you measure progress?

All three hurt, but strategy selection is the cliff. Many models parse the statement and produce plausible fragments, yet they fail to choose the right invariant or lemma that unlocks the problem. Verification fails show up as confident but ungrounded steps, especially when the solution spans multiple pages and revisits earlier claims. To measure progress, we track per-step entailment checks against the official multi-page structure, assess strategy identification via match-to-solution-path metrics, and segment results by modality so figure-inclusive items don’t get averaged away. An uptick from, say, mid-60s to high-70s overall matters less than closing the gap on the hardest third and on figure-based subsets.

Some open-source models scored 0% on Mongolian. What does that reveal about training data coverage and tokenization, and what practical fixes—curated corpora, targeted finetuning, or synthetic bootstrapping—moved the needle?

A 0% score is a flashing red light for both coverage and subword modeling. It tells you the model barely sees the language in pretraining and may shard words into noisy fragments that destabilize reasoning. Practical fixes start with curated Mongolian math corpora—problem statements, official solutions, and expository texts—followed by targeted finetuning so the model learns the language’s mathematical register. Synthetic bootstrapping helps only if it’s grounded in authentic structures; we found that pairing real problems with paraphrases and aligned bilingual versions avoids drifting into empty patterning. Even small, well-constructed slices can unlock transfer when the underlying math is shared across languages.

Near-duplicate detection is hard; top embeddings found correct matches only ~5% top-1. What features proved most predictive—structure, invariants, or proof outlines—and how would you redesign embeddings for mathematical equivalence?

Structural signatures and invariants beat surface similarity. If two statements hinge on the same invariant—say, a conserved parity or a fixed sum under a transformation—they’re often equivalent despite wildly different wording or notation. Proof outlines are gold: capturing the sequence of lemmas, not the exact sentences, correlates with equivalence. I’d redesign embeddings to factor in three channels: a normalized statement graph, a candidate invariant set estimated from the text, and a compressed proof-plan vector extracted from the official multi-page solution. Training would use contrastive pairs from the archive—true equivalents across languages versus strong distractors—so top-1 isn’t stuck at 5% when the models face deep paraphrase.

Retrieval-augmented generation boosted some models by up to 12 points, but irrelevant retrieval hurt in ~22% of cases. How should practitioners build retrieval pipelines that maximize relevance and minimize distraction?

Make retrieval precise and accountable. First, retrieve by structure, not just text—use problem graphs and invariant hints as keys. Second, re-rank with a teacher model that scores alignment to the target’s proof-plan, discarding near-misses that would distract; that alone cuts the ~22% degradation you see from off-topic context. Third, present retrieved content as scaffolds—high-level hints, key lemmas—rather than long passages that the model might copy blindly. Finally, include an abstain option: if confidence in relevance is low, solve without retrieval rather than poisoning the context window.

Leaders want to test originality before proposing new Olympiad problems. What workflow would you recommend—candidate generation, structural retrieval, human review—and which metrics best detect subtle equivalences across languages and notations?

I’d run a three-stage gate. Stage one: canonicalize the candidate into a statement graph and derive likely invariants; then retrieve nearest neighbors across the 30,000+ problems using structure-first indexing. Stage two: compare proof-plan sketches against multi-page official solutions to spot deep overlaps—if your plan maps step-for-step to an older item, that’s a flag. Stage three: human committee review with bilingual checks, especially when cross-language paraphrase might hide duplication. Metrics should include structural overlap scores, invariant match rates, and plan-alignment precision/recall, calibrated on known near-duplicates. Anything that hits high overlap in two out of three channels deserves extra scrutiny.

How did you define success metrics for coverage and reliability as the archive grew to 47 countries, 17 languages, and 143 competitions, and what checkpoints ensured you weren’t introducing silent errors?

Coverage had two axes: horizontal (countries and competitions) and temporal (year-by-year continuity). We tracked gaps as a percentage of missing years per competition and flagged any region that trailed the median by more than a set threshold so volunteers could prioritize targeted acquisition. Reliability rode on provenance checksums, cross-language concordance rates for translated problems, and re-grading audits by the 30+ evaluators. Before promoting a batch, we ran spot-solve trials on a random slice to ensure the problems were solvable as rendered—if a figure mislink or notation slip broke the logic, it wouldn’t pass. Those checkpoints are what keep a 25,000+ page collection from quietly drifting off course.

What ethical considerations guided your decision to open the dataset publicly, and how did you balance researcher needs with fairness to students and competition organizers?

Ethically, provenance and intent mattered most. These are official national materials intended for broad educational use, not private notes scraped without consent, and opening them levels the playing field for students who train alone. We balanced researcher freedom with fairness by preserving attribution, maintaining original context, and resisting the urge to “gamify” active examination cycles—when a competition requested a blackout window for very recent items, we honored it. The result is a resource that serves both public education and rigorous research, without collapsing the safeguards that keep competitions meaningful.

What did you learn about cross-cultural problem styles—say, a Romanian combinatorics approach versus a Brazilian number theory angle—and how should models adapt to that diversity rather than averaging it away?

Diversity shows up in the moves each tradition treats as “obvious.” Some cultures lean on constructive bijections in combinatorics, others on extremal arguments; number theory can swing from local lifting tricks to global structure. If models average these away, they miss the sharp edges that actually solve problems. Instead, they should learn style-conditioned priors—recognize, for instance, that a country’s problems in a given decade often echo certain proof strategies—and use that to guide strategy selection. Our 47-country, four-decade span gives models enough signal to learn those priors if the training objective rewards strategy fit, not just final answers.

Looking ahead, how would you extend the retrieval and embedding benchmarks to better reflect real committee work where near-duplicates slip into IMO exams despite best efforts?

I’d raise the stakes with adversarial curation. Construct challenge sets where two problems share an invariant and proof-plan but diverge in surface cues across different languages and notations, then add distractors that are textually similar yet structurally unrelated. The benchmark would score both the retrieval step and a human-in-the-loop review step, since committees don’t operate purely automatically. We’d also evaluate cross-year leakage—can you detect that a 1998 problem reappeared as a 2012 variant—so the metric aligns with real incidents. If an embedding can surface the true twin in the top-3 consistently, that’s the kind of performance committees can act on.

What is your forecast for AI models tackling Olympiad-level math over the next five years?

I expect steady but uneven gains, with the steepest improvements coming from structure-aware training, reliable visual reasoning, and stronger low-resource language support. On a benchmark of 6,400 problems where leaders now average around 69.3%, I’d project high-70s to low-80s with disciplined retrieval (+12 when relevant) and tighter control to avoid the ~22% irrelevant drag. Visual problems will remain the last holdout until models ground their chains of thought in checkable geometric predicates; once that lands, you’ll see a step-change on figure-inclusive subsets. The most encouraging shift will be inclusivity: as curated non-English corpora shore up gaps that produced 0% on languages like Mongolian, models will learn broader mathematical “accents,” and that diversity will make them better reasoners, not just better test takers.

Subscribe to our weekly news digest.

Join now and become a part of our fast-growing community.

Invalid Email Address
Thanks for Subscribing!
We'll be sending you our best soon!
Something went wrong, please try again later