I'm a mathematician relying heavily on AI as an association engine of massive scope, to organize and expand my thoughts. One doesn't get best results by "testing" AI.
A surfboard is also an amazing tool, but there's more to operating one than telling it which way to go.
Many people want self-driving cars so they can drink in the back seat watching movies. They'll find their jobs replaced by AI, with a poor quality of life because we're a selfish species. In contrast Niki Lauda trusted fellow Formula 1 race car driver James Hunt to race centimeters apart. Some people want AI to help them drive that well. They'll have great jobs as AI evolves.
Gary Kasparov pioneered "freestyle" chess tournaments after his defeat by Big Blue, where the best human players were paired with computers, coining the "centaur" model of human-machine cooperation. This is frequently cited in the finance literature, where it is recognized that AI-guided human judgement can out-perform either humans or machines.
Any math professor knows how to help graduate students confidently complete a PhD thesis, or how to humiliate students in an oral exam. It’s a choice. To accomplish more work than one can complete alone, choose the former. This is the arc of human evolution: we develop tools to enhance our abilities. We meld with an abacus or a slide rule, and it makes us smarter. We learn to anticipate computations, like we’re playing a musical instrument in our heads. Or we pull out a calculator that makes us dumber. The role we see for our tools matters.
Programmers who actually write better code using AI know this. These HN threads are filled with despair over the poor quality of vibe coding. At the same time, Anthropic is successfully coding Claude using Claude.
I think you're misunderstanding the point this paper is trying to make. They're interested in trying to distinguish whether AI is capable of solving new math problems or only capable of identifying existing solutions in the literature. Distinguishing these two is difficult, because self-contained math problems that are easy enough for LLMs to address (e.g. minor Erdos-problems) may have been solved already as subcomponents of other work, without this widely known. So when an AI makes progress on such an Erdos problem, we don't know if it had a new idea, or correctly identified an existing but obscure answer. This issue has been dogging the claims of AI solving Erdos problems.
Instead, here you get questions that extremely famous mathematicians (Hairer, Spielman) are telling you (a) are solvable in <5 pages (b) do not have known solutions in the literature. This means that solutions from AI to these problems would perhaps give a clearer signal on what AI is doing, when it works on research math.
I find it unbelievable that this question can't be settled themselves without posting this simply by asking the AI enough novel questions. I myself have little doubt that at least they can solve some novel questions (of course similarity of proofs is a spectrum so it's hard to draw the line at how original they are)
I settle this question for myself every month: I try asking ChatGPT and Gemini for help, but in my domains it fails miserably at anything that looks new. But, YMMV, that's just the experience of one professional mathematician.
Centaurs are a transient phenomenon. In chess, the era of centaur supremacy lasted only about a decade before computers alone eclipsed human+computer. The same will be true in every other discipline.
You can surf the wave, but sooner or later, the wave will come crashing down.
They are transient only in those rare domains that can be fully formalized/specified. Like chess. Anything that depends on the messy world of human - world interactions will require humans in the loop for translation and verification purposes.
>Anything that depends on the messy world of human - world interactions will require humans in the loop for translation and verification purposes.
I really don't see why that would necessarily be true. Any task that can be done by a human with a keyboard and a telephone is at risk of being done by an AI - and that includes the task of "translation and verification".
Sure, but at the risk of running into completely unforeseen and potentially catastrophic misunderstandings. We humans are wired to use human languages to interact with other humans, who share our human experience, which AIs can only imperfectly model.
I'm guessing that they were referring to the depth of the decision tree able to be computed in a given amount of time?
In essence, it used to be (I have not stayed current) that the "AI" was limited on how many moves into the future it could use to determine which move was most optimal.
That limit means that it is impossible to determine all the possible moves and which is guaranteed to lead to a win. (The "best" than can be done is to have a Machine Learning algorithm choose the most likely set of moves that a human would take from the current state, and which of that set would most likely lead to a win.
On the other hand, chess is not very financially rewarding. IBM put some money into it for marketing briefly, but that’s probably equal to about five minutes of spend from the current crop of LLM companies.
As far as I can tell based on scanning forums, to the extent humans contribute anything to the centaur setup, it is entirely in hardware provisioning and allocating enough server time before matches for chess engines to do precomputation, rather than anything actually chess related, but I am unsure on this point.
I have heard anecdotally from non-serious players (and therefore I cannot be certain that this reflects sentiment at the highest levels although the ICCF results seem to back this up) that the only ways to lose in centaur chess at this point is to deviate from what the computer tells you to do, either intentionally or unintentionally by accidentally submitting the wrong move, or simply by being at a compute disadvantage.
I've got several previous comments on this because this is a topic that interests me a lot, but the two most topical here are the previous one and https://news.ycombinator.com/item?id=33022581.
I'm highly worried that you are right. But what gives me hope is that people still play chess, I'd argue even more than ever. People still buy paper books and vinyl records. People still appreciated handwritten greeting cards over printed ones, pay extra to listen to live music where the recorded one is free and will likely sound much better. People are willing to pay an order of magnitude more for a sit in a theater for a live play, or pay premium for handmade products over their almost impossible to distinguish knock offs.
> Anthropic is successfully coding Claude using Claude.
Claude is one of the buggiest pieces of shit I have ever used. They had to BUY the creators of bun to fix the damn thing. It is not a good example of your thesis.
You and the GP are conflating Claude, the company or its flagship model Claude Opus, with Claude Code, a state of the art coding assistant that has admittedly a slow and buggy React-based TUI (output quality is still very competitive)
> I'm a mathematician relying heavily on AI as an association engine of massive scope, to organize and expand my thoughts.
Can you share more about your architecture & process? Also a researcher involved in math research (though not strictly speaking a mathematician, but I digress). I've often thought about using AI on my notes, but they are messy and even then I can't quite figure out what to ask: prioritization, connecting ideas, lit search, etc.
You didn't need to make this claim about driving. Coding requires robust metacognition. Driving doesn't, it can be drilled repetitively, and it also benefits from having superhuman senses and instant reaction times. It's somewhat more amenable to AI.
Very well written. Thank you for putting down your thoughts so succinctly; I'm often at a loss for words when I try to express the same thoughts in a coherent manner.
That centaurs can outperform humans or AI systems alone is a weaker claim than "these particular AI systems have the required properties to be useful for that". Chess engines consistently produce strong lines, and can play entire games without human assistance: using one does not feel like gambling, even if occasionally you can spot a line it can't. LLMs catastrophically fail at iterated tasks unless they're closely supervised, and using LLMs does feel like gambling. I think you're overgeneralising.
There is definitely a gap in academic tooling, where an "association engine" would be very useful for a variety of fields (and for encouraging cross-pollination of ideas between fields), but I don't think LLMs are anywhere near the frontier of what can be accomplished with a given amount of computing power. I would expect simpler algorithms operating over more explicit ontologies to be much more useful. (The main issue is that people haven't made those yet, whereas people have made LLMs.) That said, there's still a lot of credit due to the unreasonable effectiveness of literature searches: it only usually takes me 10 minutes a day for a couple of days to find the appropriate jargon, at which point I gain access to more papers than I know what to do with. LLM sessions that substitute for literature review tend to take more than 20 minutes: the main advantage is that people actually engage with (addictive, gambling-like) LLMs in a way that they don't with (boring, database-like) literature searches.
I think developing the habit of "I'm at a loose end, so I'll idly type queries into my literature search engine" would produce much better outcomes than developing the habit of "I'm at a loose end, so I'll idly type queries into ChatGPT", and that's despite the state-of-the-art of literature search engines being extremely naïve, compared to what we can accomplish with modern technology.
We're in agreement. I understand how much harder it is to "think with AI"; the last year of my life has been a brutal struggle to figure this out.
I also agree that neural net LLMs are not the inevitable way to implement AI. I'm most intrigued by the theoretical underpinnings of mathematical proof assistants such as Lean 4. Computer scientists understand the word problem for strings as undecidable. The word problem for typed trees with an intrinsic notion of induction is harder, but constructing proofs is finding paths in this tree space. Just as mechanical computers failed in base ten while at the same time Boole had already developed base two logic, I see these efforts merging. Neural nets struggle to simulate recursion; for proof assistants recursion is baked in. Stare at these tree paths and one sees thought at the atomic level, begging to be incorporated into AI. For now the river runs the other way, using AI to find proofs. That river will reverse flow.
Lean 4 is not a theoretically-interesting proof assistant. If you're interested in such things, look into Rocq (which uses CoIC, like Lean, but is more rigorous about it), the HOL logic, Isabelle/HOL's automation suite (though Isabelle proper is fairly mediocre, apart from being the thing everyone's standardised around), Lean-auto (https://arxiv.org/abs/2505.14929), and whatever SAT solvers are state-of-the-art this week. Like the tools for symbolic integration and frequentist statistics, there isn't any magic: the power comes from handling enough uninteresting special-cases that we get broad coverage. (Personally, I think there's still a lot of power being left on the table by using overly-general algorithms: sledgehammer is used to crack a lot of nuts, even when that takes quadratic time or longer.)
Lean 2 used HoTT, which was theoretically interesting, but not enough was known about HoTT at the time (in particular, whether it was a constructive logic – I think we have all the pieces for an explicit construction via cubical type theory now, but I don't know that anyone's put the pieces together), so that direction has been mostly abandoned. I think there's useful work to be done in that direction, but with the current state of HoTT pedagogy, I doubt I'd ever be able to keep on top of it enough to contribute; and with Lean 4 taking so much of the funding, I don't think we'll see much work in this direction until HoTT is easier to learn.
I still think you're overgeneralising. What actual thing does your poetic tree / thought / river analogy correspond to?
Those were "let's get experts to manually code every single document according to a schema defined in advance". Nowadays, we have techniques for automatically-generating explicit pseudo-semantic ontology representations from large datasets (see, for example, https://openaccess.thecvf.com/content_CVPR_2019/papers/Zhang... for image classification tasks). Getting a machine learning model to identify field-specific heuristics, map conventions from one field to another, and then constructing an index that allows us to quickly produce a search / proximity metric from an arbitrary specification, was not really possible in the 80s.
"Throw a massive neural network at it" is an extremely inefficient way to get results, and doesn't generalise well – for instance, there's no easy way to get online learning for a transformer model, whereas that capability just falls out of most search engine database systems. (The underlying relational database engines had a lot of work put in to make online CRUD work reliably, but that work has been done now, and we can all build on top of it without a second thought.)
These are very serious research level math questions. They are not “Erdős style” questions; they look more like problems or lemmas that I encountered while doing my PhD. Things that don’t make it into the papers but were part of an interesting diversion along the way.
It seems likely that PhD students in the subfields of the authors are capable of solving these problems. What makes them interesting is that they seem to require fairly high research level context to really make progress.
It’s a test of whether the LLMs can really synthesize results from knowledge that require a human several years of postgraduate preparation in a specific research area.
Not necessarily. Even the statements may not appear in the final paper. The questions arose during research, and understanding them was needed for the authors to progress, but maybe not needed for the goal in mind.
No, results in a paper are identified to be "left for the reader" because they are thought to be straightforward to the paper's audience. These are chosen because they are novel. I didn't see any reason to think they are easier than the main results, just maybe not of as much interest.
Very serious for mathematicians - not for ML researchers.
If the paper would not have had the AI spin, would those 10 questions still have been interesting?
It seems to me that we have here a paper that is solely interesting because of the AI spin -- while at the same time this AI spin is really poorly executed from the point of AI research, where this should be a blog post at most, not an arXiv preprint.
I’m confused by this comment. I’m pretty sure that someone at all the bigs labs is running these questions through their models and will report back as soon as the results arrive (if not sooner, assuming they can somehow verify the answers).
The fact that you find it odd that this landed on arXiv is maybe a cultural thing… mathematicians kinda reflexively throw work up there that they think should be taken seriously. I doubt that they intend to publish it in a peer reviewed journal.
Yes, but people at those labs may be running those problems because a Fields Medalist is in the paper, and it got hype.
Not because of the problems, and not because this is new methodology.
And once the labs report back, what do we know that we didn't know before? We already know, as humans, the answer to the problems, so that is not it. We already know that LLMs can solve some hard problems, and fail in easy problems, so that is not it either.
the last unsolved erdos problem proof generated by llms that hit the news was so non interesting that a paper published by erdos himself stated the proof
aaaaaaand no one cared enough to check
so i think the question is, are those interesting by themselves, or, are they just non interesting problems no one will ever care about except it would be indicative llms are good for solving complex novel problems that do not exists in their training set?
Deepmind’s Nobel Prize was primarily for its performance in CASP which is pretty much exactly this. Labs solve structures of proteins, but don’t publish them until after all the computational teams predict structures.
So I’m not sure where you’re coming from claiming that this isn’t scientific.
This is exciting as a reality check of our expectations from the current level of AI. I expect AIs to solve at least 2-3 of them in a week. I expect one “easy” problem that multiple models solve. And I expect at least one solution to be “interesting” and different than the human solutions. I also expect human researchers to solve more than AIs in a week (globally, by total) but I don’t know what happens if they publish their results during the week. We’ll see results soon.
Yeah, I pointed a custom thing and Claude at #6, and it's solved it in Lean besides needing to axiomize one theorem not in mathlib. Only about four of the problems have enough foundations formalized in mathlib though for this approach.
> the answers are known to the authors of the questions but will remain encrypted for a short time.
Ok. But humans may be able to solve the problems too. What prevents Anthropic or OpenAI from hiring mathematicians, have them write the proof and pass it off as LLM written? I'm not saying that's what they'll do. But shouldn't the paper say something about how they're going to validate that this doesn't happen?
Honest question here. Not trying to start a flame here. Honestly confused how this is going to test what it wants to test. Or maybe I'm just plain confused. Someone help me understand this?
This is not a benchmark. They just want to give people the opportunity to try their hand at solving novel questions with AI and see what happens. If an AI company pulls a solution out of their hat that cannot be replicated with the products they make available to ordinary people, that's hardly worth bragging about and in any case it's not the point of the exercise.
Hey, sorry, totally out of context but I've always wanted to ask about the username. I keep reading it as "yoruba" in my mind. What does it mean, if I'm not being indiscreet?
The authors mention that before publications they tested these questions on Gemini and GPT, so they have been available to the two biggest players already; they have a head start.
I don't think it's that serious...it's an interesting experiment that assumes people will take it in good faith. The idea is also of course to attach the transcript log and how you prompted the LLM so that anyone can attempt to reproduce if they wish.
If this were a competition, some people would try hard to win it. But the goal here is exploration, not exploitation. Once the answers are revealed, it's unlikely a winner will be identified, but a bunch of mathematicians who tried prompting AI with the questions might learn something from the exercise.
Nothing prevents them, and they are already doing that. I work in this field and one can be sure that now, because of the notoriety this preprint got, the questions will be solved soon.
It's possible but unlikely given the short timeline, diverse questions that require multiple matheamticians, and low stakes. Also they've already run preliminary tests.
> It's possible but unlikely given the short timeline
Yep. "possible but unlikely" was my take too. As another person commented, this isn't really a benchmark, and as long as that's clear, it seems fair. My only fear is that some submissions may be AI-assisted rather than fully AI-generated, with crucial insights coming from experienced mathematicians. That's still a real achievement even if it's human + AI collaboration. But I fear that the nuance would be lost on news media and they'll publish news about the dawn of fully autonomous math reasoning.
That was exactly my first thought as well. All those exercises are pointless and people don't seem to understand it, it's baffling.
Even if it's not Anthropic or OpenAI paying for the solutions, maybe it'll be someone solving them "for fun" because the paper got popular and posting them online.
I'm realizing I don't know if it's currently harder for an LLM to:
* come up with a formal proof that checks out according to a theorem prover
* come up with a classical proof that's valid at a high-level, with roughly the same correctness as human-written papers
The abstract of the article is very short, and seems pretty clear to both of your questions.
This is what is special about them:
> a set of ten math questions which have arisen naturally in the research process of the authors. The questions had not been shared publicly until now;
I.e. these are problems of some practical interest, not just performative/competitive maths.
And this is what is know about the solutions:
> the answers are known to the authors of the questions but will remain encrypted for a short time.
I.e. a solution is known, but is guaranteed to not be in the training set for any AI.
> I.e. a solution is known, but is guaranteed to not be in the training set for any AI.
Not a mathematician and obviously you guys understand this better than I do. One thing I can't understand is how they're going to judge if a solution was AI written or human written. I mean, a human could also potentially solve the problem and pass it off as AI? You might say why would a human want to do that? Normal mathematicians might not want to do that. But mathematicians hired by Anthropic or OpenAI might want to do that to pass it off as AI achievements?
Well, I think the paper answers that too. These problems are intended as a tool for honest researchers to use for exploring the capabilities of current AI models, in a reasonably fair way. They're specifically not intended as a rigorous benchmark to be treated adversarially.
Of course a math expert could solve the problems themselves and lie by saying that an AI model did it. In the same way, somebody with enough money could secretly film a movie and then claim that it was made by AI. That's outside the scope of what this paper is trying to address.
The point is not to score models based on how many of the problems they can solve. The point is to look at the models' responses and see how good they are at tackling the problem. And that's why the authors say that ideally, people solving these problems with AI would post complete chat transcripts (or the equivalent) so that readers can assess how much of the intellectual contribution actually came from AI.
> Wrong, as the questions were poses to commercial AI models and they can solve them.
Why does this matter? As far as I can tell, because the solution is not known this only affects the time constant (i.e. the problems were known for longer than a week). It doesn't seem that I should care about that.
Because the companies have the data and can solve them -- so providing the question to a company with the necessary manpower, one cannot guarantee anymore that the solution is not known, and not contained in the training sample.
February 13 seems right to me. I mean it's not like LLMs need to manually write out a 10 page proof. But a longer deadline can give human mathematicians time to solve the problem and write out a proof. A close deadline advantages the LLM and disadvantages humans which should be the goal if we want to see if LLMs are able to solve these.
Tried all ten with claude, then had codex take a loook at the work -- codex thinks number 7 has the lowest chance of being correct, a 1 out of 10 rating. None of them were higher than 7/10 chance of being right so far as done by claude opus 4.6 and evaluated by codex 5.3 highest.
I don't think either of these are the best choices for this. Chatgpt 5.2 pro and gemini 3 pro deep thinking I believe are the strongest LLMs at "pure thought", i.e. things like mathematical reasoning.
Would something be a proof in that sense even if it did use Mizar? As far as I can tell, Mizar has no complete reference for its language semantics, except for the single closed-source implementation. In general, information about the system itself (outside of the library) seems very scarce, or at least scarcely advertised.
Mizar source was "available upon request" for maybe 30-40 years. It got completely open-sourced under GPL some 3 years ago (maybe earlier, not sure), see [1], also [2] and [3] about an alternative implementation in Rust.
Mizar is indeed "scarcely advertised", but all the information is publicly available, who wants to know knows. As for Mizar semantics, see for example [4].
The goalposts have been on wheels basically since the field was born. Look up "AI effect". I've stopped caring what HN comments have to say about whether something is or isn't AI. If its useful to me, I'm gonna use it.
> Conflicts of interest. No funding was received for the design or implementation of this
project. None of the authors of this report was employed by or consulted with AI companies
during the project, nor will they do so while contributing to it
As it should. Good.
This is a totally independent test not conducted or collaborated by any of the AI companies or employees so that no bias is introduced at all[0].
[0] Unless the researchers are not disclosing if they have any ownership of shares in private AI companies.
As mathematically interesting the 10 questions are that the paper presents, the paper is --sorry for the harsh language-- garbage from the point of view of benchmarking and ML research: Just 10 question, few descriptive statistics, no interesting points other than "can LLMs solve these uncontaminated questions", no long bench of LLMs that were evaluated.
The field of AI4Math has so many benchmarks that are well executed -- based of the related work section it seems the authors are bit familiar with AI4Math at all.
My belief is that this paper is even being discussed solely because a Fields Medalist, Martin Hairer, is on it.
Paper not about benchmarking or ML research is bad from the perspective of benchmarking. Not exactly a shocker.
The authors themselves literally state: "Unlike other proposed math research benchmarks (see Section 3), our question list should not be considered a benchmark in its current form"
On the website https://1stproof.org/#about they claim: "This project represents our preliminary efforts to develop an objective and realistic methodology for assessing the capabilities of AI systems to autonomously solve research-level math questions."
Sounds to me to be a benchmark in all but a name. And they failed pretty terribly at achieving what they set out to do.
It's not angst. It's intense frustration that they 1) are not doing the science correctly, and 2) that others (e.g. FrontierMath) already did everything they claim to be doing, so we won't learn anything new here, but somehow 1stproof get all the credit.
Are they really trying to do science, or are they just trying to determine pragmatically whether or not current AI is useful for a research mathematician in their day to day job?
I'm a mathematician relying heavily on AI as an association engine of massive scope, to organize and expand my thoughts. One doesn't get best results by "testing" AI.
A surfboard is also an amazing tool, but there's more to operating one than telling it which way to go.
Many people want self-driving cars so they can drink in the back seat watching movies. They'll find their jobs replaced by AI, with a poor quality of life because we're a selfish species. In contrast Niki Lauda trusted fellow Formula 1 race car driver James Hunt to race centimeters apart. Some people want AI to help them drive that well. They'll have great jobs as AI evolves.
Gary Kasparov pioneered "freestyle" chess tournaments after his defeat by Big Blue, where the best human players were paired with computers, coining the "centaur" model of human-machine cooperation. This is frequently cited in the finance literature, where it is recognized that AI-guided human judgement can out-perform either humans or machines.
Any math professor knows how to help graduate students confidently complete a PhD thesis, or how to humiliate students in an oral exam. It’s a choice. To accomplish more work than one can complete alone, choose the former. This is the arc of human evolution: we develop tools to enhance our abilities. We meld with an abacus or a slide rule, and it makes us smarter. We learn to anticipate computations, like we’re playing a musical instrument in our heads. Or we pull out a calculator that makes us dumber. The role we see for our tools matters.
Programmers who actually write better code using AI know this. These HN threads are filled with despair over the poor quality of vibe coding. At the same time, Anthropic is successfully coding Claude using Claude.
I think you're misunderstanding the point this paper is trying to make. They're interested in trying to distinguish whether AI is capable of solving new math problems or only capable of identifying existing solutions in the literature. Distinguishing these two is difficult, because self-contained math problems that are easy enough for LLMs to address (e.g. minor Erdos-problems) may have been solved already as subcomponents of other work, without this widely known. So when an AI makes progress on such an Erdos problem, we don't know if it had a new idea, or correctly identified an existing but obscure answer. This issue has been dogging the claims of AI solving Erdos problems.
Instead, here you get questions that extremely famous mathematicians (Hairer, Spielman) are telling you (a) are solvable in <5 pages (b) do not have known solutions in the literature. This means that solutions from AI to these problems would perhaps give a clearer signal on what AI is doing, when it works on research math.
I find it unbelievable that this question can't be settled themselves without posting this simply by asking the AI enough novel questions. I myself have little doubt that at least they can solve some novel questions (of course similarity of proofs is a spectrum so it's hard to draw the line at how original they are)
I settle this question for myself every month: I try asking ChatGPT and Gemini for help, but in my domains it fails miserably at anything that looks new. But, YMMV, that's just the experience of one professional mathematician.
New doesn't have to mean "the hardest thing yet", but as humans mastering our subdomain, they are often the same.
Centaurs are a transient phenomenon. In chess, the era of centaur supremacy lasted only about a decade before computers alone eclipsed human+computer. The same will be true in every other discipline.
You can surf the wave, but sooner or later, the wave will come crashing down.
They are transient only in those rare domains that can be fully formalized/specified. Like chess. Anything that depends on the messy world of human - world interactions will require humans in the loop for translation and verification purposes.
>Anything that depends on the messy world of human - world interactions will require humans in the loop for translation and verification purposes.
I really don't see why that would necessarily be true. Any task that can be done by a human with a keyboard and a telephone is at risk of being done by an AI - and that includes the task of "translation and verification".
Sure, but at the risk of running into completely unforeseen and potentially catastrophic misunderstandings. We humans are wired to use human languages to interact with other humans, who share our human experience, which AIs can only imperfectly model.
Sure, but in pure mathematics there are a lot of well specific problems which no one can solve.
From a human, to a centaur, to a pegasus, as it were.
How is chess not fully specified?
They said chess was an example of something that is fully specified.
I'm guessing that they were referring to the depth of the decision tree able to be computed in a given amount of time?
In essence, it used to be (I have not stayed current) that the "AI" was limited on how many moves into the future it could use to determine which move was most optimal.
That limit means that it is impossible to determine all the possible moves and which is guaranteed to lead to a win. (The "best" than can be done is to have a Machine Learning algorithm choose the most likely set of moves that a human would take from the current state, and which of that set would most likely lead to a win.
How transient depends on the problem space. In chess, centaurs were transient. In architecture or CAD, they have been the norm for decades.
Agreed. But I don't think the time scale will be similar.
Chess is relatively simple in comparison, as complex as it is.
On the other hand, chess is not very financially rewarding. IBM put some money into it for marketing briefly, but that’s probably equal to about five minutes of spend from the current crop of LLM companies.
Last I heard, which was last year, human + computer still beat either by themselves. You got a link about what's changed?
I'm curious what you heard exactly. As far as I can tell, centaur chess looks completely dead.
Nobody ever wins anymore in the ICCF championships (which I believe is the most prestigious centaur chess venue, but am not sure).
This is not an exaggeration. See my comment from several months ago: https://news.ycombinator.com/item?id=45768948
As far as I can tell based on scanning forums, to the extent humans contribute anything to the centaur setup, it is entirely in hardware provisioning and allocating enough server time before matches for chess engines to do precomputation, rather than anything actually chess related, but I am unsure on this point.
I have heard anecdotally from non-serious players (and therefore I cannot be certain that this reflects sentiment at the highest levels although the ICCF results seem to back this up) that the only ways to lose in centaur chess at this point is to deviate from what the computer tells you to do, either intentionally or unintentionally by accidentally submitting the wrong move, or simply by being at a compute disadvantage.
I've got several previous comments on this because this is a topic that interests me a lot, but the two most topical here are the previous one and https://news.ycombinator.com/item?id=33022581.
You're the one claiming "Last I heard" so you're the one who owes a link.
Why do you nitpick his illustrative example and entirely ignore his substantive one about finance?
I'm highly worried that you are right. But what gives me hope is that people still play chess, I'd argue even more than ever. People still buy paper books and vinyl records. People still appreciated handwritten greeting cards over printed ones, pay extra to listen to live music where the recorded one is free and will likely sound much better. People are willing to pay an order of magnitude more for a sit in a theater for a live play, or pay premium for handmade products over their almost impossible to distinguish knock offs.
> Anthropic is successfully coding Claude using Claude.
Claude is one of the buggiest pieces of shit I have ever used. They had to BUY the creators of bun to fix the damn thing. It is not a good example of your thesis.
You and the GP are conflating Claude, the company or its flagship model Claude Opus, with Claude Code, a state of the art coding assistant that has admittedly a slow and buggy React-based TUI (output quality is still very competitive)
> At the same time, Anthropic is successfully coding Claude using Claude.
Is that why everyone keeps complaining about the quality getting worse?
I think that’s more about model performance degrading due to less computational resources being assigned to them over time.
This is beautifully written, thank you for writing it.
Typing out solutions to problems was only part of the job description because there was no other way to code. Now we have a far better way.
> I'm a mathematician relying heavily on AI as an association engine of massive scope, to organize and expand my thoughts.
Can you share more about your architecture & process? Also a researcher involved in math research (though not strictly speaking a mathematician, but I digress). I've often thought about using AI on my notes, but they are messy and even then I can't quite figure out what to ask: prioritization, connecting ideas, lit search, etc.
I'd love to hear what you do.
You didn't need to make this claim about driving. Coding requires robust metacognition. Driving doesn't, it can be drilled repetitively, and it also benefits from having superhuman senses and instant reaction times. It's somewhat more amenable to AI.
Very well written. Thank you for putting down your thoughts so succinctly; I'm often at a loss for words when I try to express the same thoughts in a coherent manner.
That centaurs can outperform humans or AI systems alone is a weaker claim than "these particular AI systems have the required properties to be useful for that". Chess engines consistently produce strong lines, and can play entire games without human assistance: using one does not feel like gambling, even if occasionally you can spot a line it can't. LLMs catastrophically fail at iterated tasks unless they're closely supervised, and using LLMs does feel like gambling. I think you're overgeneralising.
There is definitely a gap in academic tooling, where an "association engine" would be very useful for a variety of fields (and for encouraging cross-pollination of ideas between fields), but I don't think LLMs are anywhere near the frontier of what can be accomplished with a given amount of computing power. I would expect simpler algorithms operating over more explicit ontologies to be much more useful. (The main issue is that people haven't made those yet, whereas people have made LLMs.) That said, there's still a lot of credit due to the unreasonable effectiveness of literature searches: it only usually takes me 10 minutes a day for a couple of days to find the appropriate jargon, at which point I gain access to more papers than I know what to do with. LLM sessions that substitute for literature review tend to take more than 20 minutes: the main advantage is that people actually engage with (addictive, gambling-like) LLMs in a way that they don't with (boring, database-like) literature searches.
I think developing the habit of "I'm at a loose end, so I'll idly type queries into my literature search engine" would produce much better outcomes than developing the habit of "I'm at a loose end, so I'll idly type queries into ChatGPT", and that's despite the state-of-the-art of literature search engines being extremely naïve, compared to what we can accomplish with modern technology.
We're in agreement. I understand how much harder it is to "think with AI"; the last year of my life has been a brutal struggle to figure this out.
I also agree that neural net LLMs are not the inevitable way to implement AI. I'm most intrigued by the theoretical underpinnings of mathematical proof assistants such as Lean 4. Computer scientists understand the word problem for strings as undecidable. The word problem for typed trees with an intrinsic notion of induction is harder, but constructing proofs is finding paths in this tree space. Just as mechanical computers failed in base ten while at the same time Boole had already developed base two logic, I see these efforts merging. Neural nets struggle to simulate recursion; for proof assistants recursion is baked in. Stare at these tree paths and one sees thought at the atomic level, begging to be incorporated into AI. For now the river runs the other way, using AI to find proofs. That river will reverse flow.
Lean 4 is not a theoretically-interesting proof assistant. If you're interested in such things, look into Rocq (which uses CoIC, like Lean, but is more rigorous about it), the HOL logic, Isabelle/HOL's automation suite (though Isabelle proper is fairly mediocre, apart from being the thing everyone's standardised around), Lean-auto (https://arxiv.org/abs/2505.14929), and whatever SAT solvers are state-of-the-art this week. Like the tools for symbolic integration and frequentist statistics, there isn't any magic: the power comes from handling enough uninteresting special-cases that we get broad coverage. (Personally, I think there's still a lot of power being left on the table by using overly-general algorithms: sledgehammer is used to crack a lot of nuts, even when that takes quadratic time or longer.)
While CoIC has recursion "baked in", HOL does not. It turns out that we can treat structural recursion as a derived property, even over coinductively-defined types. We don't even need a notion of ordinals for this! (See https://www.tcs.ifi.lmu.de/staff/jasmin-blanchette/card.pdf and https://matryoshka-project.github.io/pubs/bindings.pdf.)
Lean 2 used HoTT, which was theoretically interesting, but not enough was known about HoTT at the time (in particular, whether it was a constructive logic – I think we have all the pieces for an explicit construction via cubical type theory now, but I don't know that anyone's put the pieces together), so that direction has been mostly abandoned. I think there's useful work to be done in that direction, but with the current state of HoTT pedagogy, I doubt I'd ever be able to keep on top of it enough to contribute; and with Lean 4 taking so much of the funding, I don't think we'll see much work in this direction until HoTT is easier to learn.
I still think you're overgeneralising. What actual thing does your poetic tree / thought / river analogy correspond to?
We have made those in the 80s. Much was learned about why probabilistic stochastic parrots are a far better model.
Those were "let's get experts to manually code every single document according to a schema defined in advance". Nowadays, we have techniques for automatically-generating explicit pseudo-semantic ontology representations from large datasets (see, for example, https://openaccess.thecvf.com/content_CVPR_2019/papers/Zhang... for image classification tasks). Getting a machine learning model to identify field-specific heuristics, map conventions from one field to another, and then constructing an index that allows us to quickly produce a search / proximity metric from an arbitrary specification, was not really possible in the 80s.
"Throw a massive neural network at it" is an extremely inefficient way to get results, and doesn't generalise well – for instance, there's no easy way to get online learning for a transformer model, whereas that capability just falls out of most search engine database systems. (The underlying relational database engines had a lot of work put in to make online CRUD work reliably, but that work has been done now, and we can all build on top of it without a second thought.)
Fair enough.
What a beautifully articulated take!
These are very serious research level math questions. They are not “Erdős style” questions; they look more like problems or lemmas that I encountered while doing my PhD. Things that don’t make it into the papers but were part of an interesting diversion along the way.
It seems likely that PhD students in the subfields of the authors are capable of solving these problems. What makes them interesting is that they seem to require fairly high research level context to really make progress.
It’s a test of whether the LLMs can really synthesize results from knowledge that require a human several years of postgraduate preparation in a specific research area.
So these are like those problems that are “left for the reader”?
Not necessarily. Even the statements may not appear in the final paper. The questions arose during research, and understanding them was needed for the authors to progress, but maybe not needed for the goal in mind.
No, results in a paper are identified to be "left for the reader" because they are thought to be straightforward to the paper's audience. These are chosen because they are novel. I didn't see any reason to think they are easier than the main results, just maybe not of as much interest.
Very serious for mathematicians - not for ML researchers.
If the paper would not have had the AI spin, would those 10 questions still have been interesting?
It seems to me that we have here a paper that is solely interesting because of the AI spin -- while at the same time this AI spin is really poorly executed from the point of AI research, where this should be a blog post at most, not an arXiv preprint.
I’m confused by this comment. I’m pretty sure that someone at all the bigs labs is running these questions through their models and will report back as soon as the results arrive (if not sooner, assuming they can somehow verify the answers).
The fact that you find it odd that this landed on arXiv is maybe a cultural thing… mathematicians kinda reflexively throw work up there that they think should be taken seriously. I doubt that they intend to publish it in a peer reviewed journal.
Yes, but people at those labs may be running those problems because a Fields Medalist is in the paper, and it got hype.
Not because of the problems, and not because this is new methodology.
And once the labs report back, what do we know that we didn't know before? We already know, as humans, the answer to the problems, so that is not it. We already know that LLMs can solve some hard problems, and fail in easy problems, so that is not it either.
So what do we really learn?
the last unsolved erdos problem proof generated by llms that hit the news was so non interesting that a paper published by erdos himself stated the proof
aaaaaaand no one cared enough to check
so i think the question is, are those interesting by themselves, or, are they just non interesting problems no one will ever care about except it would be indicative llms are good for solving complex novel problems that do not exists in their training set?
The timed-reveal aspect is also interesting.
How is that interesting for a scientific point of view? This seems more like a social experiment dressed as science.
Science should be about reproducibility, and almost nothing here is reproducible.
Deepmind’s Nobel Prize was primarily for its performance in CASP which is pretty much exactly this. Labs solve structures of proteins, but don’t publish them until after all the computational teams predict structures.
So I’m not sure where you’re coming from claiming that this isn’t scientific.
It wasn't like this in any way.
CASP relies on a robust benchmark (not just 10 random proteins), and has clear participation criteria, objective metrics how the eval plays out, etc.
So I stand by my claim: This isn't scientific. If CASP is Japan, a highly organized & civilized society, this is a banana republic.
Reproducibility is just one aspect of science, logic + reasoning from principles and data is the major aspect.
There are some experiments which cannot be carried out more than once.
> There are some experiments which cannot be carried out more than once
Yes, in which case a very detailed methodology is required: which hardware, runtimes, token counts etc.
This does none of that.
Not an expert but #7 is -almost- elementary.
This is exciting as a reality check of our expectations from the current level of AI. I expect AIs to solve at least 2-3 of them in a week. I expect one “easy” problem that multiple models solve. And I expect at least one solution to be “interesting” and different than the human solutions. I also expect human researchers to solve more than AIs in a week (globally, by total) but I don’t know what happens if they publish their results during the week. We’ll see results soon.
Yeah, I pointed a custom thing and Claude at #6, and it's solved it in Lean besides needing to axiomize one theorem not in mathlib. Only about four of the problems have enough foundations formalized in mathlib though for this approach.
Can someone explain how this would work?
> the answers are known to the authors of the questions but will remain encrypted for a short time.
Ok. But humans may be able to solve the problems too. What prevents Anthropic or OpenAI from hiring mathematicians, have them write the proof and pass it off as LLM written? I'm not saying that's what they'll do. But shouldn't the paper say something about how they're going to validate that this doesn't happen?
Honest question here. Not trying to start a flame here. Honestly confused how this is going to test what it wants to test. Or maybe I'm just plain confused. Someone help me understand this?
This is not a benchmark. They just want to give people the opportunity to try their hand at solving novel questions with AI and see what happens. If an AI company pulls a solution out of their hat that cannot be replicated with the products they make available to ordinary people, that's hardly worth bragging about and in any case it's not the point of the exercise.
Hey, sorry, totally out of context but I've always wanted to ask about the username. I keep reading it as "yoruba" in my mind. What does it mean, if I'm not being indiscreet?
You're not the first to have wondered: https://news.ycombinator.com/item?id=20730027
Well, now that I read that comment I remembered having read it before. My mind is going.
They could solve the problems and train the next models with the answers, as such the future models could “solve” theses.
The authors mention that before publications they tested these questions on Gemini and GPT, so they have been available to the two biggest players already; they have a head start.
Looks like very sloppy research.
I don't think it's that serious...it's an interesting experiment that assumes people will take it in good faith. The idea is also of course to attach the transcript log and how you prompted the LLM so that anyone can attempt to reproduce if they wish.
If you want to do this rigorously, you should run it as a competition like the guys at the AI-MO Prize are doing on Kaggle.
That way you get all the necessary data.
I still think this is bro science.
If this were a competition, some people would try hard to win it. But the goal here is exploration, not exploitation. Once the answers are revealed, it's unlikely a winner will be identified, but a bunch of mathematicians who tried prompting AI with the questions might learn something from the exercise.
Nothing prevents them, and they are already doing that. I work in this field and one can be sure that now, because of the notoriety this preprint got, the questions will be solved soon.
It's possible but unlikely given the short timeline, diverse questions that require multiple matheamticians, and low stakes. Also they've already run preliminary tests.
> It's possible but unlikely given the short timeline
Yep. "possible but unlikely" was my take too. As another person commented, this isn't really a benchmark, and as long as that's clear, it seems fair. My only fear is that some submissions may be AI-assisted rather than fully AI-generated, with crucial insights coming from experienced mathematicians. That's still a real achievement even if it's human + AI collaboration. But I fear that the nuance would be lost on news media and they'll publish news about the dawn of fully autonomous math reasoning.
Because LLMs are deterministic, they could provide the model files, prompt, and seed used.
That was exactly my first thought as well. All those exercises are pointless and people don't seem to understand it, it's baffling.
Even if it's not Anthropic or OpenAI paying for the solutions, maybe it'll be someone solving them "for fun" because the paper got popular and posting them online.
It's a futile exercise.
I'm realizing I don't know if it's currently harder for an LLM to: * come up with a formal proof that checks out according to a theorem prover * come up with a classical proof that's valid at a high-level, with roughly the same correctness as human-written papers
Is this known?
The advantage of the formal proof is that the LLM in a loop can know that it failed and keep trying.
https://1stproof.org/#about
Anything special about these questions? Are they unsolved by humans. I am not working in mathematics research so its hard to tell the importance.
The abstract of the article is very short, and seems pretty clear to both of your questions.
This is what is special about them:
> a set of ten math questions which have arisen naturally in the research process of the authors. The questions had not been shared publicly until now;
I.e. these are problems of some practical interest, not just performative/competitive maths.
And this is what is know about the solutions:
> the answers are known to the authors of the questions but will remain encrypted for a short time.
I.e. a solution is known, but is guaranteed to not be in the training set for any AI.
> I.e. a solution is known, but is guaranteed to not be in the training set for any AI.
Not a mathematician and obviously you guys understand this better than I do. One thing I can't understand is how they're going to judge if a solution was AI written or human written. I mean, a human could also potentially solve the problem and pass it off as AI? You might say why would a human want to do that? Normal mathematicians might not want to do that. But mathematicians hired by Anthropic or OpenAI might want to do that to pass it off as AI achievements?
Well, I think the paper answers that too. These problems are intended as a tool for honest researchers to use for exploring the capabilities of current AI models, in a reasonably fair way. They're specifically not intended as a rigorous benchmark to be treated adversarially.
Of course a math expert could solve the problems themselves and lie by saying that an AI model did it. In the same way, somebody with enough money could secretly film a movie and then claim that it was made by AI. That's outside the scope of what this paper is trying to address.
The point is not to score models based on how many of the problems they can solve. The point is to look at the models' responses and see how good they are at tackling the problem. And that's why the authors say that ideally, people solving these problems with AI would post complete chat transcripts (or the equivalent) so that readers can assess how much of the intellectual contribution actually came from AI.
> these are problems of some practical interest, not just performative/competitive maths.
FrontierMath did this a year ago. Where is the novelty here?
> a solution is known, but is guaranteed to not be in the training set for any AI.
Wrong, as the questions were poses to commercial AI models and they can solve them.
This paper violates basic benchmarking principles.
> Wrong, as the questions were poses to commercial AI models and they can solve them.
Why does this matter? As far as I can tell, because the solution is not known this only affects the time constant (i.e. the problems were known for longer than a week). It doesn't seem that I should care about that.
Because the companies have the data and can solve them -- so providing the question to a company with the necessary manpower, one cannot guarantee anymore that the solution is not known, and not contained in the training sample.
We need more of these kinds of tightly time controlled challenges for LLMs.
February 13th is a pretty close deadline. They should at least have given a month.
February 13 seems right to me. I mean it's not like LLMs need to manually write out a 10 page proof. But a longer deadline can give human mathematicians time to solve the problem and write out a proof. A close deadline advantages the LLM and disadvantages humans which should be the goal if we want to see if LLMs are able to solve these.
Interesting questions. I think I'll attempt #7.
Tried all ten with claude, then had codex take a loook at the work -- codex thinks number 7 has the lowest chance of being correct, a 1 out of 10 rating. None of them were higher than 7/10 chance of being right so far as done by claude opus 4.6 and evaluated by codex 5.3 highest.
Not going to spend too many more tokens on this.
I don't think either of these are the best choices for this. Chatgpt 5.2 pro and gemini 3 pro deep thinking I believe are the strongest LLMs at "pure thought", i.e. things like mathematical reasoning.
No, this is not a proof because not using Mizar ;-) https://mizar.uwb.edu.pl/
Would something be a proof in that sense even if it did use Mizar? As far as I can tell, Mizar has no complete reference for its language semantics, except for the single closed-source implementation. In general, information about the system itself (outside of the library) seems very scarce, or at least scarcely advertised.
Mizar source was "available upon request" for maybe 30-40 years. It got completely open-sourced under GPL some 3 years ago (maybe earlier, not sure), see [1], also [2] and [3] about an alternative implementation in Rust. Mizar is indeed "scarcely advertised", but all the information is publicly available, who wants to know knows. As for Mizar semantics, see for example [4].
[1] https://github.com/MizarProject/system [2] https://github.com/digama0/mizar-rs [3] https://arxiv.org/pdf/2304.08391v2 [4] https://link.springer.com/article/10.1007/s10817-018-9479-z
I wonder how many of these the authors privately know to be false.
I'll patiently wait for the "goalpost moving olympics" after this is published.
The goalposts have been on wheels basically since the field was born. Look up "AI effect". I've stopped caring what HN comments have to say about whether something is or isn't AI. If its useful to me, I'm gonna use it.
> Conflicts of interest. No funding was received for the design or implementation of this project. None of the authors of this report was employed by or consulted with AI companies during the project, nor will they do so while contributing to it
As it should. Good.
This is a totally independent test not conducted or collaborated by any of the AI companies or employees so that no bias is introduced at all[0].
[0] Unless the researchers are not disclosing if they have any ownership of shares in private AI companies.
As mathematically interesting the 10 questions are that the paper presents, the paper is --sorry for the harsh language-- garbage from the point of view of benchmarking and ML research: Just 10 question, few descriptive statistics, no interesting points other than "can LLMs solve these uncontaminated questions", no long bench of LLMs that were evaluated.
The field of AI4Math has so many benchmarks that are well executed -- based of the related work section it seems the authors are bit familiar with AI4Math at all.
My belief is that this paper is even being discussed solely because a Fields Medalist, Martin Hairer, is on it.
Paper not about benchmarking or ML research is bad from the perspective of benchmarking. Not exactly a shocker.
The authors themselves literally state: "Unlike other proposed math research benchmarks (see Section 3), our question list should not be considered a benchmark in its current form"
On the website https://1stproof.org/#about they claim: "This project represents our preliminary efforts to develop an objective and realistic methodology for assessing the capabilities of AI systems to autonomously solve research-level math questions."
Sounds to me to be a benchmark in all but a name. And they failed pretty terribly at achieving what they set out to do.
> And they failed pretty terribly at achieving what they set out to do.
Why the angst ? If the ai can autonomously solve these problems, isnt that a huge step forward for the field.
It's not angst. It's intense frustration that they 1) are not doing the science correctly, and 2) that others (e.g. FrontierMath) already did everything they claim to be doing, so we won't learn anything new here, but somehow 1stproof get all the credit.
Are they really trying to do science, or are they just trying to determine pragmatically whether or not current AI is useful for a research mathematician in their day to day job?