radioactivist 7 hours ago

The data set quality seems a really spotty based on looking a few random problems (I looked at about a dozen in the "Physics" subcategory). Several problems had no clear question (or answer) and seemed to be clipped from some longer resource and thus had back references to Sections and Chapters that the models clearly couldn't follow. Worse is that the verification of the answer seems to be via an LLM and not all that reliable; I saw several where the answer was marked correct when it clearly wasn't and several that were correct but not in the precise form given as "the" answer and thus were labelled as incorrect.

  • rosstaylor90 4 hours ago

    Thanks for feedback! Yes, we’re looking to improve quality in the coming months. Couple of notes:

    - The initial use of data is distillation so we’re less bound by question quality (anything that evinces output diversity is good).

    - But moving onto RL, we’ll need stronger quality. We have much better things planned both on data filtering and verification!

    - Surprisingly, a lot of ML datasets actually look like this when you look under hood. We’re hoping having more eyeballs on it will help improve quality in long run over less transparent status quo!

akomtu 8 hours ago

The output of a reasoning model must be an algorithm, formulas or something similar in a formal language, that leaves no room for ambiguity. I can "reason" all day about the P=NP problem, but I won't be able to come up with something verifiable. A language model may translate the formal language of a reasoning model into English or Chinese for example.

Once this stage is reached, once we can throw piles of data onto a reasoning model and get formal algorithms that explain or predict that data, the new era will begin.

westurner 12 hours ago

"Can Large Language Models Emulate Judicial Decision-Making? [Paper]" (2025) https://news.ycombinator.com/item?id=42927611 ; awesome-legal-nlp, LexGLUE, FairLex, LegalBench, "Who hath done it?" exercise : {Thing done}, ({Gdo, You, Others, Unknown/Nobody} x {Ignorance, Malice, Motive, Intent}) ... Did nobody do this?

Can LLMs apply a consistent procedure for logic puzzles with logically disjunctive possibilities?

Enter: Philosoraptor the LLM

carimura 16 hours ago

nginx not happy.

  • rosstaylor90 15 hours ago

    Happier now, upgraded the backend :) (co-creator here)

emorning3 15 hours ago

LLMs cannot reason, they can only say things that sound reasonable, there's a difference. Duh.

  • rosstaylor90 15 hours ago

    What's your AIME 2025 score? https://gr.inc/RJT1990/AIME2025/

    • nyrikki 14 hours ago

      The is the point of the AIME, it is a 3 hour closed book examination in which each answer is an integer number from 0 to 999 and should only depend on pre-calc...for a human with no calculator, notes, or internet access.

      The concepts are heavily covered in the training corpus, and if people were allowed to take it more than once, with even a book let alone access to the internet it wouldn't be very hard.

      Examples:

      1) Find the sum of all integer bases $b>9$ for which $17_b$ is a divisor of $97_b.$

      In the corpus: https://www.quora.com/In-what-bases-b-does-b-7-divide-into-9...

      And one more:

      3) https://artofproblemsolving.com/wiki/index.php/2025_AIME_I_P...

      Is just the the number of ways to distribute k indistinguishable balls (players) into n distinguishable boxes (flavors, without exclusion, in such a way that no box is empty.

      Thus in the corpus for any courses that need to cover combinatorial problems including physics, discreet math, logistics etc...

      IMHO these concept classes from a typical AIME are so common, the scores you gave demonstrate that those models are doing no "general reasoning" at all and are actually failing at approximate retrieval.

      • rosstaylor90 4 hours ago

        I disagree, 10 years ago AIs nailing these types of competition would have been seen as very impressive. The fact goal posts can move on this now shows how much AI has progressed.

        (Also the term “approximate retrieval” is a bad one - reasoning is inherently a process of chaining together associations. What matters is whether the reasoning reaches the right conclusions. Still some way to go, but already very impressive in tasks traditionally considered harbours of human reasoning!)

  • nh23423fefe 15 hours ago

    jokes on you, they can't even speak. so obviously your sentence is meaningless. arguing about definitions is very fruitful!

  • perching_aix 14 hours ago

    emorning3 cannot reason, he can only say things that sound reasonable, there's a difference. Duh.

    good luck. as a reminder, there are people who, with varying degrees of certainty, think their loved ones have been replaced by actors, as well as people who think they're actually the god of the world around them, for it is just their imagination.

    • leptons 12 hours ago

      > as a reminder, there are people who, with varying degrees of certainty, think their loved ones have been replaced by actors, as well as people who think they're actually the god of the world around them, for it is just their imagination.

      None of that is any proof at all that LLMs or computers in general can reason.

      "some humans are dumb, so LLMs are smart" is not a valid argument here.

      • perching_aix 3 hours ago

        > None of that is any proof at all that LLMs or computers in general can reason.

        It was never meant to be a proof that LLMs or computers in general can reason [or rather, that they can reason generalistically]. Instead, it was a demonstration of how their argument looks when mapped to other situations, illustrating that it isn't really putting anything to the table in the ways of, proofs, evidences, definitions, or logic / arguments, nor does it enable others to do so.

        > some humans are dumb, so LLMs are smart

        That kind of reading loses the point just about completely, so it shouldn't be surprising you can't find an argument in there.

        The point was and is that simply pointing at an AI model and saying "nah it's cappin" is awfully lacking, and is suspiciously similar to how certain people with certain mental conditions view their world. It is not insightful, nor reasonable. It's just an assertion of disbelief, following a - as you can hopefully agree - dubious logic that cannot be disproven or substantially argued against, as it was never designed to enable that in the first place.

      • fragmede 11 hours ago

        how do we test for reasoning? if A -> B and B -> C, then something that can reason could conclude A -> C. If I give A -> B and B -> C to an LLM, and ask it about the relationship between A and C, it'll tell me about the transitive property of implication, graph theory, transitivity. That there's no qualia behind that, that doesn't really reason or think or breath or love, we have to go back and ask what is reasoning. there are some definitions for reasoning that LLMs can pass, there are some they can't. If they're able to outperform dumb humans whom we assume do reason, why does that not mean that LLMs have some capacity to reason?

        • vhantz 8 hours ago

          > how do we test for reasoning? if A -> B and B -> C, then something that can reason could conclude A -> C. If I give A -> B and B -> C to an LLM, and ask it about the relationship between A and C, it'll tell me about the transitive property of implication, graph theory, transitivity.

          Not true.

          A LLM might give you that answer x% of the time, x being a number less than 100. However, any thinking person answering your question, will give you the same answer, no matter how many times you ask it. That's the fundamental difference between thinking and statistically mapping and reproducing the structure of human language.

          • perching_aix 3 hours ago

            > any thinking person answering your question, will give you the same answer, no matter how many times you ask it

            Oh, will they? Will they really?!

          • sharemywin 7 hours ago

            I'm pretty sure if you set the temp to 0 it will product the exact same output every time. its the sampling that produces the output variation.

  • CamperBob2 12 hours ago

    My next-token predictor said you would say that next.