How do machine learning models work? And do they really “think” or “reason” the way we understand them? It's a philosophical question as well as a practical one, but a new paper circulating Friday suggests the answer is a pretty clear “no,” at least for now.
A group of AI research scientists at Apple published a paper for public commentary on Thursday called “Understanding the Limits of Mathematical Reasoning in Large-Scale Language Models.” Although the deeper concepts of symbolic learning and pattern reproduction are a bit in the weeds, the basic concepts of their work are quite easy to understand.
Suppose you ask them to solve a simple math problem like this:
Oliver picked 44 kiwis on Friday. And on Saturday, we will harvest 58 kiwis. On Sunday, we harvest twice as many kiwis as on Friday. How many kiwis does Oliver have?
Obviously, the answer is 44 + 58 + (44 * 2) = 190. Large language models are actually arithmetically unstable, but they can solve problems like this fairly reliably. But what if we added some random additional information like this:
Oliver picked 44 kiwis on Friday. And on Saturday, we will harvest 58 kiwis. On Sunday we harvested twice as many kiwis as on Friday, but five of them were a little less than average. How many kiwis does Oliver have?
It's the same math problem, right? Of course, even elementary school students can tell that even a small kiwi is a kiwi. But as it turns out, this additional data point confounds even the most advanced LLMs. GPT-o1-mini's opinion is as follows.
…On Sunday, five of those kiwis were smaller than average. You need to subtract them from Sunday's total: 88 (Sunday kiwis) – 5 (small kiwis) = 83 kiwis
This is just a simple example of hundreds of questions that researchers lightly modified, nearly all of which led to a significant drop in the success rate of models that attempted them.
Image credit: Mirzadeh et al.
Now, why should this be? Why are models that understand the problem so easily ignored by random, irrelevant details? Researchers believe that this reliable failure mode proposes that this means that the model doesn't actually understand the problem at all. Their training data allows them to return the correct answer in some situations, but as soon as a little bit of actual “reasoning” is required, such as whether to count small kiwis, they produce strange and unintuitive results. begins to appear.
The researchers state in their paper:
[W]We investigate the weaknesses of mathematical reasoning in these models and demonstrate that performance degrades significantly as the number of clauses in the question increases. We hypothesize that this decline is due to the inability of current LLMs to perform true logical reasoning. Instead, it attempts to reproduce the inference steps observed in the training data.
This observation is consistent with other attributes often attributed to language ease in LLMs. Statistically, if the phrase “I love you” is followed by “I love you too,” LLMs can easily repeat it, but that doesn't mean they love you. Also, while we can follow the complex chain of inferences we have previously experienced, the fact that even superficial deviations can break this chain means that it actually reproduces the patterns observed in the training data. This suggests that you are not reasoning as much as you should.
One of the co-authors, Mehrdad Farajtabar, explains the paper very well in this thread on X.
While OpenAI researchers praised Mirzadeh et al.'s work, they disputed their conclusions, saying that with a little quick engineering, all of these failures could be corrected. Farajitabal (replying with the typical but admirable friendliness that researchers often use) says that better prompts may work for simple deviations, but better for countering complex distractions. pointed out that models may require exponentially more contextual data. This is also something that children can easily point out. Outside.
Does this mean there is no rationale for LLM? Maybe. Does that mean they can't understand it logically? No one knows. These are not clearly defined concepts, and questions tend to arise at the cutting edge of AI research, where state-of-the-art technology changes daily. Perhaps LLM has “reason”, but in some sense we are not yet aware of it or do not know how to control it.
This is an interesting frontier in research, but also a warning about how AI is marketed. Is it really possible to do what they claim? And if so, how? As AI becomes an everyday software tool, these kinds of questions are no longer academic. It ceases to be a thing.