A smarter way for big language models to think about difficult problems

To make large language models (LLMs) more accurate when answering more difficult questions, researchers can let the model spend more time thinking about potential solutions.

But common approaches that give this capability to LLMs set a fixed computational budget for each problem, regardless of its complexity. This means that the LLM might waste computing resources on simpler questions or be unable to solve complex problems that require more reasoning.

To solve this problem, MIT researchers developed a smarter way to distribute computational effort as the LLM solves a problem. Their method allows the model to dynamically adjust its computational budget based on the difficulty of the question and the probability that each partial solution leads to the correct answer.

The researchers found that their new approach allowed LLMs to use only half the calculation done by existing methods, while achieving comparable accuracy on a range of questions of varying difficulty. Additionally, their method allows smaller, less resource-intensive LLMs to perform as well or better than larger models on complex problems.

By improving the reliability and efficiency of LLMs, particularly when tackling complex reasoning tasks, this technique could reduce the energy consumption of generative AI systems and enable the use of LLMs in higher-stakes, more time-sensitive applications.

“The computational cost of inference has quickly become a major bottleneck for pioneering model providers, and they are actively trying to find ways to improve computational efficiency per user query. For example, the recent GPT-5.1 release highlights the effectiveness of the “adaptive reasoning” approach proposed by our paper. By equipping models with the ability to know what they don’t know, we can enable them to devote more computation to the most difficult problems and to the most promising solutions, and to use far fewer tokens on easy solutions. This makes reasoning both more reliable and much more efficient,” says Navid Azizan, Alfred H. and Jean M. Hayes Assistant Professor of Career Development in the Department of Mechanical Engineering and the Institute for Data, Systems, and Society (IDSS), principal investigator of the Laboratory for Information and Decision Systems (LIDS) and lead author of a book. article on this technique.

Azizan is joined on the paper by lead author Young-Jin Park, a LIDS/MechE graduate student; Kristjan Greenewald, research scientist at MIT-IBM Watson AI Lab; Kaveh Alim, IDSS graduate student; and Hao Wang, research scientist at the MIT-IBM Watson AI Lab and the Red Hat AI Innovation team. The research is being presented this week at the Neural Information Processing Systems Conference.

Calculation for contemplation

A recent approach called inference time scaling allows a large language model to take longer to reason about difficult problems.

By scaling inference time, the LLM can generate multiple solution attempts at once or explore different reasoning paths and then choose the best ones to follow among these candidates.

A separate model, called the process reward model (PRM), scores each potential solution or reasoning path. The LLM uses these scores to identify the most promising.

Typical inference time scaling approaches assign a fixed amount of computation to the LLM to decompose the problem and reason about the steps.

Instead, the researchers’ method, known as instance-adaptive scaling, dynamically adjusts the number of potential solutions or reasoning steps based on their probability of success, as the model struggles with the problem.

“This is how humans solve problems. We find partial solutions, then decide: Should I go further with one of them, or stop and revise, or even go back to my previous step and continue solving the problem from there?” Wang explains.

To do this, the framework uses the PRM to estimate the difficulty of the question, thereby helping the LLM evaluate the computational budget to use to generate and reason about potential solutions.

At each stage of the model’s reasoning process, the PRM examines the question and partial answers and evaluates how promising each is for arriving at the correct solution. If the LLM is more confident, it can reduce the number of potential solutions or reasoning paths to pursue, thereby saving computational resources.

But researchers have found that existing PRMs often overestimate the model’s probability of success.

Overcoming Overconfidence

“If we were to simply trust current PRMs, which often overestimate the chance of success, our system would reduce the computational budget too aggressively. So we first had to find a way to better calibrate the PRMs to make inference time scaling more efficient and reliable,” says Park.

The researchers introduced a calibration method that allows PRMs to generate a range of probability scores rather than a single value. In this way, the PRM creates more reliable uncertainty estimates that better reflect the true probability of success.

With a well-calibrated PRM, their instance-adaptive scaling framework can use probability scores to effectively reduce calculations while maintaining the accuracy of model outputs.

When they compared their method to standard approaches to scaling inference time on a series of mathematical reasoning tasks, they used fewer calculations to solve each problem while achieving similar accuracy.

“The beauty of our approach is that this adaptation happens on the fly, as the problem is solved, rather than happening all at once at the beginning of the process,” says Greenewald.

In the future, researchers want to apply this technique to other applications, such as code generation and AI agents. They also plan to explore other uses of their PRM calibration method, such as for reinforcement learning and fine-tuning.

“Human employees learn on the job – some CEOs even started as interns – but today’s agents remain largely static probabilistic software. Work like this is an important step in changing that: helping agents understand what they don’t know and building mechanisms for continuous self-improvement. These capabilities are essential if we want agents to operate safely, adapt to new situations, and deliver consistent results at scale,” says Akash Srivastava, Director and Chief Architect of Core AI at IBM Software, who was not involved in this project. work.

This work was funded in part by the MIT-IBM Watson AI Lab, the MIT-Amazon Science Hub, the MIT-Google Program for Computing Innovation, and MathWorks.

Physical AI does not replace farmers. It keeps them active

Supply Chain Risk and Resilience

How AI is helping Fonterra work differently within the cooperative

A smarter way for big language models to think about difficult problems | MIT News

ChatGPT’s in-depth research tool adds a built-in document viewer so you can read its reports

Google AI January announcements

Announcing our latest Gemini AI model

Physical AI does not replace farmers. It keeps them active

Supply Chain Risk and Resilience

How AI is helping Fonterra work differently within the cooperative

How supply chain disruptions are reshaping the future of startups

A smarter way for big language models to think about difficult problems | MIT News

Related Posts

Subscribe to Updates