The investigators used the NPR Sunday puzzle questions to review the Models that ‘Ai’ Reasoning ‘
![The investigators used the NPR Sunday puzzle questions to review the Models that ‘Ai’ Reasoning ‘ The investigators used the NPR Sunday puzzle questions to review the Models that ‘Ai’ Reasoning ‘](https://i0.wp.com/techcrunch.com/wp-content/uploads/2025/01/GettyImages-1287582736.jpg?resize=780%2C470&ssl=1)
Every Sunday, the NPR is a shortt, New York Times ‘Crossword’ Crossword ‘Crossword “The New York Times’ Quiz of listeners in Sunday Puzzle. While written to be settled out too Too much, braininteas are often a challenge even if not skilled promoters.
That is why some experts think that they are a promising way to evaluate the limits of resolving AI problems.
In the new research, the Executive College of Lellesley College, University of Texas in Austin, in northeastern Texas, and the first directions built Ai Benchmark using victims from Sunday Puzzle characters. This group says that their exercise prevents a wonderful understanding, so what is called Consultation Models – O1 O1, between others – sometimes “to give the right answers.
“We wanted to improve the problem with problems that people can understand only for normal information,” the computer Science Surgraduate in Northeast and one of the existing writers in the study, told TechCruzruk.
AI sector is in trouble measuring the mark yet. Most trials used to test AI models of AI skills, such as the NhD-Level Math and scientific questions, is not compatible with a regular user. In the meantime, many benches – even benchmarks released – they are fast approaching.
The benefits of the quiz game of radio-such as Sunday Puzzle is not to assess the esoteric information, and challenges are considered to memories of rote “to solve, explain, explain, explain, explain, explain, explain, define, define them.
“I think it makes these problems difficult that it’s really difficult to make reasonable progress about a problem until you solve it – that’s where everything meets at the same time,” said Luha. “That requires a combination of understanding and the process of elimination.”
No feces is appropriate, very much. Sunday Puzzle is US-Centric and English only. And because the quizzes are publicly available, the models are trained and can cheat “in a sense, even though you are able to see evidence.
“The new questions are issued weekly, and we can expect the latest questions truly invisible,” add. “We aim to keep benchmark survives and followed how the model performance changes later.”
At Benchmark, consisting of 600-week puzzles, a puzzle of the puzzle and the R1 deepseety and a well-dealing models and tested the results, which helped them to avoid the pulse of the AI. Trade-off is what the consultations models take a while to get to the flood – usually seconds to long minutes.
At least one model, R1 of R1 of Deepseeek, provides solutions that are not good for other Sunday Puzzle questions. R1 will say Verbatim “I give up,” followed by the wrong answer selected appears to be randomly – behavior is certainly related.
Models make some tricky decisions, such as giving the wrong answer only so that they can return quickly, try to reduce the best, and fail again. They also caught them “think” forever and give meaningless explanations to find the answers, or come in the right answer right there and continue to look at some of the answers.
“With difficult problems, R1 is really that ‘frustrated,'” said Gua. “It was funny to see how the model explains what a person could say. It is yet to be seen that ‘frustration’ in consultation can affect the quality of the example.”
The current most efficient model is at the Rebeken O19%, followed by O3-mini recently issued to O3-mini used for Supreme “(R1. (R1 has hit 35%. ) As the next step, researchers plan to increase their assessment on additional models of thinking, who hope they will help identify areas where these kinds can be developed.
![NPR Benchmark](https://i0.wp.com/techcrunch.com/wp-content/uploads/2025/02/Screenshot-2025-02-06-at-12.31.38AM.png?resize=780%2C537&ssl=1)
“You don’t need a PhD for a good time in consultation, so there should be designing the consultative benches that do not require PhD’s level information,” said Guaha. “The wide access is allowing a broad set of investigators to understand and evaluate the better solutions. Further, as the legal models are increasingly sent, we believe that everyone should be able to determine whether they are.”
Source link