Let’s get to know the benchmarks AI companies use to compare each others’ versions.
https://betterbench.stanford.edu is a repository of AI benchmark assessments for informed benchmark selection through quality evaluation and best practice analysis.
Language:
General safety:
Reasoning:
Science:
Math benchmarks:
Coding:
https://github.com/swe-bench/sb-cli/ provides the CLI to run the benchmarks.
The benchmark involves giving agents a code repository and issue description, and challenging them to generate a patch that resolves the problem described by the issue.
https://www.youtube.com/watch?v=aOjgPJ94-aM Huggingface accelerate library to run
VIDEO:
In June 2024, Leopold Aschenbrenner wrote his situational-awareness.ai blog with this illustration:
Click image to enlarge to full frame.
By 2025, AI companies have sucked up all the information that have been created by humans. The next thing is generation of new information. New problems need to be defined because, in 2025, evaluations about the extent that a particular offering has reached AGI (Artificial General Intelligence) based on a relatively small number of challenges.
PROTIP: Enhance your resume: make YouTube videos about solving the problems. Suggest new problems. Find issues with the autograder. File a PR and get listed as a contributor.
ASI (Artificial Super Intelligence) will be reached when “proto-automated” researchers automate research (within massive AI datacenters).
When xAI unveiled its Grok-3 LLM on Feb 18, 2025, one analysis shows it ranking #1 across the various benchmarks (including Creative Writing, Instruction Following, etc.):
This table compares specific scores on specific benchmarks:
That resulted in some crying that xAI neglected to include in the comparison results of OpenAI’s o3 December results:
The above is from Anthropic’s Claude 3.7 Sonnet announcement on Feb 25, 2025. Dubbed the first hybrid AI reasoning model, it is a groundbreaking AI model because it allows users to control how long it “thinks” before responding to deliver real-time answers or take extra time to provide more complex, well-reasoned responses. It’s available to premium Claude users. Pricing: $3 per million input tokens and $15 per million output tokens—higher than OpenAI’s o3-mini.
Anthropic is also launching Claude Code, a tool that lets developers run AI-driven code edits directly from their terminal, analyze projects, and even push changes to GitHub.
There are several mathematics competitions: AIME, HMMT, Mandelbrot, ARML.
Introduced 16 Dec 2023 on ARVIX (by Ruixiang Cui while working at Microsoft and during his PhD at the University of Copenhage) as “A Human-Centric Benchmark for Evaluating Foundation Models” for AGI (Artificial General Intelligence) development.
AGIEVal is called “human-centric” because its prompts are based on 20 exams: official, public, and high-standard admission and qualification exams taken by humans: SAT, law school admission (LSAT) tests, math competitions.
As of March 2025, the v1.1 version of the leaderboard for AGIEval shows:
An example is https://github.com/ruixiangcui/AGIEval JSONL (JSON Lines) is a lightweight, text-based data format designed for storing structured data records, where each line in the file represents a valid JSON object. This format is particularly useful for handling large datasets efficiently, as it allows for line-by-line processing without requiring the entire file to be loaded into memory.
Shown below: https://github.com/ruixiangcui/AGIEval/blob/main/data/v1_1/math.jsonl
{"passage": null, "question": "Let $\\lambda$ be a constant, $0 \\le \\lambda \\le 4,$ and let $f : [0,1] \\to [0,1]$ be defined by\n\\[f(x) = \\lambda x(1 - x).\\]Find the values of $\\lambda,$ $0 \\le \\lambda \\le 4,$ for which there exists an $x \\in [0,1]$ such that $f(x) \\neq x$ but $f(f(x)) = x.$", "options": null, "label": null, "answer": "(3,4]", "other": {"solution": "We have that\n\\[f(f(x)) = f(\\lambda x(1 - x)) = \\lambda \\cdot \\lambda x(1 - x) (1 - \\lambda x(1 - x)),\\]so we want to solve $\\lambda \\cdot \\lambda x(1 - x) (1 - \\lambda x(1 - x)) = x.$\n\nNote that if $f(x) = x,$ then $f(f(x)) = f(x) = x,$ so any roots of $\\lambda x(1 - x) = x$ will also be roots of $\\lambda \\cdot \\lambda x(1 - x) (1 - \\lambda x(1 - x)) = x.$ Thus, we should expect $\\lambda x(1 - x) - x$ to be a factor of $\\lambda \\cdot \\lambda x(1 - x) (1 - \\lambda x(1 - x)) - x.$ Indeed,\n\\[\\lambda \\cdot \\lambda x(1 - x) (1 - \\lambda x(1 - x)) - x = (\\lambda x(1 - x) - x)(\\lambda^2 x^2 - (\\lambda^2 + \\lambda) x + \\lambda + 1).\\]The discriminant of $\\lambda^2 x^2 - (\\lambda^2 + \\lambda) x + \\lambda + 1$ is\n\\[(\\lambda^2 + \\lambda)^2 - 4 \\lambda^2 (\\lambda + 1) = \\lambda^4 - 2 \\lambda^3 - 3 \\lambda^2 = \\lambda^2 (\\lambda + 1)(\\lambda - 3).\\]This is nonnegative when $\\lambda = 0$ or $3 \\le \\lambda \\le 4.$\n\nIf $\\lambda = 0,$ then $f(x) = 0$ for all $x \\in [0,1].$\n\nIf $\\lambda = 3,$ then the equation $f(f(x)) = x$ becomes\n\\[(3x(1 - x) - x)(9x^2 - 12x + 4) = 0.\\]The roots of $9x^2 - 12x + 4 = 0$ are both $\\frac{2}{3},$ which satisfy $f(x) = x.$\n\nOn the other hand, for $\\lambda > 3,$ the roots of $\\lambda x(1 - x) = x$ are $x = 0$ and $x = \\frac{\\lambda - 1}{\\lambda}.$ Clearly $x = 0$ is not a root of $\\lambda^2 x^2 - (\\lambda^2 + \\lambda) x + \\lambda + 1 = 0.$ Also, if $x = \\frac{\\lambda - 1}{\\lambda},$ then\n\\[\\lambda^2 x^2 - (\\lambda^2 + \\lambda) x + \\lambda + 1 = \\lambda^2 \\left( \\frac{\\lambda - 1}{\\lambda} \\right)^2 - (\\lambda^2 + \\lambda) \\cdot \\frac{\\lambda - 1}{\\lambda} + \\lambda + 1 = 3 - \\lambda \\neq 0.\\]Furthermore, the product of the roots is $\\frac{\\lambda + 1}{\\lambda^2},$ which is positive, so either both roots are positive or both roots are negative. Since the sum of the roots is $\\frac{\\lambda^2 + \\lambda}{\\lambda^2} > 0,$ both roots are positive. Also,\n\\[\\frac{\\lambda^2 + \\lambda}{\\lambda} = 1 + \\frac{1}{\\lambda} < \\frac{4}{3},\\]so at least one root must be less than 1.\n\nTherefore, the set of $\\lambda$ that satisfy the given condition is $\\lambda \\in \\boxed{(3,4]}.$", "level": 5, "type": "Intermediate Algebra"}}
TODO: Utility to display jsonl files for human consumption.
Upload a .json file and download it as .jsonl using online converters Code Beautify and Konbert.com
The American Invitational Mathematics Examination (AIME) is administered by the Mathematical Association of America each year as the second exam in the series of exams used to challenge high school mathletes competing to represent the US at the International Mathematics Olympiad (MOP). over 300,000 students in 50 states and over 30 countries
Students are invited to take the AIME based on their scores for exams AMC 10 for middle schoolers and AMC 12 for high schoolers offered November each year.
The questions test knowledge in algebra, geometry, counting and probability, and number theory. Both tests cover material typically covered in the first few years of high school math. Topics such as trigonometry, complex numbers, and logarithms are only needed for the AMC 12. Calculus is not required for either exam. Challenges include fundamentals in Pigeonhole Principle, Mathematical Induction, Inequalities, Diophantine Equations, and Functional Equations.
All answers are a single integer between 0 and 999. Click the “Solution” link for explanations.
In 2025, the AIME was held February 6th, with problems and answers published immediately afterwards on various YouTube channels, forums, and blogs:
BLOG: Annie Cushing (author of Making Data Sexy), notes that “The MathArena team … worked against the clock to run evaluations using the … problems before models could start training on it.” because the challenging math problems “makes for an excellent benchmark to see how well these models reason through more complex problems, with less opportunity to get the answer correct by chance since the test isn’t multiple choice like many benchmarks.”
For use by AI, Lex code for the first of 15 problems in AIME 2025 II are at:
https://github.com/eth-sri/matharena/blob/main/data/aime/aime_2025_II/problems/1.tex
QUESTION: Print properly formatted Lex files using wlect cat ???.lex
https://matharena.ai publishes how well various LLM models reasoned about mathematics challenges in terms of Accuracy and cost of compute.
As of Feb 20, 2025:
Each green box indicates the AI solved the problem >75% of 4 runs (repeated passes with the same prompt). Red boxes indicate problem solved less than 25% of passes. Yellow: Problem solved 25-75% of the time.
Stats: 150 is the highest score.
The HMMT (Harvard-MIT Mathematics Tournament, at hmmt.org) is a math competition founded in 1998 by students at Harvard, MIT, and schools near Boston, Massachuetts. It remains organized by students.
Each tournament draws close to 1000 students from around the globe.
WIKIPEDIA: The HMMT February tournament is generally considered to be more difficult than the American Invitational Mathematics Examination (AIME). However, difficulty varies by tournament and by round.
The top 50 scorers in the February tournament are invited to compete in the HMIC (Harvard MIT Invitational Competition), a five question proof contest.
The November tournament is easier than the February tournament, with problems similar to the AMC 10 and 12, and the AIME.
Calculus is not required for most of the problems, but it may be needed to solve some of the more difficult problems.
HMMT hosts staff exchange programs with the Princeton University Mathematics Competition (PUMaC), Carnegie Mellon Informatics and Mathematics Competition (CMIMC), and Stanford Math Tournament (SMT) to further collaboration between the competitions’ organizers. During exchanges, participants ranging from first-year members to more senior officers spend the weekend proctoring, grading, and otherwise volunteering at the host competition day-of.
GPQA (Google-Proof Q&A) is a challenging dataset of 448 multiple-choice questions written by domain experts in biology, physics, and chemistry.
“Google Proof” means that the answer is not discoverable by a query on Google.com (or Perplexity.ai). The answer requires “reasoning” through several intermediate queries to a panel of “experts”.
PDF of https://arxiv.org/abs/2311.12022 says “We ensure that the questions are high-quality and extremely difficult: experts who have or are pursuing PhDs in the corresponding domains reach 65% accuracy (74% when discounting clear mistakes the experts identified in retrospect), while highly skilled non-expert validators only reach 34% accuracy, despite spending on average over 30 minutes with unrestricted access to the web (i.e., the questions are “Google-proof”).
“The questions are also difficult for state-of-the-art AI systems, with our strongest GPT-4 based baseline achieving 39% accuracy. If we are to use future AI systems to help us answer very hard questions, for example, when developing new scientific knowledge, we need to develop scalable oversight methods that enable humans to supervise their outputs, which may be difficult even if the supervisors are themselves skilled and knowledgeable. The difficulty of GPQA both for skilled non-experts and frontier AI systems should enable realistic scalable oversight experiments, which we hope can help devise ways for human experts to reliably get truthful information from AI systems that surpass human capabilities.
The GPQA Leaderboard at https://klu.ai/glossary/gpqa-eval
The GPQA Dataset at https://github.com/idavidrein/gpqa was created by I. David Rein while he was a Researcher at New York University and now part of METR FAR.AI
https://sofworld.org/pattern-questions-and-marking-scheme
GAIA: Real-World AI Assistant Assessment GAIA (General AI Assistant Benchmark) evaluates AI systems on practical, real-world tasks that encompass reasoning, multi-modal processing, web browsing, and tool utilization. Despite being conceptually simple for humans, who achieve 92% accuracy, GAIA poses significant challenges for AI, with GPT-4 (with plugins) scoring only 15%. This stark performance gap underscores GAIA’s effectiveness in benchmarking AI systems’ robustness and adaptability across diverse, everyday scenarios, emphasizing the need for AI to match or exceed average human performance on practical tasks.
BASIS: Frontier of Scientific AI Capabilities BASIS (Benchmark for Advanced Scientific Inquiry Systems) pushes the boundaries of AI evaluation in scientific domains, surpassing even GPQA in complexity. Tailored for assessing AI systems expected to perform at or beyond human expert level, BASIS focuses on tasks demanding advanced scientific inquiry and reasoning. This benchmark is crucial for developing and evaluating AI systems capable of contributing meaningfully to cutting-edge scientific research and problem-solving, potentially accelerating breakthroughs across various scientific disciplines.
DSCodder at https://huggingface.co/deepseek-ai/deepseek-coder-1.3b-base
Over time, a larger fraction of difficult problems are introduced with model capability improvements. A drop in performance in the later months is expected.
https://livecodebench.github.io says LCB (Live Code Bench) collects problems from periodic contests on
Codeforces (ELO). VIDEO
VIDEO: Build a game usong ChatGPT 03 Mini O3-mini achieved a perfect 10/10 on pylint for a Hangman game project.
LLM platforms use them for constructing a holistic benchmark for evaluating Code LLMs across variety of code-related scenarios continuously over time.
The livecodebench runner is Naman Jain CS Ph.D. Berkeley.
Shangdian (King) Han living in Berkeley, California. Previously Microsoft Research.
OpenAI o3 scored among the top 10 contestants in Codeforces.com competitive programming solving complex problems under time constraints (2.5 hours).
LiveCodeBench.com publishes four Leaderboards: For each leadershboard provides a time slider. As of this writing:
479 Code Execution
Models submitted for evaluation are at https://github.com/LiveCodeBench/submissions
For a more nuanced evaluation of LLM performance across different difficulty levels,
“Pass@1” measures the percentage of problems a model can solve correctly on its first attempt across all difficulty levels.
“Easy Pass@1” refers to the Pass@1 performance on problems categorized as “Easy”.
See https://openreview.net/forum?id=chfJJYC3iL
At https://github.com/LiveCodeBench/LiveCodeBench/blob/main/lcb_runner/prompts/code_execution.py
prompts begin with:
system_message = "You are an expert at Python programming, code execution, test case generation, and fuzzing."
``
You are given a Python function and an assertion containing an input to the function. Complete the assertion with a literal (no unsimplified expressions, no function calls) containing the output when executing the provided code on the given input, even if the function is incorrect or incomplete. Do NOT output any extra information. Execute the program step by step before arriving at an answer, and provide the full assertion with the correct output in [ANSWER] and [/ANSWER] tags, following the examples. ```
Its dataset of “448” multiple-choice questions is in the password-protected 2.2MB dataset.zip file at https://github.com/idavidrein/gpqa/blob/main/dataset.zip
VIDEO: Here’s a question about use of hour glasses that Grok3 cannot solve in 3 minutes.
VIDEO “Write a chess engine using the UCI (Universal Chess Interface) protocol”
SWE-Bench (Software Engineering Benchmark) at https://www.swebench.com has been used as the way to compare how well LLM offerings tests systems’ ability to automatically solve GitHub issues in a dataset containing 2,294 Issue-Pull Request pairs from 12 popular Python repositories:
The 10 Oct 2023 Arxiv article describes the unit test verification using post-PR behavior as the reference solution.
Those working on it include carlosej@princeton.edu and johnby@stanford.edu
Coming soon.
Coming soon.
https://www.youtube.com/watch?v=a6bPt9oyoa8&t=1m32s “retirement will come for most people sooner than they think”. Brandage: Ex OpenAI Employee Gives Warning About The Economy TheAIGRID
https://www.youtube.com/watch?v=REjFL9hkkL4 Anthropic’s Chilling 18-Month Warning: AI Apocalypse in 18 Months TheAIGRID
https://www.youtube.com/watch?v=379s4W_EaTk
https://www.youtube.com/watch?v=379s4W_EaTk&t=9m3s LLM Engineer’s Handbook (from Packt) by Paul Lustzien,
OmniParser https://microsoft.github.io/OmniParser/
https://www.youtube.com/watch?v=kkZ4-xY7oyU&t=2m11s PersonaQA for Hallucination Evaluation
COMET, BLEU, and CHRF are widely used metrics for evaluating machine translation (MT) quality.
https://www.perplexity.ai/search/what-is-the-comet-score-for-tr-9RkzS6rsRr6R9oyBwYZvag
It depends on what you are trying to achieve.
Generation tasks are measured using mean squared error (MSE).