XaiJu
AIExplained
AIExplained

patreon


'Humanity's Last Exam' - I Doubt It

Less than 24 hours ago we got the claim that a multi-million dollar 'final test' for AI was being put together. But I ask questions about what it will achieve, drawing on evidence from 3 papers, Simple Bench, and my own analysis. Hopefully, this video will show you why 'o1 = AGI' claims leave a lot to be desired.

Link for Offline Viewing/Download: https://drive.google.com/file/d/1YZ7jsU_CAV3Rf9b7KvzjgENaTEPuEkCR/view?usp=drive_link

Reuters Headline: https://www.reuters.com/technology/artificial-intelligence/ai-experts-ready-humanitys-last-exam-stump-powerful-tech-2024-09-16/?s=09

Final Benchmark Page: https://agi.safe.ai/submit

ARC Leaderboard: https://arcprize.org/leaderboard

https://x.com/mikeknoop/status/1835091832398815235

Human Performance on ARC, studied: https://arxiv.org/pdf/2409.01374

Safety Statement, CAIS: https://www.safe.ai/work/statement-on-ai-risk

LLMs Can't Quite Get the Joke: https://arxiv.org/pdf/2406.10522

Math Traps Paper: https://arxiv.org/pdf/2405.06680

MMLU Errors (1 year ago): https://www.youtube.com/watch?v=hVade_8H8mE

'Humanity's Last Exam' - I Doubt It

Comments

Having spent 12 years at CERN, I can say that a CERN for AI would be an incredible place to be. Likely would (have) attract(ted) a lot of talent, including from big Labs (like did in physics) specially for many of those I believe might feel betrayed in the openess vision. Said this...I also belive it is way too late unfortunately. Getting state members to agree on setup, budget, locations, start the hires...would take years. At this point I just hope that the US wins the race tbh.

Robert Gomez-Reino

Hi Phil, just pondering if improved tool usage during training could be a game changer. Just like adding reasoning steps improved o1, adding some steps that would allow the model to apply fact checks (e.g Temporal, common sense, social) or use specialized algorithms (block world). It seems like a way to open up another scaling dimension, improved tool usage during inference. I cannot see a fundamental difference between training for reasoning steps and applying tools.

Peter Biela

Give any of these models a configuration of rubiks cube to solve and they will spit out complete utter BS Moreover, give them a sides configuration and ask them to solve it They also fail Ask them to write and run code to solve it They spit out sth they don’t understand and can’t explain Doesn’t that tell you guys anything ?! They can almost understand and achieve a solution better than a human But they are lossy integrated and brittle

Youssef Mohamed

Thanks this is exactly what I needed

Gilad

Great idea

Philip

There were some lower hanging 'tricks' in Simple but many categories it falls flat. On Orion 1, I honestly don't know but your figures don't sound impossible. I still think Orion with o1 would score less than humans though, unless a whole new paradigm is unveiled (think pixel by pixel?)

Philip

Something we will be teing at some point. But contrary to popular belief you can actually tell the model 'this is a trick' and the needle moves much less on performance than people think, and on some questions backwards.

Philip

Try this! https://support.patreon.com/hc/en-us/articles/212052266-Getting-Discord-access

Philip

Guys I can't seem to find a link to the discord

Gilad

Yeah I didn't even think about the Blackwell GPUs, although The model does get bigger proportionally with GPU power so I'm not entirely confident about increased inference speed. Did he say that the inference capability of the Blackwell GPUs would receive a much greater jump in power compared to the performance jump in training?

Gilad

Yah, and by then, we could also see more baseline reasoning chains given better distillation and more Blackwell GPUs used for inference vs H100s (Jensen said they're 50x better at inference for o1 use cases). Also, we could see a full menu of o2-minis specialized for different verticals with a router picking the best one per prompt or per reasoning chain (o1-mini is only specialized for STEM). I wouldn't be surprised if simplebench is basically saturated by summer.

Brian Crabtree

I wonder if anyone has tried fine-tuning a model on trick questions? What implication would that have for these tests?

Barnaby Golden

Don't you think that an o1-type system, based on Orion would be pretty close to 80 in simple bench? If o1 preview is ~45, full o1 might be ~50-60, next gen (o2?) might be >70?

Gilad

I absolutely agree with you that any test which can maximize the delta between the average human and these models is the optimal test we should be using to discern if the model is reasoning or merely memorizing and parroting back answers. Great insight. This is why I subscribed and I really hope this type of thinking becomes the dominant conversation in the world of model evaluation.

Joshua Davis

Let's say you want to test the model with a question it can't visualize. Potentially the only solution would be to run a Sora prompt in the backend with the scenario from the question, play it out, and use what happened to answer. Obviously Sora 1.0 won't be sufficient if you are asking for what kind of sounds, sensations, smells, or tastes arise from a scenario, so you really need a complete world model. Is this the only true way to answer at least a large class of questions that are technically in text form but refer to unlearned concepts? I suppose this would be taking maximizing test time compute (ala o1) to it's logical conclusion.

John Merkowsky

I do not look at benchmarks - I test models firsthand for my real use cases and compare outputs, choosing the one that gives me the best results.

Michal Babula

@AI Explained, you should set up a question for it like this one for GPQA! https://www.metaculus.com/questions/22056/highest-gpqa-diamond-scores/

Alexis Olson

Re 2 years, I think you're underestimating true RL. Here's Karpathy on Aug 8: " No production-grade *actual* RL on an LLM has so far been convincingly achieved and demonstrated in an open domain, at scale. And intuitively, this is because getting actual rewards (i.e. the equivalent of win the game) is really difficult in the open-ended problem solving tasks. It's all fun and games in a closed, game-like environment like Go where the dynamics are constrained and the reward function is cheap to evaluate and impossible to game. But how do you give an objective reward for summarizing an article? Or answering a slightly ambiguous question about some pip install issue? Or telling a joke? Or rewriting some Java code to Python? Going towards this is not in principle impossible but it's also not trivial and it requires some creative thinking. But whoever convincingly cracks this problem will be able to run actual RL. The kind of RL that led to AlphaGo beating humans in Go. Except this LLM would have a real shot of beating humans in open-domain problem solving. " https://x.com/karpathy/status/1821277264996352246 I think the big question now is: Did o1 crack the open-domain reward problem? Or did they just pound closed-domains (math, coding) so hard that it somewhat generalized into adjacent domains?

Brian Crabtree

Working on it!

Philip

Haha I am, indeed

Philip

2 years would be kinda scary!

Philip

I think there is a place for both SIMPLEBENCH and "Final Benchmark" testing. They aren't necessarily trying to measure the same thing. My wild guess is that SIMPLEBENCH will be saturated within two years or so and it will be nice to have harder benchmarks to keep measuring against. I'm also skeptical that this "Final Benchmark" thing will end up being the best for this purpose but it's still worth trying. Crowdsourcing questions will result in a large spread of quality and difficulty but I bet there will be some good ones amongst the mediocre. You should submit some questions that o1 (and other frontier models) get horribly wrong. You don't need to worry if most humans can solve them or not as long as they're unambiguous.

Alexis Olson

I do agree more with you. Testing a model's reasoning abilities is more important that its knowledge of obscure facts.

Mike D

I agree. The new 'last exam' benchmark sounds like GPQA reloaded. Just take a look at the GPQA Diamond set (https://huggingface.co/datasets/Idavidrein/gpqa?row=3) - it is incredibly difficult. I hope the wisdom of the crowd will find categories of questions in the 'final exam' that are truly helpful. For now I stick with the following signals when evaluating a LLM - reasoning gap (https://arxiv.org/abs/2402.19450v1) - elo difference in chatbot arena - ability for efficient in context learning - ARC AGI - simple bench

SteveHaupt

Interesting in your testing of o1 that you found it did well in certain classes of questions and not others. Suggests that there might be an intelligence taxonomy that is different from normal intelligence classifications according to, say, discipline?

Sean Gallagher

Thank you Phillip for putting out the good word and for your analysis here. Spot on.

Devin Pellegrino

Completely agree, hard questions for SOTA models are completely different from what humans would consider hard questions. Though I think Humanity’s last exam will be multi-modal, just as I hope Simple-Bench will become!

Trenton Dambrowitz

I don’t know man, I think you may be biased 😤 /s

Anouar Mansour


More Creators