AIExplained

AIExplained

'Humanity's Last Exam' - I Doubt It

Added 2024-09-17 18:12:45 +0000 UTC

Less than 24 hours ago we got the claim that a multi-million dollar 'final test' for AI was being put together. But I ask questions about what it will achieve, drawing on evidence from 3 papers, Simple Bench, and my own analysis. Hopefully, this video will show you why 'o1 = AGI' claims leave a lot to be desired.

Link for Offline Viewing/Download: https://drive.google.com/file/d/1YZ7jsU_CAV3Rf9b7KvzjgENaTEPuEkCR/view?usp=drive_link

Reuters Headline: https://www.reuters.com/technology/artificial-intelligence/ai-experts-ready-humanitys-last-exam-stump-powerful-tech-2024-09-16/?s=09

Final Benchmark Page: https://agi.safe.ai/submit

ARC Leaderboard: https://arcprize.org/leaderboard

https://x.com/mikeknoop/status/1835091832398815235

Human Performance on ARC, studied: https://arxiv.org/pdf/2409.01374

Safety Statement, CAIS: https://www.safe.ai/work/statement-on-ai-risk

LLMs Can't Quite Get the Joke: https://arxiv.org/pdf/2406.10522

Math Traps Paper: https://arxiv.org/pdf/2405.06680

MMLU Errors (1 year ago): https://www.youtube.com/watch?v=hVade_8H8mE

'Humanity's Last Exam' - I Doubt It

Comments

Having spent 12 years at CERN, I can say that a CERN for AI would be an incredible place to be. Likely would (have) attract(ted) a lot of talent, including from big Labs (like did in physics) specially for many of those I believe might feel betrayed in the openess vision. Said this...I also belive it is way too late unfortunately. Getting state members to agree on setup, budget, locations, start the hires...would take years. At this point I just hope that the US wins the race tbh.

Robert Gomez-Reino

2024-10-27 18:29:34 +0000 UTC

Hi Phil, just pondering if improved tool usage during training could be a game changer. Just like adding reasoning steps improved o1, adding some steps that would allow the model to apply fact checks (e.g Temporal, common sense, social) or use specialized algorithms (block world). It seems like a way to open up another scaling dimension, improved tool usage during inference. I cannot see a fundamental difference between training for reasoning steps and applying tools.

Peter Biela

2024-10-11 09:02:48 +0000 UTC

Give any of these models a configuration of rubiks cube to solve and they will spit out complete utter BS Moreover, give them a sides configuration and ask them to solve it They also fail Ask them to write and run code to solve it They spit out sth they don’t understand and can’t explain Doesn’t that tell you guys anything ?! They can almost understand and achieve a solution better than a human But they are lossy integrated and brittle

Youssef Mohamed

2024-09-21 12:41:22 +0000 UTC

Thanks this is exactly what I needed

Gilad

2024-09-20 19:10:49 +0000 UTC

Great idea

Philip

2024-09-20 18:01:33 +0000 UTC

There were some lower hanging 'tricks' in Simple but many categories it falls flat. On Orion 1, I honestly don't know but your figures don't sound impossible. I still think Orion with o1 would score less than humans though, unless a whole new paradigm is unveiled (think pixel by pixel?)

Philip

2024-09-20 18:01:23 +0000 UTC

Something we will be teing at some point. But contrary to popular belief you can actually tell the model 'this is a trick' and the needle moves much less on performance than people think, and on some questions backwards.

Philip

2024-09-20 17:59:33 +0000 UTC

Try this! https://support.patreon.com/hc/en-us/articles/212052266-Getting-Discord-access

Philip

2024-09-20 17:58:33 +0000 UTC

Guys I can't seem to find a link to the discord

Gilad

2024-09-20 14:01:03 +0000 UTC

Yeah I didn't even think about the Blackwell GPUs, although The model does get bigger proportionally with GPU power so I'm not entirely confident about increased inference speed. Did he say that the inference capability of the Blackwell GPUs would receive a much greater jump in power compared to the performance jump in training?

Gilad

2024-09-20 08:20:41 +0000 UTC

Yah, and by then, we could also see more baseline reasoning chains given better distillation and more Blackwell GPUs used for inference vs H100s (Jensen said they're 50x better at inference for o1 use cases). Also, we could see a full menu of o2-minis specialized for different verticals with a router picking the best one per prompt or per reasoning chain (o1-mini is only specialized for STEM). I wouldn't be surprised if simplebench is basically saturated by summer.

Brian Crabtree

2024-09-20 03:34:20 +0000 UTC

I wonder if anyone has tried fine-tuning a model on trick questions? What implication would that have for these tests?

Barnaby Golden

2024-09-19 16:52:31 +0000 UTC

Don't you think that an o1-type system, based on Orion would be pretty close to 80 in simple bench? If o1 preview is ~45, full o1 might be ~50-60, next gen (o2?) might be >70?

Gilad

2024-09-19 16:08:26 +0000 UTC

I absolutely agree with you that any test which can maximize the delta between the average human and these models is the optimal test we should be using to discern if the model is reasoning or merely memorizing and parroting back answers. Great insight. This is why I subscribed and I really hope this type of thinking becomes the dominant conversation in the world of model evaluation.

Joshua Davis

2024-09-19 02:08:51 +0000 UTC

Let's say you want to test the model with a question it can't visualize. Potentially the only solution would be to run a Sora prompt in the backend with the scenario from the question, play it out, and use what happened to answer. Obviously Sora 1.0 won't be sufficient if you are asking for what kind of sounds, sensations, smells, or tastes arise from a scenario, so you really need a complete world model. Is this the only true way to answer at least a large class of questions that are technically in text form but refer to unlearned concepts? I suppose this would be taking maximizing test time compute (ala o1) to it's logical conclusion.

John Merkowsky

2024-09-19 01:09:25 +0000 UTC

I do not look at benchmarks - I test models firsthand for my real use cases and compare outputs, choosing the one that gives me the best results.

Michal Babula

2024-09-18 22:10:28 +0000 UTC

@AI Explained, you should set up a question for it like this one for GPQA! https://www.metaculus.com/questions/22056/highest-gpqa-diamond-scores/

Alexis Olson

2024-09-18 15:29:38 +0000 UTC

Re 2 years, I think you're underestimating true RL. Here's Karpathy on Aug 8: " No production-grade actual RL on an LLM has so far been convincingly achieved and demonstrated in an open domain, at scale. And intuitively, this is because getting actual rewards (i.e. the equivalent of win the game) is really difficult in the open-ended problem solving tasks. It's all fun and games in a closed, game-like environment like Go where the dynamics are constrained and the reward function is cheap to evaluate and impossible to game. But how do you give an objective reward for summarizing an article? Or answering a slightly ambiguous question about some pip install issue? Or telling a joke? Or rewriting some Java code to Python? Going towards this is not in principle impossible but it's also not trivial and it requires some creative thinking. But whoever convincingly cracks this problem will be able to run actual RL. The kind of RL that led to AlphaGo beating humans in Go. Except this LLM would have a real shot of beating humans in open-domain problem solving. " https://x.com/karpathy/status/1821277264996352246 I think the big question now is: Did o1 crack the open-domain reward problem? Or did they just pound closed-domains (math, coding) so hard that it somewhat generalized into adjacent domains?

Brian Crabtree

2024-09-18 14:31:40 +0000 UTC

Working on it!

Philip

2024-09-18 08:54:58 +0000 UTC

Haha I am, indeed

Philip

2024-09-18 08:54:43 +0000 UTC

2 years would be kinda scary!

Philip

2024-09-18 08:54:13 +0000 UTC

I think there is a place for both SIMPLEBENCH and "Final Benchmark" testing. They aren't necessarily trying to measure the same thing. My wild guess is that SIMPLEBENCH will be saturated within two years or so and it will be nice to have harder benchmarks to keep measuring against. I'm also skeptical that this "Final Benchmark" thing will end up being the best for this purpose but it's still worth trying. Crowdsourcing questions will result in a large spread of quality and difficulty but I bet there will be some good ones amongst the mediocre. You should submit some questions that o1 (and other frontier models) get horribly wrong. You don't need to worry if most humans can solve them or not as long as they're unambiguous.

Alexis Olson

2024-09-18 02:28:32 +0000 UTC

I do agree more with you. Testing a model's reasoning abilities is more important that its knowledge of obscure facts.

Mike D

2024-09-17 20:27:29 +0000 UTC

I agree. The new 'last exam' benchmark sounds like GPQA reloaded. Just take a look at the GPQA Diamond set (https://huggingface.co/datasets/Idavidrein/gpqa?row=3) - it is incredibly difficult. I hope the wisdom of the crowd will find categories of questions in the 'final exam' that are truly helpful. For now I stick with the following signals when evaluating a LLM - reasoning gap (https://arxiv.org/abs/2402.19450v1) - elo difference in chatbot arena - ability for efficient in context learning - ARC AGI - simple bench

SteveHaupt

2024-09-17 19:40:59 +0000 UTC

Interesting in your testing of o1 that you found it did well in certain classes of questions and not others. Suggests that there might be an intelligence taxonomy that is different from normal intelligence classifications according to, say, discipline?

Sean Gallagher

2024-09-17 19:39:26 +0000 UTC

Thank you Phillip for putting out the good word and for your analysis here. Spot on.

Devin Pellegrino

2024-09-17 19:06:11 +0000 UTC

Completely agree, hard questions for SOTA models are completely different from what humans would consider hard questions. Though I think Humanity’s last exam will be multi-modal, just as I hope Simple-Bench will become!

Trenton Dambrowitz

2024-09-17 18:59:30 +0000 UTC

I don’t know man, I think you may be biased 😤 /s

Anouar Mansour

2024-09-17 18:49:27 +0000 UTC

More Creators

Ami

Ami

patreon

ero3d

ero3d

patreon

Maurizio Memoli

Maurizio Memoli

gumroad

missxxxg

missxxxg

patreon

Murat

Murat

patreon

Cassius Lange

Cassius Lange

patreon

True Passion Image - Shibari Arts

True Passion Image - Shibari Arts

patreon

Niki Smith

Niki Smith

gumroad

Fan_fun

Fan_fun

gumroad

キノコ君／蘑菇君

キノコ君／蘑菇君

fanbox

あたたかい茶

あたたかい茶

fanbox

Yellow Seven

Yellow Seven

patreon

GMI

GMI

patreon

Kawaii_girl

Kawaii_girl

patreon

どれんすか

どれんすか

fanbox

Jokestar

Jokestar

patreon

jhuneghtrwpow

jhuneghtrwpow

patreon

Brown Temptress

Brown Temptress

gumroad

All Tingles ASMR 2

All Tingles ASMR 2

patreon

gearbell

gearbell

patreon

The White Crow 69 🔞

The White Crow 69 🔞

patreon

PSO2

PSO2

patreon

ShortcakeASMR

ShortcakeASMR

patreon

Morphy

Morphy

patreon

Lasersloth's Lair

Lasersloth's Lair

patreon

cc-44

cc-44

patreon

wjzssnh

wjzssnh

patreon

ZoeLycan

ZoeLycan

patreon

barleyshake

barleyshake

patreon

TRY

TRY

fanbox

Keoenl

Keoenl

patreon

xanderycke

xanderycke

patreon

橙織ゆぶね

橙織ゆぶね

fanbox

zmatters

zmatters

patreon

Sordane Publishing

Sordane Publishing

patreon

Tabesol

Tabesol

patreon

acapellascience

acapellascience

patreon

kuuko_w

kuuko_w

patreon

xbaby

xbaby

patreon

quakkiibabii

quakkiibabii

patreon