Less than 24 hours ago we got the claim that a multi-million dollar 'final test' for AI was being put together. But I ask questions about what it will achieve, drawing on evidence from 3 papers, Simple Bench, and my own analysis. Hopefully, this video will show you why 'o1 = AGI' claims leave a lot to be desired.
Having spent 12 years at CERN, I can say that a CERN for AI would be an incredible place to be. Likely would (have) attract(ted) a lot of talent, including from big Labs (like did in physics) specially for many of those I believe might feel betrayed in the openess vision. Said this...I also belive it is way too late unfortunately. Getting state members to agree on setup, budget, locations, start the hires...would take years. At this point I just hope that the US wins the race tbh.
Robert Gomez-Reino
2024-10-27 18:29:34 +0000 UTC
Hi Phil, just pondering if improved tool usage during training could be a game changer. Just like adding reasoning steps improved o1, adding some steps that would allow the model to apply fact checks (e.g Temporal, common sense, social) or use specialized algorithms (block world). It seems like a way to open up another scaling dimension, improved tool usage during inference.
I cannot see a fundamental difference between training for reasoning steps and applying tools.
Peter Biela
2024-10-11 09:02:48 +0000 UTC
Give any of these models a configuration of rubiks cube to solve and they will spit out complete utter BS
Moreover, give them a sides configuration and ask them to solve it
They also fail
Ask them to write and run code to solve it
They spit out sth they don’t understand and can’t explain
Doesn’t that tell you guys anything ?!
They can almost understand and achieve a solution better than a human
But they are lossy integrated and brittle
Youssef Mohamed
2024-09-21 12:41:22 +0000 UTC
Thanks this is exactly what I needed
Gilad
2024-09-20 19:10:49 +0000 UTC
Great idea
Philip
2024-09-20 18:01:33 +0000 UTC
There were some lower hanging 'tricks' in Simple but many categories it falls flat. On Orion 1, I honestly don't know but your figures don't sound impossible. I still think Orion with o1 would score less than humans though, unless a whole new paradigm is unveiled (think pixel by pixel?)
Philip
2024-09-20 18:01:23 +0000 UTC
Something we will be teing at some point. But contrary to popular belief you can actually tell the model 'this is a trick' and the needle moves much less on performance than people think, and on some questions backwards.
Yeah I didn't even think about the Blackwell GPUs, although The model does get bigger proportionally with GPU power so I'm not entirely confident about increased inference speed.
Did he say that the inference capability of the Blackwell GPUs would receive a much greater jump in power compared to the performance jump in training?
Gilad
2024-09-20 08:20:41 +0000 UTC
Yah, and by then, we could also see more baseline reasoning chains given better distillation and more Blackwell GPUs used for inference vs H100s (Jensen said they're 50x better at inference for o1 use cases). Also, we could see a full menu of o2-minis specialized for different verticals with a router picking the best one per prompt or per reasoning chain (o1-mini is only specialized for STEM). I wouldn't be surprised if simplebench is basically saturated by summer.
Brian Crabtree
2024-09-20 03:34:20 +0000 UTC
I wonder if anyone has tried fine-tuning a model on trick questions? What implication would that have for these tests?
Barnaby Golden
2024-09-19 16:52:31 +0000 UTC
Don't you think that an o1-type system, based on Orion would be pretty close to 80 in simple bench?
If o1 preview is ~45, full o1 might be ~50-60, next gen (o2?) might be >70?
Gilad
2024-09-19 16:08:26 +0000 UTC
I absolutely agree with you that any test which can maximize the delta between the average human and these models is the optimal test we should be using to discern if the model is reasoning or merely memorizing and parroting back answers. Great insight. This is why I subscribed and I really hope this type of thinking becomes the dominant conversation in the world of model evaluation.
Joshua Davis
2024-09-19 02:08:51 +0000 UTC
Let's say you want to test the model with a question it can't visualize. Potentially the only solution would be to run a Sora prompt in the backend with the scenario from the question, play it out, and use what happened to answer. Obviously Sora 1.0 won't be sufficient if you are asking for what kind of sounds, sensations, smells, or tastes arise from a scenario, so you really need a complete world model. Is this the only true way to answer at least a large class of questions that are technically in text form but refer to unlearned concepts? I suppose this would be taking maximizing test time compute (ala o1) to it's logical conclusion.
John Merkowsky
2024-09-19 01:09:25 +0000 UTC
I do not look at benchmarks - I test models firsthand for my real use cases and compare outputs, choosing the one that gives me the best results.
Michal Babula
2024-09-18 22:10:28 +0000 UTC
@AI Explained, you should set up a question for it like this one for GPQA!
https://www.metaculus.com/questions/22056/highest-gpqa-diamond-scores/
Alexis Olson
2024-09-18 15:29:38 +0000 UTC
Re 2 years, I think you're underestimating true RL.
Here's Karpathy on Aug 8:
" No production-grade *actual* RL on an LLM has so far been convincingly achieved and demonstrated in an open domain, at scale.
And intuitively, this is because getting actual rewards (i.e. the equivalent of win the game) is really difficult in the open-ended problem solving tasks. It's all fun and games in a closed, game-like environment like Go where the dynamics are constrained and the reward function is cheap to evaluate and impossible to game. But how do you give an objective reward for summarizing an article? Or answering a slightly ambiguous question about some pip install issue? Or telling a joke? Or rewriting some Java code to Python? Going towards this is not in principle impossible but it's also not trivial and it requires some creative thinking. But whoever convincingly cracks this problem will be able to run actual RL. The kind of RL that led to AlphaGo beating humans in Go. Except this LLM would have a real shot of beating humans in open-domain problem solving. "
https://x.com/karpathy/status/1821277264996352246
I think the big question now is: Did o1 crack the open-domain reward problem? Or did they just pound closed-domains (math, coding) so hard that it somewhat generalized into adjacent domains?
Brian Crabtree
2024-09-18 14:31:40 +0000 UTC
Working on it!
Philip
2024-09-18 08:54:58 +0000 UTC
Haha I am, indeed
Philip
2024-09-18 08:54:43 +0000 UTC
2 years would be kinda scary!
Philip
2024-09-18 08:54:13 +0000 UTC
I think there is a place for both SIMPLEBENCH and "Final Benchmark" testing. They aren't necessarily trying to measure the same thing.
My wild guess is that SIMPLEBENCH will be saturated within two years or so and it will be nice to have harder benchmarks to keep measuring against.
I'm also skeptical that this "Final Benchmark" thing will end up being the best for this purpose but it's still worth trying. Crowdsourcing questions will result in a large spread of quality and difficulty but I bet there will be some good ones amongst the mediocre.
You should submit some questions that o1 (and other frontier models) get horribly wrong. You don't need to worry if most humans can solve them or not as long as they're unambiguous.
Alexis Olson
2024-09-18 02:28:32 +0000 UTC
I do agree more with you. Testing a model's reasoning abilities is more important that its knowledge of obscure facts.
Mike D
2024-09-17 20:27:29 +0000 UTC
I agree. The new 'last exam' benchmark sounds like GPQA reloaded. Just take a look at the GPQA Diamond set (https://huggingface.co/datasets/Idavidrein/gpqa?row=3) - it is incredibly difficult. I hope the wisdom of the crowd will find categories of questions in the 'final exam' that are truly helpful.
For now I stick with the following signals when evaluating a LLM
- reasoning gap (https://arxiv.org/abs/2402.19450v1)
- elo difference in chatbot arena
- ability for efficient in context learning
- ARC AGI
- simple bench
SteveHaupt
2024-09-17 19:40:59 +0000 UTC
Interesting in your testing of o1 that you found it did well in certain classes of questions and not others. Suggests that there might be an intelligence taxonomy that is different from normal intelligence classifications according to, say, discipline?
Sean Gallagher
2024-09-17 19:39:26 +0000 UTC
Thank you Phillip for putting out the good word and for your analysis here. Spot on.
Devin Pellegrino
2024-09-17 19:06:11 +0000 UTC
Completely agree, hard questions for SOTA models are completely different from what humans would consider hard questions.
Though I think Humanity’s last exam will be multi-modal, just as I hope Simple-Bench will become!