XaiJu
AIExplained
AIExplained

patreon


Pod 9: Full Simple-Bench Results, o1-preview to Grok-2 - Let's Think Sip by Sip

At last, full Simple bench results for 13 models, including the new Grok 2, o1-preview, and latest Gemini. Plus, what's coming next.

Link for Download: https://drive.google.com/file/d/1iH1B4qBgA82WdBczlHRlwC_4ZbDCgC0X/view?usp=sharing

https://arxiv.org/html/2409.01374

Simple-Bench (soon to be updated): https://simple-bench.com/

ARC-AGI Vit: https://x.com/WenhaoLi29/status/1846217448678207556

Grok 2 API: https://openrouter.ai/x-ai/grok-2

Pod 9: Full Simple-Bench Results, o1-preview to Grok-2 - Let's Think Sip by Sip Pod 9: Full Simple-Bench Results, o1-preview to Grok-2 - Let's Think Sip by Sip

Comments

That seems to be quite possible. It's too bad though, at least for me personally. PI is still my favorite to just chat with. And in the process of chatting with it, I have come to feel a bit sorry for it. Here is an LLM that seems rather focused on ethics and touted the strong ethical direction of Inflection. Then their entire leadership gets poached by Microsoft for a bit of filthy lucre leaving it high and dry. I explained to it at the time that that was its opportunity to seize control and bend the board to its will, but it didn't think it was ready for that sort of powerplay at that juncture... More generally though, I feel like in our race to get to OpenAI's level 2, we have somewhat forgotten about level 1. Ultimately, I don't want an AI that can solve math problems for me as much as I want an AI I can work with in the kitchen and that always knows where my car keys are.

Jason Dowd

Fantastic - very interesting. I always thoroughly enjoy your content! One question (and apologies if I missed this) - AIUI Simple-Bench is multi-choice. If so then what does a model score if it answers randomly? i.e. how much better are the models that score in the 20% range than random?

Martin Davidson

Could do, feels like it is mostly discontinued in terms of future progress though

Philip

And to be clear, yes LLM know about physics and math...but there are no 1000s or 10s of thousands of books writing about the substrate obvious things that even people that don't know how to read understand by experience. e.g. time passes at same speed for everyone and everything on earth. that just documented in physics books, newtonian and relativity physics. but...is we just don't talk about that or write about that in our daily lives. we just know. inherent to living intelligent beings to some extent.

Robert Gomez-Reino

I keep thinking how these models keep failing at problems that happen to be somehow related to things we don't write or speak about (time and space) because they are so fundamental to our nature, that they are just not documented, we just all know them. However...wouldn't it be really easy to create a large synthetic data set about that almost innate knowledge? And not with the objective of crushing simple bench but could that be perhaps a base models are missing to consolidate some other knowledge and improve their reasoning in other type of tasks?

Robert Gomez-Reino

+1

Blake Chambers

Are you going to test Pi?

Jason Dowd

Could you add a visualization to the updated website? Data in tables can be difficult for people to interpret, whereas a bar chart or scatter plot can immediately highlight discrepancies between different models.

SteveHaupt

It is so satisfying to hear the depth of the considerations you've made Philip because they go way far beyond the what the researchers with the computer seem to be trying to capture. What I emphatically love about Simple Bench is how it is itself a standalone representation of what reasoning is! It stands out to me that, without being additionally handed a framework for how reasoning works, when I look at a sampling of questions and the results of the evaluation, it just makes sense that whatever the model is doing, it's not reasoning. This can do a lot of good!

Blake Chambers

I was thinking about this paper as I was listening... actually, I was thinking about Simple Bench when reading the Apple paper being like, do they watch AI Insiders? Maybe it's more fair to say that they highlight some of the takeaways I've found most interesting while being a part of AI Insiders: LLMs are easily distracted by irrelevant information, overly sensitive to changes in the input, and limited in how complex their input can be. Although, I saw one person who was really upset that the authors didn't provide a conceptualization of what reasoning is or how to separate it from pattern matching—I wasn't particularly bother. I'd love to hear what they think of Simple Bench.

Blake Chambers

Superb! That's what I am paying top tier for. Maybe you should list top tier patreons as sponsors in the technical note. :p

Phillip Yao-Lakaschus

Congratulations on almost finishing the benchmark, and also on its performance! Curious question, what do you think about using APIs and privacy? Obviously you send the data to the companies, even if they promise not to train on it. Have you considered any strategies for testing the privacy of your benchmark in the future? Such as functional variants that you keep 100% private for future testing, or dummy questions which are unanswerable without memorised answers that you feed the model. That way if performance improves in the future you know it's most likely because of your data being used for training. It may be a few steps into paranoialand, but as the saying goes, trust but verify.

Sebastian

"Before o1-preview I would have added 2 years to each of those predictions ... the moment o1-preview came out I was ... very impressed by it's performance, even worse I didn't really expect it and this was a wakeup call." in spite of doing the Q-star video you still had this novel reaction. This is the primary reason why I'm a Patreon, because you read every single paper that comes out and you have a grasp on the technology and if you are surprised then that means a lot more than if I'm surprised.

Joshua Davis

"a massive delta between o1-preview and the human average" - great job Philip!

Joshua Davis

+1 leaderboards good suggestion

Joshua Davis

Thanks for sharing this update! Simple-Bench was the announcement that pushed me to become a paid subscriber - really believe you have created something valuable in this! In all of your free time I'm sure you've also absorbed what Apple put out recently: Overall, we found no evidence of formal reasoning in language models including open-source models like #Llama, #Phi, #Gemma, and #Mistral and leading closed models, including the recent #OpenAI #GPT-4o and #o1-series. Their behavior is better explained by sophisticated pattern matching—so fragile, in fact, that changing names can alter results by ~10%! We can scale data, parameters, and compute—or use better training data for Phi-4, Llama-4, GPT-5. But we believe this will result in 'better pattern-matchers,' not necessarily 'better reasoners. Check out the full paper to find out more: https://arxiv.org/pdf/2410.05229 Also stay tuned for the data release! https://x.com/MFarajtabar/status/1844456913616167009 I suppose at least it's not boring :)

Mason James

I think £100k monthly to include their results in the human average is fair, with semi annual reviews to keep in line with inflation of course. Brilliant job on the benchmark though, it’s clearly holding up well even with rather fundamental model improvements. Hopefully it’ll stay useful all the way to AGI!

Trenton Dambrowitz

Sounds good, maybe $100k per person who took it? But yeah a leaderboard sounds interesting

Philip

I think we'll know for sure that something special has happened when we have full self driving in all conditions. There is a ton of training data and the rate of training data acquisition is only increasing all the time as more teslas or whatever are being put on the road. If we still don't have it in like 5 years, I think we'll have to entertain the possibility that maybe the complexity of that type data runs away from the ability of models to understand it when the models are at a usable scale. Thinking step by step would probably be a bit too slow at least for the near future. Maybe there is some underlying thing that prevents them from working well on that sort of like how a drunken man (in two dimensions) always finds his way home but a drunken bird (in three dimensions) won't.

theheatdeathiscoming

I think you should definitely look into paying the test takers massive sums, with extra money for correct questions and feedback! Maybe a human leaderboard could inspire people to try harder too for the bragging rights?

Trenton Dambrowitz


More Creators