XaiJu
AIExplained
AIExplained

patreon


Claude 4 Simple SOTA Insights + Leaked System Prompt

What is Claude 4 Opus understanding that others' aren't, to hit new records? And what is going on in that leaked system prompt. 10 highlights of the system prompt plus a quick update on Simple. Oh, and it looks like GPT-5 is finally arriving in July...

Full Leaked System prompt: https://raw.githubusercontent.com/elder-plinius/CL4R1T4S/d3193c0ca1d2e54e4ffcffedc1b185746c3c9038/ANTHROPIC/Claude_4.txt

Public System Prompt: https://docs.anthropic.com/en/release-notes/system-prompts#may-22th-2025

System Prompt Deletion: https://simonwillison.net/2025/May/25/claude-4-system-prompt/

Humanity's Last Exam: https://agi.safe.ai/

AI Job Loss: https://fortune.com/2025/05/28/anthropic-ceo-warning-ai-job-loss/

Tibor GPT-5 July (normally reliable): http://x.com/btibor91/status/1929241704873308253

Claude 4 Simple SOTA Insights + Leaked System Prompt

Comments

Hey Kyle, should be quite quick! https://support.patreon.com/hc/en-us/articles/212052266-Getting-Discord-access

Philip

Hi how do I join the discord community please?

Kyle Behrend

Got a documentary coming but got delayed due to editing issues! Hope you also consider the general support on AI Explained content too, and have some other announcements coming

Philip

New content? Do consider I spent $10 this month for this video only!

Dane Wagenhoffer

another awesome video. what kind of prompt management system do you use when running the simplebench?

Paulius Mui

SimpleBench is one of the key benchmarks I look at. Together with SWEbench, Aider, Arc AGI 1/2, Humanity's last exam, and a couple of the well-known math benchmarks. When SimpleBench saturates, reliability further increases by at least x3, multi-needle recall (paired with complex information/needle combination) also increases, and pricing per use stays comparable to current levels, then entry-level white collar jobs are in danger... on scale. I would love to see more in-depth multi-needle and complex recall benchmarks from the big AGI labs on their SOTA models. Think this is quite important... Google did it a bit for Gemini 2.5 Pro, but for multi-needle the published benchmarking could still be better!

Flo My

Excellent work as usual, Philip. Unfortunately, Claude's convo length is abysmal, and I don't have the money to go pro, nor the ability to API it.

Norfuer

Would you care to go deeper on your experiences where SOTA model loose track of context they were previously given? Here or on Philip’s Discord (assuming you’re Max M; tag me: palimondo).

Pavol Vaskovic

The more I use LLMs the more I start to realize that they likely aren't going to replace a large portion of white collar work for a very long time.. It's one thing to solve simple logic puzzles and another to actually understand why you are doing something. Until there's some research on LLMs deeply understanding why they are doing what they are doing, the impact it has, and things greater than the sum of its weights, I see a lot of corporations riding the automation train into a total mess. One thing that should have been done years ago was having the ability for an LLM to query for missing context. Claude 4 opus, o3, whatever model you have consistently will hallucinate context when it is missing information. Sure, you may reduce some menial simplistic version of hallucinations in a benchmark, but in the real world these menial issues with LLMs scale all the way up to a very high degree of sophistication. With any genuinely sophisticated applied problem, every Sota LLM quickly hallucinates context you have literally provided it in earlier prompts. I have seen this happen time and time again with o3, claude opus 4, gemini 2.5... Sure these problems are more sophisticated in nature, but it really starts to show the cracks in the facade of LLMs. It's like I can see the model grasping at straws for context it literally has been given. Any human can just look at a file or document and know whats in it..

Maximillian Richard Mahlke

thank you Phillip !

Daniel A Barbatti

I always thought that LLMs can only consider a few instructions at once. Now we have system prompts containing hundreds of instructions? I am genuinely surprised that this works. It seems I completely missed that LLMs have improved so much on this front.

Phillip Yao-Lakaschus

Regarding Trump in the system prompt: Ask O3 about the current trump administration and check its reasoning: you’ll see it’s answering it assuming it’s being presented an unlikely fictional scenario.

Erik

I'm going to be interested to see how grok 3.5 does on simple bench's spatial reasoning questions. There are hints they are focusing a lot on physics and I wonder if RL'ing on physics will capture more "common sense" stuff.

Joshua Sellers

Sonnet 4 underperforming Sonnet 3.7 in a couple of benchmarks makes me wonder if this is the reason for the release of Opus 4.

SteveHaupt

In my experience: Opus "understands" what you want from it much faster than any other model. It recognises the nuances and details, and can switch between different levels at any time (with a full context window), from the actual theme to the meta level and even the meta-meta level, and back again. It rarely gets "confused". If I use it with my reasoning prompts, I get even more interesting results. So far, this is my subjective experience of using it. Impressive! My favourite model at the moment! :)

Christopher Pollin

Great Video!

Marc-Leonard Overbeck

This is in line with my own experience. If the instructions remain in the context window for longer, it is usually more practical. But I would also be interested to know exactly how this would work technically if it were the case. Is the context that is closer to the next generated token given a higher score (which would make sense)?

Christopher Pollin

Question around 3:08, is it proven that llms pay extra attention to an instruction at the bottom of the prompt? I didn't know this.

jokmenen


More Creators