AIExplained

AIExplained

Claude 4 Simple SOTA Insights + Leaked System Prompt

Added 2025-06-02 13:50:55 +0000 UTC

What is Claude 4 Opus understanding that others' aren't, to hit new records? And what is going on in that leaked system prompt. 10 highlights of the system prompt plus a quick update on Simple. Oh, and it looks like GPT-5 is finally arriving in July...

Full Leaked System prompt: https://raw.githubusercontent.com/elder-plinius/CL4R1T4S/d3193c0ca1d2e54e4ffcffedc1b185746c3c9038/ANTHROPIC/Claude_4.txt

Public System Prompt: https://docs.anthropic.com/en/release-notes/system-prompts#may-22th-2025

System Prompt Deletion: https://simonwillison.net/2025/May/25/claude-4-system-prompt/

Humanity's Last Exam: https://agi.safe.ai/

AI Job Loss: https://fortune.com/2025/05/28/anthropic-ceo-warning-ai-job-loss/

Tibor GPT-5 July (normally reliable): http://x.com/btibor91/status/1929241704873308253

Claude 4 Simple SOTA Insights + Leaked System Prompt

Comments

Hey Kyle, should be quite quick! https://support.patreon.com/hc/en-us/articles/212052266-Getting-Discord-access

Philip

2025-07-10 08:36:35 +0000 UTC

Hi how do I join the discord community please?

Kyle Behrend

2025-07-05 11:57:07 +0000 UTC

Got a documentary coming but got delayed due to editing issues! Hope you also consider the general support on AI Explained content too, and have some other announcements coming

Philip

2025-07-02 14:55:55 +0000 UTC

New content? Do consider I spent $10 this month for this video only!

Dane Wagenhoffer

2025-07-02 04:17:44 +0000 UTC

another awesome video. what kind of prompt management system do you use when running the simplebench?

Paulius Mui

2025-06-06 23:21:07 +0000 UTC

SimpleBench is one of the key benchmarks I look at. Together with SWEbench, Aider, Arc AGI 1/2, Humanity's last exam, and a couple of the well-known math benchmarks. When SimpleBench saturates, reliability further increases by at least x3, multi-needle recall (paired with complex information/needle combination) also increases, and pricing per use stays comparable to current levels, then entry-level white collar jobs are in danger... on scale. I would love to see more in-depth multi-needle and complex recall benchmarks from the big AGI labs on their SOTA models. Think this is quite important... Google did it a bit for Gemini 2.5 Pro, but for multi-needle the published benchmarking could still be better!

Flo My

2025-06-05 14:12:28 +0000 UTC

Excellent work as usual, Philip. Unfortunately, Claude's convo length is abysmal, and I don't have the money to go pro, nor the ability to API it.

Norfuer

2025-06-04 10:25:09 +0000 UTC

Would you care to go deeper on your experiences where SOTA model loose track of context they were previously given? Here or on Philip’s Discord (assuming you’re Max M; tag me: palimondo).

Pavol Vaskovic

2025-06-03 06:16:00 +0000 UTC

The more I use LLMs the more I start to realize that they likely aren't going to replace a large portion of white collar work for a very long time.. It's one thing to solve simple logic puzzles and another to actually understand why you are doing something. Until there's some research on LLMs deeply understanding why they are doing what they are doing, the impact it has, and things greater than the sum of its weights, I see a lot of corporations riding the automation train into a total mess. One thing that should have been done years ago was having the ability for an LLM to query for missing context. Claude 4 opus, o3, whatever model you have consistently will hallucinate context when it is missing information. Sure, you may reduce some menial simplistic version of hallucinations in a benchmark, but in the real world these menial issues with LLMs scale all the way up to a very high degree of sophistication. With any genuinely sophisticated applied problem, every Sota LLM quickly hallucinates context you have literally provided it in earlier prompts. I have seen this happen time and time again with o3, claude opus 4, gemini 2.5... Sure these problems are more sophisticated in nature, but it really starts to show the cracks in the facade of LLMs. It's like I can see the model grasping at straws for context it literally has been given. Any human can just look at a file or document and know whats in it..

Maximillian Richard Mahlke

2025-06-03 01:18:10 +0000 UTC

thank you Phillip !

Daniel A Barbatti

2025-06-02 21:08:03 +0000 UTC

I always thought that LLMs can only consider a few instructions at once. Now we have system prompts containing hundreds of instructions? I am genuinely surprised that this works. It seems I completely missed that LLMs have improved so much on this front.

Phillip Yao-Lakaschus

2025-06-02 19:43:55 +0000 UTC

Regarding Trump in the system prompt: Ask O3 about the current trump administration and check its reasoning: you’ll see it’s answering it assuming it’s being presented an unlikely fictional scenario.

Erik

2025-06-02 18:40:45 +0000 UTC

I'm going to be interested to see how grok 3.5 does on simple bench's spatial reasoning questions. There are hints they are focusing a lot on physics and I wonder if RL'ing on physics will capture more "common sense" stuff.

Joshua Sellers

2025-06-02 18:35:10 +0000 UTC

Sonnet 4 underperforming Sonnet 3.7 in a couple of benchmarks makes me wonder if this is the reason for the release of Opus 4.

SteveHaupt

2025-06-02 15:33:22 +0000 UTC

In my experience: Opus "understands" what you want from it much faster than any other model. It recognises the nuances and details, and can switch between different levels at any time (with a full context window), from the actual theme to the meta level and even the meta-meta level, and back again. It rarely gets "confused". If I use it with my reasoning prompts, I get even more interesting results. So far, this is my subjective experience of using it. Impressive! My favourite model at the moment! :)

Christopher Pollin

2025-06-02 15:23:12 +0000 UTC

Great Video!

Marc-Leonard Overbeck

2025-06-02 14:41:39 +0000 UTC

This is in line with my own experience. If the instructions remain in the context window for longer, it is usually more practical. But I would also be interested to know exactly how this would work technically if it were the case. Is the context that is closer to the next generated token given a higher score (which would make sense)?

Christopher Pollin

2025-06-02 14:37:51 +0000 UTC

Question around 3:08, is it proven that llms pay extra attention to an instruction at the bottom of the prompt? I didn't know this.

jokmenen

2025-06-02 14:16:12 +0000 UTC

More Creators

Maewen

Maewen

patreon

roxerotique

roxerotique

patreon

ow14b

ow14b

fanbox

3Dfantasy

3Dfantasy

patreon

dealien

dealien

patreon

Deviantcactus

Deviantcactus

patreon

めんテル

めんテル

fanbox

SFM Heaven

SFM Heaven

patreon

gendertf

gendertf

patreon

sircus

sircus

patreon

fittersitter

fittersitter

patreon

orangero 🍊🔞

orangero 🍊🔞

gumroad

Danitysimmer

Danitysimmer

patreon

rockblackhorn

rockblackhorn

patreon

kao

kao

gumroad

Eldervi

Eldervi

patreon

赫卡Huka

赫卡Huka

fanbox

Stephendraws

Stephendraws

patreon

nvk.tools

nvk.tools

gumroad

Martux

Martux

patreon

TeamDuwang

TeamDuwang

patreon

tatibanasiori

tatibanasiori

fanbox

Rebis

Rebis

patreon

fastscalps

fastscalps

patreon

WKK

WKK

fanbox

Supereyepatchwolf

Supereyepatchwolf

patreon

Ashardy

Ashardy

patreon

Servojob

Servojob

gumroad

ISubstance

ISubstance

fanbox

あおみかん

あおみかん

fanbox

Colton Bryant

Colton Bryant

patreon

Chad Hoverter

Chad Hoverter

patreon

MrEsan

MrEsan

patreon

なのこ

なのこ

fanbox

Faizo

Faizo

patreon

kosafordraw

kosafordraw

patreon

Caelus

Caelus

patreon

msrn

msrn

fanbox

thenaysayer34

thenaysayer34

patreon

PhillyGriff Designs

PhillyGriff Designs

patreon