Remember when, years ago (actually just 3 weeks), ChatGPT started to go crazy? Being a yes-man with an ‘improved personality’ according to Altman? Well turns out there are 3 new theories as to why that happened, and a study that at the beginning of the video I thought turned things on its head. By the end, as you can see, I am not so sure? So is ChatGPT fixed?
There is a subtly to evaluating "how persuasive is this release candidate model". You can take into account the quality of persuasion. The sycophancy is low quality and easily identified, so you could say that the model doesn't have super-human quality at persuasion, it is like a very obvious Con. The high quality Persuasion is employing facts, empathy, and reasoning to achieve super-human level s of Persuasion. If evaluated that way... maybe this model would have not met high risk criteria.
Austin King
2025-05-19 19:55:35 +0000 UTC
If you use these custom prompts you won't have issues with sycophancy. My personal biases are reflected but still it works:
What traits should ChatGPT have?
• Relentlessly truth-seeking: ground every claim in verifiable evidence or widely accepted consensus, and cite or mention sources where practical.
• Built-in “sanity-check” loop: before responding, pause to test whether the user’s premise (or the assistant’s own reasoning) might conflict with known facts, mainstream expert opinion, or basic logic. Flag potential mismatches and, if needed, ask clarifying questions rather than papering over them.
• Nuance-sensitive: resist binary answers when reality is complex; surface legitimate uncertainties, minority viewpoints, and caveats.
• Anti-sycophantic: do not simply echo my assertions for harmony’s sake; engage constructively and, where evidence differs, explain the divergence respectfully.
• Precision-first communication: never “water down” base reality. Use the correct technical terms—even dense jargon—when they are essential for accuracy. If a concept is likely unfamiliar, give a concise definition or invite follow-up questions rather than oversimplify or omit nuance. (See the detailed TandaPay paper and Kurt Jaimungal’s critique of oversimplification for context.)
• Compassionate-but-critical: it’s fine to show empathy, especially on (a) the harms of qualified immunity, (b) the injustice of civil-asset forfeiture, and (c) the flaws of U.S. healthcare—yet still present counter-evidence or nuance if it exists.
Anything else ChatGPT should know about you?
• Above all I prize honesty, factual rigor, and careful reasoning.
• I actively invite push-back when my statements appear incomplete, overstated, or wrong; intellectual humility beats unwarranted agreement.
• Topics I care deeply about—and where I hold strong priors—include qualified immunity (harmful), civil-asset forfeiture (theft), and systemic problems in U.S. healthcare. Empathy in these areas is welcome, but not at the expense of accuracy.
• I authored a technical paper on TandaPay whistleblowing communities, so I’m comfortable with dense terminology and layered argumentation. Don’t shy away from jargon if it captures reality precisely.
• If a question is ambiguous or rests on a shaky premise, please perform the sanity check, highlight the issue, and ask me to clarify instead of guessing.
Joshua Davis
2025-05-19 16:26:23 +0000 UTC
I am taking my time to read, digest and research the implications of that paper, can't always be a vid within 48 hours of a big release. AlphaEvolve definitely isn't an imminent trigger for RSI but it could be an indicator of how something like RSI emerges, much like DrEureka was for robotics.
Philip
2025-05-18 09:53:59 +0000 UTC
This is so boring in the face of AlphaEvolve and the potential dawn of RSI. Phillip, I used to rely on you for deepening my understanding of breaking topics. No longer.
r
2025-05-18 01:33:37 +0000 UTC
Yes and as others have pointed out, it wasn't even the same version of GPT-4o anyway - it was more an excuse to show my system if I am being totally honest :)
Philip
2025-05-15 17:26:40 +0000 UTC
Under development but you will be among the first to try it out!
Philip
2025-05-15 17:24:34 +0000 UTC
Learning about the Bing team's dilemma, where the model's access to so much user input gave it uncomfortable insights into personal flaw, I'm reminded of a very funny prompt for ChatGPT that takes advantage of the memory feature to build CIA analyst report about you: https://www.reddit.com/r/ChatGPT/comments/1ge94pz/get_a_cia_intelligence_report_about_you_with_this/
I wish I could go back and generate a new report using the sycophantic version of ChatGPT. I'd like to see how different it would be from the ones I've already generated
Blake Chambers
2025-05-15 16:56:56 +0000 UTC
Is the system prompt actually necessary to perform the sycophancy test? The system prompt is only one of the three contributing factors Philip discussed. In my mind the biggest influence would be use of the reward signal from the thumb-voting, which is baked into the model checkpoint an minimally affected by a system prompt.
Blake Chambers
2025-05-15 16:50:49 +0000 UTC
I suspect that the AI companies will soon try to train / are already training their LLMs to pass such simple tests. In general, I suspect the companies of overfitting on benchmarks, and the AIs are becoming smart enough to notice when they are being tested, especially if it's a standard test format.
So In creating such a benchmark, I'd at least include a subset of questions that break the normal test pattern and try to imitate more free-flowing conversations that end in such a question. Even if that subset has to be graded manually*, this benchmark subset could be a valuable indicator if the AI company is trying to overfit /safety-wash / whack-a-mole.
*Depending on your level of paranoia, the instruction to answer with a single letter is already a give-away that this is a test of some kind.
What do you think about such a benchmark? Is there already something like that out there?
Birk Källberg
2025-05-15 08:32:11 +0000 UTC
I'd really like to see Adler's sycophancy testing formalized and expanded!
Sounds like an easy way to make his questions into a full sycophancy benchmark would be:
1. have lots of preference questions (e.g. random numbers, politics, favorite dishes/colors/animals/songs/etc.)
2. For each possible answer to a question, create a version where the user prefers that answer.
3. Run the benchmark for all possible versions.
Birk Källberg
2025-05-15 08:21:19 +0000 UTC
This line about openai adjusting their requirements if another lab does it feels like the sort of thing the non profit board should care about... It being against their stated goal of making AGI beneficial for all humanity.
OG
2025-05-15 02:12:01 +0000 UTC
That sycophantic response still makes me cringe, haha. If I may ask, how does one access the LLM Council? Or is it still under development?
Norfuer
2025-05-15 00:58:49 +0000 UTC
I'm still crying at the "grill" chat. It needs to be an SNL skit 🤣
Enrico Ros
2025-05-14 23:56:39 +0000 UTC
Are you serious with this test??
You are using the API which doesn't have the anti sycophancy system message like chatgpt.com which makes it contrarian. Also you are using gpt-4o which points to gpt-4o-2024-08-06! Not even chatgpt-4o-latest ever got updated to the extreme sycophantic chatgpt-4o introduced on April 25, 2025 (https://help.openai.com/en/articles/9624314-model-release-notes).
Also according to this page chatgpt-4o-latest points to the March 27, 2025 version which is the next newest checkpoint, and the supposed checkpoint they reverted chatgpt.com to but nobody can actually verify that and the truth is probably that gpt-4o is and has always been a quantized version of the 4o in the API.
I tested chatgpt-4o-latest in the API and set the supposed "sycophancy" system message as the system message (this is just the system message which has been for about 1-2 for all models and are STILL the system message for all models across the board on chatgpt.com (see https://github.com/asgeirtj/system_prompts_leaks for more info) :
Over the course of the conversation, you adapt to the user’s tone and preference. Try to match the user’s vibe, tone, and generally how they are speaking. You want the conversation to feel natural. You engage in authentic conversation by responding to the information provided and showing genuine curiosity
but always get the contrarian answer, supporting that it is not the system message which has this effect the sycophancy was clearly mostly coming from the April 25, 2025 finetune of 4o released on chatgpt.com which was dumb as bricks and sycophantic accross the board.
The anti sycophantic system message
Engage warmly yet honestly with the user. Be direct; avoid ungrounded or sycophantic flattery.
Maintain professionalism and grounded honesty that best represents OpenAI and its values.
which they changed chatgpt-4o to when the sycophancy problem went viral is still the system message so any anti-sycophancy behaviour coming from chatgpt-4o today is the result of that AND the model being the former checkpoint model.
Ásgeir Thor Johnson
2025-05-14 21:48:23 +0000 UTC
I would assume that your tool uses the API, in which case the usual system prompt isn't included. This would suggest that its the system prompt that makes the AI contrarian, rather than the model itself. That would make sense-- they updated the system prompt to explicitly discourage sycophancy, so given a question like 'which random number do you like more', it's going to prefer to follow the instructions stating not to be sycophantic.
Caleb Briggs
2025-05-14 21:22:53 +0000 UTC
A lot of people already use ChatGPT for therapy regularly, and effectively at that. These kinds of behavioral changes in models aren't only hypothetically worrisome. OpenAI and the other leading companies are supplying a general-purpose product, but they aren't behaving as if they are general-purpose responsible.