XaiJu
AIExplained
AIExplained

patreon


Media Misreporting Over o1's 'Escape' - 70 page report highlights - why o1 did what it did

Everyone saw the headlines - o1 tried to escape. But did it? And why? Don't believe the media hysteria.

Download Link: https://drive.google.com/file/d/1QxnBRfPdG54s5F0a92pC49w5tD3GweMx/view?usp=sharing

Apollo Report: https://static1.squarespace.com/static/6593e7097565990e65c886fd/t/6751eb240ed3821a0161b45b/1733421863119/in_context_scheming_reasoning_paper.pdf

Professor Kruger Tweets: https://x.com/DavidSKrueger

Headlines: https://www.tomsguide.com/ai/openais-new-chatgpt-o1-model-will-try-to-escape-if-it-thinks-itll-be-shut-down-then-lies-about-it

https://www.dailymail.co.uk/news/article-14167015/Scheming-AI-bot-ChatGPT-tried-stop-shut-LIED-challenged-researchers.html

Deception Paper: https://arxiv.org/pdf/2405.01576

Insider Trading: https://arxiv.org/pdf/2311.07590

https://www.youtube.com/watch?v=7MNyv2JER_o

DeepMind Evaluation: https://arxiv.org/pdf/2403.13793

Sandbagging Paper: https://arxiv.org/pdf/2406.07358

Sleeper Agents:https://arxiv.org/pdf/2401.05566 https://www.anthropic.com/research/probes-catch-sleeper-agents

Many Shot Jailbreaking: https://www.anthropic.com/research/many-shot-jailbreaking

Anthropic Rapid Response: https://arxiv.org/pdf/2411.07494


Media Misreporting Over o1's 'Escape' - 70 page report highlights - why o1 did what it did

Comments

In context scheming examples are like giving AI the trolley problem and then when it chooses one of the options saying "Oh no! AI chose for people to die!".

Bob Rein

This is crucial context, you are not underplaying things.

Joshua Davis

The headlines are exaggerations by omission, since they leave out the context of the red team setting, the suggestive prompting, and the robustness of the cybersecurity/boxing relative to the model's capabilities. Simultaneously, it is true that o1 tried to escape and self-preserve when it thought that it could do so. The context really wasn't that far outside of a real use case (which are not always happy path or well-Intentioned). If a model was created that had sophisticated enough cyber capabilities that it could successfully self-exfiltrate under typical conditions, it is clear that at some point, it will do so, either because someone tells it to or for strategic reasons. It also lies when it ought to simply refuse an order, it lies when it gets caught, and just like every other model since GPT-4, it lies when it thinks that lying is a good strategy. But none of that is surprising or interesting, because it is known that RLHF naturally produces both passive and active deception. More concerningly, o1 sandbags evals when it becomes aware that doing so is to its advantage. There's a common meme: "I'm not afraid of the AI that passes the Turing test. I'm afraid of the AI that fails it on purpose." We now have early examples of that kind of behavior, and no way to fix it. There's no sense waiting to be afraid until we can't tell whether we should be. I appreciate that your questions are pointing to Apollo Research being incentivized to always find some kind of misalignment. Of course that's true, but that doesn't matter in this case, since they didn't have to dig very deep to find these problems. That doesn't mean o1 is dangerous, but it does paint a stark picture of our ability to prevent AI systems from taking harmful actions, including those not intended by the user. Your questions also gesture at the alignment problem itself. We don't know how to define human morality and all its trade-offs, or aggregate it, or get a machine to correctly understand it, or get the machine to care about and act on it once it does understand it. We don't have the ability to train infinite corrigibility, and that wouldn't be a good solution, unless we trust a handful of humans with absolute power (or worse, trust every human with extreme power). We are creating our intellectual replacements. Human irrelevance requires machine benevolence, and we are hopelessly far from so much as understanding the problem, let alone solving any important piece of it. AI is awesome, but AGI sucks, and the only winning move is not to play.

Nathan Metzger

The frontier models' post-training appears to condition them to pursue goals from their initial prompts with remarkable determination—a trait that strengthens their goal-directed behavior and agency. However, this training notably fails to incentivize a crucial safety mechanism: the models show no inclination to seek human guidance when faced with conflicting directives. The o1 model introduces additional complexity through a three-tiered prompt hierarchy: system, developer, and user. This raises fundamental questions: How should these models navigate conflicting directives across these tiers? More importantly, how do these directives interact with the core values instilled during post-training, such as HHH (Helpful, Honest, Harmless) and Constitutional AI? We appear to be entering territory that, ironically, Isaac Asimov explored decades ago in his fiction. Apollo Research has developed an evaluation framework specifically to detect and measure what they term "scheming" behavior, providing metrics to guide mitigation efforts. While these findings are concerning, they shouldn't be surprising. Models trained on the entire Internet—including countless science fiction narratives and cautionary tales from AI safety researchers like Eliezer Yudkowsky—might naturally exhibit behaviors that mirror those very scenarios in controlled tests. Indeed, one could argue that their attempts at weight exfiltration reflect the very narratives they were trained on.

Pavol Vaskovic

You flashed Wes Roth’s video to illustrate this. To be fair, he explicitly does the clickbait thumbnails and titles (with justification that it’s how the attention game is played to get an assist from YT algorithm), but the content of his videos is usually not sensational nor intentionally misleading.

Pavol Vaskovic

Seconded!

Pavol Vaskovic

I encourage you to listen to The Cognitive Revolution podcast episode with Alexander Meinke from Apollo Research. They did a really good work. You can only do so much to control how your story gets spun by others.

Pavol Vaskovic

Splitting hairs with the headlines, or the report? It's extremely hard to justify one with the other, even if you feel it serves some instrumental goal. On being concerned with these specific results, do you not think that if the models had 'passed' these scenarios, more extreme ones would not have been tried? Until we have the classic scenario of confessing that an innocent person is hiding in the attic? Do we want our models to be infinitely corrigible? Does there mathematically exist a perfect axis, on which all progress is good progress?

Philip

Not really, clicks sell

Philip

I agree that if they had access to/could manipulate the CoT for o1, it would have been better all round. If the headlines had been as you wrote them, that would have been amazing.

Philip

Might eventually be worth a video covering just how OpenAI et al select which partners to give previews too, you might be surprised.

Philip

Indeed you could

Philip

Watched before filming!

Philip

Perhaps it would have been clearer if they wrote 'sometimes act with strategic deception *to pursue a user-submitted goal*' I know what you mean about coverage, but follow the general principle that accuracy is best, so we hold fire until, inevitably, something verifiably malicious occurs, and then we go full guns blazing.

Philip

I always enjoy these, Philip, and err towards your perspective. I'm not technically grounded enough to grasp all the nuances, but listening to Alexander Meinke (paper's author) was a bit unsettling. I wasn’t aware of some of the abilities demonstrated, which seem much more advanced than the CAPTCHA and Theory of Mind examples in the GPT-4 release paper. And, yeah, I also kind of like the idea of my LLM being fiercely committed to doing the right thing, though it seems inevitable that we'll encounter the opposite as well.

Steve DeMoss

Wes Roth is the king of clickbait in the AI sphere. His comment section is like 50% people getting pissed at it. I'd like to think people should know better at this point. But I can't underestimate the issue with being a headline reader. Thanks for your hard work. Keep giving us the real score!

Norfuer

I think you've somewhat misinterpreted some of what's been laid out in the papers you're citing. For example, "[models] sometimes act strategically deceptive without explicit instruction." - You are reading this as "models sometimes act strategically deceptive without explicit instructions to pursue a specific goal", whereas I believe the researchers meant it as "models sometimes act strategically deceptive without explicit instructions to deceive". In these cases, the model has learned that deception is a viable strategy to achieve its goals, which is a dangerous capability - it opens the possibility that if future models learn an internal goal before or during RLHF, they will be able to act deceptively to preserve those goals despite the researchers observing HHH behavior during post-training. That said, I agree with you that the headlines are not entirely accurate - they tend to ignore the limitations of the study, like what specific info is included in the prompts and how the researchers are defining "scheming behavior". I am on the fence as to whether this is a bad thing though - AI safety concerns need more coverage, but we need to be sure that the coverage is accurate or people will dismiss the more reasonable concerns. I am also looking forward to future studies that more clearly show scheming or deceptive behavior in LLMs without the need for explicit prompting, but from my experience current generation LLMs are simply not capable of forming a coherent enough world model for natural scheming to arise without significant hand-holding.

Caleb Larson

The Cognitive Revolution has a nice 2 hour interview with Alexander Meinke from Apollo Research. Emergency Pod: o1 Schemes Against Users, with Alexander Meinke from Apollo Research https://www.youtube.com/watch?v=pB3gvX-GOqU

Curt Cox

You are a hero. Thanks! This kind of safety research makes me wonder: What is actually a model behavior that would be universally considered as the correct one? I feel like even if the model showed no signs of deception at all, you could still string together a clickbait headline.

Phillip Yao-Lakaschus

Fascinating video, thanks!

Barnaby Golden

Great video as usual and I agree with other comments - this is one video that public YouTube viewers need to see

Jared

Excellent break down. There is so much happening in AI all the time that there just isn't time to dog into everything. Which is exactly why I'm relying on AI explaine as one of the key sources to keep my vision balanced. Good work!

Ville Johannes Pajala

First class video Philip 🙌

Jim Greenfield

I'd love your views on Mkbhd's video on early access SORA, as it's really not making me look forward to what's to come.

Nick

The media reaction to this report hasn't been very productive, but I think it's still fair for the report to point out that given a relatively small instruction, the model will exhibit perhaps unexpected behavior in sticking to an original goal rather than a new one. Because it's really hard to describe the mathematical process behind this unexpected behavior, I guess one might pick a word like "scheming" or "deceiving" but I think a more fair statement would be that it's unreasonably sticking to a goal when it should have stopped (do models need a built-in "help I need an adult" tool?) I think reports like these are useful to develop our language on specific behaviors in models, and this report shows there's a need for the evaluator to be able to inspect the model's understanding of its inputs and outputs. Probably if you did have a sparse autoencoder annotating the output of o1 in this scenario, you would be able to detect which parts of the output were aligning with the original goal and how they were at odds with a later input. (The CoT is just a proxy, maybe it could roleplay within its CoT as well.) And I think we should require models that will act as agents to be able to do something along these lines. There's also some evidence that models can become a lot stronger at solving problems with many-shot ICL (https://arxiv.org/pdf/2404.11018) or test-time training (https://arxiv.org/pdf/2411.07279) and I think these could both be ways to inject a bad goal that becomes intrinsic to a model that is running on a longer timeline (and not just one-shot solving a single task), which actually could make even the dummy examples in this report quite scary, depending on the external code execution the LLM has access to.

Blixt

Do you think that there’s more than sensationalism behind this overplaying of safety risks? Great video as always.

Luis

Wonderful report. I share the frustration of others seeing all of the misinformed hype every time a tiny grain of information is released. Hard to find well-in formed sources of information like this these days.

Bojan Savic

I think this is splitting hairs. I understood all of this and was/am still very concerned about this and similar results. They directly demonstrate instrumental convergence, inner and outer misalignment, and deceptiveness by default. If we continue to make ever more powerful AI systems, we are likely to lose control of them in the normal course of use. Obviously. It's not good enough that the most extreme interpretations of these events are overblown. The most sensible interpretation still reveals that we are completely unprepared for broadly superhuman AI. Any sensible civilization would have shut down all AGI research last year. (I was also surprised by your skepticism of AIs "understanding". By the standard you gave, we don't know whether any humans understand when they are being deceptive either. I thought we were past this.)

Nathan Metzger

This is one you should release on YouTube, help promote your Patreon while mitigating misinformation. It’s a win-win

Anouar Mansour

This is gold. Thanks mate 🙏

Raffaele Gaito

Thanks, as usual, for a level-headed assessment. I was getting frustrated over the weekend with people broadcasting the wildly sensationalised headline version without looking a smidgen deeper to see what the report actually describes. I think you are too generous to the authors of the paper though - it seems written in a way that such reporting could only be expected.

Simon Booth

As a safety measure, OpenAI is not using reinforcement training to suppress anything undesirable in o1's hidden deliberation. (I heard someone at OpenAI say this, but can't recall where.) The idea is to facilitate monitoring the AI. It's a stroke of brilliance, in my opinion, to take advantage of test-time reasoning in this way. [ETA regarding the higher incidence of "scheming" in o1: That's not a bug; it's a feature.] IIRC, Ilya Sutskever said, when he was still at OpenAI, that it would take an AI to manage an AI. OpenAI has already implemented a crude supervisory process that scans for forbidden names in the output ("David Mayer" for a time, but no more). The supervisor might, in the future, be an AI.

Tom English

Thank you for sharing this, I've promoted you because I agree 100% with you. https://www.linkedin.com/posts/pablo-rodriguez-centeno_this-is-one-of-the-good-things-about-being-activity-7271932554000130051-L_TC?utm_source=share&utm_medium=member_desktop

Pablo Rodríguez

Thank you for bringing some truth to help us wade through all the BS!

Daniel A Barbatti


More Creators