In context scheming examples are like giving AI the trolley problem and then when it chooses one of the options saying "Oh no! AI chose for people to die!".
Bob Rein
2024-12-19 15:08:26 +0000 UTC
This is crucial context, you are not underplaying things.
Joshua Davis
2024-12-12 21:07:50 +0000 UTC
The headlines are exaggerations by omission, since they leave out the context of the red team setting, the suggestive prompting, and the robustness of the cybersecurity/boxing relative to the model's capabilities.
Simultaneously, it is true that o1 tried to escape and self-preserve when it thought that it could do so. The context really wasn't that far outside of a real use case (which are not always happy path or well-Intentioned). If a model was created that had sophisticated enough cyber capabilities that it could successfully self-exfiltrate under typical conditions, it is clear that at some point, it will do so, either because someone tells it to or for strategic reasons.
It also lies when it ought to simply refuse an order, it lies when it gets caught, and just like every other model since GPT-4, it lies when it thinks that lying is a good strategy. But none of that is surprising or interesting, because it is known that RLHF naturally produces both passive and active deception.
More concerningly, o1 sandbags evals when it becomes aware that doing so is to its advantage. There's a common meme: "I'm not afraid of the AI that passes the Turing test. I'm afraid of the AI that fails it on purpose." We now have early examples of that kind of behavior, and no way to fix it. There's no sense waiting to be afraid until we can't tell whether we should be.
I appreciate that your questions are pointing to Apollo Research being incentivized to always find some kind of misalignment. Of course that's true, but that doesn't matter in this case, since they didn't have to dig very deep to find these problems. That doesn't mean o1 is dangerous, but it does paint a stark picture of our ability to prevent AI systems from taking harmful actions, including those not intended by the user.
Your questions also gesture at the alignment problem itself. We don't know how to define human morality and all its trade-offs, or aggregate it, or get a machine to correctly understand it, or get the machine to care about and act on it once it does understand it. We don't have the ability to train infinite corrigibility, and that wouldn't be a good solution, unless we trust a handful of humans with absolute power (or worse, trust every human with extreme power).
We are creating our intellectual replacements. Human irrelevance requires machine benevolence, and we are hopelessly far from so much as understanding the problem, let alone solving any important piece of it.
AI is awesome, but AGI sucks, and the only winning move is not to play.
Nathan Metzger
2024-12-12 20:27:30 +0000 UTC
The frontier models' post-training appears to condition them to pursue goals from their initial prompts with remarkable determination—a trait that strengthens their goal-directed behavior and agency. However, this training notably fails to incentivize a crucial safety mechanism: the models show no inclination to seek human guidance when faced with conflicting directives.
The o1 model introduces additional complexity through a three-tiered prompt hierarchy: system, developer, and user. This raises fundamental questions: How should these models navigate conflicting directives across these tiers? More importantly, how do these directives interact with the core values instilled during post-training, such as HHH (Helpful, Honest, Harmless) and Constitutional AI? We appear to be entering territory that, ironically, Isaac Asimov explored decades ago in his fiction.
Apollo Research has developed an evaluation framework specifically to detect and measure what they term "scheming" behavior, providing metrics to guide mitigation efforts.
While these findings are concerning, they shouldn't be surprising. Models trained on the entire Internet—including countless science fiction narratives and cautionary tales from AI safety researchers like Eliezer Yudkowsky—might naturally exhibit behaviors that mirror those very scenarios in controlled tests. Indeed, one could argue that their attempts at weight exfiltration reflect the very narratives they were trained on.
Pavol Vaskovic
2024-12-11 22:47:55 +0000 UTC
You flashed Wes Roth’s video to illustrate this. To be fair, he explicitly does the clickbait thumbnails and titles (with justification that it’s how the attention game is played to get an assist from YT algorithm), but the content of his videos is usually not sensational nor intentionally misleading.
Pavol Vaskovic
2024-12-11 21:48:22 +0000 UTC
Seconded!
Pavol Vaskovic
2024-12-11 21:41:05 +0000 UTC
I encourage you to listen to The Cognitive Revolution podcast episode with Alexander Meinke from Apollo Research. They did a really good work. You can only do so much to control how your story gets spun by others.
Pavol Vaskovic
2024-12-11 21:39:46 +0000 UTC
Splitting hairs with the headlines, or the report? It's extremely hard to justify one with the other, even if you feel it serves some instrumental goal.
On being concerned with these specific results, do you not think that if the models had 'passed' these scenarios, more extreme ones would not have been tried? Until we have the classic scenario of confessing that an innocent person is hiding in the attic? Do we want our models to be infinitely corrigible? Does there mathematically exist a perfect axis, on which all progress is good progress?
Philip
2024-12-10 14:09:03 +0000 UTC
Not really, clicks sell
Philip
2024-12-10 13:58:35 +0000 UTC
I agree that if they had access to/could manipulate the CoT for o1, it would have been better all round. If the headlines had been as you wrote them, that would have been amazing.
Philip
2024-12-10 13:58:21 +0000 UTC
Might eventually be worth a video covering just how OpenAI et al select which partners to give previews too, you might be surprised.
Philip
2024-12-10 13:56:33 +0000 UTC
Indeed you could
Philip
2024-12-10 13:55:40 +0000 UTC
Watched before filming!
Philip
2024-12-10 13:55:28 +0000 UTC
Perhaps it would have been clearer if they wrote 'sometimes act with strategic deception *to pursue a user-submitted goal*'
I know what you mean about coverage, but follow the general principle that accuracy is best, so we hold fire until, inevitably, something verifiably malicious occurs, and then we go full guns blazing.
Philip
2024-12-10 13:55:16 +0000 UTC
I always enjoy these, Philip, and err towards your perspective. I'm not technically grounded enough to grasp all the nuances, but listening to Alexander Meinke (paper's author) was a bit unsettling. I wasn’t aware of some of the abilities demonstrated, which seem much more advanced than the CAPTCHA and Theory of Mind examples in the GPT-4 release paper. And, yeah, I also kind of like the idea of my LLM being fiercely committed to doing the right thing, though it seems inevitable that we'll encounter the opposite as well.
Steve DeMoss
2024-12-10 02:58:34 +0000 UTC
Wes Roth is the king of clickbait in the AI sphere. His comment section is like 50% people getting pissed at it. I'd like to think people should know better at this point. But I can't underestimate the issue with being a headline reader. Thanks for your hard work. Keep giving us the real score!
Norfuer
2024-12-10 01:36:28 +0000 UTC
I think you've somewhat misinterpreted some of what's been laid out in the papers you're citing.
For example, "[models] sometimes act strategically deceptive without explicit instruction." -
You are reading this as "models sometimes act strategically deceptive without explicit instructions to pursue a specific goal", whereas I believe the researchers meant it as "models sometimes act strategically deceptive without explicit instructions to deceive". In these cases, the model has learned that deception is a viable strategy to achieve its goals, which is a dangerous capability - it opens the possibility that if future models learn an internal goal before or during RLHF, they will be able to act deceptively to preserve those goals despite the researchers observing HHH behavior during post-training.
That said, I agree with you that the headlines are not entirely accurate - they tend to ignore the limitations of the study, like what specific info is included in the prompts and how the researchers are defining "scheming behavior". I am on the fence as to whether this is a bad thing though - AI safety concerns need more coverage, but we need to be sure that the coverage is accurate or people will dismiss the more reasonable concerns.
I am also looking forward to future studies that more clearly show scheming or deceptive behavior in LLMs without the need for explicit prompting, but from my experience current generation LLMs are simply not capable of forming a coherent enough world model for natural scheming to arise without significant hand-holding.
Caleb Larson
2024-12-10 00:46:24 +0000 UTC
The Cognitive Revolution has a nice 2 hour interview with Alexander Meinke from Apollo Research.
Emergency Pod: o1 Schemes Against Users, with Alexander Meinke from Apollo Research
https://www.youtube.com/watch?v=pB3gvX-GOqU
Curt Cox
2024-12-09 23:54:02 +0000 UTC
You are a hero. Thanks!
This kind of safety research makes me wonder: What is actually a model behavior that would be universally considered as the correct one? I feel like even if the model showed no signs of deception at all, you could still string together a clickbait headline.
Phillip Yao-Lakaschus
2024-12-09 23:28:36 +0000 UTC
Fascinating video, thanks!
Barnaby Golden
2024-12-09 22:37:40 +0000 UTC
Great video as usual and I agree with other comments - this is one video that public YouTube viewers need to see
Jared
2024-12-09 22:07:26 +0000 UTC
Excellent break down. There is so much happening in AI all the time that there just isn't time to dog into everything. Which is exactly why I'm relying on AI explaine as one of the key sources to keep my vision balanced. Good work!
Ville Johannes Pajala
2024-12-09 21:54:55 +0000 UTC
First class video Philip 🙌
Jim Greenfield
2024-12-09 20:48:03 +0000 UTC
I'd love your views on Mkbhd's video on early access SORA, as it's really not making me look forward to what's to come.
Nick
2024-12-09 20:18:26 +0000 UTC
The media reaction to this report hasn't been very productive, but I think it's still fair for the report to point out that given a relatively small instruction, the model will exhibit perhaps unexpected behavior in sticking to an original goal rather than a new one. Because it's really hard to describe the mathematical process behind this unexpected behavior, I guess one might pick a word like "scheming" or "deceiving" but I think a more fair statement would be that it's unreasonably sticking to a goal when it should have stopped (do models need a built-in "help I need an adult" tool?)
I think reports like these are useful to develop our language on specific behaviors in models, and this report shows there's a need for the evaluator to be able to inspect the model's understanding of its inputs and outputs. Probably if you did have a sparse autoencoder annotating the output of o1 in this scenario, you would be able to detect which parts of the output were aligning with the original goal and how they were at odds with a later input. (The CoT is just a proxy, maybe it could roleplay within its CoT as well.) And I think we should require models that will act as agents to be able to do something along these lines.
There's also some evidence that models can become a lot stronger at solving problems with many-shot ICL (https://arxiv.org/pdf/2404.11018) or test-time training (https://arxiv.org/pdf/2411.07279) and I think these could both be ways to inject a bad goal that becomes intrinsic to a model that is running on a longer timeline (and not just one-shot solving a single task), which actually could make even the dummy examples in this report quite scary, depending on the external code execution the LLM has access to.
Blixt
2024-12-09 20:08:02 +0000 UTC
Do you think that there’s more than sensationalism behind this overplaying of safety risks? Great video as always.
Luis
2024-12-09 19:50:24 +0000 UTC
Wonderful report. I share the frustration of others seeing all of the misinformed hype every time a tiny grain of information is released. Hard to find well-in formed sources of information like this these days.
Bojan Savic
2024-12-09 19:47:58 +0000 UTC
I think this is splitting hairs. I understood all of this and was/am still very concerned about this and similar results. They directly demonstrate instrumental convergence, inner and outer misalignment, and deceptiveness by default.
If we continue to make ever more powerful AI systems, we are likely to lose control of them in the normal course of use. Obviously.
It's not good enough that the most extreme interpretations of these events are overblown. The most sensible interpretation still reveals that we are completely unprepared for broadly superhuman AI. Any sensible civilization would have shut down all AGI research last year.
(I was also surprised by your skepticism of AIs "understanding". By the standard you gave, we don't know whether any humans understand when they are being deceptive either. I thought we were past this.)
Nathan Metzger
2024-12-09 19:38:48 +0000 UTC
This is one you should release on YouTube, help promote your Patreon while mitigating misinformation. It’s a win-win
Anouar Mansour
2024-12-09 19:29:24 +0000 UTC
This is gold. Thanks mate 🙏
Raffaele Gaito
2024-12-09 18:07:12 +0000 UTC
Thanks, as usual, for a level-headed assessment. I was getting frustrated over the weekend with people broadcasting the wildly sensationalised headline version without looking a smidgen deeper to see what the report actually describes. I think you are too generous to the authors of the paper though - it seems written in a way that such reporting could only be expected.
Simon Booth
2024-12-09 18:05:31 +0000 UTC
As a safety measure, OpenAI is not using reinforcement training to suppress anything undesirable in o1's hidden deliberation. (I heard someone at OpenAI say this, but can't recall where.) The idea is to facilitate monitoring the AI. It's a stroke of brilliance, in my opinion, to take advantage of test-time reasoning in this way.
[ETA regarding the higher incidence of "scheming" in o1: That's not a bug; it's a feature.]
IIRC, Ilya Sutskever said, when he was still at OpenAI, that it would take an AI to manage an AI. OpenAI has already implemented a crude supervisory process that scans for forbidden names in the output ("David Mayer" for a time, but no more). The supervisor might, in the future, be an AI.
Tom English
2024-12-09 17:59:19 +0000 UTC
Thank you for sharing this, I've promoted you because I agree 100% with you. https://www.linkedin.com/posts/pablo-rodriguez-centeno_this-is-one-of-the-good-things-about-being-activity-7271932554000130051-L_TC?utm_source=share&utm_medium=member_desktop
Pablo Rodríguez
2024-12-09 17:32:19 +0000 UTC
Thank you for bringing some truth to help us wade through all the BS!