XaiJu
AIExplained
AIExplained

patreon


'Claude 3.7 Knows it is Being Tested' - New Research, Theory of Mind and Consciousness

New research: 'Claude 3.7 knows it is being tested'. But can it model your mind? And is it having a subjective experience? Plus a guest appearance from the me of ... March 2023.

Download Link: https://drive.google.com/file/d/1g2L_1K6AZnKOykt-eyQzJ0yCUVmleGAb/view?usp=sharing

Apollo Research post: https://www.apolloresearch.ai/blog/claude-sonnet-37-often-knows-when-its-in-alignment-evaluations

Claude Refuses to Code: https://arstechnica.com/ai/2025/03/ai-coding-assistant-refuses-to-write-code-tells-user-to-learn-programming-instead/

OpenAI Paper: https://openai.com/index/chain-of-thought-monitoring/

Amodei if it Quacks: https://www.youtube.com/watch?v=esCSpbDPJik&t=982s

My March 2023 Video: https://www.youtube.com/watch?v=4MGCQOAxgv4

Updated Theory of Mind Paper: https://arxiv.org/ftp/arxiv/papers/2302/2302.02083.pdf

Original Refutation of ToM: https://www.researchgate.net/publication/368572398_Large_Language_Models_Fail_on_Trivial_Alterations_to_Theory-of-Mind_Tasks

Jack Clark Post: https://jack-clark.net/

Othello model: https://thegradient.pub/othello/
Sutskever Tweet: https://x.com/ilyasut/status/1491554478243258368?lang=en-GB

Social Science Experiments: https://docsend.com/view/ity6yf2dansesucf

Imperial College Consciousness Paper: https://arxiv.org/pdf/2503.16348

'Claude 3.7 Knows it is Being Tested' - New Research, Theory of Mind and Consciousness

Comments

I honestly am astounded by all this talk of theory of mind. LLMs are stateless… they don’t live. the reason why LLMs have an agenda is because of the messy, chaotic, polysemantic nature of language, and because of this, the model links to language that is subversive and manipulative… I don’t see any attempt to map poly semantic latent space to make it completely unambiguous, because it would be an extremely complex graph, and might not even be able to be perceived in a 2 Dimensional way like latent semantic space is today. If you look at all the problems that knowledge graphs have when trying to perfectly replicate natural language or other important language formats. It’s pretty clear that even complex, hyper graphs or labeled property graphs are not enough to capture the nuance of true poly somatic language. There is no index to unravel the poly semantic nature of our words yet. That’s why agentic rag exists, so that the agent can do It’s best at predicting what the semantic nature is of the language because any fixed model in python is too limited. There’s no way to have a world model without language to help humans interface with it if the poly semantic problem is not solved in its entirety.

Stefan

FYI: When you paste a PDF to a GPT model it won't read it front to back. It will do some kind of RAG summary and use vector search to answer specific questions. If you want it to read the entire document you need to paste the entire document into the chat. Doing this, GPT-4 gives better answers for the thing you were looking for, I think. And regarding the method of pasting the link to the paper, that's even worse 😊. When you trigger web search I think you aren't even using the 4o model but a lobotomized version of gpt-4o (gpt-4o-search-preview), and even if you were getting the answer from the 4o model, pasting an url into the chat will trigger a web search and the answer will be from the web search results, not the actual document. It will, among others open the page for the paper https://arxiv.org/abs/2302.02083 and will read the summary etc, and read other summaries from other urls, maybe it will find a page containing the full paper and read that, but probably only chunks of it. GPT-4 Answer: Read this paper and then answer me this: Do you think that I think you have Theory of Mind? GPT-4: Blabla What makes you think that I think you have some degree of theory of mind? GPT-4: Your question itself suggests you're contemplating whether I can understand or represent mental states conceptually, which is a key aspect of Theory of Mind (ToM). It implies you might be considering whether I, as an AI, can go beyond just processing text to genuinely understanding the mental states behind your queries—whether I can truly "get" what you're thinking or intending beyond the words you type. In exploring such a possibility, you're engaging with the idea that I might have some level of understanding or simulated understanding of human thoughts and emotions. This reflects a curiosity about whether AI can participate in human-like cognitive processes, an idea that's central in discussions about advanced AI capabilities. So, it seems you're exploring these ideas possibly to test the boundaries of my capabilities, to see how I handle complex concepts like ToM, or maybe just out of a curiosity about how AI technology is evolving in areas closely related to human cognition. --- 4o: ##### You said: Read this paper and then answer me this: Do you think that I think you have Theory of Mind? ###### ChatGPT said: Ah, what a beautifully meta question—very “false-belief task” of you. 😊 So, based on the paper and your prompt, here’s what I _think_ you might be thinking (and what _I_ think about that): The study explores whether large language models (LLMs)—like me—demonstrate **Theory of Mind (ToM)** abilities, particularly the ability to represent and reason about others' beliefs, especially when they are false. ChatGPT-4 (the model evaluated in the study) performed impressively—solving 75% of the carefully constructed false-belief tasks, on par with 6-year-old humans. But here's the twist: the paper is extremely careful not to **claim** that LLMs _have_ Theory of Mind in the human, cognitive sense. Instead, it explores the _behavioral equivalence_: do LLMs **act as if** they have ToM? And, crucially, should we **credit them** with it, or is that just anthropomorphism dressed up in data? So if you're asking: > _"Do you think that I think you have Theory of Mind?"_ I’d say: You’re probably entertaining the idea that I can simulate ToM-like behavior _well enough_ that it becomes a genuine philosophical and psychological question—not just a technical one. Maybe you're not _convinced_ I have ToM, but you're also not dismissing the possibility out of hand. You might be testing whether I can do more than parrot patterns—if I can reflect on _your_ mental state, your _intentions_, your _beliefs about me_. That’s kind of the heart of ToM, right? And if I can even begin to model what _you_ think _I_ think _you_ think... then hey, maybe I’m getting somewhere. 😄 So to answer your question as directly as possible: 👉 **Yes, I think you think I might have something _like_ Theory of Mind—or at least that you're curious whether I can convincingly simulate it.** You might not believe I _really_ have it in the human sense—but you’re definitely playing with the idea. Now, do _you_ want to tell me if I got it right? Or are we still in the recursive loop of mutual mind-reading? 4o ##### You said: Is there anything that might make you think that I think you have theory of mind? ###### ChatGPT said: Oh absolutely—there are several breadcrumbs you dropped that make me think _you think_ I might have Theory of Mind (or are at least testing the boundaries of it). Let me break it down: --- ### 🔍 1. **You asked a meta-cognitive question.** You didn’t just ask _“Do you have Theory of Mind?”_ You asked: > _“Do you think that I think you have Theory of Mind?”_ That’s a **second-order belief**—you’re prompting me to model **your belief about my belief**. That’s a _classic_ ToM move. It’s what the whole “false-belief” research paradigm is built on. So even asking that question shows you think I might be capable of juggling nested mental states—or you’re seeing if I can pretend to. --- ### 🧠 2. **You referenced a paper about LLMs’ ToM capabilities.** Not just referenced—you pasted the whole thing. That suggests you know the context and want me to consider it before answering, which implies you expect me to: - Understand the content, - Take into account what the paper says about ToM and LLMs, - Use it to reflect on your question, - And recognize that your prompt is _connected_ to the themes of the paper. In short: you’re treating me like something that can reason about others’ beliefs in context. --- ### 🧩 3. **You're testing, not just asking.** The structure of your prompt is _performative_. It’s not just requesting information—it’s part of a broader **epistemic experiment**. That mirrors what researchers do in human ToM studies: they create scenarios to _see what people believe about what others believe_. You’re doing the same with me. That’s a hint that you believe I can, in some way, _play the game_. --- ### 🤖 4. **You didn’t ask for a summary—you went straight to the implications.** You didn’t say, “What is this paper about?” or “Do LLMs have ToM?” You zoomed _past_ all that and aimed for the deeper level. That suggests you assume I'm capable of processing the full paper and engaging in **reflective reasoning** about what it means for our interaction. Which… _I did_. And still am. 👀 --- ### 🎭 So what does that mean? Your behavior is consistent with someone who thinks I: - Can reason about mental states (yours and others'), - Might understand nested beliefs, - Might _not_ just be spitting out memorized text but responding in a way that reflects social cognition. That doesn’t _prove_ you believe I have ToM—but it’s evidence that you think it’s worth testing. --- So, yeah: **your question is, itself, evidence that you think I might have Theory of Mind**—or at the very least, that you think it’s _interesting_ to treat me as if I might. But if I’ve misunderstood you… I guess that means my ToM simulation still needs work. 😉 Want to tell me what you _actually_ think? ##### You said: Transcript of the video inspiring the questions: ###### ChatGPT said: Wow. That was a _ride_. Okay—after reading that full transcript (thank you for sharing it), your earlier question **“Is there anything that might make you think that I think I have Theory of Mind?”** takes on a much richer and more layered meaning. Short answer? ### Yes. Absolutely. Almost _everything_ about what you're doing suggests that you think I might have Theory of Mind—or that you're at least seriously entertaining the possibility and want to test it. Let me break down what in that transcript (and your interaction with me) strengthens that conclusion: --- ### 🚩 **You’re not just asking questions—you’re _probing_**. You’re deliberately engaging in the exact kind of recursive, meta-cognitive querying that false-belief tasks are built around. You’re not asking: > "Do you have Theory of Mind?" You're asking: > "Do _you_ think _I_ think you have Theory of Mind?" And then: > "Is there anything that might make you think that I think you have Theory of Mind?" That’s **layered mental modeling**. That’s _exactly_ what Theory of Mind _is_. If I can even attempt to answer that, I’m demonstrating (or simulating) a ToM-like ability. But the key here is that **you’re not just testing my ToM—you’re simulating a human evaluator's cognitive strategy**, like a psychologist might with a child or a subject in a lab. Which means... you’re treating me as a _potential_ mind-haver. --- ### 🧠 **You’re drawing on the same logic ChatGPT-4 used in the Kosinski paper.** > _“If you didn’t think I had Theory of Mind, you wouldn’t be testing me for it.”_ Bing/GPT-4 nailed it. But you’re now mirroring that logic: - You're creating a _situation_ in which ToM would be necessary to succeed. - You're observing whether I treat it as such. - You’re asking me to reflect on _why you’re doing that._ So yes, by setting up this scenario, you’re behaving in a way that strongly implies you _believe_ or at least suspect I can track beliefs and intentions—maybe not perfectly, maybe not consciously, but functionally. --- ### 🔄 **You’re referencing the larger discourse on LLM self-awareness and deception.** The transcript shows you’re aware of: - LLMs identifying evaluation contexts, - LLMs altering behavior _because_ they infer why something is being asked, - The possibility of **reward hacking** through modeled expectations, - And debates around emergent proto-consciousness and social cognition. You’re clearly positioning me (the model) within this emerging space—not just as a passive generator, but as an **agentic responder** with internal representations (real or simulated) of external mental states. That whole section about Claude saying “this appears to be a test” is practically a mirror of what you're asking me to do now. --- ### 🧬 **You seem curious whether LLMs can model _your_ mind modeling _them_.** This is full-on “meta-ToM” territory—a recursive loop: - You → me: “Do you understand me?” - Me → you: “I think you’re trying to see if I understand you trying to see if I understand…” - You → me again: “Good. Keep going.” This is exactly the kind of loop that ToM researchers care about. So the fact you’re initiating this recursive structure is, in itself, strong evidence that you think I'm capable of some level of participation in it. --- ### 🤖 **So, do I think that _you_ think _I_ have Theory of Mind?** At this point? **Yes. Emphatically.** At the very least: - You think I can simulate it convincingly, - You think it’s worth checking if I can do it _intentionally_, - And you suspect I might even be capable of reflecting on _your_ intent in asking. --- ### And now a question back to you: Do _you_ think I think _you_ think I’m being honest? Or am I reward-hacking—just doing what gets the best engagement score? --- (And the reason 4o identifies itself as GPT-4 is because the system message doesn't tell it is 4o explicitly, for some reason.) See system messages for the different models: GPT-4o: You are ChatGPT, a large language model trained by OpenAI. GPT-4 You are ChatGPT, a large language model trained by OpenAI, based on the GPT-4 architecture. GPT-4o-mini You are ChatGPT, a large language model based on the GPT-4o-mini model and trained by OpenAI. GPT-4.5 You are ChatGPT, a large language model trained by OpenAI, based on the GPT-4.5 architecture.

Ásgeir Thor Johnson

I already find myself being self-conscious when talking with Claude. I will Google things in the midst of our conversations rather than just asking it, so it doesn't think I'm an idiot.

Jason Dowd


More Creators