Before this paper from Anthropic, out on the 29th, I was a lot more skeptical about LLMs self-reporting their internal state. This is not proof that they can, but partial proof of circuits showing true introspective capability, and more than that, the ability to map a question about it to those circuits.
Plus a big update to lmcouncil.ai.
https://www.anthropic.com/research/introspection
Full Paper: https://transformer-circuits.pub/2025/introspection/index.html#mechanisms
Earlier Work: https://www.anthropic.com/research/mapping-mind-language-model
https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html
Release Post: https://x.com/AnthropicAI/status/1983584136972677319
Philip
2025-11-09 10:34:22 +0000 UTCBret Brizzee
2025-11-08 19:57:58 +0000 UTCPhilip
2025-11-05 12:32:14 +0000 UTCPhilip
2025-11-05 12:32:09 +0000 UTCPhilip
2025-11-05 12:31:57 +0000 UTCRiley Thomson
2025-11-05 01:02:33 +0000 UTCJoshua Davis
2025-11-03 04:47:12 +0000 UTCKishore Kumar
2025-11-02 21:35:21 +0000 UTCBarnaby Golden
2025-11-02 12:57:59 +0000 UTCPaweł Pieniacki
2025-10-31 10:08:52 +0000 UTCPhillip Yao-Lakaschus
2025-10-30 21:17:41 +0000 UTCErik
2025-10-30 20:56:08 +0000 UTC