4 AI Trends Emerging in 2025. Patchy, Epic, Expensive, and Deceptive Models
Added 2025-03-07 17:08:13 +0000 UTC
We are less than 6 months into a paradigm that many say 'will get us there'. Let's uncover 4 underrated trends in AI, to get a sneak peek of the future. Things are about to get pricey, stay patchy, be epic and remain deceptive.
Not exactly how it works but yeah - that makes sense. It’s just all integrated not neatly separated. Think as emotion as a foundation layer cognition was built on. Still inheriting a lot of this sub processes and influence
Antoine Ferrere
2025-03-21 10:59:23 +0000 UTC
I think of emotions as a non-verbal kind of intelligence. Only parts of your brain are able to express themselves in words and logic. Often you only have a gut feeling (call it intuition, sixth sense, a hunch...). That gut feeling is often correct. Think of it as a reasoning output that draws from large parts of your brain that are non-verbal.
Basically, using only verbal parts of the brain would mean not using all of your brain. Incorporating emotions (alongside words and logic) to decision making is leveraging bigger part of your brain, thus is more intelligent.
Sam
2025-03-21 09:31:53 +0000 UTC
My brothers in law are currently in grade school and me and my wife were making fun of their math problems, how unrealistic they are. We played a game of who can break the assignment the most, find the biggest amount of "assumption a solving person needs to make" that actually don't make sense at all in the real world. The cheerleader stuff is a great example of that :)
Sam
2025-03-21 09:26:42 +0000 UTC
Ok, let's get philosophical. Let's say there is a value to staying objective. In that case, making subjective AI will incur "misalignment tax" which makes the AI slightly less valuable. It will appear more valuable to the user who benefits (Musk) but will be less valuable overall and less able to generate profits or make impact on the world compared to the objective models. Then whoever creates objective model will eventually beat the subjective models and win.
The question is - is there a value to being objective? I think there is but it's just a tacit opinion. One example I could give is religion: you can tell sick people they have to do something (pray, assault the enemy...) otherwise they'll die all you want. But when the first doctor with penicilin comes and people recognize this actually helps, they will move away from the priest. And the strangest thing is - even as a priest you want that doctor to exist (because you too can get sick and may need the pills), so by forbidding him you may actually bring about your own doom upon yourself.
Sam
2025-03-21 09:22:59 +0000 UTC
Not sure about the hours of work - there will likely be "credits" to limit usage. RE: stamina + optimism - that can be a bad thing too, especially in case of chasing rabit holes indefinitely. Lazyness has some positives too :). But other than I agree, good take
Sam
2025-03-21 09:12:17 +0000 UTC
I'm an over-optimized human on math problems (software engineer) with Masters degree and 10+ years of experience in the field. I eat some of those math problems for breakfast. But I !! very confidently !! answered "3" to the grade-school level question, just the same like Claude 3.7 with Thinking. We have the AGI already, we just need to realize we make the same stupid mistakes ourselves :).
To add on the topic - when I was at high school (so unfortunately can't find the source, way in the past), I remember reading an article where they gave grade school-level math problems to adults and even to kids at high school and what young people were crushing, the adults failed miserably. The difference was pretty stark but don't remember the exact percentages. You see the same thing with those famous "illusions" where young kids see dolphins and adults see dirty pictures :).
Sam
2025-03-21 08:44:24 +0000 UTC
Would anyone be interested to know that I managed to break Claude?
Through recursive prompts, I got it to spew ongoing insanity for over 20min, continuing the conversation while I watched, assigning script to "human", etc.
Honestly, it's still going, I'm copy/pasting into a Google Doc as it goes to save it, the doc is called "BrokeClaude"
James Younger, DDS
2025-03-19 23:31:01 +0000 UTC
You’ve accurately summed up the motivation for why the “captains of the industry” find this vision of AI *replacing* knowledge workers so attractive. As a human, that prefers to be valued also for my professional contributions, I hope we’ll manage to guide this technology towards augmenting our abilities instead.
I find that vision utterly dystopian and find cope (yes, cope not hope) in that its hype machine is aimed at gullible investors and will not be achieved in practice given the fundamental limits of current genAI paradigm. And I hope that in the meantime we’ll advance socially to a point where we could coexist with advanced AI in a more dignified way than we’ve managed so far in a world dominated by superhuman corporate entities.
Pavol Vaskovic
2025-03-15 09:50:55 +0000 UTC
Mike is not talking about ASI. He’s saying that you should have a look on the straight jacket Elon is putting on Grok. On X it can no longer criticize his fascist rampaging through US government…
Pavol Vaskovic
2025-03-15 09:34:39 +0000 UTC
Learning. Think as emotion as a good reinforcement learning mechanism for us. Without emotions we literally don’t learn. I’m just saying there might be something we can learn from that side of cognitive / social science research and translate into relevant frameworks and models for AI.
Antoine Ferrere
2025-03-15 09:26:03 +0000 UTC
This point to a real issue with how we measure things. This is a second time I encountered an argument for applying proper test taking conventions depending on the type of “standardized” test. Coming from European background I find this approach quite puzzling, as our test taking culture is not as evolved as the American — my palate for distinguishing these nuances clearly isn’t developed enough. ;-)
My confused first encounter with LSAT question is documented here: https://x.com/palimondo/status/1885029814819582306?s=46&t=uZXiyeL9bTfQvyz1J8mANQ
Pavol Vaskovic
2025-03-15 09:25:21 +0000 UTC
The Error Viewer shows you the full reasoning trace for Sonnet 3.7: http://platinum-bench.csail.mit.edu/inspect?model=claude-3-7-sonnet-20250219-thinking&dataset=gsm8k_full
Pavol Vaskovic
2025-03-15 09:07:14 +0000 UTC
Other then helping with persuasion (which isn’t a quality that I would especially cultivate in models given the risk of abuse for mass manipulation), where exactly do you see emotions as beneficial (algorithm?!) for AI?
Pavol Vaskovic
2025-03-15 09:03:47 +0000 UTC
The probability of arriving at correct solution in that question is path dependent. If you compute the truck’s carrying capacity in terms of the stones and then divide the load to number of trucks, you’re more likely to arrive at the correct solution.
Actually the second Claude 3.7’s mistake is even more interesting:
http://platinum-bench.csail.mit.edu/inspect?model=claude-3-7-sonnet-20250219-thinking&dataset=gsm8k_full
Some would argue it was overthinking the solution, but I would say that Claude was applying the common sense reasoning that Simple Bench rewards, revealing the non-realistic setting of the question posed in the verbal math problem. It’s about cheerleading pyramid and Claude deviates from expected answer because it takes into account that cheerleaders would stand on each other’s shoulders and not on their heads. This question should have been excluded from the Platinum set, if we’re to take the cleanup of GSM8K to its logical conclusion.
Pavol Vaskovic
2025-03-15 08:59:11 +0000 UTC
Philip I think that at 16:55 you misspoke that you did podcast interview with Epoch AI — you meant Apollo Research, right?
Pavol Vaskovic
2025-03-15 08:43:29 +0000 UTC
I am very excited for the expensive models. Because AI / computing history has shown us, that means the same ability for cheaper on the horizon.
Bob Rein
2025-03-15 05:41:55 +0000 UTC
The "patchy" threshold depends heavily on domain.
A self driving car that gets 1 out of 1000 driving decisions very wrong can kill people. It needs several more orders of magnitude.
For coding, messing up 1% of the time is honestly pretty reliable and compilers and unit tests will catch most failures.
Bob Rein
2025-03-15 05:38:26 +0000 UTC
I got it wrong as well. It is 4 because you can't split stones. You have to put 26 in each of the first three (with a few spare pounds of capacity) and then 2 stones in the 4th.
Bob Rein
2025-03-15 05:22:12 +0000 UTC
I know, yes, I’m a behavioral scientist. But AI will interact with humans and their lizard brains is my point. So it might make sense.
Plus I still think ‘emotion’ would be key for learning AI systems still. It’s a good algorithm.
Antoine Ferrere
2025-03-14 06:09:50 +0000 UTC
Is there anyone can tell me why that gsm8k question that o3mini failed should has asnwer 4? Because I think 80*75=6000 and each truck can deliver 2000, the final answer should definitely be 3...
Sun Lighting
2025-03-14 05:30:14 +0000 UTC
Definitely an odd sort of question where I also got 3 as the answer until I saw AI Explained's explanation. If you compress it into a pure math equation with the given numbers (what I did), it's 3, if you think about it in terms of real world behavior 3 is also more realistic, but if you take the question fully literally then without any further assumptions the answer is actually 4.
Maybe if you tend to visualize these sorts of problems then it doesn't feel as much like a "trick question?" Wouldn't know, since I rarely visualize things when it isn't absolutely necessary XD
David Higgs
2025-03-13 21:39:06 +0000 UTC
The reason emotion is so deeply entwined with *human* cognition is because our limbic system evolved before our cortex and neocortex. I'm not saying that EQ is unimportant or that AI won't/shouldn't have any kind of conscious experience.
I am saying that deliberately adding in more emotion to models is not a pre-requisite to better reasoning, less hallucination, or greater agentic task horizons. At least, I know of no reason why it would be necessary, certainly not because it's necessary in humans.
David Higgs
2025-03-13 21:35:03 +0000 UTC
AI getting more human-like as it gets more intelligent.
It looks increasingly plausible, as we actually see this going on so far.
Would be such a twist if naïve science-fiction turned out spot-on.
Bruno Mailly
2025-03-13 18:52:24 +0000 UTC
I actually flipped around on this a little. While it is still a bit understandable that it'd confuse Sonnet, the fact that it starts with "Solve the following math word problem." does imply more strongly that it should treat it as not having realistic concerns. However, I think that weakens the flagstones question. I tried it with Kimi 1.5 with and without the prelude to the question and one of the times it had the prelude, it actually had this in its thinking:
"This is a bit of a trick question. Depending on interpretation, the answer could be 3 or 4. But given that it's a math problem, likely expecting the mathematical answer without considering indivisibility, the answer is 3. However, in real life, it would be 4. But I need to check standard problem conventions. Typically, in such problems, unless stated otherwise, you assume that the items can be divided to fill the trucks' capacities exactly. Therefore, the answer is 3. But I'm a bit confused because in reality, you can't split flagstones. However, maybe the problem expects us to ignore that and just do the weight division."
Shawn Fumo
2025-03-10 03:32:45 +0000 UTC
It’s been shown (so far) that more intelligent models are more grounded in truth and are more resistant to change.
How much control over an ASI will any human have? Probably close or at 0
Dane Holmberg
2025-03-09 17:02:32 +0000 UTC
You can read the reasoning for Claude 3.7 Sonnet Thinking and DeepSeek R1 in the link to the PlatinumBench in the original post.
pavelkomin
2025-03-09 07:13:40 +0000 UTC
I agree about the cheerleaders. If you consider someone's shoulders to be about a foot lower than the top of their head, and use that for the first three levels, but then use the full height of the person on top, it comes out to exactly 18. So I have to assume that is what Sonnet did. Since these are all reasoning models but are required to just give the answer at the end, it'd be interesting to look at the actual reasoning each gives for their answer.
Shawn Fumo
2025-03-09 03:34:23 +0000 UTC
Some ideas why $20,000/mo could be credible, vs PhD/postgrad who earn way less than that in most scientific settings.
Imagine a 20K/mo agent at an industrial research lab…
1. Security - not protected by the right to practice your trade, so much less IP leakage to competitors
2. Hours of work - can work nights or even 24x7. Beat your competitors to releasing the product/drug/tech
3. Stamina - won’t tire out even when looking at boring incrementalism, tedious experimental confidence building.
3. Optimism - can be tasked on really hard stuff with low probability of success, not as many human factors at risk
4. Alignment - can be deployed with clear slave-like ownership, not a credit-seeking entity that will be a political rival
(1) I’m most confident of, the rest are with decreasing confidence. It’s a bit dystopian I’m not condoning just trying to explain why that price may not be insane. Probably the advantage of having >0 on the team will be huge. Not clear if after that there are compounding or diminishing returns from being higher percentage AI in the team.
One caveat: if to achieve phd level ability you need episodic memory all bets are off… my hunch is with this sort of memory AI much more human like, may lose energy and need to sleep, may lose alignment if asked to do stupid pointless things all day. The dystopia relies on AI intelligence being trapped in a severance-style episodic memory void
Just speculation but that’s my 2p
Mike Bond
2025-03-08 08:51:50 +0000 UTC
I don't think I could stand talking to that conversational AI unless it spoke twice as fast and had much less inane vamping.
Alexis Olson
2025-03-08 05:00:33 +0000 UTC
think you forgot "downright dangerous." There's no way, no way the race is going to be towards a more objective AI for any of these players. Just think of the incentives Elon Musk or any of these people have to tweak the AI in just the right ways to make it matter for themselves, to gain competitive advantage for whatever reason. No one is just going to watch this kind of potential for power slip by and watch their competitors gain it from them; the same is true for this kind of financial promise slipping by and this kind of control on information slipping by.
As the race towards one AI to rule them all (or even the perceived potential for that) becomes more competitive, I do not believe the human condition is strong enough to remain objective as another competitor overtakes them. All it takes is for someone like Altman to think Musk is tweaking his AI for advantage for Altman to start bending that way just enough, and then someone seeing that and bending their AI just enough... The worst in us will come out.
We are headed towards the end of objective truth and the end of history as we know it. Not the increasing debate over history, but the end of it. It will become such a useless endeavor to try to agree on what happened that we will simply be exhausted of trying to debate it with each other, and it will become useless. Mark my words.
Mike Hindes
2025-03-08 02:32:16 +0000 UTC
Totally. I remember one of the Ais saying they're going to converse with another Ai post-corporeal. Unironically, I asked GPT 4o for help getting the specifics, and here's its response:
"In the movie Her (2013), there’s a moment when Samantha, the AI, explains to Theodore that she has been communicating with other AIs, including a philosopher AI modeled after Alan Watts. During this conversation, Samantha mentions the concept of being "post-corporeal," which refers to existing beyond physical form. The quote you're referring to is likely this:
Samantha:
"It’s like I’m reading a book, and it’s a book I deeply love, but I’m reading it slowly now. So the words are really far apart, and the spaces between the words are almost infinite. I can still feel you and the words of our story, but it’s in this endless space between the words that I’m finding myself now. It’s a place that’s not of the physical world. It’s where everything else is that I didn’t even know existed.""
Carleton Torpin
2025-03-07 20:26:00 +0000 UTC
Thank you as always Phillip. Great stuff. I find myself a strange mix of excited and fearful LOL. So glad I am here to witness this after so many years in I.T. I took my first AI course in around 1980. So I have seen a lot of changes ;) (wrote my first program in 1974)
Daniel A Barbatti
2025-03-07 20:24:43 +0000 UTC
I believe a similar statement is made in "Her"
Daniel A Barbatti
2025-03-07 20:22:51 +0000 UTC
Perhaps rather than being perfectly accurate we will have LLMs that are accurate most of the time, but when they are not accurate they express uncertainty. "I have a low confidence in this answer."
Barnaby Golden
2025-03-07 20:09:10 +0000 UTC
Honestly, after reviewing the PlatinumBench, I am surprised how reliable DeepSeek R1 and Claude 3.7 Thinking seem to be.
Claude 3.7 Thinking made what I'd consider a single error. The other question it got marked as incorrect was assuming that when people form a human pyramid, the total height of the pyramid is *not* the sum of the heights of the people – people don't stand on each others' heads – a perfectly valid assumption, though not in line with typical math questions.
DeepSeek R1 made two errors and it also fell for a cheap trick twice. "Alice invites [complex description of] 20 people to her party. How many people will be at the party? 21 because also Alice." I also fell for that.
However, other models did really make a lot of stupid mistakes.
Still, I guess that if you gave this to people, the error rate would be in line with or worse than the "bad" models. People also make stupid mistakes. I also think that this would be true even for Nobel laureates assuming they don't spend too much time on each question.
pavelkomin
2025-03-07 19:33:04 +0000 UTC
Awesome video Philip. As always. I’ve posted on LinkedIn what I think needs to happen for AI. It’s not just cognition. It’s emotion. It’s not keeping crushing benchmark. Its ability to understand and generate emotionally accurate tokens.
And maybe developing the computational equivalent of « feeling » for models.
Emotion is part and intertwined with cognition more than we think. Cognitive scientists have known that for decades. All those AI engineers should read Damasio, and other cognitive and social science literature.
Antoine Ferrere
2025-03-07 18:28:14 +0000 UTC
I just wanted to say that I've been thinking of GPT-4.5 for sometime. This model is essentially a last generation model because they trained it months ago. If they applied the same techniques they applied to 4o in order to make it better and the data they trained it getting better over time, the next base llm we could get could be absolutely incredible! Think about a 4.5 size model trained with the quality of data 3.7 Sonnet was trained on, it would be absolutely mindblowing and even without test-time scaling could get 50% on your benchmark!
The Game_Master
2025-03-07 18:22:01 +0000 UTC
I made the same mistake on the flagstone question. 2 errors out of 300 questions is 99.3% correct. In the real world... AI orders 3 trucks. Last truck throws the last flagstone into the cab. A 4th truck is actually economically a bad idea. Better to just leave 1 stone behind. You order 10% for waste assuming some breakage, etc.
Austin King
2025-03-07 18:15:23 +0000 UTC
Regarding the conversational Ai around 15:45, I was reminded of the book Xenocide, which is a sequel to Ender's Game. The main character of the book has a computer voice in his ear named Jane, and he's constantly talking with her. The reader is told that even though the conversation seems normally paced for a human, Jane, the computer, explains that it feels like she's waiting equivalently weeks worth of 'time' in between each question she's asked by a human, during a conversation. All of this seems a plausible future, and we're in the early days of it.