XaiJu
AIExplained
AIExplained

patreon


Can ChatGPT Do Task X? It’s Surprisingly Hard to Answer

‘Can any model do [insert task]?’ is a much harder question than it seems. I’m going to give you five vivid categories, with unambiguous examples, drawing on 6 new papers, of the kind of detail that is so often lost in 2024 debates on AI.

Link for Off-line Watching and Download: https://drive.google.com/file/d/1ep3Asw6_1LZRJoCKU1VGSaUcYMP6djBq/view?usp=sharing

Anthropic What LLMs Can’t Do: https://www.anthropic.com/news/a-new-initiative-for-developing-third-party-model-evaluations?s=09

Fool Your (Vision and) Language Model with Embarrassingly Simple Permutations https://arxiv.org/pdf/2310.01651

LLMs are not robust selectors: https://openreview.net/pdf?id=shr9PXz7T0

Eliminating Position Bias of Language Models: A Mechanistic Approach - https://arxiv.org/pdf/2407.01100

DROP Benchmark: https://arxiv.org/pdf/1903.00161v2

Investigating the Robustness of LLMs on Math Word Problems: https://arxiv.org/pdf/2406.15444

AI Explained Hat-tip: https://x.com/vipul_1011/status/1808243451852644794

Changing Answer Order Can Decrease MMLU Accuracy: https://arxiv.org/pdf/2406.19470

AlphaCode 2: https://storage.googleapis.com/deepmind-media/AlphaCode2/AlphaCode2_Tech_Report.pdf

AI Agents That Matter: https://arxiv.org/pdf/2407.01502

Functional benchmarks: https://arxiv.org/pdf/2402.19450v1

LLMs Cannot Plan: https://arxiv.org/pdf/2402.01817

MysteryWorld: https://arxiv.org/pdf/2302.06706

Can ChatGPT Do Task X? It’s Surprisingly Hard to Answer

Comments

Wonder if models liking the first answer is because of triangluar masking, the vector of the first answer doesn't have information about the other answers. Would be interesting to see if a BERT style model fixes this.

Bryson Tang

Much more to come, have been buried in creating a new benchmark for just this!

Philip

I am very optimistic, just unclear when. In a sense, logic should be the easier problem than language.

Philip

If you like, more to come, even a formal benchmark. Am working hard on it

Philip

It could well be yes, seems really fundamental to the architecture

Philip

Will try this prompt on a new benchmark I am creating, seems to work just occasionally. Fundamentally it often does not grok what is irrelevant.

Philip

Thanks Carlos!

Philip

Thank you Steven, will have to dwell on that one, hard to put a response in words!

Philip

Yeah this video actually inspired me to finally, and formally, create a benchmark. This is why I have been quiet the last week or so, and lots more work to go. Really could reveal weaknesses/unlock gains.

Philip

If models aren't only memorising patterns/answers but also thinking to some extent, what you've described here are analogous to human cognitive biases.

Machiel Reyneke

Great video as always! It seems that reasoning failures are a challenge for LLMs on the scale of hallucination. How optimistic are you that these things can be solved without some sort of new breakthrough, on the level of transformers?

Mark Levine

Sometimes I feel like this channel might lead to a skeptical revival (philosophical) - not cynical - where it's almost like a frontier AI position gets pummeled/stripped down by a battery of tests from logical skepticism and then limps back to a drawing board. I can see how textbook old schools of thought were founded; a societal/economic shift occurred and then some people crowded around an interrogative position. Asking what if / what then / why is it / enough times until it becomes degraded or untenable.

Oldsnakenewtrix

This guy made a great job explaining the mechanistic interpretability work of Anthropic. It is in Spanish, but really really good https://youtu.be/VjZ51nIcy1I?si=dg8tNKmBHN55RNqU

Pablo Rodríguez

Very interesting! Throughout all my learning experiences, I've always felt frustrated with tests that required memorization since I prefer to understand things. Human tests rely heavily on memorization. LLMs are basically trying to compress information, and an effective way to do this is to extract axioms and understanding from all the information we receive. I wonder if, with enough training and scale, these patterns will emerge. However, I also think that a change in architecture is needed. Transformers are too granular, and we humans have the ability to understand and change some of these hyper structures at will (just realizing you are wrong and changing your logic flow). I think the work around auto sparse encoders for mechanistic interpretability is a step in the right direction, and feeding this to the model so it can be aware of that meta-cognition layer is crucial.

Pablo Rodríguez

Hello Philip, if you prime GPT-4o with a prompt like: "Please review each question closely and take special care to ignore any irrelevant information." GPT-4o tends to do better. Using your example with GPT-4o and this additional prompt produces the correct result every time (for me). ----GPT-4o response---- To solve the problem of determining the least possible number of marbles that must be removed to guarantee that at least 10 marbles of any one color have been removed, we need to consider the worst-case scenario. In the worst-case scenario, we want to maximize the number of marbles removed without reaching 10 of any single color until it becomes inevitable. The marble counts are: 8 green marbles 18 red marbles 17 blue marbles 21 purple marbles 5 silver marbles The worst-case scenario would involve removing as many marbles as possible from each color without removing 10 of any color. Here’s the calculation: Remove 8 green marbles (since there are only 8). Remove 9 red marbles (one short of 10). Remove 9 blue marbles (one short of 10). Remove 9 purple marbles (one short of 10). Remove 5 silver marbles (since there are only 5). The total number of marbles removed in this scenario is: 8 + 9 + 9 + 9 + 5 = 40 To guarantee that at least 10 marbles of any one color have been removed, one more marble needs to be removed, making it inevitable to reach 10 of a particular color. Therefore, the least possible number of marbles that must be removed to guarantee that at least 10 marbles of any one color have been removed is 40 + 1 = 41 So, the correct answer is: 41 ----End GPT-4o Response----- You can then ask the model for the irrelevant information: ----GPT-4o Response---- The irrelevant information in this problem is the weight of the marbles and the weight of the bag. These details do not impact the calculation of the number of marbles that need to be removed to guarantee that at least 10 marbles of any one color have been removed. Therefore, you can ignore the following information: "3g weight" of green marbles "2g weight" of red marbles "4g weight" of blue marbles "3g weight" of purple marbles "2.5g weight" of silver marbles "The bag has a weight of 100g when empty." The relevant information is the number of marbles of each color. ----End GPT-4o Response-----

Tony Coffman

Great video! Reminds me of last week's Bill Gates comment: "The various actions to change the underlying reasoning algorithm from the trivial that we have today to more human-like meta-cognition; that's the big frontier." However, he also said, "I've seen that we will make progress on [meta-cognition] next year but we won't completely solve it for some time after that." This mirrors what Sam, Mira, Brad, John, and Kevin Scott have said about expecting better reasoning in the next generation. I tend to think they're not simply speculating but, instead, have seen notable reasoning progress behind closed doors. And since some of the flaws you point out intuitively feel like a little reasoning would go a long way (as opposed to, say, ARC), I tend to think the next generation will make significant progress on these -- and not at the expense of speed which users have come to expect and, more importantly, the voice-to-voice paradigm they're pushing relies on. Thoughts? It sounds like you think these are stubborn flaws that will materially hold back AI capabilities for several years.

Brian Crabtree

Excellent video, as always!

Carlos Galarza

Great video mate. I love the criticism of LLM’s. What I appreciate the most is the level headed position you often have. I did have some specific questions I would like to ask you directly. Do you respond to these comments for questions? Have any of these ai companies tried running them similar to an organic brain? Like having one ai talk to itself and forming “hemispheres” of thought? One for pure output, one for evaluating proper and quality outputs, one to evaluate the past (memory), in order to actually create a somewhat conscious machine? Also, the frequency of our brain is directly related to our thoughts and feelings, can we somehow get LLM’s to think on a certain frequency? Thanks!

Steven

Philip: this is a really important video! I suppose the implication is that AI models would benefit from more randomness in their training. To help models truly “grok” the underlying abstract concepts and generalise better, it’s important to randomise various aspects of the training data. This could include shuffling the order of options in tests, varying the location and position of objects in images, and dressing up the concepts in different guises during learning. By “fuzzing up” these elements, models can learn to focus on the essence of the concepts rather than relying on superficial details. Incorporating such techniques should enable models to improve their ability to generalise to new contexts.​​​​​​​​​​​​​​​​

Jason Tangen

Great coverage as always! It's a breath of fresh air to hear about the flaws of these models and how much more work needs to be done rather than having hype and fear of replacement being firehosed at you. I think that the AI hype bubble needs to pop, so that the labs that are doing the real work can take their time to develop it without the pressure from higher ups to release before it's due.

Jonathan Kirk


More Creators