XaiJu
AIExplained
AIExplained

patreon


Many-Shot Magic: 2 New Papers + 1 Failed Bet Show What Can Be Done with LLMs

Two recent papers (DeepMind + Anthropic tag-team) and a failed $10k bet have reminded people not to underestimate what models can learn from the data you give them in the prompt. Let me show you how this can be harnessed to get better results, even if you don’t have great demonstrations at hand. Then we’ll see the bet that lost a start-up founder $10k but hopefully won’t fool you. And end with Anthropic’s jailbreaking revelations and how even they think such jailbreaks not be stoppable – relevant for any user or business.

Downloadable Video File for Off-line Viewing: https://drive.google.com/file/d/1DVw-9c1h5BcFjJf_1EEMAo5GxEa1bx8l/view?usp=sharing

Many-shot In-Context Learning, Google DeepMind: https://arxiv.org/pdf/2404.11018.pdf

Many-shot Jailbreaking: https://cdn.sanity.io/files/4zrzovbb/website/af5633c94ed2beb282f6a53c595eb437e8e7b630.pdf
Anthropic Jailbreaking Post: https://www.anthropic.com/research/many-shot-jailbreaking

$10k Challenge: https://twitter.com/VictorTaelin/status/1776248021858111542

Winning Prompt: https://twitter.com/futuristfrog/status/1778109834509832462

https://gist.github.com/GameDevGitHub/ffd826a4296008ba10d0df893f6fde62

Many-Shot Magic: 2 New Papers + 1 Failed Bet Show What Can Be Done with LLMs

Comments

Sound like one might test the limits of this approach with [Phi-3 mini instruct 128k](https://huggingface.co/microsoft/Phi-3-mini-128k-instruct) -- the last number is the supported context length.

Pavol Vaskovic

I think especially code verifier best of 4 benchmark is insignificant (I mean if the model has to say yes or no, and you give it four times, it says no the first three then yes last one, that is considered a true answer ?!, in what world?!, there is around 93.75% that a random generator generating four values would have at least one of them as YES) I wish I am wrong about the benchmark though (but I don't get it, if best of four considers only four yes or three yes as correct and everything else wrong, it should be as good or slightly worse than a single yes or no by the model. If this is statistically significant, it would be weird!

Youssef Mohamed

it seems more like the attention is noisy, and it gets improved by more examples, if we can solve this during training it would be great I think training against perplexity as a measure is part of the problem

Youssef Mohamed

I don't want to sound pessimistic, but how is generating examples from the model, then supplementing them as a prompt supposed to be a "good thing" Imagine if I ask you sth, you don't know, then I ask you to give me similar examples about that thing, I tell you back those examples, and kinda of you remember what you wasn't remembering maybe?! This is a fading memory kinda of thing, it is normal for biologically grown sentients, but sth tells me that computers don't and kinda shouldn't have fading memory

Youssef Mohamed

I posted in LinkedIn an image I used in a presentation to Siemens during their AI week some weeks ago here: https://x.com/BobbyGRG/status/1782079283223036379 Basically, this depicts the workflow. I must say that watching or even experiencing it is quite empowering. We haven't recoded yet this video as the Taelin challenge lead us to look into other ongoing research that is pointing to how impactful can be using propositional logic in prompts and I have been experimenting more than I should :)

Robert Gomez-Reino

Surely. Normally we go both LinkedIn and YT, sometimes X lately (privately as myself).

Robert Gomez-Reino

Yeah it is interesting to think of as an alternative to fine-tuning, and an actual training method. The stuff described in the video still sounds fairly early, where you generate some pairs or just questions and feed that in and hope for the best. I haven’t read the paper yet, but I wonder how much they tried diff examples at the same number of shots, diff order of the examples, etc. I just wonder how full automated and optimized it can be. Do you even need to filter for correct answers in the examples if it is part of a self-supervised process? Like have it generate question and answer pairs, and save those and select random subsets and run it on your benchmark and find which help the score the most, keep generating more, experiment with order, etc. Can even get more meta and have it experiment with prompts to generate the pairs. My guess is it’ll come up with stuff that may be somewhat unintuitive since it is geared toward the actual model. And when you have the best possible ICL input, does fine-tuning on those examples improve the results in the same way? Or does the ICL need slightly different things than the normal gradient descent? One thing that is kind of cool about the ICL is that since you aren’t updating the weights, you can’t interfere with other tasks that use different ICL inputs. It might even be that the optimal training/fine-tuning for ICL doesn’t work as well as a regular trained model when it doesn’t have the many-shot passed in, but then when you do, it really surpasses the normal results. So it’d be interesting to take it a step further and take a pretty small model and a bunch of sets of many-shot prompts you know work well on your benchmarks and do gradient descent on the combo (w/ diff questions than the actual benchmark of course, but related types). In theory it might further optimize the model by eliminating things in the weights that it’ll get from ICL and focus more on taking advantage of what is passed in.

Shawn Fumo

Haha, yep, with prompt optimisation and now, likely MS. Point is to hit SOTA intelligence, even if cost-effectiveness dips. And on your finetuning point, totally agree. Got me thinking now, about a combined approach, double-tuned so to speak...

Philip

Hey Robert, would the demo go on YT?

Philip

Hey Jon, you are totally right, and on your second point I feel like it's almost certain this occurs now, say with Gemini 1.5 Tricks I used to use don't work for jailbreaking, almost instantly. Still can be done though.

Philip

Great points and questions. For 1), I would come up with say 10 hand-crafted ones, then ask the model to generate 50 more. Then browse those and take only the ones that pass your given threshold of correctness. Then add those in, even if they are not perfect. For 2) interesting idea but I suspect even creativity can be enhanced in a way. Actually it's quite deep, whether procedurally generated creativity is actually creative. Will think on that, but yes, will get more determined, but possibly for the better.

Philip

Great point. Almost a start-up idea that in one-click tells you how many/which shots to use per task.

Philip

I agree Sean. We still discover things about GPT-3 today, 4 years on, GPT-5, say, given its size may not be fully 'discovered' until 2030 or beyond!

Philip

Maybe it is getting smarter within the context of your smart prompts! It is learning, perhaps, to imitate the sophistication of the person it is dialoguing with. :)

Philip

That's a fantastic point Steve, I hadn't fully vocalised it like that. It is raw intelligence to be able to learn like that, not linked to a particular task or dataset. Will mull on that one for a while, I think you are right.

Philip

SmartGPT 3.0?? 👀 But seriously though, this is pretty incredible, and as you said only possible with extremely large context windows like we have now. I’d love to see if these findings carry over to multi-modal tasks… I have a lot of examples I can pass if it means we get high quality and accurate damage assessment! This is also a fantastic alternative to the headache that is fine-tuning, you can iterate quickly and cheaply without waiting for fine-tuning APIs or model weights to be available.

Trenton Dambrowitz

Btw, a funny experience that shows how strikingly good incontext learning is, is my experience with a set of GPTs we created which use actions to call OAI API endpoints, including an Azure OpenAI GPT4 fine tune. This fine tune was trained in a propietary extensive programming language (that standard gtp4 hardly knows). After a few calls from GPTs to this backend fine tune to get some pieces of code, GPTs start spitting pieces of code autonomously, and correct pieces of code done by backend LOL. By sharing the context by the way, after a while all the GPTs start to share capabilities, quite funny, jus different actions. It's quite interesting since it looks like a warmup phase for the specific topic the engineer wants to work on, and suddenly you can choose whether you believe GPTs can already do some x subtask (faster then) of you want to push to a backend CoT process and/or FT depending on the difficulty of the task. I think btw this is a more pragmatic approach to smartgpt-like to boost productivity as of today. Will show some demo video online soon.

Robert Gomez-Reino

It is kind of like the way humans work. We have very smart people that know alot. But sometimes these smart people do not use their brain correctly or the info they have. LLMs have all this info but the many shots prompts approach (aka post-training) allows it to use it capabilities much better. As for the hacking, I feel we need an independent AI plug-in that can monitor context inputs in real time thus stopping the prompt if it does not align with safety guidelines before it goes into the LLMs.

Jon Kurishita

That was a jam-packed video with lots of goodness. I have a few questions, if I may. Dumb question #1 from me: how would I go about coming up with 50, 100 or more examples to feed the LLM for many shot ICL? That seems like a lot of work. Second (maybe less) dumb question: is many shot ICL most applicable where we want the LLM to be more deterministic in its response? Or perhaps asking it another way, we shouldn't used many shot ICL when we want the LLM to be creative? thanks

Sean Gallagher

This is very interesting. It would be pretty useful to have a tutorial with many shots that taught me how to use many shots to prompt the models. It'd be specifically useful if this can be automated just verifying the answer.

Carlos Baraza

Great video, thanks for sharing! I think we’ll look back in a few years and realise that many capabilities already exist in today’s LLMs, they’re just hidden in the huge number of weights in the models. It’s like there’s too much noise in the model. Maybe there’s a new training technique to be found that can counter this!

Sean Betts

It's definitely been my experience that Opus becomes smarter the longer the conversation. The longer context chats I have are definitely a higher quality than the new chats. I've wondered if there is a way to hack this effect by asking Claude to compress a long chat and then seed new chats with a previous compressed chat

Machiel Reyneke

I haven’t watched the video yet, so it may be covered in there, but I also find it interesting that apparently Claude Haiku has very good in-context learning. Have also heard people be impressed with Llama 3 on that front. So it seems like it might not be purely a size thing.

Shawn Fumo

Thank you for the deep dive into "in-context learning", an aspect that gets too little attention. The failed bet was still on my todo list to have a closer look, but thanks to you, I can cross it off. Your video makes me wonder if the ability to learn from context is the true intelligence test for LLMs. We observe that the larger the model, the better the ability to learn in-context, and we also observe that we humans are much better/more efficient at learning from context. We don't need 1000 examples; a few are enough.

SteveHaupt


More Creators