XaiJu
3blue1brown
3blue1brown

patreon


Bayesian updating and probability density functions

Hey everyone,

Here's the next installment of the three-part sequence leading to the beta distribution.  It introduces Bayes' rule in a context beyond the usual binary outcome introductory examples, together with a primer on probability density functions.

As to feedback, one question I have right now where you could be helpful is whether or not the script is too redundant.  I think sometimes it can be helpful to hear one thing said a couple of times in different ways when you're learning something new, but taken too far it can also be bothersome.

And of course, share any other thoughts you have or errors you catch.

Thanks,
-Grant

P.S. The TEDx talk is finally live!  

P.P.S. Thanks for all the great suggestions!  There are a number of things I think I'll go back and rework before posting this more publically.  That may be slightly delayed, as I'm also playing around with some SIR simulations that I may want to wrap together as a video by the end of this week.

Bayesian updating and probability density functions

Comments

Is this series just on indefinite hiatus?

Eammon Hart

Yeah, waiting for it

whens the third installment coming out? i cant wait for it!!

I think I see what you're getting at, but I think you phrased it awkwardly. The probability of the second toss landing Heads is the same, but the probability of landing both the first toss AND the second toss on Heads is affected. Previous tosses affect the probability of the whole sequence, I guess. But the elementary events (a single toss landing Tails, and a single toss landing Heads) define the (in this case binomial) distribution, so it undermines the whole model to assume their dependence.

Really nice video! The final simulation with coin tosses and Bayesian updating (from 9:30 on) promoted me to think about independence a bit. Even though the coin tosses are physically independent from one another, when we are estimating (or when we do not know) the probability for Heads, the probability of for instance the second coin to land Heads is dependent on the outcome of the first coin toss. This is illustrated via the probability of data (or the normalizing constant) which changes with each new update. This is in contrast to the usual presentation of coin tosses as being probabilistically independent from each other (which assumes that we know the probability of Heads). Just a thought that struck me. Thanks for the video!

I subscribed to your Patreon just for this series, but I'm a long-time viewer. An idea: around 4:30 with the viz of the posterior changing as the likelihood changes, only the likelihood in the formula is only highlighted. Might be confusing. Or maybe I'm just confused.

I'll work as fast as I can.

3blue1brown

hiii, I need video 3, please make it soon. You have made an excellent work

Really great work on this. The "parallel universes" interpretation/explanation of the continuous Bayes updating was scrumptious. I remember having to spend some time with Fubini's theorem and lots of diagrams to convince myself of this formula when I first encountered it. Your explanation is parsimonious and beautiful. Did you come up with it yourself? I haven't seen it before.

Jacob Mirra

Hi Grant, Nice video as usual! Just one suggestion... You can mention that computing the evidence, or the normalizing constant, is usually infeasible - integrating over all possible s. So people invented the MCMC (Monte Carlo Markov Chain) algorithm to sample from the posterior distribution. Btw, this can be a primer for talking about Markov Chain processes and their applications...

:)

In 9:46, I think you could show `p(x|H) = p(x)P(H|x)/P(H)` first and then substitute the `H` with "data". Since the `data` here referencing here might be `H` or `T`, might be confusing in your example. One small thing, I found that it is kind of same idea behind Microsoft's ranking system TrueSkill, hopefully you could make a video about it (or player skill ranking system) in the future.

KiceQishi

Small thing: little audio blip around 2:16

Good catch, thanks!

3blue1brown

Indeed! The plan is to talk about something called the "Haldane" prior, and also about how if you didn't trust early reviews (e.g. for worry that they are just spam) how you could factor that in. I'm not sure if it'll be in the next video or elsewhere, but it would be nice to discuss to whole circle of thoughts around how to best choose a prior which encodes as few assumptions as possible, e.g. the principle of maximum entropy.

3blue1brown

There is a question which prior distribution to assume. Will it be explained in the next video?

tunaflsh

I guess because it's pretty foundational to this area - but also one of those videos was a proof, the other was intuition but this is actually calculating using Bayesian updates and showing why the Beta distribution is important

Nice, this video was so illuminating. It reminded my of an exercise that practically noone did. It contained the definition of the beta distribution a kernel from bernoulli to beta I think and then some stuff to show. And a bonus exercise about programming something to simulate coin flips and show a priori and a posteriori distributions. I am now convinced that this exercise was trying to let us do some heavy math that your video series is showing visually and intuitively. I‘m looking forward to the beta distribution.

Supreme

Little error: in 1:25 you switched the words "reviews" and "rating" :)

Nèstor Abad Viñas

At 3'41'' I think you definitively should remove the left bar. Because the height of those parallel universe is entirely unrelated to a value on the probability scale. While my first intuition while looking at the video at time 3.54 is that a green small square would be s=0.96 and probability=0.008 for example, which makes actually no sens. So it's strange that the different universes are represented at different height while height already represents a probability This gets even worse at 4'05'', where you have a subset of the universe which looks like a distribution curve. Actually the only reason I paid attention and realized that something was strange was that I didn't understand why you had a probability distribution whose sum is extremely smaller than 1 To state the last point another way, while at 4'05 you show P(data|s), you don't explicitly mention that now you're not showing a probability distribution anymore, but the product of two numbers, which does not itself leads to a probability distribution. (By the way, P(data|s) shows too early. Since it is actually shown while the uniform distribution is still displayed

arthur milchior

At 6:09, you have all probability being equal to 0, while the sum still indicates infty. That's quite strange to see, that the sum is not updated, or at least deleted until it gets its new value

arthur milchior

One other missing thing according to me, is explaining why you make a new video about Bayes, while there is already ayes theorem, and "making probability intuitive" and "The quick proof of Bayes' theorem"

arthur milchior

Around 6 minutes, I believe there is a problem. About assigning values to each atomic probabilities. I don't know whether you're only limiting yourself to probability with a finite number of digits (or more generally to rational numbers), in which case your assertion is false. If you sequence all rational numbers r_1, r_2, ... and assign 2^i to to r_i, then it sums to 1. If you assign numbers to all reals, then the very notion of sums itself is not well defined. So it's strange to state that it sums to infinity.

arthur milchior

You’re not alone in making that point, I’ll edit that section.

3blue1brown

The relationship has a special name, the binomial is a “conjugate prior” to the beta.

3blue1brown

Thanks for the kind words. As to other resources, the book I showed as the example purchase (by Hamming) is quite good for the actually-doing-calculations-and-problem-solving part of probability.

3blue1brown

Good point. I’m thinking of setting aside some time in part 3 to really talk about all the underlying assumptions at play.

3blue1brown

Having more examples is probably a good idea, depending on how long I want the video to be. As to Coronavirus, I’m hoping for these videos to be more evergreen. That is, keep in mind the viewer a few decades from now.

3blue1brown

Good to know, I’ll think about how to make that clearer.

3blue1brown

You’re not the only one to make this comment, so I’ll add some clarification.

3blue1brown

Sounds good. When it doubt, it never hurts to shut up and let the visuals speak for themselves for a bit!

3blue1brown

Oh, I forgot: great video and not repetitive at all.

Lionel Pöffel

For the final coin-flipping example it would have been lovely to see how the re-scaling factors "C" are calculated.

Lionel Pöffel

Outstanding work. Thank you.

Working on non-parametric Bayesian methods, so not much new for me. I think it's a nice pace. Oh, what helped for me for multiple variables at the same time and visualizing joint and conditional contributions was that the structure is the same. If a joint distribution can be represented by a 3D tensor then the conditional distribution is also a 3D tensor. That this is the same mathematical structure clarified a lot for me. Somehow my intuition was first that a conditional was of a lower dimensionality. Dumb perhaps, and now it's easy to manipulate them in my head. However, that was not always the case. :-)

I'm guessing that using XML-like tags like that works well for people knowing what they mean. Perhaps a bit confusing for anyone else. I assume they are meant to be in the video.

Really great video! There was definitely not to much redundancy for me, even though I've thought about these things a lot, only more from a machine learning perspective. I kept thinking that if I were to set the prior uniformly by simulating a case of each outcome, that might approximate the right probability, since the prior and the limit as N grows are what I want them to be. I was astonished in the previous video to learn that it is actually the «right» thing to do. I'm really appreciating these videos, this one was also a real eye-opener. I didn't imagine this being as easy as multiplying by x and (1-x) (and normalizing)! And now I even understand why! EDIT: Having watched this video, I'm intrigued by the relationship between the beta and binomial distributions. Hoping for more about that later!

Thanks! Sometimes I'm a bit error-prone with the cuts on my first pass, appreciate the catch.

3blue1brown

Hmm...good point. Maybe being explicit that P(data) is the area of the P(s)*P(data|s) region would be helpful, and the idea that it normalizes should be more of a side note to that. You're not the only one to mention that part might be confusing, so I'll reword it.

3blue1brown

100% right, being a little too loose with language there.

3blue1brown

Turns out the answer is 100%. It really is there.

I heard a sound glitch at 2:16. I listened three times it is there each time. What's the chance it is there the fourth time?

10:39 There's a new head but the exponent doesn't increment for a few more seconds. Maybe that's a stylistic choice, but it added the tiniest bit of friction for me

Max Goldstein

Grant, thank you for this mini-series on Bayesian reasoning and PDF’s. Having blindly applied the formulas and discerned the input elements via rote repetition since I can remember, this and Friday’s first installment have opened up for me an intuitive sense for the deeper reasoning from which the mathematical formulas naturally emerge; what an epiphany! That said, there was one moment of cognitive discontinuity for me on the first pass: At around 10:03, while describing the observation-by-observation updating process, you explain that “… then you normalize. In this case, the normalizing constant would be 2, but let’s just write it as C in general.” Though you have the “P(data)” highlighted in the denominator, suggesting that the “C” in f(x) is connected with that element of the formula, the value of “P(data)”, and why it leads to a “C” of 2 on this iteration isn’t exactly clear. In the subsequent iteration, you explain that “…this constant will become something new”, though not how it’s arrived at. I think this could probably be remedied with a single sentence (or phrase) in the audio track – even if it’s “…which we’ll explore more deeply in Part 3” – and I believe would make this already masterful piece even that much clearer. So thanks once again, Grant. I am grateful for you, for the significant and meaningful contribution you’re making, and for the opportunity to help, even in a small way.

2:51 You can see "prior distribution" a few seconds before it comes in

Max Goldstein

I've been looking forward to this video for quite a while, and it was worth the wait :). I've been studying Bayesian inference on my own for a while and so I "knew" all the material in this video going in and even so I felt that it was definitely not repetitive and that the pace was perfect for *me*, so I can imagine that it will definitely not feel repetitive to someone who is new to the concepts at hand. On that subject, I want to add that, as a teacher, I often find that what seems like "saying the same thing in different ways" to me actually looks like many different (necessary) pieces of information to my students. I notice in this video you restate the idea of scaling the probability (or probability density) at each point based on the data and then re-normalizing in a number of different ways (I wonder if this is the repetition you were worried about?) but each time you are helping the listener make different connections relating to the central idea of multiplying and re-scaling. I don't think any of these 'repetitions' were unnecessary. Out of curiosity, and you may have written about this elsewhere, do you have recommended texts for further learning about this topic? This is my first time commenting on here, but I wanted to say - your videos bring me so much joy as a mathematician and provide a push for me to do better as an educator so, seriously, thank you for this work.

WATCH GRANT'S TED TALK... He did an amazing job explaining how to motivate people to learn math. => https://youtu.be/s_L-fp8gDzY

Richard Hackathorn

Excellent! Your probability videos were well worth the wait :) And you couldn't have picked a better motivating example!

Another annoying "real world" comment. The video suggests that the order of the data doesn't matter, because of course, the probability formula ignores order. But, if an Amazon seller had 10 negative reviews followed by 90 positive ones vs. 90 positive ones followed by 10 negative ones, I suspect you'd be safer with the 1st seller (who presumably worked out an issue when they started) vs the 2nd (who is probably experiencing challenges).

Ron Goodman

I have a question about PDF. How to open and print PDF? (Sorry I’ll let myself out now)

DocScoot

I don't find it repetitive at all. "Hearing the same idea explained in several ways" is indeed important, but I also think teachers underestimate "saying the exact same words many, many times." If I'm convinced my students know what a phrase means, and they hear me say it multiple times, especially in unrelated-seeming situations, I can drive home my thinking process better because they know exactly what I will be thinking about. So my opinion is, don't be afraid of repetition at *all*.

Jason Taff

I think another example would be helpful to bring the equation shown at 5:20 home. To be current, there are many examples that could be made and help in a better understanding of Corona. At the risk of being wordy, I will provide examples that can be used to create an example problem showing the application of the equation. Probability someone has it but a test won't show it, number of contacts an asymptomatic person makes, number of cases given that testing does not show cases with too short of a gestation period, and many more lead to a modelling of the spread and eventually the slowing of the spread under numerous scenarios. I leave it to the audience to choose the example.

Personally, I find that your videos strike a good balance with repetition. I've never been bothered by encountering the same information multiple times in different ways that might give me a different perspective on it, and I've always felt that that's something that your explanations do really well. I do have some friends who I've shared your videos with who say that they find the explanations slow - but you're never going to find something that's within everyone's acceptable range. IMHO this style of explanation is what you're good at, and it's also an area where existing free videos are often lacking.

Luc Ritchie

Hey Grant, still watching it (catching up with the previous one as well!), and paused to come drop this comment that had you released it earlier I could totally have saved my current edX/MITx's cohort on Probability! :D (I sadly failed the exam!). Also, while I was studying for that, I plotted my charts having your classes on calculus in mind. It's a great experience now to see the real thing done, kind of cheesy to say: it feels like a "déjà vu"! So cool! You're the best! Thank you!

I love it. As to your question about redundancy, I don't think so at at all. For me, your idea that "it can be helpful to hear one thing said a couple of times in different ways" is absolutely on the money and I think you have the balance just right. Looking forward to the next episode.

Eamonn

I hate to steal Grant's thunder here, but if you just want a method that works and is known to be mathematically rigorous, the folks on Stack Exchange already have instructions for doing that: https://rpg.stackexchange.com/a/70803 . They describe how to test a physical die, but the same method works just as well on a virtual die (just skip the parts about "making sure you always roll it the same way on the same table" etc.). Of course, reading those instructions is a lot less fun than watching a video explaining where the formula comes from...

Kevin

The term in the business is "Texas sharpshooter fallacy" - basically, when you keep slicing and dicing your data in many different ways until you eventually find a statistical analysis or subgroup that shows something "interesting." Statistically, you can almost always find an apparent "pattern" in random numbers, if you look hard enough. The analogy is to a Texan firing shots into the broad side of a barn, then painting a bullseye around the biggest cluster of hits and falsely claiming to be a sharpshooter.

Kevin

Laplace's rule of succession.

3blue1brown

Outstanding point! I plan to talk all about the Haldane prior in the third part.

3blue1brown

Nice, clear, video. Things I've noticed and already mentioned: the weird long arrow for 1.00 at 2:37 and the broken audio at 2:16. Also: tags (like html tags) at around 5:31 and 9:11 (Aside on pdfs); do you really want them in the video?

Edith Dubiner

This is good -- it doesn't feel redundant to me at all. My biggest concern is that the aside on PDF's may be long enough that people coming to this for the first time may have trouble keeping the prior (no pun intended) train of thought, so it may be worth shortening that, or even moving it to a separate video. Mostly, I wanted to see part 3! :)

Yonatan Zunger

Some minor thing to improve: 2:16 has some temporary audio problems (cuts out for a few ms). Also the zeros throughout the video have some odd jagged edges on the insides sometimes (more visible at 1440p). Overall good stuff, Grant :)

DomNomNom

I would like to see numeric examples for P(s|data) and P(data|s). To me, that is not clear for the viewer.

white beard geek

In 2:37 the arrow lengthens in a weird way - I guess that's an error :)

i really like that method of adding one positive and one negative review and calculating the percentage, is there a name for that?

kendall

Thank you. This is answering questions I've wondered at for a while, on how to calculate things that are too ridiculous to actually calculate (eg "if you toss a coin 1 billion times, how close should it come to 50-50?" and therefore "how far askew is within tolerance?"). In fact, one specific example where I would like to apply this is my D&D server's dice roller. To test its randomness, I ask it for some large number of simulated rolls of a 20-sided dice. Based on the number of times it comes up with each result, I want to be able to state how confident I am that it is fair - ie that the true probability of any specific roll is exactly 5%. Time to get mathy!

Rosuav

At 9:30 you mention that a uniform prior represents no preexisting knowledge about the data, but this is kind of misleading, because the uniform prior actually already carries some previous knowledge, visible in its pseudocount representation (1,1). It would be awesome if you could do a section or an extra video about non-informative priors like the Haldane Prior ( https://en.wikipedia.org/wiki/Prior_probability#Uninformative_priors ) which would lead to a s^(-1)(1-s)^(-1) prior distribution for the binomial model or pseudocounts (0,0), which intuitively nicely compares to the uniform distribution with pseudocount (1,1) and your earlier mentioning of Laplace's rule of succession. E.T. Jaynes "Probability Theory" is a great reference if you want to go deeper into the topic. "Data Analysis: A Bayesian Tutorial" by D. S. Sivia is also great and not such a heavy read as E.T. Jaynes.

I dont fully understand the connection between normalizing the distribution and bayes's formula tbh. (I have watched your previous video on bayes)

At 0:50 you call it confidence interval, but maybe this would be a great opportunity to minimize confusion by using the bayesian term credible interval here?

at 2:20 are those the same arrows from the Measure Theory vid from several years back?

Bpendragon

The animation of the bayesian updating is amazing. Feel free to take more time and slow down on those especially insightful visual learning moments. That is the moment when theory and intuition come together. I really love them! :)

IC, actually 12 min, but then just continues with music and black screen for another 19...

Right you are, I'll switch it to closed on both. Thanks for the catch!

3blue1brown

Indeed, I will mention conjugate priors in part 3.

3blue1brown

Love the Nikola car factory!

Wow, long video! (Good! Not a complaint.) I wonder if an opposite approach could also be shown or explored, in contrast to the "totally unknown" initial prior (and then refining as data comes in). Could this be compared to the opposite process or initial approach: A strong prior (e.g. of a "fair coin," at 50/50) and then updating that if the data indicates otherwise.

Nice video; script did not feel redundant to me. At 6:30 you use open intervals: P(0.8 < s < 0.85). At 8:17 you use closed intervals: P(0.60 <= s <= 0.80). Is there a reason for the difference? Intuitively it feels like they should be half-open intervals, but I am not sure. Having open and closed intervals at different places without explanation seems confusing.

Pradeep Madhavarapu

I support the use of the term 'confidence interval' to mean this Bayesian concept. it is both closer to what people intuitively think that phrase means, and also a more powerful formal reasoning tool than the frequentist interpretation.

john kraemer

my jaw literally dropped when I saw the iterative rescaling thing in the opening. at first I thought, wait, it can't be that simple, then I thought it through for a second and I was like, oh, yep, it is that simple. wouldn't occurred to me in a hundred years – possibly the most on point use of your math animation system I've seen so far.

john kraemer

Concerns as I go: Confidence intervals are frequentist, not Bayesian. The general public's understanding of frequentist confidence intervals is already rather poor, and using the specific phrase "confidence interval" to describe an entirely different object might be confusing (it's an "entirely different object" because frequentists do not use prior distributions - a frequentist confidence interval represents something like "the range in which the p value is less than 1 - x" where x is the confidence level, but then you also have to explain p values, which is even harder). I liked the geometric explanation of Bayes rule and PDFs. The script does then get a little redundant when you go into Bayesian updating, but I agree with you that it may be helpful to a student who struggled to follow the previous example.

Kevin

This is great! Re: redundancy - seems fine to me, the different visuals help. Also I'm surprised you didn't mention "conjugacy" once, but maybe that's for part 3. I would love the see in the future you go into approximate inference methods, though these are active areas of research. A lot of linear algebra concepts come into play when you get into these methods too; matrix decompositions, Hessians, etc etc and it's cool to see how they play with the notion of stochastic functions. Getting a spatial intuition of what stochastic optimization is doing esp. in higher dimensions would also be really cool to see from this channel.

jpchen

Somewhere in listening to this I had a random thought about conspiracy theories and what logical error they're making, and whether it could be expressed as some kind of function.

Mario Nigrovic

I had to pause several times to take it in. I definitely didn't think it was too redundant.

I'm listening at 0.75 speed. Is it just me or is this really fast?

Doug Fort


More Creators