XaiJu
Andy Matuschak
Andy Matuschak

patreon


Quantum Country’s suspiciously flat forgetting curves

Since 2019, I’ve been running a series of randomized controlled experiments on Quantum Country readers. The main reason I haven’t published these is that I don’t understand what’s going on. Readers are forgetting very slowly. Too slowly! I’ve been crossing off theories with each experiment, but the data seem to defy a core assumption of memory systems: that we forget more and more over time, along a steep curve. This has left me awfully confused all year, but I now have one theory which might explain what’s going on. If I’m right, it’ll mean that general-purpose memory systems won’t be able to reuse the approaches which traditional vocabulary-focused memory systems have used to evaluate and improve themselves.

Please note: this is an informal discussion of data from Quantum Country. The analysis is preliminary and shouldn’t be cited or excerpted in other work. I’m working with the garage door up here.

Quantum Country’s ecological niche

Quantum Country is a sort of observatory for human memory—or at least a very limited slice of it. There have of course been countless studies of human memory, but Quantum Country lets us observe an underserved niche:

This niche is important because it represents the sort of meaningful learning which most adults must do in the course of their own creative work.

Of course, some professionals use tools like SuperMemo, Anki, and Mnemosyne in the course of their work, but analyses of those data have an important limitation: they have only one data point per item per repetition, because items are (generally) authored by each user. Developers must rely on significant model assumptions to make sense of this sparse data. With Quantum Country, we can (I hope!) analyze how large cohorts of readers perform on the same set of questions, and largely avoid these model assumptions. Duolingo and Quizlet can make the same move, but mostly for vocab/fact-oriented material, rather than conceptual topics. Meanwhile, datasets from academic studies are almost exclusively restricted to artificial and classroom contexts—though, it should be noted, their data is often much cleaner and better controlled!

For me, the point of all this data is to learn something about how memory systems and human memory work, so that we can design better systems, so that we can give people superpowers. I’m not studying Quantum Country data to understand how people learn Quantum Country, but to indirectly understand how people might learn with these systems in general. At a high level, we want to answer questions like:

These of course shatter fractally. Two related sub-questions I’ve been trying to answer:

  1. What’s the counterfactual? Without the review mechanisms, to what extent do people remember the text’s key details?
  2. Is Quantum Country’s current schedule too short? Too long? For which readers and questions?

These turn out to be much more difficult to answer than I’d expected!

Schedule experiments: the basics

In this post I’ll focus on the most recent experiment I’ve run, since at least for the questions I’ve described above, it’s better controlled.

Each new reader is assigned to one of four schedules, having initial intervals of one week, two weeks, one month, and two months. That is, a reader in the “two week” condition will be prompted to review a question two weeks after they initially answer it in the essay. I’m over-simplifying a bit here, but this is enough to get us started.

These conditions, taken together, should help us find a “sweet spot” for an initial review. Additionally, the two month condition should tell us something about the counterfactual: what happens if a reader doesn’t review for two whole months?

And so, here are the median reader’s accuracy rates at their first delayed review, by condition (25th and 75th %ile reader accuracies in parentheses):

These data are constrained to the first essay of Quantum Country—where we have the most data—and represent readers who have collected at least 50 questions and reviewed 90%+ of those they collected. (You’ll notice I’m intentionally eschewing models and statistical tests in this article. That’s because we’re discussing effects which I expect to be large enough to be obvious at a glance!)

Only a handful of readers in the 2-month condition have fully completed their first review, so I can’t report per-reader statistics for that condition yet, but we can get some sense of what’s going by lumping all reviews in each condition into one big bucket and looking at the fraction of remembered questions in each bucket:

We can add one more data point by adding readers from early 2019, when the first review after just one day. This isn’t a clean comparison, both because there may be cohort effects and because these users didn’t have a “retry” feedback mechanism, but just to get a sense:

This is an almost unbelievably gentle forgetting curve: from 89% to 81% across two months! This shallow slope is at the heart of my confusions, but first let’s talk about the parts which make sense.

Initially-forgotten questions should get tighter schedules

The data get more predictable if we look exclusively at delayed recall accuracy of questions which readers forgot when first answering the question in the essay. Such questions will first be assigned again one day later, repeatedly if necessary, until the reader remembers—after which they’ll the question will be assigned after a delay. Recall accuracies, by delay:

This data paints a more familiar picture, and it suggests a fairly clear course for memory system authors. It the goal is to ensure that recall rates stay above, say, 90%, then when a reader forgets a question initially, we should assign it to be reviewed again quite soon. The automatic one-day-later “retry” session is not enough to support lengthy subsequent intervals.

In fact, this effect compounds. If readers forget a question in this first delayed review, then it’s assigned to them again one day later. Readers with longer initial intervals were less likely to recover in that subsequent session—that is, they were more likely to forget yet again. Recovery rates by first review session delay (note that these sample sizes are now getting small):

OK, so that’s one pretty clear implication for memory system designers that we’ve extracted: the initial schedule should contract decisively when a prompt is forgotten in-essay.

The trouble begins: when questions are initially remembered

But questions aren’t often forgotten in-essay. Aggregated across readers in these four conditions, in-essay accuracy rates are 91-92%. So what about the common case where questions are initially remembered? Here’s where the trouble comes in.

First review recall rates for questions remembered in-essay:

As in the previous section, we can irresponsibly use data from 2019 readers to add a data point at 1 day: 90% (N = 2207 readers, 109031 reviews). As before, note that this isn’t a well-controlled comparison.

That forgetting curve is unreasonably shallow! Sure, if we want to target a 90% recall rate, this data suggests we should schedule the first review after less than one week. But each review has a cost; if readers can skip an initial review or two in exchange for a few percentage points lower accuracy, I suspect most would take that trade. After all, each full repetition of the first essay’s 112 questions takes about 25 minutes. How should we think about this?

One consideration is the “recovery” effect we saw in the previous section. Do the people who forget after longer periods struggle more to recover in the following session, relative to those who forget after one week? Here are the recovery rates (i.e. accuracy rates one day after forgetting in the first delayed session, after remembering in-essay):

This doesn’t look very compelling. Maybe there’s trouble at 2 months, but I’d like to see more samples first. It sure looks here like we can defer the first review for a month without real penalty.

Another reason to delay initial reviews is to invoke the spacing effect, but I’ll skip discussing that in this post. Suffice it to say that (with sparse data) I don’t yet observe a spacing effect in the interactions between first and second session intervals.

What about slicing by question? Looking at the ten questions in the first essay with the lowest initial accuracies, but where readers remembered the answers to those questions while reading, we still see a sharp forgetting curve in that first delayed review:

This is pretty compelling, but the curve rapidly disappears. Here are the accuracies for the next ten “hardest” questions, by initial delay:

We don’t have enough data to extract trustworthy forgetting curves on a per-question basis, but the flat curves continue with increasing intercepts for the remaining groups of ten. The median ten are flat at 82%; the easiest ten are flat at 95%.

So questions vary in difficulty, but don’t seem to decline in recall rates as more time passes. What should we take from this? Sure, we could schedule “hard” prompts earlier, but would that actually do anything? Except for the ten hardest prompts, this data shows no improvement in recall with shorter intervals.

One way to interpret this is that the main thing here is that people just need to practice. The timing is not terribly important. Indeed, we previously found that once the median reader remembers an answer after a delay (of any length), their recall rate over the following year of reviews is 95%!

But I find myself simply not believing this data. The forgetting curves are too flat. This just doesn’t reflect my experience. If I don’t practice something I’ve learned for two months, I’m much less likely to remember it than I would be one week later. Our data suggest that after the first successful delayed recall, we could safely delay subsequent reviews for many months. I just don’t buy it.

What’s going on here?

My theory: cuing effects

I think the picture gets clearer if you look at a specific question. Consider this question (which has ~75th %ile in-essay recall accuracy):

This task strongly shapes the retrieval you perform: it makes you look for connections between the normalization condition and measurement probabilities. You might have instant access to this answer; but you might also consider the question on the spot and infer that this is the only reasonable answer.

The accuracy rates we collect don’t distinguish between these two possibilities. But the difference matters! If we asked you instead to solve some problem which indirectly relies on this property, you might not make a leap you need to make.

What we really care about here is fluency: your readiness to think interesting thoughts, to solve interesting problems, to notice connections and apply your knowledge creatively. You want to train a richly patterned reasoning apparatus.

My hunch is that even though cued recall doesn’t seem to dip substantially between 1 week and 2 months, free recall and transfer tasks would show a sharper curve. The kind of fluency I just described does decline. And if you could see that decline, you might want to schedule the next review earlier.

If this theory is right, it means Quantum Country and general-purpose memory systems need to follow a very different path than most prior work in this space. Following SuperMemo’s lead, most systems generally think about scheduling in terms of a simple threshold: schedule a review when the estimated probability of recall drops to 90%. That way, your expected recall rate at any given moment should remain at least 90%.

I think this is a reasonable heuristic for language learning, facts, and term-definition pairs. You can’t usually re-derive those answers on the spot. The goal is to produce the answer from memory. The effect of explicitly cuing the retrieval should be much smaller than we’d observe for conceptual material like Quantum Country’s.

If we can’t approximate a conceptual detail’s depth of encoding with cued recall rates, we can’t use the traditional scheduling heuristics. We need to establish some other way to drive the control loop.

Response time seems like an interesting proxy for fluency, but I’ve had surprisingly little success finding patterns in Quantum Country readers’ response times.

A more intrusive approach would be to insert questions which require readers to indirectly use knowledge in some new context. If it’s true that cuing effects are particularly significant for conceptual knowledge, we should see a decline in transfer performance over time even if recall accuracies remain steady. I’d like to do something like this anyway, to establish the flexibility of the knowledge reinforced by the review system.

Another way to test this theory is to consider questions which I expect to be more “rote” and less conceptual. These should have a more pronounced forgetting curve. For example, here are the recall rates for the questions asking for the matrix values of the X, Y, Z, and H gates:

Not a lot of samples here, but this data doesn’t support my theory. The flat curve between 1 week and 1 month still strikes me as highly implausible. I suppose that people could be re-deriving these values from what they remember of these gates’ desired effects, but I don’t find that especially plausible.

One simple answer here, and perhaps for this whole confusion, is that people are simply lying. Quantum Country is self-graded. Maybe people are inappropriately marking answers as remembered? I don’t find this plausible. Remember, the median reader has a self-reported accuracy of 85-87% from 1 week to 1 month. That median user is still marking plenty of questions as forgotten. What’s confusing is why the median 1-month user doesn’t mark more questions as forgotten than the median 1-week user.

Another important factor distorting my data is survivorship bias. Readers who come back and review after 2 months are probably more conscientious than readers who reviewed after 1 week. They probably care more about the topic and read more closely. This effect is probably inflating the performance of the later intervals, but I don’t have a good way to establish by how much.

I think my next step here is to dig deeper into the literature, which does include numerous memory experiments focused on conceptual knowledge and transfer learning. Perhaps some of those methods or discussions can help me here.

————————

Thanks to Gary Bernhardt for helpful discussions about this topic. And thank you to you all for your ongoing support, which makes it possible for me to conduct long-term studies like this. We’re about 3/4 of the way to the equivalent of an NSF CAREER grant now, and I’m continuously shocked that such a thing might be possible. Happy holidays!

Comments

I reread this article today and associated a theory from SuperMemo: https://supermemo.guru/wiki/Two-component_model_of_memory_stability. The flat curve is related to "stability tends to increase dramatically with review early in the process." It is resulted from "a memory finds its place in the overall knowledge structure of the individual." Those initially forgotten, on the contrary, failed. Maybe it will be helpful to you.

Jarrett Ye

I guess those questions initially remembered are known to readers before they read the essay. In other words, the readers' storage memory about those questions has already formed. So the forgetting curve is flatter. I'm working for a language learning app, and those words initially remembered have flatter forgetting curves, too. Maybe it is related to the Expert response heuristic.

Jarrett Ye

Got it—thanks, Rob!

Andy Matuschak

Sort of, depending on what's being measured. For an end of lesson survey, we try to count percent respondents as well as whatever questions we want to ask. Those who don't fill out a survey are important, and less likely to be hitting whatever goals we want, since those track with engagement / attention. Not exactly a 'fix' for the selection effect, but treating both numbers (e.g. % respondents and NPS) as important indicators means we're less likely to overindex on the selection-biased NPS score. A common frame for me is to view the completion data from a series of tasks like you would for a really long marketing funnel. At each task, there's some number of students that drop off, and the goal of the analysis is to try to model the contribution from the previous tasks towards the churn. This is pretty imprecise on its own, but combined with actually looking at the tasks and asking where students got confused / demotivated / frustrated, it can be very useful in practice. For things where the whole population had to successfully complete the task, we found time to complete was a nice indicator, since a lot of the tasks didn't have clean 'remembered / didn't remember' outcomes. Caveat -- all this is noisy data, not research-quality, so it's super likely that it doesn't generalize and that we drew unfounded assumptions. Lack of controls, small n, and likely basic stats mistakes too...

Robert Cobb

Yeah, totally. How have you gotten around this? Institutional contexts which assure everyone does everything?

Andy Matuschak

> Another important factor distorting my data is survivorship bias. Readers who come back and review after 2 months are probably more conscientious than readers who reviewed after 1 week. They probably care more about the topic and read more closely. This effect is probably inflating the performance of the later intervals, but I don’t have a good way to establish by how much. This was where my mind jumped when I was reading. More than just caring more about the topic and reading more closely, missing questions feels bad, and creates a bit of an 'ick' factor that would steer folks who forget away from returning to the tool. I don't know what fraction of questions users ought to get 'right' to optimize for retaining users, but I'm super wary of selection effects in measuring "curriculum", broadly speaking.

Robert Cobb

It is also my experience that two months is way too long. After two months I would probably feel detached from the material under review. I'd be more inclined to answer approximately. I'd rate nebulous answers as successful, which otherwise I would have considered failures. Regarding the model fitting project. I'm grateful for the opportunity, but unfortunately right now I don't have the bandwidth.

Yes, that's a very interesting possibility!

Andy Matuschak

"I keep getting troubled by a core instinct: that this just doesn't align very well with my experience! I really do experience forgetting over time, and at a fairly steep rate! Of course, item difficulty matters a lot too, but my sense is that two months is an awfully long time for a first review!" Is it possible that this is because you are juggling thousands of cards on many different topics? To someone whose only spaced repetition project is quantum country, every single card from that project might wind up more vivid and lasting in memory.

Unfortunately, I don't yet have enough data to extract reliable per-card curves, but one commonality I noticed is that the "hardest" ten cards, taken together, exhibit a clear curve; whereas the others doo not. As I briefly described in the piece, I've tried bucketing less-cue-ish prompts together to see if they exhibit a clearer curve, but no luck there yet.

Andy Matuschak

Thanks for this, Giacomo! To your first point: you're definitely right. Some initial analysis here: https://notes.andymatuschak.org/z2GQAjbUCiSiru4tTFXBBeH3UDK9xPdmLvhW To your second point: yes! I enjoyed this paper and corresponded a bit with Lindsey. You're right that these results point in the same direction: variations in ability and item difficulty dominate gross memory-over-time-effects. (Of course, their models which account for both perform even better!) This also aligns with my simplistic notion that "practice is all that matters; timing's overrated." And so I suppose the next thing to do here is to try to fit models to this data and see where the residuals land. I've just been resisting this kind of model-making as long as possible! (Incidentally, if you—or anyone else—feels like collaborating on that front, this seems like a good multi-person project) I keep getting troubled by a core instinct: that this just doesn't align very well with my experience! I really do experience forgetting over time, and at a fairly steep rate! Of course, item difficulty matters a lot too, but my sense is that two months is an awfully long time for a first review!

Andy Matuschak

Thank you, Stian! To your first point, see my reply to Glenn above. 2. Yes, this might be! Incomplete data from user interviews suggests this isn't terribly common (I'd say ~25% of interviewees were looking at other material), but it may explain a subset of users' experiences. 3. Yes, this is a great point! In fact, Pan and Rickard suggest that something like this is generally what's going on in many transfer pathways. Check out the very interesting figure 5 in https://rickardlab.ucsd.edu/pdf/PR_2018.pdf I dig the midterm design!

Andy Matuschak

Thank you, Glenn! 1. This is possibly true! And indeed, what we see is roughly what you'd expect if it were: real forgetting curves for initially-forgotten items; flat curves for initially-remembered items. Playing this out further: how should we interpret the fact that harder questions have recall rates of, say, 75% across all categories? If the information is "newish" to some subset of people—close to the borderline of retrievability—shouldn't more people cross that line as time passes? This data suggests there's a cluster of 75% of readers far above the line (far enough above not to decay below) and 25% far below the line (far enough below that there's no difference between 1 week and 2 months in performance). Maybe that's plausible! I think I'd believe it more if I had strong 1-day data (surely some of that lower cluster should move above the line?). I can make that happen. 2. Right, yes! But for me, this still isn't satisfying. The median user marks ~15% of prompts as forgotten, so they're not totally delusional. Why don't they mark a greater fraction as forgotten when more time passes? One theory is similar to the one I proposed above: bifurcation. 15% of prompts are structured such that people *cannot delude themselves* as to whether they remembered. The others are more amenable to delusion.

Andy Matuschak

Yes, I also had the same thought; QC doesn't ask readers what they know pre-essay and so it may be that the curves are flat because the items are already committed to memory, not starting from scratch

Do the cards that have the steepest forgetting curves have anything in common with each other?

Hi Andy! Thanks for sharing this. It would be interesting to run the same experiment with no in-essay prompts and compare the results. Answering prompts while reading introduces a desirable difficulty. It might be affecting the forgetting curve. Something similar was observed in the more traditional context of language learning. In Mozer, Lindsey, 2016 - "Predicting and Improving Memory Retention: Psychological Theory Matters in the Big Data Era" (Section 5), they compare the accuracy of several models in predicting the student's recall success at the first review (3 min - 27 days delay since studying). The data aggregates multiple items and multiple students. I was surprised to see that an IRT model, which does not account for the time lag, makes way more accurate predictions compared to a power law forgetting curve (fitted homogeneously across items and students). The experimental setting is different but, I believe, the results point in a similar direction.

Thanks a lot for sharing these early thoughts, very fascinating. I had two thoughts, along the lines of Glenn Willen - first that this kind of resource probably attracts a lot of people who are interested in quantum mechanics anyway. This could play out in one of two days. First, as Glenn said, they might already have known some of the facts/concepts before reading the essay. This would of course change "everything" - seems like your analysis depends on the idea that everything in the essay was "new" to people. (It seems important to account for people's knowledge structure, even if every single fact that you do SRS on is technically "new", people's underlying background in physics/maths etc would make certain things much easier/harder to understand/recall - they would make more/less connections to existing material etc). Second, it might be that some of the people reading the essay are inspired and keep learning about quantum mechanics! Indeed, one would hope so :) So perhaps they've actually come across these concepts more often than your review schedule can account for? The final explanation that you offer, is the cueing effect of the prompts. Although there could also be a cueing effect of poorly written prompts when recalling specific facts, that still seems much easier to get right, but when you are trying to get at concepts, not only might the cueing effect be stronger, but also I wonder if you are in fact "exercising the entire concept" in your mind when you are doing a repetition... If I forgot the capital of Angola, and get a question, even if I get it wrong, I'm much more likely to get it right the next time... But if I forgot what a qubit is, and a question is "how would a qubit interact with a quazzar", and I answer wrongly, and see that the answer is "it would quibricate", I might say "Oh right, I now remember all the fundamental properties of qubits", but you might also say "hm, OK... no idea what that means, but I'll try to remember it" - which will not only be much harder since you're trying to remember a disconnected fact, but is also not what we want - since you are not remembering the underlying concept - in fact you'd probably have to click on a link to go back to the original learning material to "relearn" (probably quicker the second time around) the concept... A professor at UofT did something like this relearning with a psychology midterm where students getting a wrong answer had the option of clicking on a link and jumping to the part of the lecture recording/notes/slides where that concept was explained. They could then attempt the same question again - if they got it right the second attempt, they got 50% of the score. I thought that design was really neat - coming out of an exam knowing more than when you entered... https://www.insidehighered.com/blogs/steve-joordens

A few thoughts: - Probably some fraction of people working through Quantum Country have existing familiarity with the material. If those people are answering questions on material they already know going in, that means they'll be on a very different part of the forgetting curve than you expect. I don't know how big a subpopulation they might be. - While you prompt people to think about the answer before clicking to see if they were right, you don't really have a way to enforce that they do so. They may not be _intentionally_ lying about whether they remembered, but they may be lying to themselves about it. They'll click "no" if they truly had no idea, but they might get a vague sense of the answer, click to reveal it, then click "yes" if it seems familiar / if they believe they could have produced it, even if they couldn't really (or not without a struggle.) For the latter, it might be instructive to give people a box to type the answer into, before clicking to reveal. It doesn't need to be mechanically graded; it just helps them admit to themselves whether they actually knew it.


More Creators