XaiJu
Andy Matuschak
Andy Matuschak

patreon


Too easy to be effortless

Now that a few Orbit experiments are in flight, I’ve spent much of the last month digging back into data from Quantum Country. I’m struck by a surprising problem: basically everyone remembers basically everything, basically all the time.

Feelings-driven optimization

How effortless can memory be?

At the limit, we can imagine automatically remembering everything we perceive. We might not want that—savants like Shereshevsky often report curse-like symptoms of their perfect memory. Perhaps we’d settle for the ability to remember or forget something as easily as moving a muscle. What would be true of such a world? Certainly schools would not exist as we know them, but what of workplaces and studios? What of relationships? Borges, Chiang, the Wachowskis, and other great science fiction authors have dramatized these implications, but I’m also interested in the mundane: shifts in the give-and-take of workplace collaborations; coincidences and contradictions suddenly more salient.

(Of course, effortlessness is just one of many useful lenses! A contrary lens points out that maybe effortfulness is exactly what you want from your interactions with memory. You want to constantly be questioning things you think you “know”; you want everything to stay molten so that you can form new connections and see things in new ways; etc etc…)

Even with today’s systems, memory is far from effortless. How close can we get? The usual approach is to treat this as an optimization problem, but I find it generative to recognize that effortlessness is a feeling. Powerful technologies feel like an extension of the body. The edges melt away; the space between intention and action closes. Strap a brick to your pencil, though, and it ceases to feel like part of your hand. Likewise, learning can seem effortless in an energetic discussion with friends, but in a boring study hall, the same ideas may demand more effort than you can muster.

This lens gives us a different way to think about how we might “optimize” tools for thought. What kinds of interactions create a sense of separation, of dutifulness, of boredom?

In any kind of computerized learning system (including spaced repetition systems), one reliable source of boredom is material which feels too easy. This material isn’t the good kind of effortless. Flipping through this stuff feels almost like speed-running a license agreement prompt in a software installer. “Yeah, yeah, I know, I know.” I don’t really have to think; I’m not really engaged; I resent being asked. Sometimes the problem is that I don’t actually care about the material, in which case I should really remove it (perhaps fuzzily). Quite often, though, I do really care about the material. I’d engage more seriously if it felt less trivial in that moment.

This observation devolves into a classic problem in learning technology: correctly estimating the state of a student’s knowledge to optimize a study plan. The difference is that if we hold onto our feelings-based lens, we don’t see optimization itself as the problem to be solved. Our central goal is a feeling of effortlessness. Model optimization is an instrumental lever for that feeling. But there are other levers. You can’t play The Witness without memorizing many complex rules, but you’ll do that naturally as you interact with the environment: memorization itself is not the effortful part.

Quantum Country’s over-easy effortfulness

Having paid this lofty penance, let’s turn our attention to the performance of an unusual memory augmentation system: Quantum Country.

Please note: this is an informal discussion of data from Quantum Country. The analysis is preliminary and shouldn’t be cited or excerpted in other work. I’m working with the garage door up here.

On the one hand, Quantum Country delivers on its promise to help people remember what they read. After the fifth repetition, most readers have been able to recall 95%+ of questions across intervals of more than a month. That’s pretty remarkable. In my past experiences reading textbooks, I’d be lucky to remember a fraction of the details after a month.

Another way to look at this is “maintenance cost.” To maintain the first essay’s 112 questions for the first year, the median reader performs 567 reviews, consuming ~1.5 hours. Readers report that the first essay takes 2-4 hours to read, so we can frame the first year’s reviews as a ~50% extra time cost these readers could choose to pay to durably remember all the key details from that essay. I expect the second year to have roughly half the time cost, but we don’t have the data for that yet.

The problem, I suppose, is that Quantum Country works “too well.” Basically everybody remembers basically everything basically all the time.

The trouble we’ll discuss begins at the start of what I call the “maintenance” phase. For a given reader and question, histories are generally clustered into two phases: an initial (usually short) “learning” phase, in which readers absorbs the material enough to remember it across sessions; followed by a (much longer) “maintenance” phase, in which repetitions mostly serve to combat the erosion of forgetting. You can approximate the delineation pretty well by saying that people transition to the maintenance phase after their first successful repetition.

After the first successful repetition of a given question—once they’re in the “maintenance phase”—the median reader answers 95% of subsequent repetitions correctly. In fact, 82% of all first-year question histories contain zero forgotten answers after that point (which is indeed what you’d expect from the typical first-year repetition count given a binomial variable with p=0.95).

That’s a bit abstract. To make it more concrete: after their first successful repetition, the median reader forgets just 15 times out of 448 reviews over the following year, across the 112 questions in the first essay.

A whole year of diligent reviewing and just 15 misses! 433 successful recollections! The problem here isn’t exactly one of efficiency. Talking to readers, plenty of them would be (and have been) happy to pay a 50% time cost to thoroughly internalize the material. It’s not that 448 is too many reviews, or that it takes too long. The problem is that it feels tedious, like wasted time, to review material that you already know perfectly well. And that’s mostly what people are doing.

But actually, the forgetting is even more skewed than I’ve let on. If those 15 misses were drawn with equal probability from all the questions, it might not feel so bad: any question might be the one you miss today! As it happens, though, half of all long-term lapses come from just 12% of questions. Emotionally speaking, those are the questions which generate “oh, no, that question again…”. By contrast, the median question produces only one lapse for every ten readers across the entire first year of the “maintenance phase.” For 95% of questions, the median reader never forgets in the first year of the maintenance phase. So most reviews probably feel tedious and unnecessary.

We might worry that perhaps everything’s fine for the median reader, but many less-capable readers are struggling. After all, questions are highly power-law distributed in the lapses they produce. But readers are not nearly so sharply distributed. Our 25th percentile reader forgets 35 times in 483 repetitions over the first year of maintenance. The 10th percentile reader forgets 59 times in 516 repetitions. And again, this forgetting is localized in a relatively small pool of questions. The vast majority of questions produce no forgetting, even for relatively less successful readers.

When forgetting does happen, it’s usually not that bad. One way to look at this is to ask how often questions are forgotten multiple times back to back, so that the reader fails to recall a prompt across an interval they could previously span. This happens almost never: on about 2% of first-year reader/question histories. So our “demonstrated retention” progress metric is a pretty good one. Once you’ve demonstrated a given interval of retention, you’re very unlikely to lose it if you keep reviewing. And if a lapse does occur, it has only a 7% chance of “backsliding” to the point that a reader can no longer span five days. As a reminder, Quantum Country roughly halves the review interval when a question is forgotten. Anki’s default behavior of resetting the interval to zero upon every lapse seems particularly inappropriate in our context given this data.

The implication here is that we should probably be much more aggressive with our expanding review schedule. Yes, this would make the experience more efficient; but what I really care about is that it would probably make the experience feel much less tedious.

What should the schedule be, exactly? Many papers suggest dynamic and complex models for these schedules, and perhaps I’ll implement one at some point. An ideal schedule would weigh tedium-avoidance with other important feeling-variables: connectedness to the material, the frustration of forgetting the same thing repeatedly, predictability of session timing. In terms of low-hanging fruit, it’s amazing how far simple heuristics could go. For instance, when readers begin by successfully answering a question both while reading the essay and in their first review session, 96% of those histories include zero lapses in the next year. It’s probably safe to stretch them out a great deal.

Just by focusing on too-easy questions, it’s pretty easy to imagine cutting the number of repetitions necessary for the first year of maintenance down by half, or perhaps more. If we did that, we’d cut the number of reviews in the first year from 567 down to 343, a 40% reduction. The marginal time cost for the first year of retention would drop from 50% to 30%.

The data I’ve presented don’t have much to say about the counterfactual. If the intervals had been twice what they are, would we see only a bit more forgetting, or would we see bedlam? I’ve been running controlled experiments along these lines, and they’ve been producing very interesting and confusing results… which will have to wait for another time.

Scheduling for the mnemonic medium versus existing SRS modalities

Almost all work around spaced repetition systems—both academic and commercial—has focused on definitions: vocabulary for language learners, terminology for medical students, people and events for history classes, etc. This kind of knowledge tends to be arbitrary and disconnected, and so I suspect it’s forgotten much more rapidly.

Quantum Country’s schedule is pretty aggressive. We start at a five-day interval and grow by 2-3x on each repetition. By default, Anki starts at a one-day interval and grows by 1.8x. And yet we’re still seeing very little forgetting. I don’t think the problem is that Anki’s wildly conservative: I think it’s that conceptual knowledge, introduced in a narrative arc and thoroughly connected to prior knowledge, has very different memory dynamics from vocabulary words. Scheduling for the mnemonic medium should probably look quite different from scheduling for traditional spaced repetition systems.

SuperMemo models something like the effect I’m describing with “item complexity,” but because each user makes their own databases, it must estimate each item’s complexity from just a few point samples. The mnemonic medium’s shared questions create an interesting opportunity: item complexities can be estimated by pooling many prior users’ attempts, and a new user’s pre-existing proficiency with the material can be estimated by comparing their in-essay performance to that of prior students. This type of approach has been used in a model for scheduling Spanish vocabulary practice, and I’m interested to explore how it might fare on more conceptual topics. One distinguishing challenge for mnemonic essays (unlike vocabulary lists) is that the questions are highly interdependent. Reviewing one question makes readers more likely to be able to answer various other related questions. So I’ll probably need to mix a model like the one I’ve described with something like deep knowledge tracing, which can account for inter-item interactions.

I’m not yet sure how deep I want to go on such optimization. There are so many opportunities to explore in this space, and my hours are so few! In fact, there are many simple levers for a feeling of effortlessness which don’t involve actually reducing the number of repetitions. For example, Quantum Country readers felt reviews were much less burdensome when we “batched” them so that small review sessions on adjacent days were combined into a single full-length session.

In a future post, I’ll explore how multiple experiments are struggling to measure any appreciable forgetting-over-time at all on Quantum Country. Until next time, thank you as always for your support.

Comments

The binary "yeah I remember that" (which most people would say yes after seeing the answer) might not give you complete insight. Maybe instead of clicking yes/no you click somewhere on a scale on your comfort with the question. Behind the scene you could quantify it with a number but I don't think you have to show a scale of 5 different buttons say. That would just increase the friction. Then you could see if their answer improved over time. I don't know...just a thought.

Jim Beaver

Thanks for making this point. You helped me realize that my current approach can de-bias generous self-grading, but only within a given time slice. I don't have a good way to de-bias it across time. Hmmm…

Andy Matuschak

I know from my experience of going through Quantum Country, there is a real temptation to say you remember something when you don’t. I think that people can self report probably skews the results. It’s also a way to get the system to quit asking you questions because it makes you feel bad that you don’t know it. I do think humans have a tendency to say they understand something but when asked to explain it, they become aware that they don’t. There are definitely questions I consistently can’t remember :0

Jim Beaver

So interesting! I haven't read such case studies from players such as Anki, SuperMemo, Readwise where they talk about the effectiveness of their SRS. So reading this opens up a whole new set of issues to be tackled: how to not make remembering too effortless. +1 on Cooper's comment above about making questions adaptively harder (something that english language tests like GRE implement). Kudos on the work, Andy! Look forward to reading more.

1. Possibly, yes! How many questions are necessary to "span" the space of the content? We don't know how to help mnemonic essay authors assess this yet. 2. This is a great question. I'd love to eventually have objective data as you suggest, but we do have some strong directional cues that people aren't just making everything up. For instance, the median user will mark most answers as remembered… but they won't mark *all* answers remembered, and the ones which they admit to forgetting are fairly consistent across users (i.e. they're somewhat objectively harder to remember). They may still be overly lenient on themselves, but at least it seems to be a directional bias rather than a blanket blindfold.

Andy Matuschak

I'm very interested in an approach like this! There's a field called "intelligent tutoring systems" that try to do something like what you describe—the notion here would be to combine that more detailed knowledge model with a spaced-over-time practice regime and model of memory.

Andy Matuschak

Hey Andy, I wonder about the effect of increasing question difficulty with time. For example, perhaps the in-text questions regard simple definitions, and, with each review session, the review questions become more complex application questions that require synthesis of multiple definitions and concepts. To me, this seems like a nice way to counter feelings of tediousness. Of course, a downside of this would be a higher overhead for the author. Also, the system would have to be more complex and so the models one would have to employ would also need to be more complex (for example, the question of what should happen when someone gets a question wrong becomes harder to answer).

That's really interesting. It leaves me wondering a couple of things: 1. Are there too many "easy" questions? 2. How reliable is the self-report of having remembered? I know there's no easy way around that with this kind of tool, but am wondering if it might be possible to do some experiments with an occasional objective test to go along with the subjective one. That might give the writer some feedback to help hone their prompts, and might give the learner a check on the tendency to tell themselves "I basically had that one right". As a teacher, I can tell you most learners fool themselves without even realizing it.

Yes, I've noticed something similar! https://notes.andymatuschak.org/Spaced_repetition_review_sessions_often_become_boring_and_detached_without_a_steady_stream_of_new_prompts I suspect this is a real barrier to adoption for the mnemonic medium, since there's relatively little content available now.

Andy Matuschak

That's an interesting piece of feedback, thank you! On Quantum Country, readers seemed to much prefer it when we limited new prompts to 50/day (and 25 in their first session).

Andy Matuschak

Thank you. This was actually a good reminder to me to complete my last review session from a while back.

Thanks for this! I wonder whether the ingredient of a few fresher, recently-written prompts sometimes acts as leaven for the whole batch? Seems to save the entire review session from the feeling of dutiful “maintenance”. And of course, with respect to our learning and developing thinking as a whole, the tedium of maintenance during reviews can also sometimes signal that we’re stagnating slightly!

Great post. A note on tedium: across a few different SRSes I use reguarly, I rarely feel tedium from seeing a particular prompt too often. But a long review session definitely demoralizes me. I think this often arises from many "first learned" items on the same day all getting scheduled together in the future. It's like trying to get a large bolus down your esophagus. In these cases, I'd much rather have the platform rate limit how many prompts I get per day, assuming there's enough unused capacity on adjacent days.

Well done. I probably wrote a dozen notes from this, Andy. Probably a measure of generativity in there, somewhere.


More Creators