XaiJu
Andy Matuschak

Andy Matuschak

patreon


Andy Matuschak posts

In praise of the particular, and other lessons from 2023

Publicly available URL: https://andymatuschak.org/2023 

See also previous annual reflections: on 2022, 2021, 2020.

Systematic output; one-off methods

It’s so tempting to conflate aims with methods.

I’m trying to invent systems that work for lots of people in lots of contexts. My instinct, then, is to focus the process of invention on scalability and generality. If I’ve got an idea, I’ll usually try to make something linkable—something I can send to a wide range of people, for use in a wide range of contexts. I’ll send out a survey; I’ll pick some telemetry variables to measure and analyze; maybe I’ll run a randomized controlled trial; I’ll interview users about their experiences.

Possibly the most important thing I’ve learned in recent years is that these instincts are often a mistake. To invent a systematic solution, I’ll often need to synthesize insights from many non-systematic experiments. In other words, my aims—general solutions—will often require methods (and intermediate outputs) with very different properties.

Paul Graham has famously advised startups to “do things that don’t scale.” Things like talking to customers door-to-door instead of buying a national ad campaign. He mostly justifies that advice in terms of business, marketing, and brand reasons; I think that’s what most people remember. But he does briefly discuss non-scalability as a method of creation, and although he’s focused on startups, his arguments capture some of my experience as well. I’d like to elaborate with some observations on “doing things that don’t scale” in the context of discovery and invention.

Single-user evaluation

Late last year, I felt that my experiments had somehow distanced me from my aim: enabling people to acquire rich understanding of complex ideas that really matter to them. I’d been building systems and running big experiments, and I could tell you plenty about forgetting curves and usage patterns—but very little about how those things connected to anything anyone cared about. I knew that I was making some progress, but I was mostly flying blind, without the feedback I needed to drive my iteration.

So in 2023, I switched gears to emphasize intimacy. Instead of statistical analysis and summative interviews, I sat next to individuals for hours, as they used one-off prototypes which I’d made just for them. And I got more insight in the first few weeks of this than I had in all of 2022.

Of course, I got a different kind of insight. I didn’t learn anything about what would generalize to a wide population. I couldn’t measure fine differences in memory performance as I changed some parameter. Instead I could see, in great detail, the texture of the interaction between my designs and the broader learning context—my real purpose, not some proxy. I could see things going as planned in my prototype, and then going decidedly off the rails when that knowledge was put into use. I could see the emotional beats of that experience as they were happening. I could see a hundred questions I hadn’t thought to ask in a survey or post-hoc interview.

Is this “rigorous” research? Not as that word is typically used. But working this way, I feel better-equipped to make progress on my ideas than I have in quite some time. Single-user experiments like this emphasize problem-finding and discovery, not precise evaluation. In fact, my work with one student uncovered a startling hypothesis that consumed much of my year: what seems like a problem of forgetting is sometimes a problem of reading comprehension—never having understood in the first place—and we can’t reliably tell the difference.

As far as evaluation goes, negative results in single-user experiments do seem quite instructive. If I make a bespoke solution for one particular student, and it doesn’t work well, I rarely feel the need to double-check that negative finding with a larger trial. Some failures might be personal or idiosyncratic, but my designer’s instincts can sort out much of that; I’m not too worried about false negatives. And, unlike a controlled experiment with formal measures, this kind of qualitative observation will usually tell me a great deal about why my solution’s not working well. Practically speaking, a good heuristic for evaluating my work seems to be: try designs 1-on-1 until they seem to be working well, and only then run more quantitative experiments to understand how well the effect generalizes.

Single-context designs

My aim is to invent augmented reading environments that apply to any kind of informational text—spanning subjects, formats, and audiences. The temptation, then, is to consider every design element in the most systematic, general form. But this again confuses aims with methods. So many of my best insights have come from hoarding and fermenting vivid observations about the particular—a specific design, in a specific situation. That one student’s frustration with that one specific exercise.

Fred Brooks keenly distills Christopher Alexander:

The only way to achieve good fit between any design and its requirements is to find misfits and remove them; there is no direct way to derive form from requirement. Good fit is the absence of all possible misfits.

It’s often hard to find “misfits” when I’m thinking about general forms. My connection to the problem becomes too diffuse. The object of my attention becomes the system itself, rather than its interactions with a specific context of use. This leads to a common failure mode among system designers: getting lost in towers of purity and abstraction, more and more disconnected from the system’s ostensible purpose in the world.

I experience an enormous difference between “trying to design an augmented reading environment” and “trying to design an augmented version of this specific linear algebra book”. When I think about the former, I mostly focus on primitives, abstractions, and processes. When I think about the latter, I focus on the needs of specific ideas, on specific pages. And then, once it’s in use, I think about specific problems, that specific students had, in specific places. These are the “misfits” I need to remove as a designer.

I think often of Ted Nelson’s quixotic efforts with Xanadu, originally motivated by his ambitions as a screenplay writer. Decades of abstraction and infrastructure aspiring to encompass all possible forms of writing, but (as far as I can tell) none of that involved him actually trying to write anything meaningful using those designs. And so, interesting as those systems’ ideas were, most remain unrealized even today. It’s not a matter of young technologists ignoring the ancients’ wisdom: many of these designs simply aren’t developed adequately to solve the problems they aspire to solve, or to suggest whether they can solve those problems. By contrast, consider Douglas Engelbart’s contemperaneous NLS, which was developed for his lab to use as a collaboration tool. Every key element of that design has long been realized in our tools today.

Of course, I do want my designs to generalize. That’s not just a practical consideration. It’s also spiritual: when I design a system well, it feels like I’ve limned hidden seams of reality; I’ve touched a kind of personal God. On most days, I actually care about this more than my designs’ utilitarian impact. The systems I want to build really do require abstraction and generalization. Transformative systems really do often depend on powerful new primitives. But more and more, my experience has been that the best creative fuel for these systematic solutions often comes from a process which focuses on particulars, at least for long periods at a time.

The feeling of the particular

Also? The particular is often a lot more emotionally engaging, day-to-day. That makes the work easier and more fun.

In a one-hour interview or observation session, I can build some emotional connection with a person. They’re more vivid than an email thread or a row in a spreadsheet. But then the session ends, and I generally move on, influenced but not transformed. By contrast, if I work with a particular student for many hours across many sessions, I develop quite a strong desire to help that particular person flourish, at whatever really matters to them. I find myself naturally thinking about them as I work, again and again. I feel vicarious joy when something goes especially well. When they struggle, I really want to solve those problems. (I can understand the appeal of being a teacher!)

All this moves my motivation from the spacious, timeless theoretical to the sharp, focused interpersonal. This helps with my understanding, as I’ve described, but it’s also creatively energizing and much more immediately motivating.

Throughout my career, I’ve struggled with a paradox in the feeling of my work. When I’ve found my work quite gratifying in the moment, day-to-day, I’ve found it hollow and unsatisfying retrospectively, over the long term. For example, when I was working at Apple, there was so much energy; I was surrounded by brilliant people; I felt very competent, it was clear what to do next; it was easy to see my progress each day. That all felt great. But then, looking back on my work at the end of each year, I felt deeply dissatisfied: I wasn’t making a personal creative contribution. If someone else had done the projects I’d done, the results would have been different, but not in a way that mattered. The work wasn’t reflective of ideas or values that mattered to me. I felt numbed, creatively and intellectually.

By contrast, when I’m doing work that I find gratifying and meaningful over the long term, the day-to-day experience is usually frustrating and unpleasant. The work is gratifying because it’s deep and personal and unique. Unfortunately, in my projects, those same attributes also mean that progress tends to be inconsistent and hard to discern; it’s rarely clear what to do next; there’s rarely anyone I can ask for help; I usually feel incapable.

Of course, these qualities don’t need to produce suffering, and I’m slowly getting better at handling them skillfully. But I’ve also noticed that when I focus my work on particular people in particular contexts, that more immediate emotional connection sometimes overpowers the day-to-day frustration that comes with being lost in the woods. For several long stretches this year, I found the work really gratifying, both in the moment, and retrospectively over the long term. Even if focusing on the particular didn’t help the creative process in the other ways I’ve described, this emotional effect would be well worth pursuing on its own.

Mignardises

Progress often doesn’t look like progress

It often feels like I’m not making any progress at all in my work. I’ll feel awfully frustrated. And then, suddenly, a tremendous insight will drive months of work. This last happened in the fall. Looking back at those journals now, I’m amused to read page after page of me getting so close to that central insight in the weeks leading up to it. I approach it again and again from different directions, getting nearer and nearer, but still one leap away—so it looks to me, at the time, like I’ve got nothing. Then, finally, when I had the idea, it felt like a bolt from the blue.

When the insight arrived, I didn’t notice the connection to the trail I’d laid on the preceding pages. My experience was of making no progress, and then, finally, making some. In hindsight, I can see that I had been making plenty of progress over those weeks; I just couldn’t tell at the time. I suspect this is pretty common in my work. So, “I feel like I’m not making progress” is probably not a good local heuristic for guiding my work. Alternately, the lesson might be that I need to become more sensitive to the many subtler flavors of progress in this kind of work.

Stillness is trainable

I’ve written that one of the hardest challenges for my work is: “how to cultivate deep, stable concentration in the face of complex, ill-structured creative problems?” I now have several years of data on self-reported focus and energy levels, and it’s comforting to see that this does get easier with practice.

I work in one big block with no interruptions; in 2022, this was usually around 7:30 AM–1:30 PM, and I’d get quite restless towards the end. By late 2023, I usually didn’t start feeling antsy until 2:30. Slowly, over 2023, I increased the amount of time I spent “highly focused” by 38%, while simultaneously lengthening my total working time by 12%.

Some of this is the accumulation of many small tweaks to my practices, but I think much of it is me slowly—still—becoming less reactive to the discomfort of sitting in stillness and confusion.

Help can come without domain expertise

This year in particular, I’ve benefited enormously from friends who are willing to talk in great detail about challenges in our respective creative lives. To my surprise, it often doesn’t matter if these friends are experts in my domain, or even that they have much context about my projects. I shouldn’t have been so surprised. In late 2021, when I sought out a coach, I wanted to find someone who had experience with research and invention. I imagined that we’d want to talk at great length about my projects and their problems. Two years later, my coach has helped me enormously, but he still has only a hazy idea of what I’m researching.

It turns out that my coach hasn’t needed a detailed understanding of my work, for roughly the same reason that my helpful friends haven’t. Much of the time, whether I’m aware of it or not, the conversation I most need to have about my creative work is about my emotional relationship to the work and to the process. I can get that from sensitive, creative people who care. They don’t need to deeply understand my work as a domain expert. Of course, help from domain experts has often been extremely helpful. But I’ve underrated help from highly creative people outside my field, to my detriment.

New patterns of collaboration

My collaborations have mostly fallen into two forms: 1) everyone’s driving the project creatively, full-time, together; 2) someone’s driving the project creatively, and someone else is mostly “taking tickets”—doing well-defined piecework. This year, I had two happy collaborations in different modes.

Since last February, I spent an afternoon each week with Matthew Siu, the winner of last year’s research fellowship. Our project isn’t an idea that either of us were thinking about when we met; it’s not on the “critical path” of any of my projects; instead, it’s something we discovered was fertile ground over many hours of conversations. Matthew is very much the primary driver, while I mostly advise, sketch, and synthesize during our weekly meetings. (More on this project soon! It turns out six months was a very unrealistic fellowship length…)

In parallel, Derrek Chow and I had a different sort of collaboration. I’d gotten to know him over many long walks, and I’d come to admire the artist’s temperament that he brings to human-computer interaction problems. Derrek was looking for raw material for his creative practice, and for a good excuse to collaborate, so he asked if I had any project ideas he might be able to work on.

I suggested the prompt which became BookBridge, and we began to meet weekly to discuss an ongoing stream of prototypes. As in my collaboration with Matthew, Derrek gave the project much more time than I between our meetings. But this collaboration had a different texture because it needed me to do more “driving”: this was a prompt I’d suggested based on my experiences as a reader, not a need Derrek felt viscerally himself, so my role involved more actively steering iteration and synthesizing a whole from the parts.

Both of these collaborations were exciting for me because I felt newly able to scale myself creatively: with a few hours of participation each week (plus intermittent bursts of lots more), I could cause a new branch of idea space to be meaningfully explored. In Matthew’s case, it’s a problem I wouldn’t have considered alone; in Derrek’s case, it’s a problem I’d wanted to explore but for which I hadn’t made the time. One very rewarding element of my collaboration with Matthew has been investing in his growth as a researcher and designer. It would be good for my “field” if I could do that repeatedly. Meanwhile, in my collaboration with Derrek, I benefited from complementarity: his skills and mindset are very different from mine, in ways which are interesting and additive. I’d like to embark on more collaborations of both these types—and others!—in the future.

Crowdfunding continues more-or-less steadily

2023 was my fourth year of crowdfunding my research. Year over year, my member count increased by 18%, which is lovely. At the same time, my income actually declined 5%. The disparity comes from a shift in the distribution of contribution amounts, from the higher to the lower tiers. That’s okay, I think: it smooths out the effects of inevitable churn.

The fundamental dynamics remained the same as in past years: similar churn rates and conversion rates. The biggest lever available remains getting more people into the “top of the funnel”, and I don’t want to spend any real attention on that. I experimented with talking about crowdfunding and my Patreon more actively on Twitter this year, and that helped a bit, but reaching new audiences drives much more growth: my appearance on Dwarkesh’s podcast led to most of my new subscribers this year. All in all, my income is still somewhere between that of a postdoc and that of junior faculty. Not great, but OK, for now. If I’d like to sponsor more collaborations, I’ll need to seek more outside funding.

At the start of 2019, I wrote some five-year goals for myself, and then ignored them completely until a few weeks ago, when I exhumed them for a review. It’s amazing to me how fixated I was on hopes and concerns around funding. I suppose it shouldn’t be surprising: my fundraising experiences at Khan Academy were still raw at the time. My greatest wish was to have a source of funding steady enough that I could pursue whatever creative projects I found most interesting. My greatest concerns were that I’d be incredibly distracted by fundraising, or that I’d allow my work to be distorted by funders’ interests. It’s pretty ironic, then, that I actually hit these goals in 2021, and didn’t even realize that I was several years ahead of the ambitious five-year schedule I’d set for myself.

It’s an unbelievable privilege to wake up each day and fly by my own creative compass. A small group of people made that possible: all patrons, past and present, could fit in a typical Broadway theater. That’s a stirring image to visualize whenever I need a hit of gratitude. So, thank you, to all who have made this very unusual life possible.

View Post

Demo: an early look at BookBridge

There's a funny trouble in my work: I've been doing all these experiments around augmenting the reading experience with dynamic media. But, as it happens, I hate doing serious reading on screens! Most readers I interview feel the same way, to varying degrees.

Derrek Chow and I have been experimenting with bringing the digital into physical reading experiences, and the physical into digital reading experiences. This talk briefly summarizes one branch we've been exploring. There's lots more to be said about the design principles which have emerged in this project, but it's still early days—more as the work develops.

Happy holidays, all! I don't think I can really say it enough: I'm so, so grateful to you all for the opportunity to do the work that I do. Thank you for your support.

View Post

Initial results from highlight-driven prototype

I’ve been working on a new augmented reading environment centered around highlighting as the core interaction. The idea is to give readers a magic wand with two unusual “powers”:

  1. You can point the wand at anything and say “make sure I know this.”
  2. You can wave the wand over a section and ask “did I miss anything important?”

Of course, those are aspirational framings. My current design instantiates those powers like this:

  1. You have a special purple highlighter. When you mark text with it, spaced repetition prompts about those ideas will be added to future review sessions.
  2. At the end of each section, you can press a button to take a second pass over “suggested highlights”. This button marks phrases corresponding to all the details which the author thought were important, but which didn’t semantically intersect your own highlights. You can scroll through to quickly check for gaps in your reading comprehension.

It’s easy to imagine more elaborate instantiations! These are intended as a meaningful first step towards the more aspirational powers. For more background, see my introductory letter on the concept.

An initial prototype

This month, I tested a prototype of this concept, adapting a linear algebra primer by Jim Hefferon. You can see the prototype in action in this new demo video (6m17s). I hosted study sessions with 14 people who had some authentic reason to study linear algebra, and who had some experience with spaced repetition memory systems. We met in person, one-on-one—that always helps me form richer impressions of a new prototype.

After a short background interview, and an explanation of the interface, participants read the first section of the book, marking it with both a normal yellow highlighter and the special purple highlighter however they liked. At the end of the section, we used the “suggested highlights” tool to make a quick second pass. I asked readers to comment on each extra highlight: was it something they understood but didn’t feel was worth marking, or was it something they skimmed over? Finally, we reviewed all the prompts corresponding to their purple highlights. I probed readers about how it felt to review these prompts, and whether they felt their highlights were faithfully represented.

Before we dig into what I observed, I should explain that this prototype involved some significant smoke and mirrors. Readers imagined that I’d implemented some elaborate machine learning system. But no—not yet, anyway. Here’s how it worked:

  • Before meeting with any readers, I manually “curated” highlights corresponding to what I thought were all the important details in the section we read.
  • For each of those highlights, I wrote one or more practice prompts.
  • Then, in realtime while each participant read, I manually mapped each of their highlights onto the curated highlights (if any) which pointed at the same underlying idea. This sometimes required fluid judgment!
  • The “suggested highlight” feature then displayed all my curated highlights which had no corresponding reader highlights.
  • The review session displayed the prompts associated with all the curated highlights I’d mapped to the reader’s highlights.

That may look fairly baroque in writing, but in practice it created a remarkably transparent experience. Readers didn’t perceive manual steps; they often innocently asked if they could keep using the tool to read on their own after our session. This “Wizard of Oz”-style testing let me focus on the interaction design concepts, rather than on potentially unbounded problems of language model pipelines. That’s the right trade to be making for now.

What I learned from readers

Pre-screening is always imperfect in user research. For 3 of my 14 participants, the book (intended as a first course for undergraduates) was too difficult to read comfortably. Another 2 readers didn’t actually want to understand the topic in detail; they just wanted “the big picture”. In discussing what I learned, I’ll focus on the 9 readers who aligned with my target audience.

Mapping highlights to prompts seems very promising

Readers broadly loved the concept of the augmented highlighting interaction. Most of them already had a habit of highlighting texts, though all readily admitted that they didn’t think it actually affected how well they learned the material. Instead, readers described highlighting as a fidgeting behavior, a way to stay more engaged, and an ad-hoc bookmarking method. One reader didn’t end up using the special highlighter; he self-described as hypermnesic and felt he didn’t need practice support for the section’s material. The rest used the highlighter extensively.

Most readers were extremely happy with the retrieval practice prompts they were given. One said: “ captured my own intent to the point where it took me some time to realize that they weren’t written by me”. Another said: “These are the kinds of things I wish I could actually have with a highlighting system. … It’s not just throwing things back at me verbatim. … For the concepts I highlighted, it asked me about the important logical relationships.” Most readers spontaneously asked if they could use “the magic highlighter” with other books.

Interestingly, readers had such positive perceptions of my highlight-to-prompt mappings despite the fact that I hadn’t always prepared a corresponding prompt for each of their highlights. Most readers highlighted a couple points which I didn’t feel were important enough to mark. No one noticed the absences, but I worry that this kind of silent omission would erode trust in the tool over time. Fully fixing this problem would require reliable machine-generated prompts; absent that, we could provide a fallback workflow for readers to write their own prompts for these “missing” highlights.

As I hoped, this design also mostly eliminated two failure modes I saw routinely in past mnemonic medium prototypes. Because each prompt (in principle) corresponded to something which the reader “said” they wanted to know, readers were much less likely to experience the review sessions as unpleasantly authoritarian or “school-like”. And for the same reason—in conjunction with the reading comprehension support mechanism—readers were less often outright confused by a question or its answer. Of course, readers could (and did) highlight passages without really understanding them, but when that happened, readers didn’t complain of those prompts as feeling arbitrary and capricious as they did in previous prototypes. The review interface offers a “View Source” button which shows the connection to the source material they’d highlighted, for any prompt. I think this generally created a feeling that readers had “asked for” their confusion, rather than that the confusion was “being done to” them.

These sessions weren’t a rigorous experiment; I was only aiming for high-level qualitative evaluation. But my early impression is that the prompts-from-highlights design concept seems fundamentally quite promising, and is well worth pushing further.

Suggested highlights diagnosed some gaps, felt lightweight

This prototype’s second big idea was “suggested highlights” as a lightweight reading comprehension support intervention. Here results were somewhat more equivocal. The readers I worked with varied enormously in their pace and diligence. Some muttered every word under their breath, stopped routinely to ask and answer questions of the text, and re-read passages multiple times to clarify misunderstandings. Others breezed through in less than half the time, skipping passages which seemed repetitive or obvious. The “testing” context created distortions, too: some readers confessed that they were reading much more carefully than they would if I weren’t present—even though I explicitly asked them not to as part of their initial instructions.

Of the 9 readers matching my intended target user, 4 had meaningful reading comprehension gaps. The “suggested highlights” interaction quickly identified places where these readers hadn’t attended to some important point, and gave them a straightforward opportunity to fill that gap. Sometimes readers felt the details they missed weren’t so important, but often readers colored the “suggested highlights” with their special purple highlighter once they’d re-read the passage. I take that as a sign that the interaction identified something meaningful.

These readers were quite enthusiastic about the design. One said: “This is the tool that I want!” Another: “This is insanely cool! Man, I wish I had this everywhere.” One hesitation I have is that if these readers had a few straightforward gaps which my tool could identify, they probably had some other subtle gaps which will remain. Maybe it’s fine. Maybe these are the kinds of details which will get easily ironed out during a problem set. And the interaction at least ensured that readers weren’t being asked to do retrieval practice on material they hadn’t understood—a key goal. I’ll need to run more focused experiments to better understand the effects of my intervention on reading comprehension.

The other 5 “target” users had no overt comprehension gaps; the “suggested highlights” were all false positives. I spontaneously probed these readers’ understandings with extra questions, and they all performed quite well. So these readers didn’t need extra reading comprehension support, at least in the test context. Happily, 3 of these 5 liked the idea of the tool, and said that they didn’t mind the false positives; they found the interaction lightweight enough that they would want to use it anyway: “I still think it’s helpful. It gives me a safety net—guardrails.” The other 2 weren’t sure.

Purple highlights as to-do’s

Several readers used the special purple highlighter in a surprising way: they marked passages which they didn’t yet understand. These readers wanted to move on with the reading, but they wanted to make sure that they eventually understand the detail they’d marked. They were effectively leaving themselves a “to-do for understanding.”

This makes a lot of sense! After all, I told them that if they mark a passage with their purple highlighter, the system will make sure that they internalize those ideas. The current mechanism sort of accomplishes this goal. These readers received retrieval practice prompts about their “to-do” markings. They predictably didn’t know the answer, and they used the “View Source” button to return to the original passage for a re-reading. In several cases, the explanation made more sense on a second pass, now that they’d seen how the ideas fit into later parts of the text.

So at least sometimes, the retrieval practice prompts indirectly accomplished these readers’ “to-do” intention, insofar as it provoked them to re-read the relevant passages. But in some cases, the passage was still confusing, and the reader needed some conversation to make sense of it. In other cases, the confusing passage didn’t correspond to any retrieval practice prompt—for instance, one reader was confused by a particular step in an example problem—and so the to-do was effectively dropped. It would be interesting to consider how one might support the “to-do” workflow more directly.

Transparency in highlight-to-prompt mapping

One of my “target” readers felt that the highlight-to-prompt mapping was uncomfortably “magical”. When I asked to what extent he felt the review prompts represented his highlights, he said that he really didn’t know: he couldn’t easily see the correspondences, so he couldn’t tell how well his intent had been reflected. The whole system felt like a black box.

This makes sense! I’m surprised more readers didn’t feel this way. Technologists like to describe their products as “magical”, but we really want “magic” in the sense of “astounding capacity, ease, expressivity”, not in the sense of “ineffable, inscrutable, mythical, eldritch.” My favorite paper on AI in interface design is Jeffrey Heer’s 2019 “Agency plus automation”. In it, he argues that such interfaces benefit from shared representations. You want the automated system to clearly surface its proposals in forms you yourself can create and manipulate, and you want to clearly see the connection between those proposals and the inputs which influenced them.

All that is missing from my current design. You can’t write your own prompts or modify those which the system provides. During review, you can “View Source” on each prompt to see which highlight it “came from”, but that’s a pretty cumbersome way to get an overview of the connections. And there’s no equivalent available while reading: that is, when you make a purple highlight, you’re given no hint of what the system understands you to mean—what prompts will result. Ideally, those representations should enable an interactive feedback loop. That is, you should be able to say “oh, no, that’s not what I meant; focus on this part.”

One naive solution: whenever a reader makes a purple highlight, we display a preview of the associated prompts and allow readers to intervene if they like. But I want to be careful to avoid re-introducing problems I encountered in my prototypes late last year. In those designs, curated prompts were presented in the margin alongside associated content. This approach made the interaction very clear: if you click to save a prompt, you’ll get exactly what you see; you can edit it in place to adjust as you like. But it also created quite a distracted reading experience. Those marginal prompts tugged readers’ eyes away from the body of the text. Watching them, I could see them constantly jumping back and forth, losing their place, finding it again, eyes skipping down the page to the next spot in the text with a marginal prompt. I think most ended up spending far too much attention evaluating and making decisions about prompts.

One of my big motivations for this new highlight-centric design was to solve that problem. I wanted to make it easy for readers to remain immersed in the text, while still benefiting from selective augmentation. I think this prototype performed quite well in that regard. But I’m not yet sure how to sustain that success while creating a more transparent and shared representation for the prompts.

Tailorable prompt mapping: emphasis notes and feedback

This prototype’s reading interface treated the highlight-to-prompt mapping as a black box, but as an experiment, it did offer a way to “steer” the prompts. When readers used their purple highlighter, they could optionally write a note to clarify what, specifically, they’d like to emphasize. For example, maybe you’ve highlighted a definition, but it’s the notation which you want to make sure you remember; or you want to make sure you internalize the contrast between this definition and some earlier concept. So you can highlight the definition of (say) linear equations, and jot a note that says “contrast with linear combinations.”

Most users didn’t use this feature heavily, but did use it at least once. And I can imagine that they might tend to use it more over time, as they build a mental model for how the system maps their highlights onto prompts, and for how that mapping sometimes isn’t exactly what they’d want.

In practice, I was only able to honor about a third of these requests with my pre-made curated prompts. I could map another third onto broader prompts which included or indirectly reinforced the detail they mentioned. The rest of the requests were idiosyncratic enough that they probably can’t be satisfied without machine-generated prompts. Only one reader noticed that his requests weren’t exactly being granted, and he didn’t express much concern. But I think this would become more troubling over time, and I wouldn’t want to include an “emphasis note” feature if it isn’t reliable.

The “emphasis note” framing front-loads user guidance. Another way to think about this kind of control is through iterative feedback. For example, one reader highlighted a theorem which says that if a linear system is transformed through one of three listed operations into another system, the second system has the same set of solutions as the first. In his review session, he got a prompt about this theorem’s role in the safety of Gauss’s method. He was confused about this, and once he clicked “View Source” to see where the prompt came from, he said “Oh, no, I don’t really care about proving correctness here—I wanted to make sure I know the three “safe” operations.” Ideally, he should be able to just tell the system that, during review: “Just make sure I know the safe operations."

Another reader wished that he could make the prompts less formal: more verbal explanation and examples; tone down the notation and abstraction. This kind of feedback should influence not only the current prompt but probably all the prompts in the book, and maybe all prompts in general.

My brief experiments suggest that tailoring pre-existing prompts is a much more viable task for large language models than asking them to generate prompts anew. For example, consider the prompt: “Q. What is the leading variable of a row in a linear system? A. The first variable with a nonzero coefficient.” One of my readers wanted to see this kind of abstract answer in the context of an example. GPT-4 was able to rewrite the prompt appropriately for that request. Of course, this example wouldn’t be hard to do by hand, so the tradeoffs here may not make sense unless they can apply to many prompts, or unless the machine generation is extremely reliable.

Next steps

Speaking more personally for a moment: this last round of testing was quite exciting for me! The new design seems to have solved many of the problems I’ve observed with my various memory system designs over the years. And—very tentatively—it also appears to help some readers with reading comprehension support. It’s a great sign that I really want to use this tool every day in my own reading.

All that said, this past round of testing was pretty shallow. I wanted to see how a variety of people reacted to the design ideas, so I met with 14 people for around an hour each. Because we were reading the first section of the book, there wasn’t much opportunity for the ideas to really build on each other and put heavy demands on memory and comprehension. And because I observed just a single session, I didn’t have a chance to see how the memory and comprehension support fared over time, as forgetting became more relevant.

So, starting this week, I’ll switch to a depth-first approach. Like I did earlier this year, I’ll be meeting with one student weekly for a few hours. We’ll continue more deeply into the book, where the material will start to compound in more demanding ways. We’ll also work through some problem sets during those sessions, to observe how the augmentation interacts with practical capacity.

Those observations will still be qualitative. Meanwhile, I’d like to start working towards more systematically understanding the impact of my design on reading comprehension. How often does it help readers identify meaningful gaps? Are there kinds of gaps which it tends to ignore? Do readers who use this intervention understand the material appreciably better? Feel appreciably more capable or engaged with the material?

Right now, my prototype requires me to manually map reader highlights to curated highlights. And I have to do that with little delay, so that they can click the “suggested highlights” feature once they finish reading the section. If I’d like to run an experiment with a few dozen users, this would consume a huge amount of time. So, in parallel with my depth-first work with one student, I’ll attempt to automate the mapping between readers’ highlights and the curated highlights.

My initial tests suggest that this is a much more tractable task to automate than the two other currently-manual tasks in my design: identifying the most important elements in a text, and writing good prompts for each of them. It’s nice that these tasks are somewhat separable, so that I can make some progress by automating just the highlight-to-prompt mapping.

And if I automate that mapping, I can make this prototype publicly available, albeit constrained to this one book. That’s a nice milestone to work towards, and I’m sure that—as always—public use would surface many unexpected insights. From there, yes, it would be great to automate the curatorial and prompt-writing tasks. But I’m also interested in the prospect of using one large book as a depth-first laboratory to explore other reading augmentation ideas, many of which I’ve discussed in these essays.

One idea that’s animated my work is the claim that “books don’t work”. That is: in order to actually understand, internalize, and remember ideas from an explanatory text, a reader has to employ all kinds of tacit and often unreliable strategies, and the medium of the book does surprisingly little to help. What if the experience of engaging with a text naturally assured all this extra work gets done? I want to create an alien sense of capacity and ease when I engage with explanatory texts.

Breaking down some of the things which you may need for a book to really “work” the way you might hope:

  • Comprehension. You need to actually process the words on the page, and notice when you’ve failed to do that. This is surprisingly difficult for most readers, much of the time! Most interventions are quite obtrusive; the current prototype is my attempt at a more functional augmentation.
  • Memory. You need to remember what you read. This motivated Quantum Country and the mnemonic medium. But I find that today’s memory systems often produce brittle memory, and I’d like to explore ideas like varying prompts and escalating their complexity to improve that.
  • Elaboration. You need to understand not just what the text says, but what it means, why that matters. You need to connect the text’s ideas to prior knowledge and experience. Discussion is one method I like; thoughtful writing (ideally for some authentic purpose) is another. I’d like to find more, and I’d like to find ways to better connect those activities to the reading experience.
  • Fluency. You need to practice using what you’ve read so that it becomes automatic. Pattern induction; schema acquisition; knowledge compilation. In technical topics, this often means problem-solving practice—see projects like Mathigon and Execute Program. Project-based learning is another common approach, and I’m interested in ideas like “doing-centric explanatory mediums” to that end.
  • Intervention. You need to diagnose and resolve confusions and misconceptions that you may have. Procedurally-focused problems often fail to clearly identify conceptual issues. Teaching others is a classic approach here, inspiring integrated interventions like AutoTutor. Frontier LLMs are often surprisingly good at resolving confusions, when the user can articulate them. I’m interested in integrated, lightweight methods for identifying and acting on confusion.
  • Integration. Much of the time, you read a book not to just to acquire knowledge but because you want it to change you somehow—change the way you think, or the way you view the world, or the way you act or feel in a situation. To make a book real in this way, you have to carry its ideas with you into your life. For a few ideas, see salience prompts and timeful texts. Reading clubs are great for this; I’d like to explore more ideas at the intersection of new media and social convening.

That’s enough research agenda for several lifetimes, of course. But I articulate all this here as a way of helping myself resist the cultural forces surrounding me in San Francisco. Those forces demand that if I find an idea—like highlight-driven memory prompts—I should focus aggressively on scaling it to apply to as many places and people as possible. There’s merit in that, of course! I certainly want to use this special highlighter everywhere. But I also need to weigh that impulse against the prospect of uncovering more foundational ideas, of solving the problem I care about more completely.

————————

Thanks to all the students who worked with me on this first round of tests, and thanks to Benjamin Reinhardt for helpful discussion about my next steps.

View Post

(Recording) Discussion: AI-generated ad-hoc UIs and malleable software; Saturday, November 25th @ 9AM PST

Thank you all for a very interesting discussion! See original post for premise.

Please don't share this recording publicly.

And here's the chat transcript with some links which were mentioned:

00:20:19.962,00:20:22.962

Taylor Rogalski: quip had a nice chat-as-changelog-alongside-artifact pattern https://www.smartsheet.com/sites/default/files/2020-02/5cba0c3ca6e1a17be8045cc0.jpg


00:27:12.684,00:27:15.684

Andy Matuschak: This is my #1 favorite UI + AI paper! https://www.pnas.org/doi/10.1073/pnas.1807184115


00:27:18.199,00:27:21.199

Andy Matuschak: (the Jeff Heer one)


00:45:41.614,00:45:44.614

Maggie Appleton: I forget if mentioned in the call notes, but Adpet's latest release very relevant here: https://www.adept.ai/blog/experiments


00:46:44.369,00:46:47.369

Maggie Appleton: At the moment it's essentially keyboard maestro in the browser, but the unreleased stuff they're working on much more interesting and includes lots of on the fly UIs: https://www.youtube.com/watch?v=PAy_GHUAICw


00:46:55.387,00:46:58.387

Andy Matuschak: ^ this is new to me—thank you, will watch!


00:50:26.475,00:50:29.475

Andy Matuschak: https://dynamicland.org/links/2023-04-27


00:55:15.232,00:55:18.232

Tim Riherd: https://www.sidefx.com/tutorials/foundations-overview/


00:55:17.112,00:55:20.112

Taylor Rogalski: https://en.wikipedia.org/wiki/5D_Chess_with_Multiverse_Time_Travel


00:57:02.054,00:57:05.054

Maggie Appleton: @Geoffrey have you written anywhere about those hypothetical primitives of a world where we're generating UIs? Want to think about that more


00:57:30.416,00:57:33.416

Geoffrey Litt: hmm i dont think so;  maybe soon :)


00:57:45.575,00:57:48.575

Maggie Appleton: 👍 please do


00:57:51.926,00:57:54.926

Tim Riherd: +1


00:58:20.816,00:58:23.816

David D: https://github.com/inkandswitch/cambria-project


01:00:59.230,01:01:02.230

Maggie Appleton: Walking treadmill is a sad solution to that dream ;)


01:01:29.100,01:01:32.100

Tim Urian: for real!


01:01:53.136,01:01:56.136

Geoffrey Litt: Currently 1% of people make software; getting to 10% would be huge progress. 100% not necessary!


01:01:58.290,01:02:01.290

Andy Matuschak: ++


01:03:11.572,01:03:14.572

Gabriel Nunes: The problem of adapting data between interfaces for malleable software was the subject of my master’s thesis (not AI though): https://www.kronopath.com/blog/atypical-a-differently-optimized-type-system/


01:03:17.299,01:03:20.299

Maggie Appleton: ++ thanks for hosting


01:03:18.279,01:03:21.279

Daniel Dosen: Thanks!


The org-mode-for-sermons video Taylor was referring to: https://www.youtube.com/watch?v=Q8AqHdZTgNI

View Post

Discussion: AI-generated ad-hoc UIs and malleable software; Saturday, November 25th @ 9AM PST

tl;dr: Join me (via Google Meet) this Saturday, November 25, at 9 AM PST [GCal] for a discussion of the intersections between two fascinating research agendas for the future of personal computing: malleable software, which aspires to let users compose and tailor interfaces; and AI-generated ad-hoc UIs, which aspires to let users generate interfaces anew, on-demand.


One of my favorite current lines of human-computer interaction research pursues the dream of malleable software. Rather than being "stuck" within the silos of traditional apps, these systems imagine that users could spontaneously recombine favorite parts of different pieces of software and, when necessary, modify those pieces to suit their idiosyncratic preferences. This dream stretches back to Smalltalk and the Dynabook, but Klokmose's Webstrates inspired much recent work. Here's a 2021 video overview of that lab's work; see also Potluck for related ideas in the context of text editors, Mirrorverse for newer work with malleable video interfaces, and Dynamicland for a physicalized approach.

Recently, though, large language models have focused new attention on another old idea: AI-generated ad-hoc UIs. If the central idea of malleable software is that different people in different situations need different software, one way to achieve that is through flexible composition and tailoring; but another way is to generate the appropriate software and UIs as needed. "Apps" become throwaway objects, like an afternoon's arrangements of papers on a table. For past work, see e.g. programming by example. More recently, we've seen ad-hoc UIs generated in the context of chats (Dot, at least aspirationally), search (Perplexity), and whiteboards (tldraw). None of these really substitutes for the full aspirations of the malleable software projects… yet.

Some questions I'd like to understand better:

  • How might users refine and expand the behavior of AI-generated UIs?
  • How might AI-generated UIs express their interpretation of user intent; or, how might users understand how the AI-generated UIs will behave?
  • What unique challenges will we face in "mixed-initiative" UI programming, in which part of the UI is machine-generated, and part is hand-created or modified by an expert?
  • Is the range of AI-generated ad-hoc UIs bounded by the intelligence of the underlying LLMs? Are there more fundamental limitations?
  • If AI-generated UIs are successful, what happens to the malleable software agenda? Is it fully subsumed? Or do its goals shift?—how? Do its ideas somehow enable part of the AI-generated UI agenda?

Please read/view some of the links above, and bring your own favorites to share, as well as questions and ideas! I'll record the discussion and share it here afterward.

View Post

From my notes: first impressions of Humane's Ai Pin

I spent some time this weekend writing notes around the Ai Pin and recent related announcements in AI-centric personal computing. This isn't a proper essay—just some rough notes—but I thought enough of you might be interested that I'd share.

Humane 

From Imran and Bethany and Ken Kocienda and friends, a new personal computer which de-emphasizes screens and direct interaction in favor of Ubiquitous computing ideas and AI-centrism. The company’s first product is the “Ai Pin”, announced 2023-11-09.

Ai pin

The device is a wearable pin with a camera, microphone, speaker, touch pad, and laser projector, along with various other smaller sensors and components. The intent is to enable a mostly screen-free computing experience, driven primarily by voice.

The primary interaction pattern with the device is to press and hold its face plate, and to speak a query aloud. What distinguishes this interaction from Siri and its ilk? The query is evaluated not via POFAI-style heuristic trees, but rather via GPT-4 (or maybe sometimes 3.5-turbo?), with Retrieval-augmented generation through appropriate contextual information like location and user data (calendar, messages, contacts, recent queries, etc). It also appears to implement something like the Reason+Act pattern, so that for instance information about music or specific locations can be supplied to response generation; the model can output action plans which include things like sending a message, playing a song, and so on.

The device has a camera, but (as of 2023-11-09) its integration into queries appears to be limited to questions about nutrition and food (“How much protein is in this?”). Otherwise, it can take impromptu photos, though of course the composition will be somewhat unpredictable.

The device’s primary output modality is audio, but it also features limited visual output via a monochromatic “laser ink” system, projected onto the user’s hand. We’ve been shown a minimal framework of gestural interaction: you can pinch your hand to actuate, tilt your hand to select a secondary function via a pie-menu-like design, and move your hand in Z to access system menus. I presume that it would be uncomfortable to use this for more than a few seconds at a time, but that’s compatible with the device’s intent to keep users present in their environment.

Queries, actions, photos, reminders, and other key items are made available as a stream of memories via the Humane.Center web site.

“We don’t do apps”

Actions begin with natural-language queries, so there’s no traditional home screen or app switcher. Instead, the device appears to use something closer to the blackboard pattern: various services can supply information and actions to the query, as contextually appropriate (again, presumably through a combination of Retrieval-augmented generationand something like the Reason+Act pattern).

One can imagine retrieving appropriate information based on one’s context, so that (for instance) a transit service could supply location-appropriate timing information, or information about a museum you’re visiting. The blackboard pattern also permits distributed sourcing of information and actions: perhaps your local restaurant could have a small Bluetooth device which broadcasts its menu so that it would be available for appropriate local queries.

But “we don’t do apps” clearly isn’t the full story. For example, the music player has an interface on the laser ink display. What defines that interface? And where does the music come from? In this instance, the answer appears to be Tidal, which has partnered with Humane. But I have a Spotify subscription; how might that be made to work? Perhaps Spotify (and other music players) could expose a suitable API which would permit the device to query and play music. Perhaps that API could even transmit declaratively-specified interfaces for the laser ink display.

Likewise, consider the text messaging service. Many of my friends use WhatsApp or Signal. Perhaps on this device we think of those systems as API endpoints which expose suitable information and actions to the device’s LLM queries, and perhaps declarative interface specifications for service-specific presentations and interactions. I see from Humane’s web site that they have a partnership with Slack; I wonder if that necessitates extensive first-party work on their part, or if it’s more of a permission-granting relationship. That is: is it more like the original iOS YouTube app, or like iOS VOIP apps (which must be Apple-blessed IIRC)?

Finally: suppose I want to use the device to think aloud about my research as I’m walking around. It would be natural to query my research notes, which are stored on my laptop as a big folder of Markdown files. Perhaps this information could be synced into Humane’s cloud so that appropriate content could be supplied to conversation in the future.

I’m extremely curious to learn more about the architectural design of the platform. I would guess that for now, in the name of shipping, it’s fairly ad-hoc and special-cased—but that they have some more systematic plans in mind.

The power of context and deixis

One naive way to look at the Ai Pin is to ask: how is this different from Siri? One first and important difference, of course, is that it’s powered by GPT-4, which is capable of open-ended natural reasoning, while Siri is more like one of those hideous phone tree systems that tries (and often fails) to use simple heuristics and decision trees to map your request onto some limited action space it supports.

But a much more interesting difference is the way you can use words like “this”. “How much protein is in this?” “What’s the best way to get to the meeting this afternoon?” “When was this park built?” You can point to things, or times, or places, or things, and the device’s sensors and prior activity surface the context necessary to make answers possible via Retrieval-augmented generation.

The device’s responses to queries which include the word “I” or “my” are likewise enriched: “Have I been there before?” “Don’t my notes include something about this?” I imagine that many people will end up giving basically unlimited context about their lives to these models—e.g. via Rewind, every piece of text I’ve ever viewed on a screen.

This same logic is what makes Dot interesting to me.

On the astounding luck of Humane’s timing

Humane was founded in 2018, a year before GPT-2 showed that first glimmer of promise which most technologists (including me) still largely ignored. At the time, when I heard about the company’s ambitions, the main goal seemed to be to reimagine the personal computer without screens, so that people could remain connected and present in their worlds. The patent and fundraising hype emphasis was on the laser projection display and the wearable pin form factor. Insofar as AI was focused in job reqs, the emphasis seemed to be on the computer vision which enabled the laser ink gestures. We heard talk of voice-based inquiries, but the aspirations seemed closer to something like “Siri++” than to a weak AGI.

But now, in 2023, what’s interesting about this device—to me—bears little resemblance to any of that. As far as I can tell, this device will succeed or fail because it deploys a Siri which actually works, a Siri which uses a frontier LLM and supplies it with gobs of appropriate context. The laser ink display isn’t nearly as central as the early hype suggested. I haven’t used an Ai Pin, of course, but my impression is that if the laser ink projector weren’t present at all, the device would be worse off, but its ultimate success or failure would not change. Likewise, the computer vision doesn’t seem terribly essential (yet—perhaps deixis will extend more substantially to vision in the future). And the wearable pin form factor doesn’t seem very important either; it’s easy to imagine a very similar device as an earbud.

All this to say: the extraordinary work of many brilliant designers and engineers notwithstanding, I believe this device will owe its success (if it succeeds) to the shocking capabilities of GPT-4. If we were still in 2018’s NLP days, and they could ship something only roughly Siri-level, the Ai Pin would amount to something like an extraordinarily expensive and cumbersome AirPod alternative. And so, my gosh: the timing! They got so, so incredibly lucky! They can’t have predicted in 2018 that this much progress was going to happen by 2023; and if they did make that prediction, it would have been a pretty irresponsible bet.

How long might it have been clear that these kinds of results were possible? Maybe a fine-tuned GPT-2 could have achieved a few of their demos; they could have tried that as soon as November 2019. But probably not. GPT-2 had a 1024 token context window, which would have been too small for the contextual awareness which most of these demos rely upon. Maybe they could have made it work with many-shot GPT-3 prompting as soon as its invite-only availability in June 2020. But I doubt they would have gotten far before InstructGPT at the earliest, in January 2022. That’s awfully recent! And prior to that release, very few technologists had internalized the astonishing acceleration of language transformers’ capabilities. I doubt the Humane folks had.

In summary: what unbelievably fortunate timing for the Humane team! If InstructGPT had taken two years longer, would they have been able to sustain their funding? Would they have released some “Siri++”-like thing instead? Could they have survived that?

Versus AirPods and an Apple Watch

One problem for the Ai Pin is: if what I find compelling is that it offers an intelligent, context-laden, voice-driven AI assistant… then does that justify a significant hardware purchase? Is it a defensible moat?

I often keep an AirPod in all day while walking around. I can access a voice-driven AI assistant via an AirPod. Of course, an AirPod can’t show me visuals—there’s no laser ink display—and only I can hear the audio responses, which you could argue cuts me off from people around me in a way that the Ai Pin does not. But, OK: let’s consider the combination of the AirPods with an Apple Watch. Now I’ve got a visual display with similar I/O limitations: you only want to use it a few seconds at a time, it’s a bit anti-social to consult, and it accepts only limited gestural input. And my watch has a speaker which could emit responses to friends if I’m in a social setting. (The remaining important difference in basic functionality is a camera—but the Ai Pin’s camera doesn’t seem to expand its capacity much. Maybe this difference will become decisive in time; I can certainly imagine that.)

The main problem with the AirPod and Apple Watch combination is that they’re made by Apple, and Apple isn’t presently participating in frontier language model applications. Apple strikes me as exceedingly unlikely to partner with OpenAI, given its privacy stance. I expect that Apple will continue to lose in competition for top-tier ML talent. Maybe they can replicate something like GPT-4 with their internal teams, but given that Anthropic and Google haven’t managed to do so after eight months, I expect this will take some time. Meanwhile, Apple won’t let third-party apps like Dot have enough access to the hardware to create interactions which have as little friction as the Ai Pin’s.

But if that litany weren’t true—if Apple were competitive in the AI race, or if its platforms were more open—it’s hard to imagine that I’d buy an Ai Pin if I already had AirPods and an Apple Watch.

What if I didn’t have those devices already? Well, I’d want Bluetooth headphones of some kind. In Humane’s advertisements, they show people playing music on the Pin’s speakers. I hate when people play personal audio in public places. So I’d want to buy something like the AirPods anyway. As far as the Watch versus the Pin, I suppose it’s mostly a question of form factor. I find the watch form factor to be a less obtrusive placement for a wearable, but I can imagine that others would prefer the pin’s placement. It’s great that the pin’s placement enables vision-based workflows. But I assume that I’d continue to carry my iPhone, even if I had an Ai Pin. And so I can imagine a workflow where (wearing an AirPod), I say “Siri, what’s this?” and then pull my phone out of my pocket to hold it up to some object for a moment. A little more friction, perhaps, but it doesn’t seem like a clear dealbreaker to me. And of course, in this configuration, I can take photos with more intentional composition and a vastly more capable camera.

Taken together, my rough take is that the Ai Pin is compelling as an independent hardware purchase only due to Apple’s cultural problems and platform policies. I don’t think the device hardware itself is really necessary beyond the existing hardware already available to consumers—the unique capabilities are mostly a matter of integrated design execution and software. That’s a scary position to be in!

Does Humane succeed in its stated goal?

Humane aspires to help people remain more connected and present—with each other, in the world, not staring at their screens. Does it succeed? I’ll need to try the device to say, really. My high-level impression is that it offers a less obtrusive avenue to input and output for many standard tasks than a smartphone, but it’s not obvious to me that it performs better in that regard than an AirPod. Or, if it does, that’s in large part only because of Apple’s AI limitations (see previous section). The laser ink display doesn’t strike me as a victory for staying present: it looks uncomfortable and anti-social. Likewise, public audio input and output seems pretty obtrusive to me. For most applications, I’d prefer private audio output and—ideally—a Silent speech interface for input.

I look forward to updating this once I’ve had a chance to try the device!

2023-11-09 marketing impressions

I’m honestly quite shocked by the contrast between Humane’s 10 minute marketing video and Dot’s marketing story. Humane’s is quite technology-centric, thing-centric. The first sentences? “It’s a standalone device and software platform build from the ground up for AI. It comes in three colorways… There’s two pieces, a computer and a battery booster. Now, the battery booster powers a small battery inside the main computer… This is a perpetual power system… Built right in, our own Humane network, connected by T-Mobile… It runs a Qualcomm Snapdragon chip set…” What?!

Tell me stories! Tell me about how this helps me live a better life!Contrast to Dot’s warm and human story of Mei and her journey. And gosh: the delivery is bafflingly low-affect, low-energy. Imran, friend, are you OK? And why are we looking at this device in your extremely cold and sterile lab, rather than out in the world—the one it’s supposed to help us remain connected with? I’m a fan of Humane’s ideas, and of Imran and Bethany and Ken, but I can’t fathom why they made their headline introductory video this way, or why they thought this was an attractive way to present all their hard work.

The marketing web page likewise begins with technology: laser ink display, gestures, touch-and-hold, etc. It improves from there, and ends up in a better place than the 10 minute video, showing some authentic scenarios in which the Ai Mic and other related features would be useful in life.

Humane’s 1 minute video is much better by contrast, much more centered on reality and life. There are some very good bits—particularly a “catch me up” moment, a mother taking a photo of a child, and a bilingual interpreter. It does come off as a bit a grab bag of features, rather than a unified vision, and I don’t always understand the curatorial direction. For instance, it strikes me as odd that “nutritional facts” is the first thing emphasized in the film debuting this device. High wind alerts?

One of the demonstrations is “what are some fun things to do nearby?” I’m not sure that the Ai Pin’s form factor really shines in this query. This task lends itself to laterality and visuals—it’s best answered by quickly scanning a big list, with imagery and information density, and pointing at items of interest to dig into. That’s not something that the voice and laser ink combo do all that well. I think the people in the video would be better off with a traditional smartphone display for this task.

View Post

On breadth vs. depth in learning

There’s a funny downside to doing all this research on memory and comprehension—a sort of loss of innocence. I’ve viscerally internalized just how poorly I’ll understand and retain complex ideas when I read in my default “casual” gear. And I’ve learned something about methods which produce deeper and sturdier understandings: memory systems, active reading practices, elaborative notes, and so on.

But now I have two new problems. The first is that part of me feels I “should” be reading in those slower gears all the time, or at least much more of the time. Casual non-fiction reading now carries a tinge of guilt. I’m keenly aware of how little I’m really internalizing.

The second problem is a new sense of finitude and scarcity. It takes a long time to study a text properly! In this mode, a page I might have casually read in a minute now consumes ten minutes—or twenty, or an hour. I’ll spend tens of hours carefully studying one monograph, but my reading list is miles long. At this rate, I can only read a tiny fraction of it. What if a critical insight is hiding halfway down the stack?

How should I contend with these problems? How should I think about the tradeoffs between breadth and depth—or quantity and quality—in my reading? My observations will be necessarily personal, focused on my own proclivities and priorities. But people often write me with similar questions, and I hope this will help.

Against a naive “efficiency mindset”

I want to begin by observing that there’s something very wrong with the way these discussions often go: they frame the issue as a simple optimization problem. “How can I read as many books as possible?” or “I have ten hours per week of study time; how can I learn as much as possible to a satisficing level of depth?”

If taken too seriously, this stance is corrosive to the kinds of authentic curiosity and creative engagement that produce good work, good insight, and—for me—a good life.

If I’m reading about ideas I found personally meaningful, and the explanation is captivating, I want to wallow in the text. I want to figure out how to go deeper, not how to “get through it” faster.

The real trouble is that much of my non-fiction reading is in a category I’d label “merely useful”. A given paper might not be worthy of devotional attention, but it does help, in some way, to accomplish my goals. We might improve upon the optimization framing by asking something like: “How to find the most beautiful sources of insight for a given aim? Is there a path which only traverses texts worth wallowing in?”

One problem with this new framing is that such a path often doesn’t exist. If I’m trying to reach the edge of human knowledge about a given question, I’ll find that many key ideas are buried haphazardly within piles of otherwise prosaic sources. Of course, there are enough beautiful texts about beautiful ideas to occupy a lifetime, but if I were to read exclusively from that list, learning a little about a thousand topics, I would be sacrificing my personal creative interests.

Another problem is that I’m often not quite sure what my aim is—I’m exploring, following my nose, looking for ideas that resonate with an inchoate interest. I’m not ready to wallow in anything yet. Sometimes that’s because I’m reading in off hours, when I have the energy to sniff around but not for careful study.

Both of these problems require making trade-offs in favor of breadth. Now that I’ve grounded the concerns in meaning, I’m more comfortable exploring how we might “optimize”. If I’ve exhausted the most compelling sources for my present aims, or if I’m not sure what my aim is, or if I’m reading with low energy, how should I orchestrate my reading to support my creative practice and my search for beauty?

Mapping a space

When I dove into the reading comprehension intervention literature, I didn’t know enough about the space to know what I wanted to read carefully. For the first week or two, I was just sniffing around, my questions too vague to articulate well. In this time, I certainly wasn’t studying deeply. I was sifting through dozens of papers, making (mostly mental) notes of key words, people, and concepts. And I was paying attention to my internal reactions (“Boring!”; “Ooh…”; “Dubious.”; etc) to help me get a clearer sense of what I really wanted to know.

I mostly prioritized breadth during this time. I was honest with myself: I knew I wasn’t absorbing much. I wrote some notes and memory prompts here and there, almost as signposts, but my goal was to begin to roughly chart the space. I wanted to know what the main “territories” are, how they relate to each other, and where the main ideas might be found within those regions. Eventually, some questions cohered, though they were still quite broad:

  • What are the mechanistic causes of reading comprehension gaps?
  • How do these gaps impact understanding?
  • How do the processes of text comprehension connect to what I already know about the processes of conceptual learning?
  • What causes some people to read with more comprehension than others?
  • What sorts of interventions have people tried? How did they go?

I used these questions to structure a somewhat slower read of a smaller pile of papers and books which had emerged as “canonical” during my initial sweep. This pass helped me refine my questions and my reading list. For example, I found myself asking much more specific questions about methods for distinguishing reading comprehension gaps from issues of memory encoding and retrieval.

As my questions got sharper, and as I triangulated a better bibliography, I started reading much more carefully. I ensured that I could fully reconstruct the procedures of key experiments. I took care to understand the details of theoretical frameworks well enough to apply them to my own questions.

The full “dynamic range” of my speed across this exercise was quite wide. Initially, I’d just read abstracts—maybe a minute per paper. After a few weeks, I’d spend an hour or two on a paper. And after a month, I’d spend a day or two deeply studying a particularly good paper.

In my introduction, I mentioned two problems: I often feel like I “should” be reading in slower gears all the time; when reading slowly, I feel anxious about how little ground I can cover. My literature review example illustrates one way I think about resolving these problems. When I’m not sure what my aim is, I don’t need to feel like I should be reading in a slower gear. In fact, in this situation, I should mostly be reading quite shallowly, trying to build an index and to inform my sense of what matters. I can cover lots of ground in this mode, and I can cheaply triangulate a smaller pile of higher-quality works for more careful study. It’s also a mode I can use constructively when I’m low on energy. Critically, though, I don’t delude myself into thinking that I’m learning much at the object level. Once I have a clearer sense of my aims, my heightened emotional connection to those sharpened questions makes it easy to set aside quality time for slow, careful reading. Then, when I’ve exhausted the most compelling sources, I’m best served by speeding up again, grazing over wider swaths until I find ideas worth lingering over again.

No need to count cards

When I started using memory systems, I felt a chronic optimizer’s anxieties: which material should I add to my memory system, given the cost of review? In how much detail? Happily, I don’t worry much about this anymore. Now I view review time as more or less “free”. I’ll illustrate with a few examples.

The first essay of Quantum Country is about 25,000 words long and contains 112 prompts. Let’s call it one for every 225 words. A typical reader will take about four hours to read that essay. With the current schedule, they’ll spend about an hour reviewing those prompts in the first year, and a small fraction of that time thereafter.

I recently wrote prompts covering the first section of Jim Hefferon’s Linear Algebra textbook. This section is about 2,500 words long, and I wrote 28 prompts, roughly one every 90 words. My test readers take about 45 minutes to read this section. I don’t have long-term data on these prompts’ review time, but my own data from similar texts would suggest they’ll spend about 15 minutes reviewing in the first year.

Taken together, these examples suggest that memory practice adds an overhead of a quarter to a third of the original reading time. That’s really not enough for me to be worried about. My original optimizer anxieties were thinking about these review costs in terms of doubling—or quadrupling!—the effective time. That’s simply not the case. If I find some material striking or possibly useful, that extra quarter or third is well worth paying in almost all cases.

By comparison, the problem set associated with that linear algebra section took me about 2.5 hours to complete: ten times the “overhead” of the memory practice. How many problems should I do, if they’re provided? If the problems are about ideas I really want to understand, they’re precious gifts; my rough policy is to continue as long as I find myself engaged or surprised in the course of answering them.

If I feel that I’d rather cover 25% more ground than reliably internalize whatever details I find important, then one of these things is usually true:

  1. I’m quite unsure of what I find important, and I’m skimming to map a space. This is fine, so long as it’s intentional!
  2. I’m reading in a low-energy state, and I don’t feel like exerting effort at the moment. This is fine when reading for entertainment.
  3. I’m fooling myself, imagining that I’m understanding and retaining the material “well enough” without extra support. This is rarely true for ideas I actually care about.

In fact, review costs are effectively much lower than we’ve discussed, because review time is rarely rival with deep reading time. I review in fragmentary time: waiting in line, on a bus, between appointments, and so on. I wouldn’t be reading seriously in that time, anyway. For me, review time is more directly rival with, say, Twitter time.

So I don’t worry about the time costs of review anymore. But two variants of this concern do remain relevant. The first is that while reviewing is cheap, writing good prompts can easily double or triple my reading time. Now, that’s a misleading observation—much of what’s happening is that the prompt-writing process forces me to a deeper level of comprehension, and to make connections and elaborations which wouldn’t have occurred without the prompt-writing. Assuming the material is meaningful, that’s all time very well spent. But a big chunk of the time is fairly mechanical, and this cost is much more impactful than the cost of review. (More on this later.)

Another much more salient version of the “time cost of review” concern is the emotional cost of review. For example, when I’m first engaging with a topic, I often have a poor sense of what’s important, and of how much reinforcement will be necessary. A few months later, I’ll often run into a glut of prompts written during that early naïveté, and I’ll find myself bored and emotionally disconnected, wanting to do something else. That’s certainly not how I want review sessions to feel. I’m not worried about the cost of reviewing a given prompt in terms of time; I worry much more in terms of the damage which unsuitable prompts can do to my relationship with review.

Shifting the gumption frontier

Educational technology designers often talk in terms of shifting a Pareto frontier in learning. That is, if you have some fixed amount of time, and you use the best methods available, you can learn a given topic to some level of depth. But if we develop new methods, we can expand that limit, so that you’ll be able to learn to a deeper level in the same amount of time. Conversely, if you want to learn some material to a fixed level, new technology may lower the time required for the best-available route.

I think this framing can often be improved by replacing the time axis with “gumption” or “energy” or “enthusiasm.”

Relative to studying with traditional flashcards, an algorithmically scheduled spaced repetition system lets me learn more material more deeply in a given amount of time. But I don’t think time is the most important factor limiting my use of traditional flashcards. The more important issue is that traditional flashcards rapidly drain my enthusiasm. After a few sessions, I’ll know most of the cards quite well, and I won’t need to review them. But the format still requires me to flip through the entire deck, answering every card, in order to ensure that I reinforce the few which give me trouble. The exponentially expanding schedule of spaced repetition systems reduces (but does not eliminate) this gumption tax. And because it focuses review on the prompts I’m more likely to miss, I feel the value of review more keenly, which generates gumption. These dynamics make me willing to use spaced repetition systems in more situations than I’d use traditional flashcards.

I listen to most podcasts and talks at 2-3x speed. This isn’t primarily because I want to “get through them” faster. It’s because at slower speeds, these media usually aren’t insight-dense enough to hold my attention, energetically. At 1x, my attention will often wander off, or I’ll find that I want to listen to something else. The speed control is a technology which shifts my gumption frontier: it lets me get more out of these media at a given level of enthusiasm.

Personalized learning systems like Khan Academy aim to optimize learning by algorithmically ranking exercises so that each student works on the best-available task at a given time. This is usually framed in terms of learning more per unit time. But again, I think the more important issue is gumption. Nothing saps enthusiasm like a long string of problems you can already solve fluently. You may also quickly lose gumption if you’re asked to tackle problems significantly beyond your level of development. Now, when we reframe the issue in terms of gumption, rather than time, we can also see that a good solution will require more than estimating the probability of a successful answer. It makes us ask how we might orchestrate activities which are meaningful and rewarding for the student.

My enthusiasm for the mnemonic medium and for my recent highlighter-based memory system proposal draws on a similar logic. Yes, it takes time and expertise to write good memory prompts. But in many cases, the more decisive bottleneck is gumption. Comprehensive prompt-writing often feels disruptive and draining; in many situations, I don’t write nearly as many prompts as I would care to review. I’d love to shift that Pareto frontier—to reach a given level of depth with a lower tax on my gumption, or to increase my depth of study for a given level of enthusiasm.

Learning by doing vs. “book learning”

I grew up as a programmer. And, like most programmers, I absolutely loved “learning by doing”. Rather than reading a reference manual on a new language, I’d just dive in, try to implement something in it, and learn just-in-time when I encountered issues. This method was great for me in a number of ways. It gave me rapid feedback loops and rapid rewards. It kept my learning activities more closely connected to my actual goal (writing programs). When I did engage in “book learning”, I’d be especially motivated because I’d be doing so in response to some specific issue I’d encountered. And all the hands-on effort helped cognitively, too: to build fluency in a practical skill like programming, we must induce and learn patterns across many experiences.

What kinds of material can be learned in this way? What kinds of problems arise in this style of learning? I don’t yet have good answers to these questions. For now, I’ll share a few sketchy observations.

When I was a teenager, I’d been programming in C for years, but I often encountered inscrutable crashes due to “accessing bad memory”. I spent hours staring at my code, changing things semi-randomly, trying to track down the bugs. None of it made sense. It turned out that this was because I simply didn’t understand pointers and memory allocation on a conceptual level, at all. I was cargo-culting pointers, copying the relevant lines from other programs without any idea of what they were doing. No amount of “just trying to build programs” fixed this problem. I needed to sit down and actually learn the relevant conceptual material, which was covered in multiple books which I owned but hadn’t bother to read. Unfortunately, at this point, I’d built bad habits around reading: I’d developed a dependence on learning by doing, and I didn’t have the attention span to study difficult conceptual material. It took me years to fully fill these holes. And then the same problems repeated soon thereafter when I “learned” OpenGL by copying and building upon code from tutorials. I made some awfully elaborate games despite my enormous conceptual gaps. But more and more often, I hit impenetrable walls. Then in university, I implemented various machine learning systems from scratch but didn’t understand how they worked. Again, I soon found myself stuck, and couldn’t exactly diagnose why.

The moral of this story is not that I should have read a textbook front-to-back before doing any serious projects myself. All the hands-on work made me incredibly enthusiastic about the domain, and did give me real skills. I would have benefited from a mentor who could tell me very directly: “Oh, you’re stuck because you don’t understand pointers. Read these couple of pages, then let’s chat about it, and I’ll suggest what you might look at next.” That would have saved me a few years of frustration.

My programming story was possible only because programming is a field where you can start engaging in meaningful activity with very little understanding. Cooking was this way, too. Topics with that property are particularly amenable to learning-by-doing approaches. The learning problem mostly becomes one of orchestrating secondary activities to fill conceptual gaps as they become salient. In my experience, the likeliest failure mode is that I spend too little time on conceptual learning, because I can’t discern the impacts of my gaps, or the possible value of acquiring certain understandings.

On the other hand, most topics do require a fair amount of “book learning” before I can dive into a meaningful project of my own. I have a few ways of dealing with these situations.

One approach is to treat the first phase of “book learning” like map-making, reading shallowly to understand what kind of initial project I might like to tackle, and what I should learn to make that possible. Then I can read deeply in those slices, confident that I’ll be rewarded with some meaningful activity.

Another approach is to find a way to fall in love with the ideas themselves. I grew up uninterested in math largely because I only had bad math books and bad math teachers. I saw math as instrumental, as broccoli to be swallowed to help me with my programming. Later in life, I was introduced to discussions of math filled with beauty and awe, and I found it easy to immerse myself in these without needing to have any particular project in mind. I dearly wish I’d found that kind of math earlier.

I’m often asked: Are memory systems really necessary? If the ideas are important, won’t they come up naturally in your work? And if they’re not important, isn’t it good that they be forgotten? A few of my standard answers echo the discussion above:

  1. Naturalistic environments often don’t reliably or clearly surface conceptual gaps. And if you have a nuanced conceptual understanding at the moment, the naturalistic environment often won’t reinforce all aspects of it, even though those aspects may in fact be useful later.
  2. Memory systems can help you more rapidly “bootstrap” yourself to the point where you can use that material naturalistically.
  3. When an idea is beautiful or striking, it’s often meaningful to deepen your engagement with it, even if your immediate projects have no use for it.

But I think these objections do point to important problems with memory systems. Review often feels boring and detached from things I actually care about. Too often, I’m captivated by a discussion of an idea, but when the associated prompts arise in my memory system, I’m left cold. And skill-related practice often feels like it’s much less effective than it could be because I’m not doing those tasks in the real environment where that knowledge will be used. I believe it’s possible to make progress on all these problems, both by becoming more skilled with the systems we already have, and by creating system-level improvements.

————————

Thanks to Michael Nielsen, Alec Resnick, and Gary Bernhardt for many past discussions which inform my views on these points. Thanks also to José Luis Ricón, whose article “Massive input and/or spaced repetition” nudged me to write on this topic. And finally, thanks to Nick Barr for the term “devotional learning”, which I’m roughly appropriating here.

View Post

Highlight-driven practice and comprehension support

I’ve been wrestling with a new insight this summer: when people struggle to recall and use what they’ve read after a few months, it’s often because they didn’t really understand in the first place. The lapse feels like forgetting, but people often can’t tell the difference. I’ve argued that “books don’t work” because people seem to rapidly forget almost all of what they read. But when those supposed memory failures are actually poor initial comprehension in disguise, memory augmentation probably isn’t the right solution.

So I’ve been exploring how reading environments might directly support comprehension, learning what we know about expert practice and interventions. None of the directions I’ve prototyped have seemed promising. The central problem seems to be obtrusiveness. Systems in prior research and in my own experiments intrude too much on the reading experience. These systems all try to offer feedback and support by determining what you’ve comprehended and what you haven’t. It’s tough to do that without demanding a whole lot of burdensome interaction.

Somewhat frustrated a few weeks ago, I stopped to reevaluate. Fine, so people have comprehension gaps; but ongoing practice is still quite helpful, right? How, specifically, have I observed practice fail in the face of comprehension gaps? What if I reframe around those problems, rather than treating reading comprehension as the end itself?

These are the (hypothesized) problems that set me on this path in the first place:

  1. Retrieval practice of poorly-comprehended conceptual material usually doesn’t actually work; you can parrot but can’t use the knowledge.
  2. When retrieval practice feels dogmatic—like “guessing the teacher’s password”—that’s often because of comprehension gaps. This is especially true when someone else writes the prompts.
  3. Retrieval practice (and problem-solving practice) are unpleasant and indirect ways to diagnose and fix comprehension gaps.

Very naively, one way to deal with these problems is to ensure that people only practice material which they comprehend. So: how might we do that? I found myself tying together some ideas I’ve described over the past few years. To my surprise, that path led me to a more promising opportunity for reading comprehension support.

Concept overview

Here’s the high-level design:

  1. As you read a text, you have a magic highlighter. You can use it to mark anything important, anything you want to make sure you understand and remember. Maybe you can jot a few extra words to clarify what specifically interests you.
  2. Future practice sessions will include tasks which reinforce and elaborate the ideas you highlighted.
  3. When you finish reading a section, you can press a button to highlight other important details which you didn’t mark (in a different color, say). These “extra” (“suggested”? “shadow”?) highlights let you quickly check whether you skimmed over something you might value.

The main design insight is that a highlight interaction can serve both as a way for the reader to choose what to practice and also as a (weak) indication of comprehension. That same highlight primitive can then be repurposed to draw attention—in a very lightweight way—to important details which the reader might have unknowingly missed.

Conceptual elements I like about this design

Subverting a natural (but ineffective) interaction. People naturally (and sometimes compulsively) highlight. It’s a favorite study practice. This makes sense: it feels good to point at things you feel are important. It’s an outlet for interest, a mark of your efforts, vibrantly reflected back on the page. And it’s very undemanding. The trouble, of course, is that highlighted material isn’t actually better understood or remembered in controlled studies, although students believe it is. It would be awfully nice if we could somehow “rescue” highlighting—could make it actually have the effect which we wish it had. In my proposed design, the act of highlighting would be as ineffective as it usually is; what’s different is that those highlights would trigger later practice (which we know can be quite effective) and comprehension feedback (efficacy to be determined).

Keeping the locus of control with the reader. This design continues my 2022 efforts to give readers control over what they practice. But it extends that goal to the comprehension support interaction. (I haven’t seen this elsewhere in the research literature; a typical intervention asks students to explain every sentence aloud after they read it.) If the first half of a section is totally familiar, a reader can simply skip it. After they reveal the “extra” highlights, a reader can scroll right by anything in that first half—no interaction necessary. Then, just by scrolling and looking, they can pay attention to details they neglected in the second half, where the material felt unfamiliar.

Non-throwaway comprehension support interactions. Last month I described a reading comprehension support system which works by asking readers to explain the text to themselves as they read. This sort of thing often feels unpleasantly hygienic. I think that’s in large part because these self-explanations have no enduring value. They’re throwaway work. I’m just writing them to make sure that I understand, in this moment. But the interaction feels like too much cost for too little benefit. I feel like I’m understanding just fine without all the ceremony—in part because I, like most people, underestimate how often I have comprehension gaps. Every focused reading comprehension intervention I’ve seen has the same “throwaway work” problem. By contrast, in the proposed design, when you use the magic highlighter, you’re teeing up future practice which will ensure that you understand and remember that detail. The interaction isn’t thrown away; it has enduring meaning and weight.

An idea-centric memory system. In my 2022 mnemonic medium designs, prompts are presented alongside the text; readers can choose which prompts they’d like to add to their collection. I learned from my user research that people don’t naturally think in terms of evaluating prompts; they react to ideas in the text—“Ooh, I’d better make sure I remember this!”—then look at the adjacent prompts. The prompt-saving interaction was an awkwardly indirect way of capturing that reaction. My impression is that what people really wanted was to be able to point at the idea they found important; the prompts are mostly just implementation details. The proposed design moves us towards such an idea-centric practice system, which I believe may have other benefits, like promoting fluid understanding through variation and escalation.

Smooth on-ramps to obligation. In my late 2022 user research, I observed an interesting tension: readers often weren’t initially sure how much they cared about a detail. They could see that it was important. But did they want to sign up for ongoing practice? It wasn’t clear—they had to read a little further, to get a sense of how that detail fit into the whole. Many readers asked for a highlighter; when I dug in, this uncertainty was often behind their request. People wanted to mark details as tentatively important, then to come back and “upgrade” those details by saving the adjacent prompts later, if it seemed appropriate. This makes sense! I often do something similar in my own memory practice. I’ll read through a section, highlighting what seems important. Then I’ll make a second pass, guided by my highlights, to write prompts for whichever details seem to deserve it. My proposed design lends itself naturally to these smoothly escalating interactions. You can have an ordinary yellow highlighter to mark details which seem tentatively important, and a purple “magic” highlighter to mark details you want to make sure get reinforced. Highlights can be “swapped” to the other color with a click. Readers would have a smooth slope between “mark as important” to “mark as to-be-reinforced.”

Conceptual challenges for this design

Highlights don’t encourage deep processing. Effective readers are demanding. They interrogate a text, interpret it, elaborate it, and connect it to prior knowledge. People can—and typically do—highlight without much of that happening. It’s easy to highlight text without even processing what it says. All this means that my system’s “comprehension support” is setting a very low bar. But if the goal is to resolve my three motivating problems, I think it’ll help a great deal. You’ll be much less likely to be given prompts about ideas you completely missed. And the prompts can be constructed to induce the elaboration and interpretation which might not yet have occurred.

Density and ambiguity. Prompt-writing has given me a great appreciation for just how many separate details can be conveyed in a single sentence. If the reader highlights a key sentence, they could be interested in many different details—or all of them. Also, they might have comprehended only half those details. (I had this happen to me in testing.) I’ve found that helps to make “minimal” highlights—i.e. to highlight a key adjective if that’s what you’re interested in, alongside perhaps other separate small highlights in the same sentence. It also helps in these cases to jot a few words about your specific interest.

Trees over forest. A highlight-centric interaction emphasizes locality and detail. But I usually want practice to include synthesis, too. Often the best questions are about getting to the heart of some idea, finding a one-sentence way of expressing it when you look from the right angle. Sometimes I want my practice to be about summarizing a long exposition.

Novices can’t reliably judge what matters. One advantage of the original mnemonic medium design is that a domain expert tells you exactly what you need to know. In the proposed design, we shift the locus of control quite decisively to the reader; an expert merely provides “hints”. For a reader who really does want to be authoritatively led, this new design has much more friction. The deeper problem is that readers often aren’t in a good position to judge what matters most in a text. Is the “extra highlights” interaction enough to mitigate that problem?

An initial test

I took the concept for a scrappy initial test drive, with Wizard-of-Oz help from my friend Elliott Jin (a computer science instructor at Bradfield). Continuing last month’s studies, I read section One.III.1 of Jim Hefferon’s Linear Algebra, highlighting the details I wanted reinforced as I went. This material was already familiar to Elliott; he separately and carefully marked all the important details in the section. Then, once I’d finished reading, he manually compared my highlights to his, and marked my copy with any ideas I’d skipped. Then I could review those extra highlights as proposed in the design.

First, and most crucially, the interaction helped me notice three important ideas which I had completely ignored. My eyes had simply slid right past them on the page. That’s a promising validation of the notion that this kind of “extra highlights” interaction can surface comprehension gaps.

The exercise also demonstrated that highlighting doesn’t necessarily imply comprehension. In one instance, I had highlighted a definition but had totally ignored a few key words. This turned out to be fine. Subsequent practice quickly revealed that I’d missed those words; and because I highlighted the definition, I’d indicated that I wanted to know them—so I was grateful to the practice for revealing that.

In another instance, I ignored an “extra” highlight because I thought it was subsumed in something else I highlighted. That judgment turned out to be wrong! Subsequent practice of some downstream ideas revealed the misconception. It turned out to be fairly easy to diagnose in this instance, but that wouldn’t be true in general.

Crucially, the interaction felt great. I already like to highlight as I read; this felt like it was working with my natural behavior and making it more powerful, rather than distorting my reading practices. It feels subtly rewarding to “color in” the text, and even better to make those markings have real meaning, both in terms of the comprehension check and in terms of subsequent practice. Scrolling through the “extra highlights”, I felt interested in checking them out but not inappropriately compelled to engage. Elliott had highlighted some details I’d skipped because they were familiar or didn’t seem interesting; it was easy to scan past those.

Alongside the highlights, Elliott found himself also wanting to mark “lowlights”—details which might be worth attention, but which seem more incidental. Perhaps highlights could display some mark of their importance, e.g. with color intensity? If a reader could mark something as lower priority, we could then arrange to show them relatively fewer tasks about that detail. Alternately, these levels could act as a kind of feedback for readers that they’re mostly highlighting relatively unimportant details, and not the central ideas.

The next section of the book (One.III.2) posed some interesting difficulties. This section revolved around proofs of some important statements made in the previous chapter. Along the way, some useful new properties and procedural strategies emerge. The latter category could be handled using the same highlighting interaction, but it was much less clear what to do with the proof material. I think that’s in part because I don’t have naturalistic highlighting habits when reading proofs, whereas my automatic highlighting behavior in explanatory prose aligns quite well with indicating what should be reinforced. Learning from others’ proofs seems to demand different patterns; unfortunately, I lack a rich theory of knowledge and learning here.

One final hesitation: how important are my comprehension gaps, really? This prototype revealed a few meaningful details I’d skipped. But as it turned out, those holes were diagnosed by the section’s problem set without much fuss. If I hadn’t had this fancy augmented reading environment, I would have been fine. But I worked with a student earlier this year who seemed quite blocked by reading comprehension problems. And in previous sections of Hefferon’s textbook, I had comprehension gaps which left me simply confused during the problems. Even worse, such gaps might not even be noticed. Auditing the problem sets for the material they cover, I notice that they’re focused on applied problem solving, and problem sets will often fail to reinforce conceptual details discussed in the text—or to reveal gaps in comprehension of those details.

My rough impression is that conceptual gaps are more likely to be ignored or poorly diagnosed by problem-solving practice than factual or procedural gaps. Confusion also seems to arise when knowledge is only covered by problems which involve some transfer. So, in principle, maybe we could just construct problems which ramp up smoothly enough to effectively identify comprehension gaps. But I notice that I don’t like answering such basic questions when my comprehension is actually fine. They feel boring and burdensome. Maybe the proposed design’s lightweight comprehension support is a reasonable compromise.

Evaluating with Quantum Country

Another way we can judge the proposed new design is to ask: what would Quantum Country have been like to read this way?

A first question we might ask: how many highlights would a reader need to make to “collect” every prompt? I mapped the 112 prompts in the first essay (QCVC) to representative highlight ranges and found that 78 highlights would cover all the prompts. For ~25,000 words, that doesn’t seem so unreasonable: it’s about 1 highlight every 320 words, or roughly for every screenful of text on my display. (Though obviously the density of the text’s ideas varies considerably.)

This exercise revealed that there are plenty of details in QCVC which seemed important and non-obvious to me, but for which we didn’t include questions. That’s a limitation of the original mnemonic medium design: because every user would receive every question (and in fact would receive every question immediately—we didn’t introduce them over time), we had to be somewhat conservative with prompts. We didn’t want to overwhelm people. As a result, a given person is probably asked to practice some details which they didn’t find meaningful (e.g. a “boat” metaphor for computational range) but not given practice for other details which they found important.

The overwhelming majority of highlights (57) mapped onto a single prompt. 16 mapped onto 2 prompts, 3 onto 3, and 1 each onto 4 and 5. Most of the one-to-many instances are places where we used several prompts to encode an idea from multiple angles, or through multiple examples, or emphasizing different aspects. Auditing all these grouped prompts, I feel that at least 80% would be better off practiced in separate sessions. The prompts are mutually reinforcing; practicing one will generally diminish retrieval demand for another. Also, such theme-and-variation prompts are especially apt to feel boring and dogmatic when presented in rapid succession.

I feel this distribution of prompt counts also illustrates a limitation of the original mnemonic medium: most of those 57 “solo” details would have benefited from reinforcement from multiple angles. But again, we had to be conservative when all prompts are presented en masse to all users.

3 prompts had no direct source in the text; they ask the reader to draw an inference based on one or more details. These are a problem for my highlight interaction! One fix might be to assign these “synthesis / inference” prompts if a user’s highlights include the “inputs” to the expected inference. These sorts of prompts seem especially valuable to me, since they force the reader to go beyond the text. At the same time, because the whole point of these prompts is that you’re not retrieving the answer from memory, you’d probably want them to vary each time, demanding a new inference involving those ideas.

6 prompts are actually about details made in the problem statements of optional exercises. These are a bit tricky. One way to look at these prompts is that even if you don’t do the exercises which are about showing why these statements are true, you should learn that the statements are true. If we take this perspective, the highlighting interaction would probably want the statements to be made “in the main text”, so that all readers encounter them. Another way to look at these prompts is that if you do an exercise proving some result, you probably want to remember that result. From this perspective, the highlighting interaction is probably fine, and in fact better than Quantum Country’s one-size-fits-all model.

This exercise also helped me see the catch quite clearly: as we’ve discussed, readers are not always the best judge of what’s important. The proposed design would result in people missing important details, relative to Quantum Country’s design. That’s a price we’d be paying to permit a more fluid and reader-centric experience. Of course, I don’t yet understand the true cost or the true perceived benefit. That will require more user research.

Interaction cost

More than two years ago, when I was just starting to dig into tensions around reader control in the mnemonic medium, I observed that if QCVC contains 112 prompts, a reader wouldn’t want to make 112 decisions about which prompts to save, or even to click “save this prompt” 112 times in the interface! That motivated the introduction of the “bulk” prompt interaction in last year’s prototypes.

And yet I notice that I don’t feel much concern about requiring a reader to make 78 highlights. 78 still seems like a lot of interactions. Why do I feel so differently?

One factor is that highlighting is a natural behavior for many readers. It feels like part of reading the text, not a separate decision or interaction. Spatially, it’s happening within the text, not on a separate interface surface.

It’s also important that readers wouldn’t be required to evaluate prompts. Choosing which of 112 prompts to save is much more burdensome: you’d have to read and consider all that text. But in the proposed design, you’re not deciding “which prompts to save”; you’re emphasizing a subset of the text you’ve already read. The “extra highlights” view will offer a lightweight way to quickly add anything important that you might have missed, and even this interaction will be less demanding than evaluating prompts, since you’re evaluating the main text, much or all of which you’ve already read.

Some of QCVC’s prompts are much less important than others. In Quantum Country and in last year’s mnemonic medium designs, all prompts had the same status, so a user would have to evaluate all 112 prompts on equal footing. But in the proposed design, a user might naturalistically not highlight some text representing a less important detail, and that’s no big deal. The text doesn’t impose a cost. And the cost of evaluating an “extra highlight”, while low, could be further mitigated by a visual indication of importance.

Implementation details and challenges

So far, I’ve focused on the interaction design and ignored how it would actually work. I think that’s the right emphasis, but I’ll briefly discuss implementation insofar as it bears on my next steps for the design.

We can factor this design’s implementation into three core problems:

  1. Text to curated highlights: Given a text, what are the most important details to understand, and what highlights would draw one’s attention most appropriately to those details?
  2. Highlights to tasks: Given a set of highlights-in-context (and, potentially, emphasis remarks), construct a set of practice tasks.
  3. Semantic highlight diff: Given a set of user highlights-in-context, determine which of the curated highlights’ conceptual matter is not “covered”?

Of course, there are other ways to factor this problem. We could map highlights onto a knowledge graph, and a knowledge graph onto tasks, to capture connections and dependencies. Rather than expressing and comparing to an “ideal” set of highlights, we could try to find a set of highlights to recommend based on the reader’s apparent interest and level of detail. We could allow readers to express why they’re reading the text—their goals and questions—and steer highlights appropriately. But I’ll set these elaborations aside for now.

Let’s start with a simplified implementation model which involves no cutting-edge machine learning.

  1. Text to curated highlights: Paralleling the original mnemonic medium, an expert constructs an “ideal” set of practice tasks and maps them (many-to-many) to an “ideal” set of highlights.
  2. Highlights to tasks: Given a user’s highlights, we use traditional NLP tools like latent semantic analysis to identify “semantically matching” expert highlights. Readers are given the corresponding tasks from the expert’s map.
  3. Semantic highlight diff: Compute the set difference between the expert’s “curated highlights” and the ones identified in (2).

Apart from the heavy demand on expert labor, the main drawback of this model is that if the reader highlights a detail which the expert didn’t emphasize, there would be no tasks to reinforce it. Likewise, a reader couldn’t expect the system to create practice tasks around an original observation, or to heed any notes you make about your specific interest in a highlight. But a model like this would let me explore and refine the interaction design without extensive generative AI sidequests.

Of course, I don’t think I would have come up with this design idea in the first place were it not for the astonishing recent progress in large language models. The freeform nature of the highlight interaction cries out for the open-ended interpretation that is these models’ hallmark. And an idea-centric practice system requires high-quality task generation machinery. By shifting (or expanding) the system’s primitive from prompts to ideas-in-context, we would make it much easier for users to add their own ideas to their practice. The highlighting interaction can apply to your own notes. If you read a text and notice some important limitation in the author’s argument, you can jot a sentence about that and highlight your own words. Likewise, if you’re writing in your journal about a striking comment from a friend, you could simply highlight that remark to ensure you’d grapple with it in future sessions.

Back to reality. I’ve run many experiments with using GPT-4 to perform all three of these tasks. My coarse impression so far has been: these systems are amazing, and I’m able to get remarkably far; I’ve not yet managed to make their output quite as good as it needs to be; but I expect they’ll get there with some combination of determined prompt engineering, fine-tuning, or patience for next year’s model.

  1. Text to curated highlights: A surprisingly good start. Usually includes 10-20% unimportant details (even when I ask the model to include an importance rating), and omits a handful of important elements. The endpoints of the highlight are often not quite in the best spot.
  2. Highlights to tasks: The most difficult of the tasks. I believe that much of this will come down to articulating a philosophy of instructional design, or a “pattern language” for review. The issue generally isn’t the model’s “intelligence”; it’s that you can’t describe the sorts of tasks you want (and don’t want) clearly enough. Still, for basic retrieval practice tasks, I can get usable results somewhat more than half the time. (more notes)
  3. Semantic highlight diff: Surprisingly difficult for the model. It particularly struggles with user highlights which aren’t in the set of “curated” highlights—it wants to make spurious mappings.

Next steps

Happily, I don’t need to solve those open-ended technical problems to evaluate and improve upon the core design idea here. I plan to conduct a round of Wizard-of-Oz user testing:

  1. Text to curated highlights: constructed by me, as described in the simple model above.
  2. Highlights to tasks: as described in the simple model above, but I’ll match reader highlights to my curated highlights by hand, rather than using LSA or similar.
  3. Semantic highlight diff: I’ll just do it by hand.

I’ll test initially with some experienced spaced repetition users, so that I can focus on the highlighting interaction design, and the concept of idea-centric practice. What I’d like to observe:

  • Whether readers feel they can trust mere highlighting to indicate the tasks they’ll be practicing. Are “emphasis remarks” necessary?
  • Whether the “extra highlight” visualization uncovers comprehension gaps, and how readers feel about that.
  • How the tasks mapped from their highlights feel, emotionally—is there still a sense of guess-the-password if they indirectly “signed up for” these tasks?

Ultimately, what excites me about this design is that it’s positioned to attack three distinct problems which have emerged in my experiments over the past few years:

  1. The mnemonic medium feels unpleasantly authoritarian in many contexts; the locus of control should move towards readers.
  2. Comprehension gaps are routine; practicing others’ prompts doesn’t work and feels oppressive when this occurs in conceptual material.
  3. “Mere” retrieval practice of conceptual material often produces brittle understanding which transfers poorly; more fluid practice would likely produce more fluid understanding.

It’s safe to assume that this new design will fail, too, but I’m feeling optimistic that it will fail in interesting and instructive ways.

————————

My thanks to Elliott Jin for facilitating my initial test and for extended discussion of these ideas! Thanks also to Joe Edelman for helpful discussion.

View Post

Studying myself studying linear algebra

After the past few months digging into research on problem-solving practice and reading comprehension, I felt lost in a fog of abstraction. I needed to ground all those ideas in something real and concrete. Are these all just threshold effects? Do these problems basically just go away with moderate reading strategy skills and appropriately-leveled problem sets? If so, what issues remain?

So this month, rather than observing another student, I decided to observe myself. What important barriers remain in my learning experience—even when I use my memory system, apply “expert” reading comprehension strategies, and solve every exercise? I want to create an alien sense of ease and confidence when learning from an explanatory text. Alex, the student I worked with a few months ago, had troubles which seemed to be rooted in reading comprehension and a need for more problem-solving scaffolds. If I make it past those issues, where do my systems and strategies fall short?

As it happens, I’ve been looking for a good excuse to study linear algebra. My last attempt was about 17 years ago at Caltech. I’d attended a liberal arts-focused secondary school, which left me wildly unprepared for Caltech’s theory-laden math program. I absorbed very little. These days, when I dig into computer graphics and machine learning papers, I’m often frustrated: my fragmentary understanding of linear algebra spreads like contagion into fragmentary understanding of ideas in graphics and machine learning. I’d really like to understand these topics more solidly!

It’s also nice for my experiments that linear algebra involves a good mix of declarative, conceptual, and procedural knowledge. That is, the subject involves: a zoo of notation and terminology; interrelated mathematical objects and their myriad properties; and essential methods for manipulating those concepts. Problem-solving practice emphasizes procedural knowledge; memory systems traditionally emphasize declarative knowledge, but I’ve been trying to stretch them into conceptual (and to a lesser extent) procedural domains.

I chose Jim Hefferon’s Linear Algebra because it’s both well-regarded and licensed permissively: I knew I might want to build experiments adapting whatever book I used. For my initial study period, I read the first 35 pages in a careful, interrogative fashion (like in my video with Dwarkesh), writing and reviewing 65 memory prompts along the way. I solved all 57 problems included in those sections, checked my answers against the solution set, and wrote notes about any errors or interesting differences in my solutions. That whole process took about 20 hours.

I emerged with what feels like a strong understanding of the material. But my internal experience was far from “an alien sense of ease and confidence”. I felt unsupported in a number of important ways by the learning environment I’d created. Happily, many of these observations point to interesting paths for future prototyping and exploration.

Comprehension support

Before I can worry about building rich understanding or reinforcing my long-term memory, I need to ensure that I simply comprehend what the text is saying.

I’ve built habits around reading comprehension strategies like questioning and elaborating the text. These put me in a better position than many readers, but I know from past experience that they’re not enough: I still often discover that I’ve missed important points from the text.

For this book, though, I had extra help from my memory system and the problem sets. Let’s look at the impact of those supports on my reading comprehension.

Prompt-writing and comprehension support

My reading comprehension is much more reliable when I’m writing thorough memory prompts about a text. The process puts me into an active state of mind. I’m on the lookout for anything that seems important; I’m less likely to gloss over key details. And in order to transform those details into retrieval tasks, I usually need to comprehend them in at least some basic way.

That’s all great. But I do notice a few important limitations.

Sometimes I don’t know how to write prompts. Learning to write good prompts is like learning to write good prose: you build up an enormous library of micro-strategies for dealing with different situations. But I don’t (yet) have strategies for writing memory prompts about some kinds of material. For instance, in this book, an explanation of a preferred form for solution sets begins by presenting an abstract symbolic representation; then it delves into several contrasting examples to demonstrate how that form plays out in practice and to give the reader a feel for why it’s expressed as it is. It’s easy to write some basic prompts exercising the abstract symbolic form. It’s much harder to write prompts which involve the subtle points which the examples demonstrate, and the various reasons why this form is preferred. If I really focus and get creative, I can generally figure something out. But often that feels too burdensome, so I’ll just keep moving, sometimes without feeling I’ve explicitly made that choice. In these cases, prompt-writing hasn’t really checked my comprehension. (And, of course, my memory of those details won’t be reinforced—more on that later.)

Prompts create obligations. I want to make sure that I comprehend everything that the author says. But I don’t necessarily want to sign up to repeatedly practice everything the author says. When I lean heavily on memory prompts to reinforce my comprehension, I’ll often feel burdened later by the density of prompts. I’ll find I don’t care about many of the finer details, or that those details are adequately reinforced by later synthesis prompts. I can delete the excess prompts, of course, but there’s a small cost to each deletion decision. And it takes a lot of work to write all those “unnecessary” prompts—much more than, for instance, just explaining the text aloud to myself as I read. Some of that effort produces a more elaborated understanding, but much of it feels like wasted energy.

Shallow prompts, shallow comprehension. When a key definition is provided, it’s easy to write a prompt by simply paraphrasing the definition as given in the text. And that’s a problem because it’s easy to paraphrase without comprehension, as the self-explanation literature has found. Now, prompts which simply paraphrase the text usually aren’t very good. Prompts often need to distill and elaborate to be effective. So one could respond to my complaint by saying: “just write better prompts!” Sure. But I’ll point out that the medium doesn’t help me do the right thing here. It’s easy to inadvertently write shallow prompts, and to avoid that, I need to both apply constant monitoring—difficult when learning new material—and also spend more effort on prompt-writing—but it’s not always obvious when it’s “worth it”.

The mnemonic medium and comprehension support

If prompt-writing is an important comprehension strategy, that seems to pose a problem for the mnemonic medium, where prompts are written by others. Now, the embedded memory prompts do (inadvertently) act as a kind of basic comprehension check. If a prompt about a passage you just read seems baffling, that’s unlikely to be an issue with forgetting. You probably skimmed over the relevant material. The embedded prompts have indirect effects, too: many readers have reported that after discovering that their comprehension was so poor, they start reading more carefully (as the adjunct question literature predicts).

But we designed the embedded review interface with memory practice in mind, not comprehension checks. I’ve observed enough mnemonic medium readers to see that memory failures and comprehension failures feel very different.

Memory failures usually feel like uncovering something you once knew but have since forgotten: “Ah… where did the imaginary term go again? (Reveal answer) Oops, it’s in the upper right cell. OK, better review that again.” Or: “I didn’t remember that, but I don’t care about remembering that. (Delete)” Readers are more comfortable declaring that they don’t care about something when they know what that something is.

By contrast, comprehension failures often feel arbitrary and capricious: “I don’t really know what this is talking about. (Reveal answer) O…kay? I don’t get it.” The interface presents two buttons: “Remembered” and “Forgotten.” Both of these feel bad to this reader. They feel they’re “supposed” to click “Forgotten”, but they also know that this will just make the exact same question reappear at the end of the session. They don’t want to review this question again: they know it won’t be any less confusing next time, and because they don’t understand the answer, they don’t value knowing it. They know that if they answer the prompt “correctly” when it’s repeated, it’ll only be because they’re parroting without understanding, and parroting feels bad. They can click “Remembered” to “make the prompt go away”, but that doesn’t feel good either: maybe it’s important? And they don’t necessarily want to sign up to answer it again in the future.

If we’re going to use adjunct questions as comprehension checks, I think we’ll want to differentiate them more from retrieval practice prompts, probably in both content and interface design.

Problem sets and comprehension support

With my recent literature reviews fresh in my mind, I noticed that problem sets (in this and other university-level texts) seem to serve three distinct goals:

  1. Check comprehension: ensure that you actually read and understood the text.
  2. Facilitate skill acquisition: practice applying what you’ve learned to induce patterns, reinforce long-term memory, and build procedural automaticity.
  3. Stimulate elaboration: induce you to draw connections, notice additional properties, think creatively. This promotes both richer understanding and higher memory stability.

Different problems mix these goals in different ratios, of course, but all problems implicitly involve some kind of comprehension check. You can’t solve the problem if you didn’t understand the part of the text it’s depending on. The advantage here is that you get a comprehension check by doing—by putting the material to some use. This tends to be much more enjoyable than directly answering rote comprehension (and memory) questions. That’s particularly true when the problems are authentically interesting for their own sake, rather than just drill questions.

But largely because of their overloaded goals, the problem sets in this book don’t offer as much comprehension support as I want. And I think these issues are true of problem sets in other similar books.

Bad failure modes. At times I found myself stuck, but it wasn’t necessarily clear that I was stuck because I glossed over some important point in the text. When comprehension isn’t the issue, stubborn perseverance is usually appropriate, so I’d wonder if I just needed to try harder. So I’d flail at the problem, and the flailing wasn’t constructive, because I simply lacked some relevant information. The trouble is that I often couldn’t tell which situation I was in. The best remedy for a comprehension gap is usually to re-read some explanation in the text, but even with the solution manual in hand, it wasn’t necessarily clear where I should focus. In these cases, I’d end up flipping through the chapter again, looking for something that might be relevant. This seems like a more or less universal issue with problems as comprehension checks.

Biased coverage. In the course of a chapter, you’ll often learn that something is true, why it is true, and why that matters. Problems emphasize application, analysis, and synthesis, so they’ll mostly check if you comprehended that the thing is true, but not so much the other discussions. For example, the text tells me that the solution set of a linear system can be expressed as the sum of a particular solution and a linear combination of free variables which represent the solution set to the associated homogeneous system. Much of the chapter was spent discussing a proof of this property and some of its implications. But none of the problems checked my comprehension of the proof, and the interpretative remarks were only partially probed. I tested myself later to see if I could explain the proof, and I realized that I hadn’t understood a central move, though I had successfully completed the problem set.

You could argue that this is just a flaw in the textbook—that a problem set should exercise your understanding of everything the corresponding book section says. But I think problems are naturally predisposed to check comprehension of certain kinds of material and not others. If you push against that grain, you’ll end up with a different kind of activity, something that doesn’t feel like a problem.

Which problems cover new ideas? Some problems look very similar, but each has been cleverly constructed to exercise some different facet of the underlying material. But sometimes similar problems are just about repetition, to promote fluency. Often I’d feel like I didn’t need to, say, solve yet another system of linear equations: I felt fluent enough! But some of those problems hid subtle differences, novel comprehension checks. So I did every subproblem, and many of them felt unnecessary. Or, to put it another way: many redundant subproblems should have been “smeared” across the subsequent weeks, to support long-term memory. But I wouldn’t want to delay the comprehension checks.

Got the answer, missed the point. In a few cases, a problem was constructed so that it could be straightforwardly solved by applying some property that was discussed in the text. I’d solve it laboriously, and get the same answer, but I’d missed the point. To notice this happening, I needed to not just check my answers but to retrace the steps of every solution, comparing them to my own and watching out for important differences.

Skill acquisition

When we learn to do something new, a very interesting transition occurs. At first, we think explicitly about the various objects and actions at play. There’s often explicit verbal rehearsal: “To perform Gaussian elimination, there are three operations I can use…” But with practice, we learn patterns. Then we “just know” what to do, without the feeling of active retrieval: “Alright, halve this equation, then use it to knock out the first term of the other one…”[1]

In my experience, retrieval practice doesn’t cause this transition by itself, though it seems to make the transition easier. Maybe that’s because reliable long-term memory for the relevant declarative knowledge reduces working memory load. Problem-solving practice is the traditional way to produce this transition. That mostly worked quite well for me, but I’ll mention a few challenges which seem like interesting opportunities for improvement:

Problem-solving practice should be smeared over time. I finished reading each section, then did all the problems associated with that section. Many of those problems were about repeatedly practicing the same skill: say, finding the solution set of a linear system with free variables. That’s great—it’s what I need to do to get fluency—but it’s awfully unpleasant to solve ten problems like this in a row. Courses traditionally solve this by assigning only a fraction of the book’s problems, but what if I need lots of practice to become fluent? Emotionally and practically (per the spacing effect), it would be better to spread problem-solving practice out over time. In reality, that’s tough to orchestrate. You might want, say, one problem a day for a few days, then every other day, then twice a week, etc? This requires a kind of programmable attention.

How much practice should I do? Maybe I only needed to do half of the book’s problems. I’m not sure. “Mastery-based” learning systems will have rules like “keep practicing until you can do five in a row.” The trouble here is that it’s possible for me to achieve this criterion without having learned the kinds of patterns which let me “recognize directly what [I] formerly had to think through.” Then I might struggle on some more difficult skill which depends on this prior skill, because the prior skill still requires so much cognitive load, and it wouldn’t necessarily be clear why. “Khan Academy says I mastered all the prerequisites!” The literature on procedural knowledge has probably identified some useful answers here, but I haven’t yet read deeply in that topic. As an example, I’d guess that time pressure (e.g. Number Munchers?) would reveal a sharper difference.

How to practice proof problems? This book includes plenty of problems which require proofs. These are less rote, of course: you’re not learning to perform some consistent operation. Proofs are in part about pushing you to understand more deeply, which we’ll discuss in the next section. But these proof problems are also about learning to recognize certain patterns, about becoming fluent in exploiting specific properties of these mathematical objects. So, what should I do when I fail to recognize the pattern, or when I fail to see how to exploit the property? I can read the solution and write a note about the insight I missed—maybe even a memory prompt about it. But how to practice further? How to ensure I recognize that pattern? Proof problems aren’t as fungible as more routine problems; I can’t easily arrange to slip in another proof problem which exploits the same property.

Building deeper understanding

I don’t want to just know what the text says. I want to know what it means, and why it matters. I want to connect these ideas to my prior experiences, and to deploy them creatively in the future. All in all: I want to internalize a richly integrated and elaborated representation of the material.

Writing memory prompts helps with this a little: the process encourages me to generate examples, to infer details which the text omits, to consider implications, to clarify why I find a detail important. These are all relatively “expert” prompt-writing practices, not things most memory system users would experience. A mnemonic medium reader wouldn’t get this benefit through prompt-writing, but they could get this benefit if the author’s prompts were designed to cause them to think those thoughts.

This linear algebra book’s problem sets did that for me. Many of the problems were clearly designed to encourage deeper consideration of details in the main text. One simple example: “In the proof of lemma 3.6, what happens if there are no 0=0 equations?” I hadn’t thought about that case when I read the proof, despite my prompt-writing practice.

But the problem sets were mostly focused on problem-solving practice in service of skill acquisition. Perhaps more than straight memory support, I’d like to answer dozens more new questions which encourage elaboration and inference, perhaps one or two per review session, for weeks. Ideally, these questions would slowly increase the transfer distance involved.

Remembering what I’ve learned

Having gone to all this effort, I’d like to make sure that I remember what I’ve learned, for the long term. My memory practice will help with this. But there are a number of interesting issues, some of which we first discussed in “Fluid practice for fluid understanding.”

To what extent is explicit retrieval practice obviated by good problems? Many of my prompts aim to make me retrieve a single detail: e.g. what is the definition of singularity? But if I solve a problem which requires using that definition, then that problem also makes me retrieve it from long-term memory, which will make it more stable in a similar way—or better, via elaboration. The problem-solving task is likely to feel less rote, more authentically interesting. But it also takes more time and effort than a simple flashcard interaction. How should we think about this tradeoff? Which kinds of prompt are best practiced in focused isolation, rather than through integrative practice? Does the answer vary with familiarity?

Prompt clusters should be spread out. Given a concept like “singularity”, I should be able to give the definition. But I should also be able to: given the definition, provide the term; given an example, determine whether it’s singular, nonsingular, or neither; given that a matrix is nonsingular, conclude something about the solution set size of its associated linear system; answer some prompt which emphasizes the constraint that singular matrices must be square; etc. The trouble is that when I answer several of these questions consecutively, I don’t have to retrieve as much from long-term memory for the later prompts. Parts of the answer are still in my working memory from the earlier prompts. I think it would be better to spread highly related prompts like this out over multiple review sessions. And a right or wrong answer on one should probably influence the schedules of the others. This is one important way in which memory systems for conceptual knowledge seem to want to be organized around some different primitive from their declarative-focused forebears.

Some prompts want variation. I like prompts which involve examples: “is this matrix singular?” But you wouldn’t want the prompt to use the same matrix every time—you’d just memorize the answer. The point is to reinforce memory for the procedure. This sort of question seems like a good opportunity for randomization, or at least for choosing from a set of pre-made possibilities, as in Quantum mechanics in a nutshell’s “application prompts”.

Can problem-solving be integrated into review sessions? By the end of these problem sets, I felt quite able to apply what I’d learned in a range of situations. How can I ensure that this remains true? One approach would be to insert similar problems into my review sessions. But they do involve a different stance: many required a paper and pencil, and much more time than a typical prompt. I usually do my reviews on my phone, and I don’t necessarily have paper around. Maybe I need to set aside one or two sessions per week for more involved practice? Alternately, as an experiment, I was able to rewrite many of these problems into congruent forms I could solve “in my head”, without paper. How broadly can I apply that strategy? Are those “reduced” problems sufficient to maintain the skills I learned?

How to handle insights which arise during problem-solving? The most interesting problems in this book seem designed to produce specific insights as I solve them. Many of these insights are difficult to describe—a subtle emphasis or connection in the realm of symbols or abstractions. And yet I have the strong sense that I’m learning something important. How can I ensure that I retain that lesson? I often struggle to write prompts about these insights.

How to treat proofs? I’ll confess that I don’t have a strong theory of knowledge for proofs. What I want isn’t really just to know the proof, in the sense of being able to parrot it back—but rather, to develop enough intimacy with its key moves and insights that I could use them to prove some related claims. For example, the book presents a proof that reordering linear equations doesn’t change the system’s solution set. I was able to use analogous proof-strategies to demonstrate the validity of the scaling and combination Gaussian operations. But how to make sure that I retain that understanding? I feel unsure about which prompts to write, or how to write them[2]. Testing myself now, two weeks after my initial read, I find that I can reproduce the most complex proof discussed so far—but only with a fair amount of struggle.

Prompt-writing is a lot of work. I’d estimate that prompt-writing consumed around 2 of the 20 hours I spent on these sections. A 10% tax doesn’t seem so bad, given the comprehension and long-term memory benefits. But prompt-writing requires much more mental effort than mere reading, and more than much of the problem-solving. It felt like at least a third of the overall effort, and in some sections as much as half.

Some high-order bits

That was a lot of detail. Stepping back a moment, if I had a magic wand, these are my main wishes:

  • Some “quickly check my comprehension of this passage” interaction. Perhaps a series of lightweight questions can push me to distill and lightly interpret key ideas, without feeling like dull rehearsal. Problem solving and retrieval practice are much more enjoyable and effective in the absence of glaring comprehension gaps.
  • An easy way to orchestrate skill-acquisition-oriented problem-solving practice so that it’s smeared out over time.
  • 2-3x as many integrative, elaborative problems, dripped into my review practice over time.
  • To shift my review practice from retrieval practice to unique problems, insofar as the latter can subsume the former.
  • A better understanding of how to capture and reinforce inchoate insights which arise during problem-solving.
  • And, finally, to mostly not write my own memory prompts.

If I got these wishes, I think I’d feel something closer to an “alien sense of ease and confidence”. I expect I’ll embark on some prototypes in these directions in the coming weeks.

————————

Thanks to Gary Bernhardt, Elliott Jin, and Russel Simmons for helpful discussion of these topics.

————————

[1]: For a good summary of this process, see John R. Anderson’s “Learning and Memory”, chapter 9.

[2]: Michael Nielsen’s “Using spaced repetition systems to see through a piece of mathematics” is very stimulating here, but I’m not yet quite able to connect the dots myself.

View Post

(Recording) Seminar: the 2006/2007 constructivist academic flame war; Sunday, August 27 @ 1PM PDT

Thank you all for a very interesting discussion about Kirschner, Sweller, and Clark's attack on discovery/inquiry learning. I feel I understand these papers several notches better now!

(Please don't share this video publicly)

View Post

Seminar: the 2006/2007 constructivist academic flame war; Sunday, August 27 @ 1PM PDT

With the LLM genie out of the bottle, I've found myself often drawn into conversations about The Young Lady's Illustrated Primer and similar dreams of futuristic learning environments which emphasize inquiry-based learning, exploratory learning, problem-based learning, curiosity-driven learning, and so on.

These conversations have naturally led me back to a famously controversial paper: "Why Minimal Guidance During Instruction Does Not Work: An Analysis of the Failure of Constructivist, Discovery, Problem-Based, Experiential, and Inquiry-Based Teaching", by psychologists Kirschner, Sweller, and Clark (KSC).

The authors argue that all this talk about inquiry and discovery learning is nice, but the actual empirical record looks surprisingly poor, and there are some fundamental conflicts with our mechanical understanding of cognition.

I’m instinctively quite sympathetic to those schools of pedagogy, so I found this paper quite stimulating. It provoked many opposing responses; I’ll index a few highlights:

  • Hmelo-Silver et al and Schmidt et al accept KSC’s premises but argue that their preferred inquiry-based method actually does have subtle guidance.
  • Kuhn makes the more interesting argument that content knowledge is less important than values, attitudes, meta-skills, etc.
  • KSC published a response to these three papers and maintained their positions.
  • Schwartz et al argue that we’re testing the wrong thing; constructivism performs better if you ask which method prepares the learner for learning from new experiences in the future; KSC reply (included in the paper) suggesting adversarial collaboration (but they never do it).

Join me (via Google Meet) on Sunday, August 27 at 1PM PDT [GCal] for a discussion of these issues. Please read at least the original KSC paper; I’d suggest picking at least one follow-up that interests you to hear “the other side” too. Bring your noticings, wonderings, and ideas; I'll bring discussion questions.

I'll record our discussion and share it here afterward.

View Post

Initial experiments in self-explanation support

In 2019, I argued that “books don’t work” because people seem to forget a surprising fraction of what they read. No wonder it’s so hard to learn complex topics. But in the past few months, I’ve come to believe that what seems like “forgetting” in these cases is often “never having really understood in the first place.” And that when people find themselves unable to draw on what they’ve learned, they often can’t distinguish between having forgotten and never having understood, in that moment. If that’s true, and if I want to help people learn complex topics much more effectively, then I’ll need to move upstream of forgetting. That doesn’t mean ignoring memory—understanding depends heavily on memory, as we’ll discuss—but it does mean thinking about memory in a larger context.

So: what is “understanding”? How does it happen, mechanistically, when interacting with (say) a text? If someone has failed to understand something, what hasn’t happened, exactly? What factors affect those processes? What does any of that imply about possible interventions? In this letter, I’ll sketch what I’ve learned so far about these questions.

How do we understand text explanations?

What is “understanding”? Learning scientists start from high-level behavior in the world: “To understand is to be able to wisely and effectively use—transfer—what we know, in context; to apply knowledge and skill effectively, in realistic tasks and settings.”[1] They work downwards from there, asking what sort of learning activities tend to produce that high-level capacity. Meanwhile, cognitive psychologists start from the low-level mechanisms of the brain—perception, attention, processing, memory—and try to work upwards toward more complex phenomena. For them, “understanding” isn’t one thing, but an imprecise name we give to many processes working together.

The problem for designers like myself is that the two disciplines haven’t yet extended their findings far enough to meet in the middle. A principled understanding of “understanding” would help us invent new ways to bring it about. Absent that, we’ll follow the stalactites and stalagmites as far as we can go on each side; and then we’ll need imagination, intuition, and guesswork to bridge the gap.

As a first step, let me constrain the kind of understanding we’re talking about. Let’s focus on understanding an explanation from a text. Say you’re reading an explanation of the human circulatory system. Eventually, you might aim to “understand the circulatory system”. But for the moment let’s merely aim to “understand this explanation of the circulatory system.” That’s a lower bar, but it seems to be the bottleneck in many cases.

Concretely, you should be able to say what the author’s sentences mean; you should be able to explain the properties and relationships described and implied in the text; you should be able to make simple inferences relying on those details and any relevant prior knowledge. We won’t worry about automaticity or long-term retention for now. Let’s imagine that you’ve just read the explanation, and it’s fine if your answers are slow.

Cognitive psychologists call this kind of understanding “text comprehension.” Walter Kintsch’s well-studied construction–integration model divides this process into two parts[2]. First, you need to construct some kind of internal representation of the literal text on the page: read the words, parse the syntax, resolve noun references and ambiguous verb senses, etc. Kintsch calls this mental representation a textbase. If we ignore forgetting for a moment, a complete textbase will let you answer questions which only require manipulating the literal words of the author’s explanation. If you have a textbase for “Bargleborp, if left unchecked, causes hixitak”, then you can answer “Nobody noticed Bob’s bargleborp until it was too late. What happened?” (“Hixitak.”) You know what the words say but little about what they mean.

Then, you need to make meaning through integration, forming connections that aren’t on the page. You’ll link the terms and propositions in the text to things you already know, and you’ll infer details the author means but hasn’t explicitly written. Kintsch calls this integrated mental representation a situation model.

For example, in an explanation of the circulatory system, suppose you read this passage: “When a baby has a septal defect, the blood cannot get rid of enough carbon dioxide through the lungs. Therefore, it looks purple.”[3] Your textbase representation of these sentences isn’t enough to understand this explanation. Why does the blood look purple? To explain, you need to give these details meaning by connecting them to your prior knowledge about blood flow, as in this figure:

We can use this model to give a little more mechanistic color to commonplace intuitions about text comprehension and learning:

Better integration produces better memory. Propositions in both the textbase and the situation model will be encoded into long-term memory. But recall and memory consolidation are associative. If you can only “reach” a text’s elements through a single link from your prior knowledge, it will be difficult to retrieve, and it won’t get much reinforcement in later cognition. Kintsch’s experiments suggest that durable learning mostly happens by hooking new elements from a text to old elements in your knowledge base, to form a richer situation model. The more hooks, the better. You can get more hooks by processing the text more deeply or by having more prior knowledge. This accords with the intuition that it’s easier to remember new details about domains where you have a great deal of expertise.

Better memory produces better integration. As we saw in the example above, connections and inferences between nodes in the textbase often require following paths “through” nodes representing prior knowledge. This suggests a mechanistic role for memory reinforcement: you can’t traverse nodes you’ve forgotten. If chapter 2 depends on unfamiliar material from chapter 1, we need to ensure that you don’t forget details from chapter 1 before you use them to integrate explanations in chapter 2.

“Put it in your own words.” That’s common advice for students. But I’m tickled to notice that Kintsch’s model suggests you basically can’t learn without putting a text in your own words (though perhaps unconsciously). To form a well-integrated situation model, you need to connect the concepts referenced by the author’s words to the concepts which make up your prior knowledge. That is, you’re manipulating concepts here, not words. If one author talks about “getting rid of carbon dioxide through the lungs” and another about “exhaling carbon dioxide”, you’ll generally need to interpret those phrases as referring to the same conceptual nodes. If you don’t, you’ll fail to form connections across the situation models formed by the two texts.

Attention moderates comprehension. One trivial failure mode is all too familiar: sometimes your eye skips across a sentence so quickly that you never even decode the words in it. Or you read the words in a sentence, but your attention is so shallow that you never parse its thorny syntax. In these cases, you never form a textbase of the text. You might process the “surface features” of the text—for instance, noting that a particular key word was present in that part of the text—but you won’t be able to use any of that information, even on a verbatim basis. Text selection affects behavior here: in one of Kintsch’s experiments, when a text was too verbose or familiar, subjects tended to lean on their existing situation models and formed a lossy textbase representation.

Integration moderates transfer. If you find that you can only parrot what the author says verbatim, but that you struggle to put the ideas into your own words or put them to use, that’s a sign that your situation model is dominated by your textbase. That is, your mental representation consists mostly of nodes corresponding to the authors’ words and phrases. So when cues come in which aren’t shaped exactly like the text, you can’t connect them to what you’ve learned. You’ll have to work more with the text, connecting it to what you know, asking questions, and making inferences.

Self-explanation helps understanding

Kintsch’s experiments mostly ignore issues of attention, self-regulation, and reflection. His studies typically assume an ideal case—the learner constructs the best possible model they can from the text, given their prior knowledge—and asks about properties of those models. In practice, though, my sense is that comprehension issues often arise because readers haven’t constructed the best possible model given their prior knowledge. Often that’s appropriate: careful reading takes time and effort, and most books don’t deserve an unlimited budget. The ideal reader would construct exactly as good an understanding as they intend. My job in “augmentation”, then, would be about expanding the Pareto frontier. In reality, as we discussed in last month’s essay, readers often don’t understand the text as well as they intend. They fail to notice large holes in their understanding—not just of the concept, but of the explanation of the concept.

What might we do about this? We’d like to help readers do things like draw connections, monitor their comprehension, and correct holes or misapprehensions. The best-studied intervention along these lines relies on self-explanation. Michelene Chi and colleagues asked one group of students to explain a text aloud as they read; a control group re-read the passage to hold total study time constant. The explaining group performed much better on post-tests, particularly on more difficult transfer questions which required new inferences.

Within the explaining group, some students naturally produced more voluble explanations than others. Interestingly, the “high explainers” learned much more, even after controlling for differences in initial domain knowledge and verbal ability. In an open-book test, the “high explainers” almost never referred to the text, while “low explainers” (and the control group) did routinely, suggesting that the “high explainers” had internalized more of the material. Interestingly, the nature of the self-explanations also differed between these groups. The “high explainers”’ explanations were much more likely to draw connections across topics in the text, rather than just within topics. And they made more inferences in their explanations, which may explain why on the post-test, they were much more able to explain details which were only implied in the text, rather than explicitly stated.

Chi and colleagues suggest that since the “low” and “high” explainers had the same pre-test scores, their differences may be due to skills and habits of self-explanation. For example, maybe the “high explainers” have previously picked up a general strategy of making connections across topics, and a general belief that more explanation is better. This view has led to a long line of research about explicitly teaching effective self-explanation strategies.

Teaching people these sorts of strategies will probably make them more likely to understand conceptual explanations. Speaking for myself, I became a much more effective learner as I expanded my repertoire of reading techniques. But I don’t think this is enough. I read these papers on self-explanation almost a decade ago. But I don’t reliably do it—sometimes because it feels like annoying drudgery, and sometimes because it simply hasn’t occurred to me. It’s all too easy to slide into “just reading”, assuring myself that I’ve understood just fine. Often, I haven’t!

A few prototypes

One way I think about this is: the medium of prose explanation doesn’t really encourage what I want here. The default action is to continue reading. The text won’t do much to help me monitor my understanding or draw appropriate inferences; I need to watch myself continuously and notice when I become confused. If I spend more time understanding something thoroughly, the text won’t react or surface that. Compare to a conversation. In a conversation, my partner will watch for apparent confusion. They’ll ask and answer questions about the explanation. They’ll encourage me when I make the effort to understand.

In the past year’s prototypes, I’ve demonstrated various combined mediums of text-plus-linked-memory-systems, and I realize now that those come much closer to encouraging the kind of behavior I want. I’ve been doing a lot of reading with a simple system where I can highlight any text and write associated memory prompts. The prompts float alongside the associated text, and the text, in turn, is shaded to indicate the presence of the prompt. The spatial markers give me an instant visceral indication of where I’ve given the text lots of attention, and where I’ve not. And there’s a funny thing that happens: the shading is somehow viscerally rewarding! It feels good to “fill up” the sidebar, to “color in” all the “good bits”. Obviously, this can be misleading; highlighting a text feels virtuous, but you don’t want to highlight the whole page. Still, I want to stress the contrast with traditional text: this is a medium which naturally helps me notice where I’ve “gone deep” in a certain sort of way, and encourages me to do more of it. The “default path” is deeper with this medium than when I read a normal book.

Memory prompts are much more difficult to write than self-explanations. Some of that additional difficulty probably translates into richer understanding, but I suspect that a lot of it is “waste heat” lost to the idiosyncratic requirements of rewriting an explanation as a task which will produce exactly the desired kind of retrieval. So, inspired by those observations about the text-plus-linked-memory-system medium, I prototyped a reading experience which “passively rewards” self-explanation in an analogous way. As I read, I can highlight important passages and explain the passage in my own words, adding connections and color as appropriate. I can explain aloud using dictation, if I wish. Those passages are shaded to give me a clear sense of which parts I’ve explained. And, as an added measure, I send the explanation to GPT-4 and display feedback if I’ve made an error or ignored something important.

Here’s a brief video (0m53s) of me writing a “successful” and a “mistaken” self-explanation in a linear algebra textbook. (Sorry, Patreon doesn't allow inline video embeds!)

After a few iterations on this theme, the resulting prototypes are interesting, but not something I’ll develop further in quite this form. On the positive side, when I read with this tool, I feel confident that I understand the material quite well. But it’s also pretty unpleasant. Much of the time, typing out (or speaking) full explanations of the author’s points feels like unimportant busywork, irrespective of whether it’s helping me understand. Interestingly, I don’t always feel that way: when I find myself confused by something the author said, the exercise feels naturally rewarding.

Writing memory prompts might be much more difficult, but it also feels more valuable as an activity than written/spoken self-explanation. I think that’s because the memory prompts aren’t just chaff to be discarded when the reading experience is over. I’m not writing them for the benefit of writing them. I’m writing them so that they’ll be added to my memory practice. I’ll carry them with me after I finish reading, confident that I’ll remember all those details indefinitely.

So, I tried reframing the self-explanation activity as note-writing. What if all the self-explanations I write while reading get bundled up into a Markdown file in my note system? Then I wouldn’t be writing them just for the benefit of writing them; I’d carry them with me afterwards in my notes. Unfortunately, the results in my experiments just weren’t very compelling. I end up with a note file that’s a fragmentary and verbose summary of the original text, in my own words. But it’s not really the summary I would have written if I were trying to produce useful notes on the text. It’s not an output which really motivates its own creation.

Still, I think the two heuristics I’ve discussed are compelling. First, you want the reading medium to naturally “suggest” expert strategies. And second, those “extra” reading activities would ideally result in something you find viscerally valuable, rather than just feeling like virtuous busywork.

This relates to one way I think about my ecological niche in this space: I care, a lot, about how the reading experience feels. The educational psychology literature is littered with interventions and reading augmentation systems, but they’re more or less universally appalling experiences for readers, and the researchers seem utterly uninterested in that fact. My hope is that if I dig into the principles behind some of these systems, I can reconstitute them into something empowering and delightful.

————————

[1] Grant Wiggins and Jay McTighe, Understanding by Design (2005), page 7.

[2] See Kintsch’s monograph, Comprehension: A Paradigm for Cognition (1998). I’m over-simplifying here: construction isn’t just about the textbase and integration about the situation model; integration is needed to form a coherent textbase, too (e.g. to resolve ambiguous interpretations); but this is a workable approximation for our purposes.

[3] This example from Kintsch, W. (1994). Text comprehension, memory, and learning. American Psychologist, 49(4), 294–303. Unfortunately, I think it actually contains a misconception about blood flow: as I understand it, blood flowing to the lungs is dark red, not purple, and that’s because it is low on oxygen, rather than because it contains carbon dioxide. But it’s a nice figure, so we’ll use it for the sake of discussion.

View Post

(Recording) Design discussion: Squidgies (language learning tool); Wednesday 07/19, 3PM PDT

Here's the recording of our design discussion today with the creators of Squidgies, a language learning system. I want to congratulate the team on a compelling project. It's fascinating to see people so dramatically reinterpreting interaction models using large language models, not just as a chat interface but as behind-the-scenes infrastructure powering many elements of their design.

View Post

Live quantum mechanics study session with Dwarkesh Patel

I mentioned in my latest essay that my recent thinking about reading comprehension was inspired in part by recording a live study session with Dwarkesh Patel. That video is now live—enjoy! We also recorded a traditional interview, full of Dwarkesh's trademark probing questions.

The study video is unscripted. You'll see me making lots of mistakes, some of which I don't catch during our study session. You'll see us both get very tangled in an example problem for the last half hour or so. But I hope you'll also enjoy seeing some unusual study practices in action.

You'll notice the very demanding way that I'm constantly asking questions of the book and myself. This is a routine practice among scholarly readers—I'm not doing anything unique—but I've noticed that many professionals have had no exposure to this sort of close reading, and find it quite surprising. Something I find interesting about this is that it's hard for people to even know that they're unaware of others' reading practices: reading is usually invisible, private, silent.

Of course, what's less routine is the way that I'm writing memory prompts as I read. It's certainly a more overt difference from typical reading practices, but I think it's actually less important for capacity-building than the demanding interpretive work I'm doing. In any case, the two are entwined, as I described in my latest essay on reading comprehension and memory systems.

I'm already getting asked: is this prototype available for others to use? Sorry—not yet. It's held together by duct tape; a couple days of work would be necessary before a wider audience could use it; I'm not sure I want to sacrifice that time at the moment. But if I do devote more time to it, you all would get access long before a public audience.

View Post

Design discussion: Squidgies (language learning tool); Wednesday 07/19, 3PM PDT

Experimenting with a new type of event!

In April's office hours, Bill Roberts presented an unusual and interesting new language learning system called Squidgies. I offered to circle back for a longer discussion of some of his design ideas and problems.

I'm interested in the challenge of transmitting tacit knowledge like design methods, so I asked Bill if I could invite you folks to join in as peanut gallery; he graciously agreed.

Join us via Google Meet next Wednesday, July 19, at 3PM PDT [gcal]. I'll post a recording after the event.

View Post

Reading comprehension and memory systems

One surprising difficulty in “making it easy for people to remember what they read” is that often, for large swaths of the text, people never really knew what was said in the first place.

Long-term memory doesn’t enter the picture here. The problem is that the reader’s eyes skipped like stones across the surface of the page. They never processed those words beyond visual decoding, if that. The ideas were never represented in working memory. We could roll our eyes and chalk this up to “poor reading skills”, but I wouldn’t want to be too dismissive: I suspect most highly-paid knowledge workers routinely fail at basic reading comprehension in this way.

My reading has improved quite a lot as I’ve investigated reading as a research problem. Still, I’m humbled by basic failures, surprisingly often. I enrolled last year in the University of Chicago’s four-year program on the Great Books. It’s great fun: we meet weekly for deep discussions of challenging texts. But in most classes, at least one of the facilitator’s questions will make me realize that I simply hadn’t comprehended what the text said in some important passage—and I hadn’t noticed.

Getting the gist

Can it really be that educated adults so routinely encounter basic reading comprehension problems? Maybe the trouble is that the scenarios I’m thinking about are unrealistically demanding. In my research, I’ve been observing readers’ comprehension with a goal of helping people internalize a text in great detail. My University of Chicago course covers unusually challenging texts from antiquity, and perhaps the discussion questions are more probing than the ones you might ask of “normal” reading.

Besides, some say, most people are just “reading for the gist” anyway. They want the takeaways. They don’t feel the need to really understand—much less memorize—all the fine details of an author’s explanation. Fine; say I accept that for the moment. Well: do people “get the gist”? Are they aware of whether they got it?

In one of the seminal experiments on adult reading comprehension, Michael Pressley and colleagues asked university students to read short SAT-style text passages (188-520 words). They were instructed to read at a pace which would allow them to answer a question about its contents. Some of the questions were about “the gist”—e.g. to state the main idea, or its primary purpose; others were about details. After students finished reading each passage, they were given a question to answer. Then students were given the opportunity to re-read the passage and change their response, if they thought it might be wrong.

Now here’s the central finding: when readers gave an incorrect answer to a question about “the gist”, they were very unlikely to choose to re-read and try again—they did so only 20%/27% of the time (there were two relevant conditions). And even in those instances where students re-read, only 50%/57% ended up with a correct answer.

Maybe students are just lazy, and they’re choosing not to re-read because this is an artificial experiment? Perhaps, but when the question was about a specific detail in the passage, students with wrong answers did re-read and change their answer 60%/75% of the time. That difference is hard to square with the hypothesis that students just didn’t care about getting the right answer. To explore that hypothesis further, the authors ran a second experiment. In this round, students just rated their confidence in their answesr; they didn’t get a chance to re-read and alter it. Students reported high confidence in 60%/64% of wrong answers to “gist” questions! In fact, they were almost as confident in their wrong answers as in their right answers.

The authors also wondered: is comprehension awareness just a matter of high “verbal ability”? Not quite, it turns out. They administered an abbreviated verbal SAT to the same subjects and found a moderate correlation between verbal performance and response accuracy, but no significant correlation with the accuracy of students’ confidence ratings. The authors’ interpretation here is that comprehension awareness is at least partially separate from traditional performance measures. Students who score well on SATs do so because their first attempt is more likely to be right; when they do give a wrong answer, they’re not more likely to notice that than their lower-scoring peers.

The same authors published a different study[1] with a charming title: “Being really, really certain you know the main idea doesn’t mean you do”. They ran another experiment like the ones we've discussed, this time asking one group of students to read and re-read each passage as many times as necessary to confidently answer a question about it. Compared to a control group which just read each passage once, the “high-certainty" group took more time and had more confidence… but didn't perform significantly better.

Now, we shouldn’t conclude that no one “gets the gist” from what they read. Skilled readers do exist—e.g. actively publishing professors are often quite sophisticated—and we can learn plenty by studying their behavior. But I’d guess that most knowledge workers would often exhibit the same comprehension problems that we repeatedly observe[2] in university students.

From “what it says” to “what it means and why”

I’ve been thinking about the problem of reading comprehension since my work with Alex a few months ago, but another recent experience forced it to the front of my mind.

Dwarkesh Patel is the host of The Lunar Society, a podcast focused on interviews with scientists and domain experts. Dwarkesh differentiates himself by asking probing, well-researched questions of his guests, to go beyond the usual shallow conversations. He had a few physicists scheduled to come on the show in the coming months, so he decided to embark on a serious study of more physics to prepare himself. In conversation about how he might do that most effectively, I suggested that Dwarkesh might enjoy watching me study the quantum mechanics book he’d already started reading. I’d verbalize my thought process, and he could pester me with questions[3].

Dwarkesh was quite surprised by my approach to the book. I moved at a pace of about fifteen minutes per page, while he had spent a few minutes or less. More importantly, I was constantly asking questions of the text and of myself. Some examples:

  • What does this sentence mean? Can I explain it in my own words?
  • Which ideas are particularly important here?
  • The author clearly thinks I should see why this claim is true—so why is it true?
  • The author’s emphasizing this detail—so why is it important?
  • The author seems to be setting up a contrast here—so what is it, exactly?
  • How does this detail relate to my prior knowledge in physics?
  • If I hide all but the beginning of this worked example, can I produce the rest myself?
  • I made a mistake a moment ago—do I understand why? Can I explain my misapprehension?
  • And of course: can I simply recall what was said on the previous page?

These questions will sound familiar to scholarly readers. In How to Read a Book, Adler and van Doren suggest that the essence of reading for understanding is asking questions of the book, and trying to answer them. An undemanding reader “asks no questions—and gets no answers.” I was being demanding, in fairly ordinary ways.

Now, Dwarkesh is a sophisticated, motivated thinker with a university education and a job which demands piles of careful reading. Yet these strategies—these ways of interrogating a text—were startlingly new to him. Not only had he not asked these questions while he read the text, and not only had he not fully understood the meaning of many of the phrases I was interrogating, but he hadn’t realized that he hadn’t understood what the author meant in those phrases. He wasn’t making a conscious choice not to dig deeper into those sentences (say, for the sake of time). Rather, it just wasn’t salient that he was making an implicit choice as a reader about how deep to go.

I don’t think this is unusual: I’d guess that most knowledge workers (particularly in STEM) read this way most of the time, unaware of the tradeoffs they’re making. I certainly did before I started my research on learning.

Adler and van Doren give us one over-simplified contrast between “reading for information” and “reading for understanding”: it’s the difference between being able to say what the author says (information), and being able to say what he means and why he says it (understanding). In the previous section, we focused mostly on problems with knowing what the author says. We looked at experiments where students were asked to state the text’s main idea, not to make sophisticated inferences. By contrast, Dwarkesh was missing details around meaning and implication.

These deeper levels are tougher to access because they demand that readers go beyond what’s printed on the page. If a sentence makes no sense to you, the words themselves will trip you up, so long as you’re paying attention. On the other hand, if you understand what a sentence says but don’t grasp its implications for the author’s explanation, the literal words won’t necessarily trigger confusion. You’ll only notice if some voice in your head is continuously demanding answers to questions like “how does this part fit into the whole?” The question isn’t on the page. The answer usually isn’t, either, at least overtly.

Interactions between reading comprehension and memory systems

So: reading comprehension is a bigger problem than many people expect, particularly at deeper levels of understanding. One implication is that if my goal is to help people reliably internalize difficult texts—not just what they say but what they mean and why—a direct focus on memory may put the cart before the horse.

I tried to help Alex by giving him a memory system stocked with all the important details from the physics chapters he was studying. But his review sessions were often unpleasant ordeals: for many prompts, he found the expected response confusing, or didn’t see why it mattered, or felt like he was parroting the answer without really understanding. Review sessions took longer, felt harder, and delivered less benefit than I’d expected. Understandably, Alex developed a somewhat aversive relationship with memory practice.

I’m increasingly inclined to see these issues as rooted in reading comprehension. When Alex found a prompt’s answer confusing, I think he also would have found it confusing immediately after reading the relevant explanation in the text. So these problems were probably not caused by forgetting, except insofar as Alex might have understood the reading less well because weak long-term memory of prerequisite concepts produced excess cognitive load as he read. That is, if you have unreliable recall for foundational details about electric fields, you’ll have trouble understanding explanations of Gauss’s Law in the next chapter. But we had problems with prompts about the first chapter, so this can’t fully explain what’s going on.

Very naively, we might say: first, understand the material; then, we’ll ensure you remember it. These are separate problems. Piotr Wozniak, the creator of SuperMemo, suggests as much. This is a good simplifying heuristic, but memory system prompts have—or can have—a more complicated relationship with the process of understanding.

In-text questions promote understanding

In mnemonic texts like Quantum Country, we interleave retrieval practice directly into the text, so that every few minutes of reading, you pause to answer questions about what you just read. Readers told us that this embedded practice dramatically altered the way they read. Prior research on “adjunct questions” in texts has isolated four distinct effects:

  1. specific backward effects: actively recalling information makes it more likely that you’ll be able to recall that information in the future
  2. general backward effects: the questions induce mental review and deeper processing of the surrounding text and related ideas; poor performance may cause you to re-read the text[4]
  3. specific forward effects: in the subsequent text, you’ll be more attentive to the kinds of things the questions ask about; you’ll have better recall for related later material[5]
  4. general forward effects: you’ll pay more attention in general, including to unrelated material; you may read more slowly and carefully if you learn you performed poorly

My understanding is that all of these effects occur, at least to some extent, even if you didn’t understand the material very well. In fact, encountering the embedded questions may reveal to you that you don’t understand the material and hence cause you to understand the material (e.g. by re-reading more carefully). These effects suggest a more complex model than a linear “understand, then remember” process.

Specific strategies aside, reading comprehension is largely about self-regulation. How quickly should you read? What should you focus on? What kinds of questions should you be asking of the text? How well is your current behavior producing the results you want? It’s hard to answer metacognitive questions like these while your mind is occupied with difficult material, especially if that’s not a habit you’ve already built. Embedded practice partially outsources the asking and answering of these questions. The prompts model (a certain type of) “successful” reading behavior, offer feedback, and create a natural pause for reflection and integration.

Review sessions promote understanding

Let’s look at the review sessions which occur over the days and weeks following your initial reading.

If you find a prompt utterly baffling at this stage, then recall practice is probably not going to help your understanding, except insofar as it causes you to re-read the relevant text. But this is an unpleasant way to discover that you didn’t understand. You’re not sitting in front of the book anymore; it may not be easily accessible at all; you must either interrupt review to re-read (awkward), or flag the concept for later study (unreliable). It’s too tempting to just mark the prompt as “forgotten” and move on. (There’s no button for “I don’t understand this”!) When I was working with Alex, I hadn’t embedded the questions into the text, as we did in the mnemonic medium, so this was often his experience—stumbling on comprehension issues days later.

But for less extreme examples, review sessions offer a good opportunity to understand more deeply. The first time you answered the question, while reading the text, the idea was still raw in your mind. But when you answer it again a few days or weeks later, you’ll probably look at it somewhat differently. Maybe you’ve read more material which depended on this idea, or you’ve used it to solve problems, or it’s come up in a conversation. Some of those experiences will re-surface alongside the original detail. They may help you notice new connections. Even when I’m the one writing the prompts, I often realize some important aspect of what the author means only after several rounds of review.

Retrieval enables activities which promote understanding

Say that you’ve just read about Gauss’s law. You feel you understand what the author is saying. You can explain it to another person. You follow how it’s used in the examples given, and see how it relates to Coulomb’s law. In other words, you know “what the author means and (some of) why he is saying it.” In the linear understand-then-memorize model, you’ve finished part one.

But then you try to use Gauss’s law in a simple practice exercise, and you find that you struggle. You’re constantly flipping back to look at the definition and examples. If you could solve this problem, and half a dozen more, you’d understand Gauss’s law much more deeply. For example, you’d viscerally grasp the consequences of the dot product inside the surface integral. But your wobbly memory of the material is making it hard to solve the exercise.

If you reviewed the relevant prompts over a few days, you’d consolidate those details into higher-level chunks, and you’d build relevant connections that would help you retrieve the right details at the right time during problem solving. After you solved a handful of practice problems, you’d build automaticity for some of the problem-solving processes surrounding Gauss’s law, which would make it easier to solve more problems, which would in turn would facilitate more understanding.

In this story, we invert the understand-then-memorize model: memorizing helps you create certain kinds of understanding.

Prompt-writing promotes understanding

So far, I’ve adopted the frame of the mnemonic medium, in which an expert writes memory system prompts for you, the reader. But if you’re willing and able to write your own memory prompts, that process can have a profound effect on your understanding.

The process is very demanding. To write good prompts, you must constantly ask: Which parts of this text are crucial, and which not? Can I restate this idea in my own words? How does this idea connect to other things I know about? How does this connect to my interests? Can I find boundary conditions I should write prompts about? Can I generate examples to use in a practice question? Can I observe anything important about the author’s mental model or motivations? These prompt-generating questions overlap substantially with the set of questions one must ask to read for understanding. And in both cases, you’ll often find that you can’t answer the questions, because you haven’t read carefully enough. That gives you some of the feedback you need to regulate your reading.

The questions in the last paragraph are ones you’d ask of the text, but as you better understand the natural “grain” of the prompt-writing medium, you’ll notice notice that you’ll also ask questions of your prompts. You’ll look at a prompt you’ve drafted and think: does this distill the heart of the idea? You’ll try to polish prompts: are any of these details unnecessary, removable? You’ll wonder: have I captured all the important connections? You’ve written a prompt stating that something is the case; so now you naturally ask: can you write one about why it is the case? And in answering these questions, you’ll end up with a sharper picture of the material.

What’s odd about all this is that ostensibly, you’re making these prompts so that you can remember these details later. But in many cases, the process of creating the prompts may have much more impact on your learning experience than any subsequent memory practice. Maybe you could just throw out the prompts afterwards—treat them like a structured note-taking method. But Michael Nielsen points out that the downstream practice has a powerful motivational role. Note-taking can sometimes feel like abstract homework. You know you’re probably never going to look at the notes again, and that can erode your motivation. But if you’ve had experiences where memory practice has helped you effortlessly retain ideas in great detail, you start to attach a kind of automatic value to writing new prompts about interesting ideas. You adopt the belief: “Prompts do good things for me.” You know you’re going to see these things again. You have confidence that you’re going to durably remember all the details you’re writing about, so the process feels more real, more purposeful[6]. That’s potentially true even if many of the prompts are actually just “chaff”, kindling you needed to write and later discard. You used them to understand the material, and to compose a few prompts that you actually care to review.

Of course, this is a tremendously effortful and difficult process. That’s roughly why the mnemonic medium automates the prompt-writing process away completely. Some scaffolded middle ground might be interesting territory to explore.

Implications for my research

I opened with one simple way to frame my goal: “making it easy for people to remember what they read.” But in the context of learning and explanatory texts, I’m really more interested in “making it easy for people to internalize and make use of complex ideas.” It’s certainly a more interesting goal. But it’s also scope creep. Last month I wrote about how memory systems might want to expand into problem-solving practice. Here I’m writing about the possible need to expand into scaffolding reading comprehension.

It’s not obvious that a broader scope is a good idea. In fact, it’s probably a terrible idea. Countless entire careers have been spent on recall, problem-solving, and reading comprehension individually. So I’m not really planning to bite off that whole problem. But there are obvious interactions between these problems. By thinking hard at the unusual point of their intersection, I may end up with a more powerful solution to some subset of that space than I would by fixating on, say, memory alone.

Comprehension-centric in-text questions

Earlier, we discussed how the mnemonic medium’s embedded questions can help improve reading comprehension. But it’s worth noting that we weren’t exactly trying to do that. Speaking just for myself—Michael’s design goals may have differed—I was trying to test and reinforce readers’ memory for the specific material in the prompts, and I was creating an on-ramp for subsequent practice. We talked about the notion of giving people “feedback” while they read, and of modeling what good memory prompts looked like, but we didn’t adopt an overt frame of systematically facilitating reading comprehension.

It’s interesting to ask: if your primary goal were to enhance reading comprehension, what design would you produce? For example, you might not need to ask nearly so many questions to get the same metacognitive benefits. Maybe it would be better to save the detailed retrieval practice for the next day’s review session; instead, you might ask a higher-level interpretation question or two. The point wouldn’t be to remember the answer—the answer wouldn’t be in the text at all. Instead, such questions would be aimed at promoting deeper processing and reflection of the text. For recall prompts, our design discourages looking back at the source text; but for these comprehension prompts, we’d want the interface to encourage scrolling back up for another read.

There’s evidence that these kinds of questions can help reading comprehension, in controlled experimental settings. But I’m not terribly excited about this direction. Lots of books have interpretative questions like this already; they don’t seem to have a powerful impact; what’s different about my proposal? For that matter, lots of textbooks have retrieval practice-like review questions embedded in the text. What makes the mnemonic medium different is that when you answer a question which appears in a mnemonic essay, you’re setting yourself up to remember that answer forever.

I find that I mostly hate answering comprehension-oriented questions in books. They often feel condescending, boring; like unpleasant homework. I don’t care about answering them, and (rightly or wrongly) I generally don’t feel that doing so will help me in a way I care about.

Discussion questions

Interpretative questions in textbook exercise listings? Nope; boring; I don’t care. But when I show up at my University of Chicago class, the facilitator asks me interpretative questions, and I find I want to answer them. Sometimes that’s because they’re unusually interesting, but often they’re simple questions like: “What justification does Aristotle give for X? Do you believe it?”

I think the difference is mostly about social context. A real person I respect is asking the question; they’re going to genuinely engage with my answer; other students might build on my answer or have interestingly different answers; the facilitator will connect our answers to subsequent questions; etc. It’s also partially about the framing of my activity. I’ve shown up to have a discussion about the book, so that’s what I’m going to do. I’m “discussing”, not “answering boring comprehension questions.”

Of course, these simple “discussion” questions routinely reveal that I’d failed to understand what the author was saying. I want to make clear: I have trouble with reading comprehension too! In the scenario with Dwarkesh earlier, he saw me in “super careful expert reader mode”. I was “on” because I knew that’s what he wanted to see, and the social setting sharpened my engagement. And when I’m in “carefully processing a text into prompts mode”, my comprehension is solid. But in less extreme scenarios, I slip up pretty regularly. I’d love to have reading augmentation which helps me make sure that I’m understanding a text as deeply as I intend to—so long as that augmentation isn’t too burdensome.

These discussions really do reinforce my comprehension, albeit very slowly, and with spotty coverage. And I can’t easily orchestrate a well-facilitated discussion for everything I read. The technologist’s snap answer here is: use large language models! Have a bot ask me questions and give me feedback about my answers. Maybe it could work in some future where I have an ongoing relationship with the bot across time, but for the time being, I find these kinds of interactions leave me totally cold. I just don’t care about answering the bot’s question; I know that it doesn’t really “want” to know the answer; it doesn’t “care” about my answer; my answer isn’t “meaningful” to it; etc.

One more promising alternative might lie in something closer to “elaborative vocal rehearsal”. It's a pretty simple practice: after you finish a passage, close your eyes and explain it back in your own words. This reinforces basic comprehension in a similar way to question-answering, but it has a very different emotional feel. In particular, it doesn't make me feel like I'm answering a boring question I don't care about. There are some interesting opportunities for augmentation here. For example, the text could highlight important details which you didn't include in your explanation. It could scribble over a realtime transcript of your explanation, circling bits which seem to conflict with the text. Maybe lightweight badges could reflect that you captured what the text means but not why it matters.

Very pragmatically, there’s an important problem to be solved: it’s extremely unpleasant to be asked to remember details you never understood in the first place. One way to avoid that is to facilitate comprehension, as we’ve been discussing; another way is to avoid asking questions about non-comprehended material. Perhaps elaborative vocal rehearsals offer a way to orchestrate the latter: we could only ask you to remember details you included in your explanation.

Open-book practice

My ideal memory system would not only reinforce my recall of an idea, but actually deepen my understanding of it over time. We’ve discussed a few ways it might do that: scaffolded problem-solving practice, reflection prompts, synthesis prompts, etc. These tasks aim to produce understanding from within—that is, by solving a certain kind of problem, you’ll acquire a certain kind of insight.

But I could probably also deepen my understanding by returning to the text two weeks later, with reinforced memory and a fresh perspective. I might better grasp the significance of one of the author’s points, or see some connection I missed the first time through. One limitation of the review session modality as it exists today is that it exists apart from the text; it basically assumes you “got” everything on your first read through the text. If you didn’t, it must be a simple failure of memory. And the review session can’t really include a probing question which would send you back through the text for a new interpretation.

I can imagine designing an alternative review session interface around “open-book” review. Imagine that a prompt floats along one edge of the screen, while the rest is given over to the original text. Structured, scaffolded, purposeful re-reading could be integrated directly into the experience. And it seems that open-book tests can produce comparable effects on long-term memory.

On its surface, this suggestion seems quite similar to the integrated comprehension questions I suggested (and dismissed) earlier. But I have two important differences in mind. First, I’m imagining that these re-reading prompts wouldn’t be simple comprehension questions—they’d be probing questions which would encourage the reader to see the text in a new way, or to make connections with subsequent material. Second, I think it matters that you’d be encountering the question in the context of a review session. You’ve signed up to practice, and here’s a practice question. When you’re reading, you might feel that you’ve signed up to read, not answer review questions.

————————

As should be obvious, I’m still figuring out what I’d like to do with my recent observations about reading comprehension and problem-solving practice. I suspect there’s some powerful synthesis with memory systems, waiting to be discovered.

————————

Thanks for Michael Nielsen and Russel Simmons for helpful conversations on this topic. Thanks also to Dwarkesh Patel and Alex for sharing their learning experiences.

Footnotes

[1] Thanks to Gwern for finding the fulltext here and for documenting that search as a case study.

[2] For more, see the bibliographic litany in this review, page 9, first paragraph.

[3] Video of this should be available in the coming weeks, alongside a more traditional interview. [edit: now available; more traditional interview here]

[4] The latter claim isn’t discussed in Hamaker’s review, but I’ve heard lots of reports like this from mnemonic medium readers. Interestingly, I haven’t yet found studies exhibiting this effect in the adjunct question or reading comprehension literature. Probably I just haven’t yet found the right term of art.

[5] The latter claim isn’t discussed in Hamaker’s review, and it’s not as well studied as some of the other effects I’ve mentioned, but it’s usually called “test-potentiated learning” or the “forward testing effect”. See e.g. Arnold and McDermott, “Test-Potentiated Learning: Distinguishing Between Direct and Indirect Effects of Tests” (2013).

[6] This belief-building process is part of why the experience of “just parroting back answers” so harmful. It’s not just that it feels like you’re wasting time in that moment, or that you’re not understanding the thing you want to understand. It (rightly) undermines your belief in the value of memory practice.

View Post

(Recording) Seminar: Mark Weiser's "The computer for the 21st century" and ubiquitous computing (1991)

We were fortunate to have some very thoughtful participants with some very relevant knowledge. Thanks to them for an interesting discussion on Weiser's classic paper.

(Please don't share this video publicly)

View Post

Seminar: Mark Weiser's "The computer for the 21st century" and ubiquitous computing (1991); June 27 @4PM PDT

The most profound technologies are those that disappear. They weave themselves into the fabric of everyday life until they are indistinguishable from it. …
There is more information available at our fingertips during a walk in the woods than in any computer system, yet people find a walk among trees relaxing and computers frustrating. Machines that fit the human environment instead of forcing humans to enter theirs will make using a computer as refreshing as taking a walk in the woods.
— Mark Weiser, ibid

Join me (via Google Hangout) on Tuesday, June 27 at 4PM PDT [GCal] to discuss Weiser's classic paper introducing "ubiquitous computation". It's been on my mind recently—Weiser's vision has both significant overlaps and differences with the Vision Pro.

Please read the paper before attending: it's short, and originally published in Scientific American, like As We May Think. Bring your noticings, wonderings, and ideas; I'll bring discussion questions.

I'll record our discussion and share it here afterward.

View Post

Some initial rough notes on the Vision Pro

I don't normally send new working notes out here, but I figure this might be of interest to a number of you. The situation is still very much evolving, of course! This isn't a proper "Letter", so no audio…

Vision Pro 

The hardware seems faintly unbelievable—a computer as powerful as Apple’s current mid-tier laptops (M2), plus a dizzying sensor/camera array with dedicated co-processor, plus displays with 23M 6µm pixels (my phone: 3M 55µm pixels; the PSVR2 is 32µm) and associated optics, all in roughly a mobile phone envelope.

But that kind of vertical integration is classic Apple. I’m mainly interested in the user interface and the computing paradigm. What does Apple imagine we’ll be doing with these devices, and how will we do it?

Paradigm

Given how ambitious the hardware package is, the software paradigm is surprisingly conservative. visionOS is organized around “apps”, which are conceptually defined just like apps on iOS:

  • to perform an action, you launch an app which affords that activity; no attempt is made to move towards finer-grained “activity-oriented computing”
  • apps present interface content, which is defined on a per-app basis; app interfaces cannot meaningfully interact, with narrow carve-outs for channels like drag-and-drop
  • (inferred) apps act as containers for files and documents; movement between those containers is constrained

I was surprised to see that the interface paradigm is classic WIMP. At a high level, the pitch is not that this is a new kind of dynamic medium, but rather that Vision Pro gives you a way to use (roughly) 2D iPad app UIs on a very large, spatial display. Those apps are organized around familiar UIKit controls and layouts. We see navigation controllers, split views, buttons, text fields, scroll views, etc, all arranged on a 2D surface (modulo some 3D lighting and eye tracking effects). Windows, icons, menus, and even a pointer (more on that later).

These 2D surfaces are in turn arranged in a “Shared Space”, which is roughly the new window manager. My impression is that the shared space is arranged cylindrically around the user (moving with them?), with per-window depth controls, but I’m not yet sure of that. An app can also transition into “Full Space”, which is roughly like “full screening” an app on today’s OSes.

In either mode, an app can create a “volume” instead of a “window”. We don’t see much of this yet: the Breathe app spreads into the room; panoramas and 3D photography is displayed spatially; a CAD app displays a model in space; an educational app displays a 3D heart. visionOS’s native interface primitives don’t make use of a volumetric paradigm, so anything we see here will be app/domain-specific (for now).

Input

For me, the most interesting part of visionOS is the input part of the interaction model. The core operation is still pointing. On NLS and its descendants, you point by indirect manipulation: moving a cursor by translating a mouse or swiping a trackpad, and clicking. On the iPhone and its descendants, you point by pointing. Direct manipulation became much more direct, though less precise; and we lost “hover” interactions. On Vision Pro and its descendants, you point by looking, then “clicking” your bare fingers, held in your lap.

Sure, I’ve seen this in plenty of academic papers, but it’s quite wild to see it so central to a production device. There are other VR/AR devices which feature eye tracking, but (AFAIK) all still ship handheld controllers or support gestural pointing. Apple’s all in on foveation as the core of their input paradigm, and it allows them to produce a controller-free default experience. It reminds me of Steve’s jab at styluses at the announcement of the iPhone.

My experiences with hand tracking-based VR interfaces have been uniformly unpleasant. Without tactile feedback, the experience feels mushy and unreliable. And it’s uncomfortable after tens of seconds (see also Bret’s comments). The visionOS interaction model dramatically shifts the role of the hands. They’re for basically-discrete gestures now: actuate, flick. Hands no longer position the pointer; eyes do. Hands are the buttons and scroll wheel on the mouse. Based on my experiences with hand-tracking systems, this is a much more plausible vision for the use of hands, at least until we get great haptic gloves or similar.

But it does put an enormous amount of pressure on the eye tracking. As far as I can tell so far, the role of precise 2D control has been shifted to the eyes. The thing which really sold the iPhone as an interface concept was Bas’s and Imran’s ultra-direct, ultra-precise 2D scrolling with inertia. How will scrolling feel with such indirect interaction? More importantly, how will fine control feel—sliders, scrubbers, cursor positioning? One answer is that such designs may rely on “direct touch”, akin to existing VR systems’ hand tracking interactions. Apple suggests that “up close inspection or object manipulation” should be done with this paradigm. Maybe the experience will be better than on other VR headsets I’ve tried because sensor fusion with the eye tracker can produce more accuracy?

By relegating hands to a discrete role in the common case, Apple reinforces the 2D conception of the visionOS interface paradigm. You point with your eyes and “click” with your hands. One nice benefit of this change is that we recover a natural “hover” interaction. But moving incrementally from here to a more ambitious “native 3D” interface paradigm seems like it would be quite difficult.

For text, Apple imagines that people will use speech for quick input and a Bluetooth keyboard for long input sessions. They’ll also offer a virtual keyboard you can type on with your fingertips. My experience with this kind of virtual keyboard has been uniformly bad—because you don’t have feedback, you have to look at the keyboard while you type; accuracy feels effortful; it’s quickly tiring. I’d be surprised (but very interested) if Apple has solved these problems.

Strategy

Note how different Apple’s strategy is from the vision in Meta’s and MagicLeap’s pitches. These companies point towards radically different visions of computing, in which interfaces are primarily three-dimensional and intrinsically spatial. Operations have places; the desired paradigm is more object-oriented (“things” in the “meta-verse”) than app-oriented. Likewise, there are decades of UIST/etc papers/demos showing more radical “spatial-native” UI paradigms. All this is very interesting, and there’s lots of reason to find it compelling, but of course it doesn’t exist, and a present-day Quest / HoloLens buyer can’t cash in that vision in any particularly meaningful way. Those buyers will mostly run single-app, “full-screen” experiences; mostly games.

But, per Apple’s marketing, this isn’t a virtual reality device, or an augmented reality device, or a mixed reality device. It’s a “spatial computing” device. What is spatial computing for? Apple’s answer, right now, seems to be that it’s primarily for giving you lots of space. This is a practical device you can use today to do all the things you already do on your iPad, but better in some ways, because you won’t be confined to “a tiny black rectangle”. You’ll use all the apps you already use. You don’t have to wait for developers to adapt them. This is not a someday-maybe tech demo of a future paradigm; it’s (mostly) today’s paradigm, transliterated to new display and input technology. Apple is not (yet) trying to lead the way by demonstrating visionary “killer apps” native to the spatial interface paradigm. But, unlike Meta, they’ll build their device with ultra high-resolution displays and suffer the premium costs, so that you can do mundane-but-central tasks like reading your email and browsing the web comfortably.

On its surface, the iPhone didn’t have totally new killer apps when it launched. It had a mail client, a music player, a web browser, YouTube, etc. The multitouch paradigm didn’t substantively transform what you could do with those apps; it was important because it made those apps possible on the tiny display. The first iPhone was important not because the functionality was novel but because it allowed those familiar tools to be used anywhere. My instinct is that the same story doesn’t quite apply to the Vision Pro, but being generous for a moment, I might suggest its analogous contribution is to allow desktop-class computing in any workspace: on the couch, at the dining table, etc. “The office” as an important, specially-configured space, with “computer desk” and multiple displays, is (ideally) obviated in the same way that the iPhone obviated quick, transactional PC use.

Relatively quickly, the iPhone did acquire many functions which were “native” to that paradigm. A canonical example is the 2008 GPS-powered map, complete with local business data, directions, and live transit information. You could build such a thing on a laptop, but the amazing power of the iPhone map is that I can fly to Tokyo with no plans and have a great time, no stress. Rich chat apps existed on the PC, but the phenomenon of the “group chat” really depended on the ubiquity of the mobile OS paradigm, particularly in conjunction with its integrated camera. Mobile payments. And so on. The story is weaker for the iPad, but Procreate and its analogues are compelling and unique to that form factor. I expect Vision Pro will evolve singular apps, too; I’ll discuss a few of interest to me later in this note. Will its story be more like the iPhone, or more like the iPad and Watch?

It’s worth noting that this developer platform strategy is basically an elaboration of the Catalyst strategy they began a few years ago: develop one app; run it on iOS and macOS. With the Apple Silicon computers, the developer’s participation is not even required: iPad apps can be run directly on macOS. Or, with SwiftUI, you can at least use the same primitives and perhaps much of the same code to make something specialized to each platform. visionOS is running with the same idea, and it seems like a powerful strategy to bootstrap a new platform. The trouble here has been that Catalyst apps (and SwiftUI apps, though somewhat less so) are unpleasant to use on the Mac. This is partially because those frameworks are still glitchy and unfinished, but partially because an application architecture designed for a touch paradigm can’t be trivially transplanted to the information/action-dense Mac interface. Apple makes lots of noises in their documentation about rethinking interfaces for the Mac, but in practice, the result is usually an uncanny iOS app on a Mac display. Will visionOS have the same problem with this strategy? It benefits, at least, from not having decades of “native” apps to compare against.

Dreams

If I find the Vision Pro’s launch software suite conceptually conservative, what might I like to see? What sorts of interactions seem native to this paradigm, or could more ambitiously fulfill its unique promise?


Huge, persistent infospaces: I love this photo of Stewart Brand in How Buildings Learn. He’s in a focused workspace, surrounded by hundreds of photos and 3”x5” cards on both horizontal and vertical surfaces. It’s a common trope among writers: both to “pickle” yourself in the base material and to spread printed manuscript drafts across every available surface. I’d love to work like this every day, but my “office” is a tiny corner of my bedroom. I don’t have room for this kind of infospace, and even if I did, I wouldn’t want to leave it up overnight in my bedroom. There’s tremendous potential for the Vision Pro here. And unlike the physical version, a virtual infospace could contend with much more material than could actually fit in my field of view, because the computational medium affords dynamic filtering, searching, and navigation interactions (see Softspace for one attempt). And you could swap between persistent room-scale infospaces for different projects. I suspect that visionOS’s windowing system is not at all up to this task. One could prototype the concept with a huge “volume”, but it would mean one’s writing windows couldn’t sit in the middle of all those notes.

Ubiquitous computing, spatial computational objects: The Vision Pro is “spatial computing”, insofar as windows are arranged in space around you. But it diverges from the classic visions along these lines (e.g. Mark Weiser’s ubiquitous computing, Dynamicland) in that the computation lives in windows. What if programs live in places, live in physical objects in your space? For instance, you might place all kinds of computational objects in your kitchen: timers above your stove; knife work reference overlays above your cutting board; a representation of your fridge’s contents; a catalog of recipes organized by season; etc. Books and notes live not in a virtual 2D window but “out in space”, on my coffee table (solving problems of Peripheral vision). When physical, they’re augmented—with cross-references, commentary from friends, practice activities, etc. Some are purely digital. But both signal their presence clearly from the table while I’m wearing the headset. My memory system is no longer stuck inside an abstract practice session; practice activities appear in context-relevant places, ideally integrating with “real” in my environment.

Shared spatial computing: Part of these earlier visions of spatial computing, and particularly of Dynamicland, is that everything I’m describing can be shared. When I’m interacting with the recipe catalog that lives in the kitchen, my wife can walk by, see the “book” open and say “Oh, yeah, artichokes sound great! And what about pairing them with the leftover pork chops?” I’ll reserve judgment about the inherent qualities of the front-facing “eye display” until I see it in person, but no matter how well-executed that is, it doesn’t afford the natural “togetherness” of shared dynamic objects. Particularly exciting will be to create this kind of “togetherness” over distance. I think a “minimum viable killer app” for this platform will be: I can stand at my whiteboard, and draw (with a physical marker!), and I see you next to me, writing on the “same surface”—even though you’re a thousand miles away, drawing on your own whiteboard. FaceTime and Freeform windows floating in my field of view don’t excite me very much as an approximation, particularly since the latter requires “drawing in the air.”

Deja vu

A few elements of visionOS’s design really tickled me because they finally productized some visual interface ideas we tried in 2012 and 2013. It’s been long enough now that I feel comfortable sharing in broad strokes.

The context was that Scott Forstall had just been fired, Jony Ive had taken over, and he wanted to decisively remake iOS’s interface in his image. This meant aggressively removing ornamentation from the interface, to emphasize user content and to give it as much screen real estate as possible. Without borders, drop shadows, and skeuomorphic textures, though, the interfaces loses cues which communicate depth, hierarchy, and interactivity. How should we make those things clear to users in our new minimal interfaces? With a few other Apple designers and engineers[1], I spent much of that year working on possible solutions that never shipped.

You might remember the “parallax effect” from iOS 7’s home screen, the Safari tabs view, alerts, and a few other places. We artificially created a depth effect using the device's motion sensors. Internally, even two months before we revealed the new interface, this effect was system-wide, on every window and control. Knobs on switches and scrubbers floated slightly above the surface. Application windows floated slightly above the wallpaper. Every app had depth-y design specialization: the numbers in the Calculator app floated way above the plane, as if they were a hologram; in Maps, pins, points of interest, and labels floated at different heights by hierarchy; etc. It was eventually deemed too much (“a bit… carnival, don't you think?”) and too battery-intensive. So it's charming to see this concept finally get shipped in visionOS, where UIKit elements seem to get the same depth-y treatments we'd tried in 2012/2013. It's much more natural in the context of a full 3D environment, and the Vision Pro can do a much better job of simulating depth than we'd ever manage with motion sensors.

A second concept rested on the observation that the new interface might be very white, but there are lots of different kinds of white: acrylic, paper, enamel, treated glass, etc. Some of these are “flat”, while others are extremely reactive to the room. If you put certain kinds of acrylic or etched glass in the middle of a table, it picks up color and lighting quality from everything around it. It’s no longer just “white”. So, what if interactive elements were not white but “digital white”—i.e. the material would be somehow dynamic, perhaps interacting visually with their surroundings? For a couple months, in internal builds, we trialled a “shimmer” effect, almost as if the controls were made of a slightly shiny foil with a subtly shifting gloss as you moved the device (again using the motion sensors). We never could really make it live up to the concept: ideally, we wanted the light to interact with your surroundings. realityOS actually does it! They dynamically adapt the control materials to the lighting in your environment and to your relative pose. And interactive elements are conceptually made of a different material which reacts to your gaze with a subtle gloss effect! Timing is everything, I suppose…

---

Only some of the WWDC videos about the Vision Pro have been released so far. I imagine my views will evolve as more information becomes available.

---

[1] Something in the Apple omertà makes me uncomfortable naming my collaborators as I normally would, even as I discuss the project itself. I guess it feels like I’d be implicating them in this “behind-the-scenes” discussion without their consent? Anyway, I want to make clear that I was part of a small team here; these ideas should not be attributed to me.

View Post

Fluid practice for fluid understanding

Early, rough thinking, but perhaps useful for others interested in the design of learning environments. Assumes detailed familiarity with memory systems.

When everything goes right, my memory practice totally transforms the experience of diving into a new topic. I’ll feel like I can keep turning a crank, and my understanding will just ratchet up and up, durably and inevitably. But quite a lot of the time, everything doesn’t go right. I’ll find that I’ve ended up with brittle, parochial understanding of a new topic. Review questions will strike me as boring or alienating; I’ll feel like I’m parroting, rather than really understanding.

One way to address this problem is to improve as a practitioner of memory systems. I grow my armory of prompt-writing strategies; I monitor my emotional connection more attentively; I intervene more aggressively when something’s not working. Over-simplifying, the quality of my memory practice depends on the quality of my prompts, and on the processes I’ve put into place for monitoring and revising them over time. If I’ve produced a brittle understanding, that means I need better prompts and prompt-writing processes.

Prompt-writing ability is a useful framing. I’ve improved enormously at writing and maintaining prompts over the years; my memory practice is much more versatile and reliable as a result. Yet it’s also worth thinking about the limitations of this framing. Aspirationally, I’d like a memory practice which not only helps me build and maintain a flexible, reliable understanding, but which actually deepens my understanding over time.

I’ve been wondering: to what extent is achieving that goal about learning to write better prompts? Are individual prompts the best primitives for that task? Generalized flashcards are a convenient representation for declarative knowledge, but I’d like to help people learn complex conceptual topics. If we step back and look at what’s actually necessary to produce that sort of understanding, and at the mechanisms of practice, could we find a better-matched primitive?

Learning from Alex

This exploration is motivated in part by my experiences with Alex, a motivated adult student whom I helped study physics earlier this year[1]. Among other experiments, I wrote hundreds of memory prompts for Alex to help him internalize his readings and tutoring sessions. I learned a great deal by watching him grapple with these concepts both inside and outside his memory system. Being an observer, rather than a learner, really helped: when I’m writing and reviewing my own prompts, I often find it tough to create reflective distance in the moment, to see what’s going on and what’s breaking.

One key issue I saw in Alex’s experience was trouble with transfer. Often, he learned to answer the prompts I’d written, but his understanding was brittle: he struggled to draw on what he’d learned in slightly different circumstances. In my experiments, it seemed that what Alex really needed was much more variation in his practice with each idea.

For some concepts, I tried the standard memory system practice, writing a dozen (or more!) prompts about this one idea, accessing it from many different angles and with many different surface features. This may have produced more flexible understanding, but the impact on review sessions was quite negative: even when prompts are shuffled, it’s obnoxious to answer many variations of the same question in a single review session. Because these variations are so burdensome, I didn’t write them for as many concepts as I probably should have. And from a cognitive perspective, retrieval will be much less effortful after the first question, so the subsequent questions will probably not produce as much reinforcement as they could.

Emotionally and cognitively, I believe it would be much better to “smear” these variations out across many review sessions. Conceptually, I thought of myself as writing many variations on the same Platonic idea, using various strategies to solicit that idea in different ways. These prompts were a single thing in my mind—a “prompt cluster”—but represented as an unrelated set in the memory system, despite the high cognitive overlap.

A related phenomenon: we saw a lot of pattern matching in Alex’s review experience. His comments would often include the telling phrase “this one”—i.e. “Oh, I know this one has a trick where…”. This is probably a sign of trouble for transfer, as we’ve discussed, but it’s also a marker of emotional friction. It feels bad to parrot. It makes you (rightly!) question whether your practice is a good use of time. It can create a feeling of rote drudgery. It seems to make Alex (and me) less likely to pay close attention, not only to that question but sometimes to the ones which follow. I believe that pattern matching creates emotional distance, a dulling of involved participation.

So it’s interesting to consider the design goal: how might we make pattern matching never occur? What if an idea is never accessed in exactly the same way twice? In fact, I have a strong instinct that this is a terrible idea as written. There’s something very powerful about the reflexive stimulus–response pattern that highly entrained memory practice establishes. And the feeling of flow in practice sessions is delicate: you don’t want rote boredom, but you also don’t want meaningless make-work. Still, I’d like to experience these tradeoffs more viscerally.

Alex also needed variation in scope of the prompts over time. Immediately after reading an explanation, he most needed prompts which acted like a simple reading comprehension check—in many cases, he hadn’t actually understood or absorbed some of the important statements on the page. Then he needed help building a durable memory for key details: terms, notation, equations, relationships, etc. But that’s not enough to actually understand the topic. We’d meet to discuss the concepts and to solve problems together, and I’d write more prompts for him each time, often about progressively higher-level details.

Once the fundamentals were in place, what he really needed was problem-solving practice. The problems were tough for him, so they needed to start quite simple and become more complex. Intermittently we’d retrace some worked examples in great detail. After days of problem-solving practice in one chapter, he’d want to move onto the next chapter to keep motivated, but we needed to ensure that he’d swing back to earlier chapters and continue to solve problems about their ideas over time. At this point, the “reading comprehension” prompts and some of the basic prompts on notation felt burdensome to review, but more complex declarative prompts, like those on equations, still seemed necessary to keep those details fresh.

Eventually, Alex and I got to the point where we were connecting the material we were learning with new research papers he found interesting. It’s conceivable to me that review sessions could be systematically orchestrated to include that sort of activity: he showed me the papers he wanted to understand, and I said things like “oh, once you’ve got some experience with electric flux, we’ll be able to make sense of this figure together.”

Alex’s study sessions needed to consist of different activities at different stages of his learning process. He needed more complex practice activities to really understand the material, but those couldn’t be introduced immediately. Conversely, trivial “bootstrapping” tasks could feel obnoxious when he’s just finished an activity which uses those same ideas in much more complex ways. If we want a memory practice to deepen our understanding over time, that will likely mean shifting the distribution of tasks over time to create progressively more complexity, elaboration, and creative opportunity.

In summary, my experience with Alex suggests that to produce fluid understanding, one would benefit from much more fluidity in review. A straightforward flashcard works if you want to remember the value of the electric constant. But to internalize complex conceptual matter, you want to be pushed to draw on each idea from different angles, with different wording, in different modalities, alongside different combinations of other ideas, at different levels of complexity over time.

Review sessions as generic vessel

Part of my hazy thesis here is that much of the potential of a memory practice is not about memory, per-se: it’s about the daily sessions. A steady Anki user has designated a ~5-15 minute slice in their day for systematically improving their knowledge and understanding. That’s a powerful vessel. It can be filled with rote recall flashcards, but more experienced memory system users discover that prompts can be used to stimulate a wider range of thought.

My contention is that if you’re trying to learn a complex conceptual topic, we should fill that daily vessel with whatever activities would most effectively and enjoyably produce the understanding we desire. If the material is new, it might be best to reinforce your memory of key details, and ask you to predict the next steps of some partially-worked examples. In the next session, you might be asked to solve some simple problems; in the next session, some different, more complex ones. One week later, you might be offered an alternative derivation of a theorem you’d studied, and some questions relating the two approaches. Perhaps it would be helpful to ask you to synthesize an explanation, or to brainstorm a list of questions you have about the concept. Each session’s contents are completely different, unless some activity seems particularly desirable to repeat verbatim.

In some sense, I’m arguing that “memory systems” can be thought of as “practice systems”. You can’t quite orchestrate what I’ve described for yourself with today’s tools. But experienced memory system users have internalized many strategies for filling their practice vessels—that is, for digesting ideas into interesting and effective practice tasks. These strategies are a reflection of several more fundamental (generally implicit) understandings:

  1. a theory of knowledge: what it means to understand something in a particular context;
  2. a theory of learning: how to construct concrete activities which produce and maintain those understandings; and
  3. a theory of the medium: how to instantiate those activities in an item as it will be presented by an existing memory system, and in the context of a review session (i.e. possibly on a phone, interleaved with many other tasks, often without pen and paper, etc)

As Michael Nielsen has pointed out, it’s tough to directly communicate these strategies or these understandings. But I suspect there is a kind of “pattern language” hiding within these strategies we develop. I've now written memory system prompts for quite a lot of technical material; it would be interesting to work through those prompts and their associated context, looking for "vocabulary" and "grammar".

Prompt primitives for conceptual learning

The flashcard primitive captures declarative knowledge quite directly. If you’re studying Italian and want to remember the word “carciofo”, you can write it on a flashcard and write “artichoke” on the back. If you want to learn anatomy, your flashcard can be an arrow pointing to a part of an illustration. You can learn lead’s atomic number by writing “lead’s atomic number” on a flashcard and “82” on the back. Context makes the task obvious; in each case, the implied task is to answer: “What is this?”

But when you’re learning about the relationship between electric potential and electric potential energy, you can’t put that concept directly into a memory system. It’s multifaceted. There are lots of tasks you should practice to understand that relationship. You should make sure you know the symbolic relationship, sure. But you should also be able to apply that relationship in various scenarios; you should be able to see when it makes sense to use one of these concepts rather than the other; you should see the parallel between this relationship and the relationship between electric field strength and electric force. You’ll want to form new connections to this concept over time, as you learn more. For example, once you’ve been introduced to capacitance, you’ll be able to work with a new facet of this relationship.

One way to model all this is as a large collection of primitive prompts, each reinforcing a different “atom” of knowledge related to that concept. But as we’ve discussed, they’re all so highly related that it would be better to smear them over many sessions. Feedback on their scheduling should be at least somewhat linked. You’d want many of the task details to vary with each presentation. And some of these tasks should be added later, once you’ve digested the earlier ones or learned more.

Practically speaking, when I write prompts for a concept like this, the way I’m thinking about it is that there’s a single thing—the concept—and I’m turning it in the light to see it from many angles, to see how it interacts with other objects in the space. My implicit prompt-writing “pattern language” suggests a family of tasks, and as I start pulling on those threads, more patterns suggest themselves. Sometimes it’ll feel like I’m working with the focal concept and another adjacent concept at once; or I’ll feel like I’m zooming in on some facet of the concept so far that the facet itself becomes the object of my attention. But the concept itself retains a sort of integrity in my mind. For the most part, mentally, I’m “putting that concept” into my memory system, in richer and richer ways. The concept is the primitive noun, and the verb (“putting it into my memory system”) is a messy, complex thing which depends on my pattern language and my prior knowledge.

I can put the word “carciofo” directly into my memory system. There’s almost no distance between the object itself (“carciofo = artichoke”) and its representation on a flashcard (“Q: carciofo? A: artichoke”). That flashcard basically is the declarative knowledge atom. That’s its strength as a primitive. Memory systems were developed for learning declarative knowledge, so maybe we shouldn’t be surprised that their core primitive has a natural representational unity with that kind of information. But now we’re trying to stretch memory systems to work well for internalizing complex concepts, too. Reframing “memory systems” as “practice systems” for a moment, is there some other kind of primitive waiting to be created—one which would let me “add the concept” as an elementary operation, in the same way I can “add a vocabulary word” to one of today’s memory systems?

Language models and ideas as primitives

In the past few months, lots of people have been trying to use large language models to automate the creation of memory system prompts from explanatory text. This might be close to tractable now (1, 2), for simple declarative knowledge. But I’m more interested in memory systems for their potential to help me learn complex conceptual material, and in deeply internalizing ideas that are relevant to my creative work. This is a very different task, not obviously amenable to the same kind of direct automation. The language model doesn’t seem to have the pattern language we’ve been discussing—in part because it’s still something being actively discovered at the frontier of the memory system practitioner community.[2]

But if I could externalize that pattern language into something the model understands, maybe we could produce a memory system in which ideas are a primitive, alongside concrete tasks. Here, I’m using the word “idea” very broadly to cover a multifaceted-but-distinctly-coherent element—like the relationship between electric potential energy and electric potential. In such a system, perhaps you could “add a concept” directly, optionally with a comment about the nature of your interest, and the system would use the pattern language to create novel practice activities on the fly in each session.

What does it mean to “add a concept”? What specifically would you be adding? One natural route would be to add a passage from a book, perhaps with some markup designating the central idea within its context. In this world, the memory system library would consist not (or not just) of static flashcards, but rather of a set of marked-up references to texts or personal notes. This would also mean that if a task gave you trouble, you could easily navigate to the writings which inspired it.

This conceptual design isn’t “use a language model to generate prompts, then add them to a library”; it’s “add ideas to a library, then use a model to generate activities dynamically.” Those activities would vary in each session to help you encode a fluid grasp of the concept. They can become more complex to help deepen your understanding over time—for instance, using the pattern language to combine multiple ideas. If the pattern language is concrete enough that it can be reified in the interface, we can give different kinds of activities different visual identities (e.g. problem-solving practice, visualization, generating examples, explanation synthesis) and allow users to quickly flip between alternative activities for a given idea.

My claim here is that much more fluid practice will produce much more fluid understanding. But idea-as-primitive is appealing from an interaction design perspective, too[3]. My concrete emotional experience as a memory system user is: I hear or read or think something that excites me. I think, “ah! I want to bring this into my practice!” But I can’t quite bring “this” into my memory practice; I have to find a way to transform it into some other object which I can bring into my practice. Sometimes, that transformation task is a welcome exercise. It brings me closer to the idea that stimulated me, helps me pick it apart and understand it better. But much of the time, it just feels like a burden. I want to emphasize that this impulse is not about efficiency or “lowering the floor.” It’s about making the core verb feel better—making it more naturally aligned with my internal intent and my emotional interest. I think this explains the attraction of cloze deletions: they’re closer to “adding the thing itself.” The trouble is that they don’t really work for complex material, at least not when used so directly.

What next

Language models are shiny, but I don’t think that the best way to explore the ideas I’ve articulated here is through a leap to systematization and automation. I can experiment with “practice systems”, variation, problem-solving, escalation, and “pattern languages” manually, working with another student or perhaps with my own studies. Intimacy and rapid experimentation still feel like the right properties to prioritize for now.

One serious reservation I have about these ideas is: by expanding the scope of memory systems to include such a wide range of practice and learning activities, I dilute the specific powerful idea that they represent (efficient, systematic retrieval practice using a simple primitive). It might be better to let memory systems remain tightly scoped, better to play to their natural strengths. What I’ve been describing is instead closer to an intelligent tutoring system. I have a lot of reservations about those systems, but the last time I grappled with that body of research was years ago when I was at Khan Academy. If I continue down this path I expect I’ll need to return to that literature with fresh eyes, and to draw some clearer delineation between my ideas and those of that field.

————————

[1] Alex’s professional life has been undergoing some significant upheaval, so his studies have been on pause recently. We may resume in time, and/or I may start working with another student.

[2] This relates to a gripe I have about the recent gush of “AI tutors”: they lack a clear theory of instruction. Some of them are told to quiz you with boring problems; others seem simply told to “be a world-class tutor and answer the student’s questions.” What does it mean to be a world-class tutor? That’s hard enough to answer; it seems clear the model doesn’t have a strong opinion. What does it mean in this person’s context, specifically? A better system needs to describe a theory of instruction—what is supposed to happen in the course of a session, what does that mean cognitively, and how should the tutor bring that about, in interaction with the student’s behavior? If there were millions of tokens in the training set with good answers to those questions, the model might not need more steering. But like the memory system pattern language, this sort of knowledge is mostly tacit; where it is written out, accounts differ so widely as to offer little guidance.

[3] There’s an extended discussion of “ideas as primitives” in last October’s letter (“Lessons from summer 2022’s mnemonic medium prototype”), mostly from an interaction design perspective.

View Post

AI ethics letter now publicly available

A number of you mentioned to me that you wanted to be able to refer to my AI ethics essay in public conversation. I've revised it and made it publicly available here: https://andymatuschak.org/personal-ai-ethics

In case you haven't read it yet, I've also re-recorded the audio for the revised manuscript.

View Post

Ethics of AI-based invention: a personal inquiry

Hofstadter’s Law wryly captures my experience of difficult work: “It always takes longer than you expect, even when you take into account Hofstadter’s Law.” He suggested that law in 1979, alongside some pessimistic observations about chess-playing AI: “…people used to estimate that it would be ten years until a computer (or program) was world champion. But after ten years had passed, it seemed that the day…was still more than ten years away.”

Ironically, my experience observing the last ten years of AI research has been exactly the opposite. The pace has been extraordinary. Each time I’m startled by a new result, I update my expectations of the field’s velocity. Yet somehow, I never seem to update far enough—even when I take that very fact into account. My own ignorance is partly to blame; AI has been a side interest for me. But my subjective experience is of an inverse Hofstadter’s Law.

No surprise, then: GPT-4’s performance truly shocked me. This is a system that can outperform a well-educated teenager at many (most?) short-lived cognition-centric tasks. It’s hard to think about anything else. Inevitably, I now find myself with an ever-growing pile of design ideas for novel AI-powered interfaces. But I’ve also found myself with an gnawing concern: what are my moral responsibilities, as an inventor, when creating new applications of AI models with such rapidly accelerating capabilities?

If today's pace continues, the coming decade’s models are likely to enable extraordinary good: scientific breakthroughs, creative superpowers, aggregate economic leaps. Yet such models also seem very likely to induce prodigious harm—plausibly more than any invention produced in my lifetime. I’m worried about mass job displacement and the resulting social upheaval. I’m worried about misuse: cyberattacks, targeted misinformation and harassment campaigns, concentration and fortification of power, atrocities from “battlefield AI.” I’m worried about a rise in bewildering accidents and subtle injustices, as we hand ever more agency to inscrutable autonomous systems. I’m not certain of any of this, but I don’t need much clairvoyance to be plenty concerned, even without the (also worrying) specter of misaligned superintelligence.

In sum, these systems’ capabilities seem to be growing much more quickly than our ability to understand or cope with them. I wouldn’t feel comfortable working on AI capabilities directly today. But I’m not an AI researcher; I’m not training super-powerful models myself. So until recently, the harms I’ve mentioned have been abstract concerns. Now, though, my mind is dreaming up new kinds of software built atop these models. That makes me a moral actor here.

If I worry that our current pace is reckless, then I shouldn’t accelerate that pace by my own actions. More broadly, if I think these models will induce so much harm—perhaps alongside still greater good!—then do I really want to bring them into my creative practice? Does that make me party to something essentially noxious, sullying? Under what circumstances? Concretely: I have some ideas for novel reading interfaces that use large language models as an implementation detail. What moral considerations should guide my conduct, in development and in publication? What sorts of projects should I avoid altogether? “All of them”?

One trouble here is that I can’t endorse any fixed moral system. I’m not a utilitarian, or a Christian, or a neo-Aristotelean. That would make things simpler. Unfortunately, I’m more aligned with John Dewey’s pragmatic ethics: there is no complete moral framework, but there are lots of useful moral ideas and perceptions. We have to figure things out as we go, in context, collaboratively, iteratively, taking into account many (possibly conflicting) value judgments.

In that spirit, this essay will mine a range of moral traditions for insight about my quandary. There’s plenty I dislike in each philosophy, so I’ll make this a moral buffet, focusing on the elements I find helpful and blithely ignoring the rest. And I’ve skipped many traditions which were less instructive for me. I’m not an expert in moral philosophy; I’ll be aiming for usefulness rather than technical accuracy in my discussion.

Before we begin, let me emphasize that this is a personal moral inquiry. This essay explores how I ought to act; it does not assert how you ought to act. That said, I do have one “ought” for you: if you’re a technologist, this is a serious moral problem which you should consider quite carefully. Most of the time, in most situations, I don’t think we need to engage in elaborate moral deliberation. Our instincts are generally fine, and most ethical codes agree in everyday circumstances. But AI is a much thornier terrain. The potential impacts (good and ill) are enormous; reasoning about them is difficult; there’s irreducible uncertainty; moral traditions conflict or offer little guidance. Making matters worse, motivated reasoning is far too easy and already far too pervasive—the social and economic incentives to accelerate are enormous. I think “default” behaviors here are likely to produce significant harm. My reflections here are confused and imperfect, but I hope they will help inspire your own deliberation.

Utilitarianism

Let’s warm up with a familiar moral tradition: the utilitarianism which surrounds me in San Francisco. When I tell people here about my moral confusion, I’m usually met with bewilderment. For most utilitarians, my problem seems straightforward. Add up the benefits; subtract the costs. As one person told me somewhat sheepishly: “Listen, Andy… you’re just not that important! Things are already moving so quickly that any acceleration you cause will be imperceptible. A speculative reading interface seems harmless.”

I’m not deluded. I think that assessment is basically right, in terms of my direct counterfactual impact. I also think that utilitarianism often produces terrible conclusions. But even utilitarianism has more to contribute than this.

Popularization, normalization, dissemination, investment

Trends, fashions, and flashy exemplars push around the computing world. Plenty of young technologists and designers look up to me. It’s easy to imagine myself expanding a young person’s understanding of how AI can be used to design interfaces, shifting their career to emphasize AI-based invention. I’ve already had that kind of influence in my prior work. Likewise, my work has inspired lots of copycats. Those copycats are actually part of my theory of change: I depend on others to productize and scale my research. But I certainly don’t expect a startup to adopt my ethics. Copycats also mean attracting more funding into the AI space. Venture capital is, by nature, an almost pure force of acceleration; few investors seem influenced by any ethical considerations of AI. All these indirect impacts add up to more counterfactual harm than might have seemed obvious.

I can damp my impact a little: zero hype; anti-marketing; make the work anti-flashy; focus on the conceptual design rather than the AI components; be proactive in public writing about harms; be reticent about capabilities. Still, I can’t fully mitigate my influence. There’s unavoidable cost here. So even utilitarianism isn’t as permissive as it initially seemed: any of my projects must pass some strong “benefit” bar to be worth pursuing, at least in public. I commit to not sharing AI-based work in progress until I feel confident it passes that bar.

I’ll also set a high bar for amplifying others’ AI-based tinkering; I won’t gush about exciting AI papers on social media. And I’ll set an even higher bar for associating myself with AI in public events or groups, for instance by being some kind of featured guest. These constraints may sound a bit frivolous—oh no, my follower count!—but they come with real costs: I use social media and public conversation to deepen my emotional connection with ideas, and to incrementally develop my thoughts. I’ll need to use private conversation for those purposes in this domain.

Economic dislocation

In terms of misuse and accidents, a user interface for reading seems innocuous. So far, we’ve asked about the ethics of specific acts: what would happen if I did this one design project? That kind of utilitarianism often produces reasoning like “I’m just one small drop in a big ocean.” In such cases it can be helpful to rephrase the question as a general rule, and to ask: what would happen if everybody obeyed it?

Let’s try this rule: “So long as you’re making something in an ostensibly low-risk domain (like reading), and the concept seems highly beneficial, AI-based interface invention is fine.”

If everyone followed this rule, we seem pretty likely to displace jobs at a startling, unprecedented rate over the coming decade. That’s a lot of suffering for our utilitarian calculus—particularly if the economic dislocation produces dangerous social unrest. A utilitarian might argue that this rule’s benefits would outweigh such costs. I think it’s quite hard to tell.

Yet, consider the inverse rule: “Do not invent tools which displace jobs.” Written so broadly, the cost seems far too high. I’m glad that the steam engine exists. I’m glad that the electric motor exists. I’m glad that the personal computer exists. How should I think about this, if only from a utilitarian perspective?

Some kinds of displacement seem to cause less suffering than others. One factor is clearly time. An invention adopted over a generation will cause much less disruption than one adopted overnight. Some people will retire unaffected, or a little early; others will slowly shift into other work; others will have enough warning to avoid that career before they begin it.

Another factor lies in the relationship between the displaced jobs and new jobs created by the invention. Desktop publishing displaced many jobs in the typesetting industry, but that knowledge might have transitioned gracefully to new roles in digital graphic design. On the other hand, I expect robotics in manufacturing have been less kind to most assembly line workers. Large language model-based customer support bots will probably create new jobs, but not ones which most existing customer service representatives can easily access. And, in these last two cases, I’d guess that the invention creates fewer new jobs than it displaces in those industries.

When displacement comes with greater economic productivity and higher aggregate incomes, we should see increased labor demand and new jobs in service sectors. This probably occurred during the era of mass production at the start of the twentieth century. Mass production sharply decreased middle-class families’ material costs, and many likely found themselves with much more disposable income. By contrast, automated customer support phone systems probably replaced countless human operators with minimal impact on aggregate income.

Finally, what kind of social safety nets are available for these displaced workers? If utilitarians decide that the benefit to society from economic productivity justifies the suffering from job displacement, how will we as a society help those harmed? Social technologies like unemployment insurance and universal basic income shift the utilitarian equilibrium here.

In summary, a good rule will depend on a pile of highly contingent parameters. Like: “Do not invent tools—even in a low-risk domain, even highly beneficial ones—which are likely to displace jobs much more rapidly than retraining, service sector demand, and social safety nets are likely to accommodate within a relatively short time.” The trouble is that this rule is impossible to evaluate, both in principle and for most particular instances. It requires clairvoyance, and interpretation of nebulous words like "accommodate." (This isn’t just a problem for AI impacts—it’s a problem with utilitarianism and analytical moral systems in general.)

Some personal conclusions; caution around job-displacing AI

I can still draw a few personal conclusions from that mess of a rule. Given the extraordinary current pace around AI, I’m nowhere near confident that the constraints in that rule are satisfied. So: I won’t work on any AI application which will plausibly cause meaningful direct job displacement, until social safety nets seem likely to become much stronger, or until some other regulation or analysis eases my concerns of widespread economic upheaval.

I bite this bullet: I wouldn’t work on self-driving cars at the moment. I don’t like that conclusion, given the auto accident death toll. But this doesn’t mean I accept that we never get autonomous vehicles. It means that I want more analysis, or a policy change, or something which will make me less worried about rapidly displacing 2-3% of the US workforce amid a wave of other AI-driven unemployment.

I often hear utilitarian AI arguments like: “The printing press caused wars and atrocities, but surely you’re glad we have it.” But we’re in a very different moral situation. First, I don’t expect that Gutenberg and his contemporaries had any idea of the suffering they would unleash. Second, even if they did, I can’t imagine what action I would have preferred they take. By contrast, in the case of autonomous vehicles, we know (some of) the harm we’ll cause, and there are plausible ways to mitigate that harm.

For example, these vehicles are expected to cause huge economic productivity gains; so let’s tax them, and create pensions and programs for displaced workers. There’s a halo of other revenue sources: maybe we’ll tax insurers a fraction of what they’ll save paying medical bills associated with accidents. Shape the taxes to phase in as the technology is adopted, and out after a generation. Yes, we’d overtax in some instances and under-compensate some of those affected. I see that as probably fine. Something half as ambitious as this would probably be fine; lots of other creative solutions would probably be fine. “Let’s wait and see” seems less fine. There’s essential uncertainty, but the finance world is used to that; clever risk-mitigating instruments and if-then regulation can soften the edges here.

Look: I want to live in a society where as much meaningless work as possible is automated, and where, as a result, people can spend most of their time doing whatever they find meaningful. But there’s a great deal of path dependence in getting to that world. We’re on a path with very high costs, and I think we can find our way to a much better one. Until then, I’m comfortable with a policy which would defer the benefits of my contributions to projects like autonomous vehicles.

We run into similar tensions with rules like “Don’t invent anything which could be turned into a weapon.” Wouldn’t that forbid much high-energy applied physics work, and much synthetic biology? “Don’t invent anything which could lead to endangering the human species.” Wouldn’t that forbid nanotechnology research? Like the unemployment rule, these rules can only be rescued with a litany of complicated parameters. These rules are much further from any project I’m contemplating, so I’ll simply leave them in a broken state, and commit to not touching any project along those lines for now.

Christianity

“Love thy neighbor as thyself” is pretty great moral advice. In this formulation, it’s not just a statement about how you should act. It’s a statement about how you should feel. That’s important because the classic formulation of the Golden Rule—“do unto others as you would have them do unto you”—doesn’t seem to constrain my actions in this space very much. I’m in a relative position of power. I’m quite happy to have large swaths of my work automated away; I’m confident that I can find something else to do. I don’t mind if my work is co-opted in a commercial data set. And so on.

But, no—love thy neighbors. One thing that bothers me about much discussion around the harms of AI is that it’s easy to treat people as abstractions. Which regulations should we adopt? How should we fix situations where the model gives undesirable outputs? Which people should we allow to use the model, and under what terms?

“Love thy neighbor” pushes away from abstractions, and towards the particular. It says: no amount of utilitarian number-crunching justifies cruel indifference. It makes me connect with individual lives. One practical consequence is that I’m now actively collecting individuals’ stories of AI-driven job displacement and misuse. When reading stories like this one about an artist who feels their work has become much less meaningful, I can easily summon love for that person. What effect does that have on my actions? I can’t produce a systematic rule, but the feeling profoundly shapes the way I think about embarking (or not) on projects in this space. Positive personal stories of impact also provide a helpful influence: the point of all this work, in my mind, is to create flourishing.

More contemporarily, this is something I like about the “ethics of care” proposed by Carol Gilligan: moral philosophy tends to focus on duty to abstract rules, or to theoretical people. But so much of what we actually find good in the world depends on individual people's relationships, their attention and care for particular other individuals they love. Part of the problem with making scalable software systems is that it automatically puts me in a stance of abstraction, of de-particularizing. Maybe this observation should push me toward Robin Sloan's notion of software as a home-cooked meal: I'm not very concerned about doing harm with an AI-based invention I create for the sole use of three friends.

Insofar as Christianity considers groups of people, rather than individuals, it focuses our attention on the least fortunate. Classic formulations of utilitarianism sum over all people, a kind of egalitarianism. But I’m sympathetic to Christianity’s emphasis here. So even when I do utilitarian calculations about the costs of my projects, I commit to weighting the least fortunate.

Another nice Christian maxim is: “Speak up for those who cannot speak for themselves”. Decisions about AI-based systems are mostly getting made by a small group of technologists and venture capitalists. I feel a moral impulse to represent the views and interests of the people who aren’t present. This essay is one small example of that.

Rounding out our Christian buffet: “If your right hand causes you to sin, cut it off and throw it away! It is better to lose one of your members than to have your whole body go into hell.” My AI-directed creative impulses are my right hand, here. If find that those impulses are leading me to make moral decisions that I regret, I should just ditch them. Christianity emphasizes personal sacrifice for the good of others. It sometimes asks for more than I would endorse, but I commit to this particular sacrifice if it seems at all requisite.

Buddhism

Traditional Buddhism might not really have an ethical system, but I still find its ideas quite helpful in this deliberation. For example, in the Buddhist tradition, there are three “poisons” which keep us trapped in suffering: attachment, aversion, and ignorance.

Attachment is a hungry, clinging desire. In the present dilemma, I struggle with attachment to achieving, to “producing output”, to others’ validation (“wow, amazing project!”), to being perceived as competent and innovative, to novelty. These sorts of attachments can never really be sated, and they motivate “unwholesome” action.

Buddhism’s proposed antidote to attachment is to free myself from these cravings: notice them as percepts, without identifying with them; then act from equanimity. That’s the project of a lifetime, but I find it quite a useful frame when thinking about potential AI-based work. It’s easy to notice: oh, yes, I’m drawn to that idea in this moment because I want approval. And as soon as I pay attention to it in that way, the hunger loses much of its power.

Aversion is an impulsive negative reaction to painful or unpleasant things, a flinching away that gives rise to fear, resentment, and anger. Here, I feel aversion around “falling behind”, being “stuck” in my work, being perceived as “soft” or “obsolete”. I’ll confess that I also feel aversion to constraining my work at all—the whole project of this essay—alongside aversion to being judged as morally “bad”.

One proposed antidote to aversion is “non-aversion”, which, like non-attachment, involves cultivating perception and equanimity through mindfulness. It’s pretty easy for me to notice aversion’s hand guiding my impulses around AI-based design when I pay attention. Another effective antidote is “loving-kindness”. This is like Christianity’s “love thy neighbor”, but bigger, embodied: love everyone; love yourself; love every living thing; stoke that feeling viscerally and bodily; cultivate an earnest wish for all to experience happiness and freedom from suffering. It’s a joyful feeling, and it absolutely keeps aversion at bay.

The third poison, ignorance, refers to Buddhism’s claims about the nature of reality. These ideas do bear on moral questions, but I’ll skip them to keep us from getting too deep into “philosophy seminar” territory.

The “poisons” aren’t really a virtue ethic, as I understand them. The claim isn’t that an action is righteous if and only if it’s done without attachment, aversion, or ignorance. But paying attention to these ideas has helped me think much more clearly as I consider my AI-based projects, and, I think, produce more ethical conclusions. A simple way to think about these poisons is: they’re a lens which reveals places where I’m acting from selfishness.

Tantra, via David Chapman

I only understand the Tantric tradition of Buddhism second-hand, through David Chapman. But I understand it to suggest another consideration: my creative impulses are important, morally! By “impulses”, I don’t mean my grasping attachment-based urges to “produce”, but the wide-eyed, curious, playful excitement for creation. To ignore or unilaterally flatten these impulses is to engage in a kind of self-destruction. No, this doesn’t give my creative spirit unlimited license—but it’s a legitimate party to the moral deliberation.

Chapman’s interpretation is that “being ‘morally correct’ in an ordinary, unimaginative, conformist way may be an excuse for avoiding the scary possibility of extraordinary goodness, or greatness.” He describes the higher aspiration as nobility: “the aspiration to manifest glory for the benefit of others.” I think this is very beautiful, though words like “glory” and “benefit” must do a lot of work to guide appropriate action in difficult situations like the ones we’re discussing here.

Aristotelianism

For Aristotle, the ethical thing to do in a given situation is what a supremely virtuous person would do. And he proposes that a virtuous person is one who has found a happy median between excess and deficiency—for example: gentleness (median), rather than irritability (excess) or servility (deficiency).

I want to mention Aristotle’s virtue of courage, rather than rashness (excess) or cowardice (deficiency). For a while, I felt so confused and overwhelmed by the ethics of this situation—so afraid that I’d make a harmful choice—that I felt like running away from the problem. Just throwing up my hands and having nothing to do with AI. But if I make that decision, I don’t want to do it out of fear or overwhelm; I want to make that choice explicitly, courageously. This is an ambiguous, unknowable situation. I’m not going to be able to reason my way to certainty. I need to take a stand.

Confucianism

One important idea I take from the Analects is that society itself has moral patienthood. This notion is surprisingly absent in most other ethical frameworks. I don’t endorse Confucianism’s proposed resolution—that we should maintain harmony by fulfilling our “natural social roles”—but I do think AI threatens society.

Threatening an evil tyrant can be just, so it can be just to threaten an evil society. But I don’t necessarily accept the premise that our society merits that threat, and even if I did, it’s far from clear that AI will reform society in a corrective direction. If I think of society as a person, what might it mean to love it as a neighbor? To want it to grow, sure—but to despise its suffering?

How does this play out for my proposed reading user interface? I really don’t know. I think it mostly pushes around my utilitarian calculus, and makes me apply the precautionary principle somewhat more strongly to potential harms.

John Dewey’s pragmatism

I first encountered John Dewey through his writing on education reform, but he’s also written some of my favorite moral philosophy. He argues that in a modern, dynamic society, there can be no fixed ethical code. Rather than searching for some kind of ultimate answer, we should focus on finding ways to improve our moral judgments. Then we can apply those methods iteratively. His proposed methods are rooted in democratic deliberation. We should draw together the value judgments and experiences of those affected by a decision, ensure that feedback will flow to decision-makers, and make moral decision-makers accountable to those they affect.

We’re far from these democratic ideals in the creation and deployment of AI-based systems. People affected by these systems have effectively no voice or recourse. Dewey emphasizes continuous democratic involvement through constant social interaction; instead, we have a small, insular group, mostly in San Francisco, making decisions which affect all.

I’d like to experiment with ways to make my work in this space better embody Dewey’s democratic ideals. Before deploying any AI-based systems of my own, I’ll create some channel for public participation in that decision. The channel will remain open, so that I can learn from feedback, and so that I can un-deploy the system if the public decides that’s the right course. I really don’t know what the appropriate details might be here, but I expect to develop them iteratively, in public.

Sprinkling the word “democratic” here doesn’t guarantee that my work won’t do harm. One problem for Dewey’s philosophy, particularly for AI, is that it emphasizes experimentation and learning from experience. But with sufficiently powerful systems, a single iteration can do tremendous damage. I see this lens as one piece of a larger approach to moral discovery, and not one I’d emphasize when a decision has large, irreversible consequences.

Democracy also requires informed participation. If people don’t understand what these systems are, or what they can do—both for good and for ill—it’ll be hard to involve their views in the moral deliberations. I’d like to find a way to help here, perhaps by making some good explanatory media.

Phenomenology

I’ll admit: I’ve struggled to deeply grasp Husserl and his successors. But I’m willing to mischaracterize phenomenology if it produces some insight that seems helpful. Applying its ideas here, my understanding is that I should ask: what is it like to be the one making this ethical decision? What are the qualities of that experience? Do I feel like I’m trying to “get away with” something? Do I feel a sense of pride and capacity? Those feelings, in those circumstances, contain real, meaningful moral cues.

A month ago, the prospect of working on an AI-based design of any kind made me feel internally quite contorted. After much deliberation and the tentative commitments I’ve outlined in this essay, I notice that my internal experience is much more sedate. I still feel the nagging hint of motivated reasoning, but it’s subtler, and a sense of generosity to myself and others is more central.

Positive moral obligations for helping with AI impacts?

Most of this essay has focused on moral constraint: actions I must avoid taking, motives I must avoid having. I know many people in this space who have reasoned themselves into moral obligation. If AI could cause such tremendous harm, and they could conceivably help avert that, they feel a duty to contribute. If I’m so worried, why aren’t I shifting my research focus to directly mitigate AI impacts?

I’m instinctively wary of ethics which create strong positive obligations to act. There are lots of meaningful things I could be doing. I feel I should spend my time working on something useful and good, which also aligns well with my interests and capabilities. I’ve been watching “AI safety” and its adjacent fields from the sidelines for years, and I haven’t yet spotted any opportunities which check those boxes. But it’s also not reasonable to expect those ideas to fall into my lap. If any are to be found, they’ll come from tinkering, reading, and discussion. I’ve ramped up the time I’m spending on such things, though still as a secondary activity for now.

One challenge here is that most “AI safety”-adjacent projects are more ambiguous, morally, than they might seem. Efforts to democratize and open-source models might help avoid oligarchic concentration, but they also exacerbate misuse threats. Technical projects which improve our understanding or control over large models also seem likely to accelerate those models’ capabilities. For example, RLHF was invented to help align large models with human preferences, but it also unlocked much more capable models. In the end I may find myself contributing to one of these projects, even if it might do some harm, but I want to highlight the issue: “AI safety” vs. “AI capabilities” is a pervasive and misleading dichotomy.

Conclusion

Where does all this leave me? Friends aware of my moral deliberations have asked for my “take-aways.” Well, one of the take-aways is that this issue does not compress neatly into take-aways. There is no systematic conclusion. I won’t abstain completely from inventing AI-based systems, but I’ve limited myself to a pretty narrow subset. I think my speculative AI-based reading interface is OK for now, with numerous caveats.

I’ve made some initial commitments:

  • I won’t publish or deploy AI-based systems unless I feel they’re likely to be of significant social benefit.
  • I won’t uncritically amplify others’ AI-based systems or their capabilities on social networks.
  • I won’t work on any AI-based system likely to cause meaningful direct job displacement, or which could be directly weaponized, or which could produce disastrous accidents.
  • I’m collecting a public list of personal stories of AI impact.
  • I’ll experiment with some channel of democratic participation for my first AI-related project’s dissemination.

And some broader resolutions:

  • I’ll bar myself from AI-related work if I believe it’s morally corroding me.
  • In dissemination: zero hype; anti-marketing; make the work anti-flashy; focus on the conceptual design rather than the AI components.
  • When using a utilitarian lens, weight impacts on the less fortunate.
  • I’ll cultivate awareness of my attachments and fears around projects in this space, and avoid actions which seem driven by those distortions.
  • I’ll ramp up my reading and conversations around AI impact mitigation projects, looking for opportunities to contribute.
  • I’ll also be looking for ways to contribute to public understanding of AI systems and their impacts, perhaps through some explanatory media project.

These points are all tentative: the goal, for me, has been to find a solid starting point for iteration and experimentation. This space is so dynamic that I’m sure my views will evolve rapidly as events unfold, and as I learn more.

————————

I’d like to thank Avital Balwit, Ben Reinhardt, Catherine Olsson, Danny Hernandez, Joe Edelman, Leopold Aschenbrenner, Matthew Siu, Nicky Case, Sara LaHue, and Zvi Mowshowitz for helpful discussion; and particularly David Chapman, Jeremy Howard, and Michael Nielsen for conversations which have substantially shaped my views here. None of these people should be understood to endorse my positions.

View Post

Recording of office hours from 2023-04-23

We discussed: conceptual understanding through connections; conjecture about metrics for feedback on memory system expertise; feedback on a novel language learning interface; feedback on a novel quantified self data collection interface.

It was nice to discuss work in progress!

(Please don't share the link to this video.)

View Post

Office hours / Q&A tomorrow, Sunday 04/23, 9AM PDT

Hi, all. It's been a while since I've hosted an office hours / Q&A, so I thought I'd offer one tomorrow [GCal]. Let's discuss your projects or any other questions you might have. Join via Google Meet here.

[Edit: recording here]

Repeating past details on office hours:

  • At least for now, there are no reservations. Just show up; we'll form a queue.
  • In the fashion of academic office hours, eavesdropping is encouraged. You may have to wait a while to ask your question, but listening in on others' questions may turn out to be more valuable than whatever motivated you to attend, anyway. Likewise, feel free to chime in if you have thoughts on a question someone else brings—just be graceful in sharing airtime.
  • To make sure everyone gets a chance to participate, I'll probably cut off any one line of discussion at a maximum of about ten minutes. If it feels we've gotten the most out of a topic after just a few minutes, I may switch us up sooner. Take that as a sign of success, rather than a critical judgment!
  • Rough work and ill-specified questions are very welcome. Several people have told me that they're waiting to show me design work until it's more polished. Honestly: that's silly!

View Post

Memory systems and problem-solving practice

One problem with most discussion around memory systems is: the real goal isn’t to remember answers on flashcards; it’s to expand your capacity to think and act in the world. Sure, your app says you can remember this set of cards for months. But what does that mean in terms of what you can do, thoughts you can think? The connection is far from clear. If our goal is to produce real-world capacity, rather than rote recall, how should memory systems be used, designed, redefined? I’ve wanted to dig into these questions for years, but I’ve found it quite tough to establish an effective experimental context. Happily, this month, I’ve been able to watch closely as a student struggles to transfer knowledge from his memory system practice to complex problem solving.

I’ve been acting as a “personal learning assistant” for “Alex”, an adult learner studying physics in service of a meaningful project. We talk every day or two about problems and progress in his learning journey; I listen in on his tutoring sessions; I coach him on handling challenges which arise; and I prototype interventions which might help. This month, Alex has been studying electrostatics from a classic textbook by Young and Freedman. I’ve written memory prompts to reinforce the content—about 70 for each chapter we’ve worked through, covering the material on declarative, procedural, and conceptual levels.

Now here’s the trouble. After Alex read each chapter and completed a few rounds of memory practice, he still found the book’s exercises very difficult. I’m sure the memory practice helped: he’s able to recall long equations and definitions from memory when solving problems. Yet detailed memory practice wasn’t enough, on its own, to let him solve complex practice problems independently. He got stuck and made significant errors.

Then, after many hours of problem-solving, Alex found that the answers came more easily. Something important happened when solving those exercises, above and beyond what occurred when he read the textbook and did detailed memory practice. It certainly wasn’t cheap: I’d guess he spent at least four times as long solving problems as he did reading and reviewing the text.

On some level, this isn’t surprising. Sure, of course, you can’t just read about a topic. You can’t even just answer lots of questions about a topic. You have to do a topic. Fine. But what’s happening, cognitively, during that problem set? Can we cause those changes more effectively or efficiently, e.g. through some kind of targeted practice? What are the implications for topics which don’t come ready-made with problem sets? And: how can we ensure that whatever insights are acquired during this period are retained, like the other material reinforced through memory practice?

Transfer-appropriate processing

For one set of answers, we can look to a theory called “transfer-appropriate processing”, which suggests that our ability to remember information depends in part on how well the processing involved in encoding matches what's involved in retrieval. It's not necessarily enough to just practice recalling some information: practice should require processing that information in the same way you expect to be processing it when you want to use it later.

I saw something like this firsthand with Alex. He could fluidly explain to me how electric fields relate to electric forces, but he struggled to apply that knowledge in a problem where he needed to find the force that a given field would exert on a given charge. Once I demonstrated how he could do that, he quickly saw the connection to the conceptual explanation he’d just given. But that connection was a separate thing from the explanation itself—not one and the same as the explanation.

Cognitive psychologist Garrett O’Day recently ran a relevant series of experiments. He aimed to explore the impact of transfer-appropriate processing on retrieval practice, in the context of complex problem-solving. In one experiment, he gave undergraduates a brief tutorial on Poisson processes. The “practice” group was tested on recalling the procedural steps to solving a kind of Poisson process problem. The “control” group spent the same amount of time reading more worked examples. Then they were both asked to solve problems like the ones they’d been studying. This is the sort of situation which would normally produce a “testing effect”—active retrieval usually produces better performance than passive re-reading. But it didn’t. The groups performed about the same—poorly. O’Day ran a follow-up experiment in which the “practice” group repeatedly solved similar practice problems with feedback, rather than just recalling the procedural steps. This time, on a post-test one week later, the practice group substantially prevailed.

Studying worked examples wasn’t enough; rehearsing the procedural steps wasn’t enough. Both of those groups performed poorly on a post-test. O’Day’s experiments suggest that to become good at solving problems, you need to practice solving problems, ideally with feedback. Theoretically, transfer-appropriate processing suggests that you don’t need to solve problems, per-se; you just need to practice doing something which involves a similar kind of cognitive processing. I’m not immediately sure what that would be. Maybe it would be enough to set up a problem or to take a single “step” in a problem-solving sequence. I’m not yet aware of any experiments along those lines.

Schema acquisition: building problem-solving flexibility

O’Day’s findings highlight another challenge. When test problems were similar to practice problems, practice produced good performance. But when the test problems required small changes to the procedure, performance fell. The gains of practice didn’t transfer

Cognitive psychologists Yeo and Fazio observed a similar result across several experiments using similar materials. But they observed something else of interest: in an experiment where students struggled with the practice problems (< 50% correct), students were better off studying worked examples, rather than solving practice problems. Then, when the materials were changed in later experiments to produce better problem-solving performance during the learning phase, the testing effect returned.

Yeo and Fazio suggest that we’re really observing multiple interacting processes. On the one hand, students need to move what they’re learning into long-term memory and build fluency. Practice and testing support that goal. But to tackle the transfer problems on the test, students also need to generalize what they’re learning. The literature calls this “schema induction”. The claim is that when experts solve problems, they lean heavily on schemas, structures “that permit problem solvers to categorize a problem as one which allows certain moves for solution”. To build flexible problem-solving capacity, you need to acquire these schemas. That’s often done through induction—noticing patterns across a set of problems, noticing that a particular set of moves seems to help.

Another cognitive psychologist, John Sweller, suggests that problem-solving practice poses an important trade-off. Difficult problems may create stronger memory encodings, but they also tax your working memory, creating “cognitive load”. You may end up with little remaining capacity for noticing patterns in problem structure, “acquiring schemas”.

In Yeo and Fazio’s first experiment, the problems were quite difficult, and students were better off studying worked examples than solving practice problems. The authors suggest that’s because students experienced less cognitive load when reading worked examples than when solving problems. And so those readers would have spare capacity for schema acquisition—though the penalty was that they forget what they learned more quickly.

In their second experiment, Yeo and Fazio made the practice problems easier by making the problems’ surface features identical (only the numbers were changed). That helped: students who practiced solving problems were better off than students who studied example problems. But their transfer performance was poor, which makes sense because they practiced a bunch of identical problems. Students built durable memory of what was involved in the problems they practiced, but they didn’t acquire general schemas.

So, in a final experiment, Yeo and Fazio made the practice problems easier, but varied their surface structure. Students’ transfer performance improved enough that it no longer exhibited a statistically significant difference with performance on identical problems. The authors suggest that this experiment’s practice problems were varied enough to generalize over, and the cognitive load was light enough that schema acquisition was possible. Unfortunately, their experiments weren’t really designed to test this particular hypothesis, so we’re left making cross-experiment comparisons for now.

The rough implication here matches common sense. Schema acquisition and memory are at least somewhat independent. You can notice patterns, but fail to remember them durably; you can reinforce your memory of problem-solving, but in a brittle fashion which won’t transfer to unfamiliar problems. Cognitive psychologists often talk of desirable difficulties—productive struggle which induces more complex processing—but students do face a real possibility of undesirable difficulties. If you want both flexibility and durable fluency, then you should practice, but along a gentle slope. You don't want to overload your working memory so much that your mind can’t reorganize its representations of the relevant concepts. And, when you’re getting started, that will often mean that you’re better off carefully studying a worked example, rather than solving a problem yourself. To put it too reductively (because memory is of course involved in building flexibility), you want to be able to solve problems in the first place, and then you want to build fluency so that you can solve them for the long term.

None of the papers I’ve mentioned has anything to say about maintenance—that is, keeping this kind of transferable problem-solving performance durably accessible for a long time. My guess would be that once these flexible schemas are acquired, you could reinforce them through ordinary distributed retrieval practice. As we discussed in the last section, you’d want to actually solve problems, rather than just retrieve the procedure. Maybe you could set up problems with “friendly numbers” so that they could always be readily solved in your head. And to reinforce those flexible schemas, you’d want the problem to vary each time. One could probably use a language model to generate pretty good variations of that kind, but it’s probably not necessary to have a truly bottomless pool: if practice is widely distributed, you could probably get away with practicing a small handful of structural variations, particularly if the numbers were randomly generated.

It’s worth noting that at this point, we’re approaching the territory of intelligent tutoring systems (ITS), another branch of educational technology research. Heavily inspired by the same theory of cognitive load, these systems are laser-focused on finding the smoothest problem-solving slope to any desired destination. They’re usually less concerned with flexibility and durability as goals, but it’s interesting to consider how memory systems might be adapted with ITS techniques, or how ITS might be adapted to support flexibility and durability.

Incomplete absorption

In these past two sections, I’ve begun to address some of what’s happening, cognitively, during problem sets, and how we might make some of that activity more effective. But there’s another simple explanation for some of the struggle I’ve been seeing: Alex hadn’t actually understood parts of what the textbook was saying, and he hadn’t noticed.

Unfortunately, the exercises didn’t clearly reveal those holes. Problems relying on missing pieces just felt confusing and difficult, in a diffuse and undirected way. But in think-aloud video of his memory practice, quite a few questions elicited remarks like “I remember that the answer is X, but I don’t see why,” or “I’m confused that the answer is X.” Just as an example, one question was: “Why does the electric flux through a box containing a proton not change if you double the size of the box on each side?” There are many valid ways to answer that question, but the problem here was that no answer really made sense to Alex.

With this question and the others like it, I wasn’t trying to help Alex generate new understandings or acquire general schemas. These questions are really just intended to be straightforward memory reinforcement. They directly recapitulate some important explanation from the text, with similar wording. If Alex can’t produce an answer, but the solution immediately makes sense once he reads it, then more retrieval practice will probably help. On the other hand, if the answer doesn’t make sense, rote repetition isn’t the right fix. If he doesn’t understand why the answer is what it is, then he doesn’t understand the concept we’re trying to reinforce. There’s not much point in just memorizing the answer.

This is an important distinction to understand, and I fear it’s one that memory system designers consistently ignore: the appropriate intervention for a “wrong” answer will vary enormously with the nature of the prompt. If you’ve forgotten something you’re supposed to memorize, then it’s probably fine to review it again until it sticks. If you never understood the answer in the first place, then you need some very different intervention. And there are other meaningful categories. If it’s a generative task (“hum a melody in pentatonic minor”), and you find it too difficult, you might want to browse some examples, or tone down the difficulty of a task (“hum a final bar for this pentatonic minor melody”). If it’s a problem-solving task, you might want to read a worked solution, then add focused prompts about important patterns or procedural steps. Or, ITS-style, you might want to read an explanation of the misconception which your answer suggests, and then queue up some simpler problems which focus on that confusion.

In Alex’s case, I think the most appropriate next step is to simply reread the relevant passage from the textbook. I don’t think the book’s explanations were too confusing; probably his attention just lapsed during those passages. (I think this happens to everyone, more or less constantly, but we just don’t notice.) I’ve linked each prompt directly to a source location, so re-reading is relatively easy mechanically, but it’s more difficult practically. It feels bad to interrupt the smooth flow of a review session to go read a textbook. Also, Alex prefers to review on his phone, and it’d be pretty awkward to pull up textbook pages on that tiny screen. So it’s probably best to flag these questions and to establish a process to work through that queue during study sessions. If the book’s explanation still feels confusing, then the right next step is probably to discuss it together. Or, for some concepts, it may be better to just set it aside for a while and see if it makes sense later.

In Quantum Country, I think our design helped with this sort of situation. You first review every prompt in the context of reading the book, rather than in the context of a memory review session. So if you notice that you’re confused by some answer, it’s much less disruptive to scroll back up a little and re-read. And because the reviews occur every few minutes of reading, the relevant passage won’t be too far away. Readers we interviewed often remarked that they felt an unusual sort of confidence while reading, since they knew the interleaved reviews would ensure they absorbed everything they were “supposed to”.

Why didn’t we use a design like Quantum Country’s for Alex’s physics review? At first, it was because he was reading a physical copy of the book! He later switched to a digital edition. I’ve implemented a PDF reader with integrated memory prompts, but PDF rendering is obnoxious enough that I haven’t yet managed to integrate the interstitial reviews. Right now, prompts are presented in the margins, like last summer’s prototype:

In this discussion, confusing memory prompts play an unusual but important role. Instead of doing the job of reinforcing long-term memory, they act as backstops. They make sure that if you didn’t understand that part of the text, or if you weren’t paying attention, you’ll notice! I think this is quite valuable. The usual alternative is that you try to solve exercises and end up confused or stuck because you’re missing some important conceptual understanding. As I mentioned earlier, the trouble there is that it’s not clear what you’re missing. Even reading a worked solution may not make the missing concept nearly so obvious as the simple prompts we’ve been discussing.

Learning in conversation

In the introduction to his Lectures, Feynman writes:

Problems give a good opportunity to fill out the material of the lectures and make more realistic, more complete, and more settled in the mind the ideas that have been exposed.
I think, however, that there isn’t any solution to this problem of education other than to realize that the best teaching can be done only when there is a direct individual relationship between a student and a good teacher—a situation in which the student discusses the ideas, thinks about the things, and talks about the things. It’s impossible to learn very much by simply sitting in a lecture, or even by simply doing problems that are assigned.

Alex and I have met roughly once a week to discuss the material and to solve problems together. I’m not sure how true Feynman’s claim is, but Alex has told me that these sessions have felt extremely valuable. I don’t think these sessions replace the role of a memory system: we get more out of our time together because we don’t have to review the fundamentals, and he’s not looking things up all the time. And the memory system ensures that much of whatever progress we make will stick.

But I can imagine that in an ideal world, Alex would have me (or a more effective tutor) by his side during 100% of his problem solving time. This isn’t cheating; I’m not making his work easier by giving the answers away. I let him struggle, and when he’s stuck, I ask probing questions which might unblock progress. I can supply raw information where it’s needed. If I notice a misconception, I can ask a question which might create some revealing conversation. I’ll call attention to patterns or connections he might have missed.

It’s hard not to start wondering: could a large language model do some or all of this? I don’t think I’m doing anything terribly special, though I’m certainly drawing on my subject matter experience in both physics and education. A sufficiently good model could also help with the question of how to handle topics which don’t come with ready-made problem sets and discussion questions. There are now six zillion “AI tutor” startups in flight, but none I’ve seen have yet felt like they’re on the right track. I don’t have a clear sense of what my complaint is: too instructional, too interested in “right answers”, not interested enough in discussion and sense-making.

There is one element of our collaborative problem solving sessions which I’d guess these systems would have trouble providing: social energy. Alex is motivated and disciplined, but like all of us, his gumption ebbs and flows. On some days, it can be hard to muster the energy to pick up another chapter. But he’s observed that when we’re working together, it’s much easier to get and stay excited about the learning process. I’ve certainly had that experience when collaborating with others on… well, basically everything!

I’ll close by noting that while I’ve introduced a lot of problems in this essay, I’m thrilled about what I’m learning here. This feels like a basically ideal context for my research. Alex is highly motivated to deeply understand. There are strong pressures on that understanding. He’s struggling, so there’s real opportunity for augmentation. I’m getting an intimate view of his learning process, ample opportunity to intervene, and tight feedback loops on anything I try. It’s exhilarating! Now I just have to deliver.

View Post

Recording of today's seminar on GPT-4

Some topics of discussion:

  • things we noticed in the paper which seem under-discussed
  • attempted tea-leaf reading regarding model size, compute, scaling laws
  • implications for software design
  • what are AI systems, as a medium or a material?
  • moral challenges for accelerationism

Chat transcript, for URLs referenced in the discussion.

Please don't share this video or the chat transcript publicly.

View Post

Seminar: GPT-4; Mar 25 @ 9AM PDT

Join me (via Google Hangout) on Saturday, Mar 25th at 9AM PDT [GCal] for a discussion of GPT-4 and its implications for augmented cognition.

Please read the technical report before attending; bring your noticings, wonderings, and ideas.

View Post

Becoming a Wizard-of-Oz learning assistant

As I described in December, I’m experimenting with some unusual (for me) new research methods. Data analysis and stacks of user interviews have given me a ten-thousand foot view of learning in action. Now I’m taking the opposite approach. I’m diving in close, trying to understand the emotional and practical consequences of individual learning actions and design decisions, over time.

To restate my (admittedly ludicrous) aspiration: I’d like to invent an utterly transformative environment for learning and growth. I want to induce an uncanny, almost alien, feeling of effortlessness and proficiency. So far, I’ve chased that goal by building and iterating on scalable systems. But now, before systematizing or scaling anything else, I aim to produce that uncanny sensation for a single person. I want to make one person feel that they’ve been granted impossible superpowers. Then if I can do it again, and again, I hope I’ll see how to bottle that lightning in a system—a much more powerful system than I could create by iterating “in system space.” If nothing else, I expect this N-of-1 approach will produce some unusual insights along the way.

Meet Alex

So, for the past month, I’ve been acting as a “personal learning assistant / coach” for Alex (name and gender randomized). He’s a creative and driven adult, employed at a startup. Last year, through friends, he met some people working on an obscure problem in physics. Alex became absolutely obsessed. He decided that he had to try to contribute. There was just one “small” problem: he hadn’t studied much advanced physics.

In December, Alex embarked on a six-month quest of full-time study. He aims to understand the state of the art and to become a competent participant in conversations about this problem. He hired two tutors (one for math, one for physics) and began to work through related papers and textbooks.

I suggested that I’d ride alongside his learning process, like a little daemon sitting on his shoulder, and I’d help however we thought might be most useful. Of course, I expected to intervene with memory systems: Alex already had an Anki practice and some experience writing prompts for conceptual material. But I left the scope deliberately open, to make space for generative experiments—whatever seems appropriate to the problem at hand.

It’s an uncomfortable arrangement for me… and that’s good, I think. My practice, my experience, and my culture fixate on abstraction and systematization. Some part of me objects that N-of-1 insights are “fake”, that I’m not “doing real work” when I hack together bespoke one-offs for a single user. It’s been gratifying (and challenging) to confront these preconceptions, to stretch my practice in new directions. Happily, now four weeks into this experiment, I feel more richly connected to the design space in many ways than I have in years.

The force of a live project

As I listened to Alex’s plans and to his tutoring sessions, the first thing I noticed was the constant force exerted by his project—the papers he wants to understand but can’t, the arguments he can’t evaluate.

Here’s an illustrative example: Alex’s tutoring sessions are largely driven by a list of questions he’s brought, not by the tutor prescribing what he should learn next. Alex’s questions aren’t abstract or academic. They’re live blockers for something he cares about immensely. His project drives the learning loop: deciding the next thing to learn, assessing when it’s understood well enough, choosing to move on. And this particular project demands deep understanding at every step.

I’ve interviewed and observed dozens of learners over the past couple years, but this was a new pattern of learning behavior for me. And my instinct is that it provides exactly the sort of pressure my work needs.

The serious participants I’ve previously observed have mostly followed two consistent patterns of behavior.

The first pattern, which I’ll call Syllabus Learning, frames the goal as “learning Subject X.” Sometimes this means trying to pass a test or a job interview. Sometimes it’s a felt sense of obligation or “should”, like “I’ve always felt I ‘should’ learn statistics properly.” Sometimes there’s a project or interest in mind—“it’d probably help my data science work to brush up on probability theory”—but that project isn’t driving the learning loop, day-to-day. The common thread is that the learner’s emotional connection is abstract. An external structure (often a course or a textbook) mostly drives the learning loop: decides what to learn next, when it’s understood well enough, and when to move on. Most of these learners aren’t really trying to carefully internalize the material; they’re trying to “finish the course.” Syllabus Learning tends to be satisfied if the learner can “make it through” the readings and the exercises they’re “supposed to” do.

I’ll call the second pattern Exploration Learning. People reading Quantum Country because they’re “curious about quantum computing.” Sunday reading over coffee—grazing for new knowledge to spark joy or to inspire future action. A sense of novelty and interest drives the learning loop: decides the next thing to learn, when it’s understood enough, choosing to move on. Because that curiosity-directed impulse often doesn’t extend to details, Exploration Learning often aims to internalize results, methods and ideas, but not necessarily the foundations needed to fully explain them.

It may sound like I’m disparaging these patterns of learning. That’s not my intent! These patterns are useful; people (including me) follow them for a reason. Syllabus Learning is a low-cost way to get a basic footing in some topic. Exploration Learning is great for introducing yourself to a wide swath of ideas. In both cases, initial exposure can guide future projects and deeper study.

But speaking now as a designer, I’m trying to create environments which can help people deeply internalize difficult topics. Such an environment would likely help both Syllabus and Exploration Learning. But these patterns don’t really supply the right pressures to help me create that environment. It’s like trying to design a race car by iterating with commuters when they’re driving on city streets. They’re like: “Yeah, it seems like it could go pretty fast.”

So the first big insight I’ve gotten from my work with Alex is: wow, okay, this is the pressure I want. Fiery Learning. He’s driven by a project which demands that he understands difficult material. His emotional connection to the project pushes him to truly understand the material, rather than just “get through it” as in Syllabus Learning. And unlike in Exploration Learning, this project insists on understanding in great detail. This is a demanding pattern of learning. Alex is truly struggling without augmentation. He would very much like help. He’s taking the race car out on a punishing track, trying to beat a formidable time. Any slight changes in performance become extremely salient; any small difficulties become major irritants. It’s an intensely high-signal, high-energy context for me as a designer.

Tutoring transcripts as powerful design inputs

One of the first interventions I proposed was: suppose I give you a special purple highlighter and pen. Now as you read, and as you write in your notepad, you get a new power. If something seems particularly important or interesting, just write with your purple pen or mark it with your purple highlighter. Then, magically (through my Wizard-of-Oz efforts), you’ll find that your memory system includes lots of prompts about that material, to ensure that you internalize it.

But at least when we started, the material which felt most salient to Alex wasn’t from a textbook or his own notes—it was from his tutoring sessions. Thankfully, he records every session with Otter, so he was able to send me audio with associated transcripts. What incredible material these are for a designer of a learning system! The conversational format externalizes much that would otherwise remain invisible to me, trapped inside Alex’s head.

Before I began this project, I was worried about how I’d get a clear picture of Alex’s evolving understanding. But through these transcripts, the tutors largely solve that problem for me. They ask questions and pose problems designed to interrogate Alex’s understanding. Through his responses, I receive a nuanced picture of confusion, confidence, frustration, interest, surprise, sluggishness, and facility. It’s richer material than I’ve gotten from any diaries, interviews, or observations I’ve done in the past. Part of the reason for that is that the tutor’s questions are about helping Alex, whereas when I’ve asked interviewees similar questions (“could you try to explain X for me?”) it’s about helping me, the researcher. There’s much more investment and connection in this context.

So one thing I’ve been doing is listening to these transcripts, and writing prompts to support anything important that comes up. The format makes “big a-ha moments” surprisingly clear. With fairly high accuracy, I can hear in Alex’s voice when he’s surprised, when he’s learning something new, when he finds something important. I mostly haven’t needed a purple marker cuing me to find the moments he’d want to internalize. But that doesn’t mean I can always tell what about the moment was surprising or exciting: in several cases, I identified the moment correctly but wrote prompts about the wrong aspect. And, of course, it’s time-consuming for me to find those moments through his tone of voice. We’d like to set up a sort of “clicker” that would let Alex mark, in real-time, important moments for me to review.

Tutoring sessions also create strong reference points to emotionally anchor the memory prompts. For example, in one case, Alex found himself confused in the middle of a difficult problem because he was mistaken about a fundamental property of matrix arithmetic. I wrote some abstract prompts to capture the relevant linear algebra, and he cleverly suggested: maybe I could include a “motivation” note with the prompts’ answers, to explain how the prompt connects to his confusion in the physics discussion he actually cared about. That seemed like a great idea. The questions I’d written were abstract and direct. It’s easy to imagine that a few months from now, one would come up, and he’d wonder: “why am I getting this random abstract math question about matrices? who cares?” I’ve certainly experienced that quite a lot in my own practice. But if I ground the question in a powerful discussion or experience or project, I suspect it wouldn’t be hard to reclaim my interest. Alex’s initial impression of the “motivation” notes has been quite positive. I’ll be curious to see how he feels in a few months.

One of my goals is to understand the connection between memory system prompts and practical fluency. In what circumstances does one support the other, and to what extent? Are certain kinds of prompts more important than others for that kind of transfer? In what ways do my current repertoire of memory practices fall short of producing practical fluency? I’ve gotten to see one interesting example of that last category in these tutoring sessions. Alex has lots of memory prompts about measurement of quantum systems, but there’s still a real sluggishness and hesitancy when he wants to actually write out the states of systems and manipulate them in the context of measurement. I think this may be in part because the memory prompts are all about laws and definitions and properties, but they don’t really practice applying that knowledge. I’ll be interested to experiment with some more “exercise”-oriented prompts.

The tutoring transcript format also suggests affordances which I might otherwise not have considered. For example, I noticed that sometimes when Alex brings up a question for discussion, the conversation ends up meandering onto another subject before the question seemed fully answered. I was able to mark these with comments like “Did this question get answered?” and “Are you still confused about this?” In some cases, Alex told me that he knew what was happening and decided it wasn’t important; in other cases, he found it useful to have pointed out that the confusion remained. He’d already established a queue of outstanding questions (partly, I imagine, as fodder for these tutoring sessions), so there’s a natural place to put questions which came up but remained unanswered. One can imagine a future learning environment surfacing these automatically.

One final note on these tutoring sessions. For Alex, the recordings were already quite transformative, long before I came along. Here’s a passionate ode to Otter, his audio transcription app, from one of our conversations: “Without Otter, I actually feel like I would be helpless. [laughter] … Reviewing the sessions… is like the core of being able to learn this stuff. … I just don’t retain enough on the first pass. [I also] don’t capture enough information to be able to make [memory system] cards with. I feel so held by Otter!” Properly humbling for me as a system designer.

Part of Alex’s comment there, by the way, is about the importance for him of reviewing his tutoring sessions. Before we started collaborating, he was already carefully going back through the transcripts, noting where he was still confused, and writing his own memory prompts. He expressed some initial suspicion of me writing the prompts for him: “Will that make me lazy?” Maybe. There really are trade-offs here. It’s complicated. Empirically, he’s been eager to use the prompts I’ve written for him. I’m hoping to understand some of these trade-offs better over time.

Talk-aloud review sessions

Another revelation for me came in the form of a fifteen-minute video clip. I wanted to understand how the prompts I was writing felt, a few weeks later. Where were they boring? Helpful? Off-target? Could Alex notice the way a single prompt had influenced his practical facility? So, at first, I asked for feedback, suggested he make some notes while he did his memory system reviews. These were somewhat helpful, but I wanted a lot more detail. I also worried those notes were a little too filtered, a little too detached. So I asked Alex if he’d be willing to use the iOS screen recorder to talk aloud while he did his review session.

And: absolute magic! I felt I was accessing a new level of insight about the practice of writing memory system prompts for someone else. Historically, I’ve gotten live per-prompt reactions from readers on their initial pass through a mnemonic essay. Those observations have been quite instructive, but they’re also limited—those readers don’t have enough distance on the prompts. From Alex, I get per-prompt reactions like:

  • “I know the answer, but I’ve noticed that I don’t really understand why this is the answer. I’m just parroting.”
  • “Whenever this one comes up, I find myself wanting to rush through it. Something about the wall of text.”
  • “I’m answering this question by rote. I’m not really thinking.”
  • “This feels like it’s basically the same as the question that came up a few prompts back.” (it wasn’t… but the fact that it feels that way suggests something interesting)
  • “This one has felt really useful—I’ve noticed it’s come up a few times in tutoring.”
  • “This was helpful at first, but now I feel like I really know it, and it’s annoying to continue being asked.”
  • “I feel like this actually wants to be two questions.”

In some cases, this feedback led me to rewrite the prompts, or to write new prompts. In other cases, the right prescription was for some questions to get added to Alex’s outstanding question queue. In still others, the problems seemed more with the system itself. In any case, it seems clear to me that this format makes a much more effective feedback loop possible.

One common theme in Alex’s feedback is just how hard it is to really nail the target of the prompt. Often I’d written a prompt that used all the right key terms, so it seemed right initially—but when Alex practices it, he notices that it’s not reinforcing quite the right aspect. Or I’ve written a prompt about something that had confused him, and now he can answer it… but he’s still confused; there’s something important I failed to capture. All this rhymes with my experiences trying to get large language models to write good prompts. They’ll write prompt-shaped text, and it has all the right words, but most of the time it’s subtly (or not so subtly) off-target. Worse: it’s usually not at all straightforward to evaluate whether the prompt is on-target, nor to articulate the way in which it’s off-target. In many of these cases, I feel it may not be possible (for a person or a language model) to hit the right target on the first try. The prompt-writing process may truly need the feedback pressure from subsequent review sessions, to shape the prompts appropriately.

On a more mundane note, these recorded sessions seem like a good way to make low-hanging improvements to memory systems, by eliciting and working through a fine-grained friction log. For example: with a number of prompts, Alex noticed that he felt an impulse to move on without really thinking about them, because there was an imposing wall of text. This makes total sense in hindsight. He’s using Anki, and—I think because it was designed for vocabulary words—it has truly awful default typography for prose. More structurally, there really should be a hierarchical separation between the “answer” one is supposed to check, and a longer explanation one could read for more detail if one wants. When both are presented with the same appearance, the answer appears enormous. (We noticed the same problem in Quantum Country and put explanations behind a disclosable section on the backs of cards.) So I did a quick typography polish pass, and I styled extended explanation differently. I’ve included a before/after below; Alex reports that these prompts now feel much better.

Stuff like that is easy. It’s not important, exactly. But a lot more iteration of that kind will help us see the actual boundary conditions of these systems much more clearly, without distortions from needless impediments.

Next

It’s only been a few weeks; I’m still getting my footing; we’re still figuring out the right way to work together. I certainly haven’t yet produced anything approaching a transformative augmentation. But I’m excited, I’m prototyping at a faster pace than I have in a long time, and I feel I’m learning a great deal, even if I can’t articulate much of that very concretely yet.

My immediate next challenge is to find a good way to “get inside” Alex’s learning loop. In our first couple rounds, he might articulate a question on Monday, have a tutoring session on Tuesday, get prompts about it from me on Wednesday, review them for the first time on Thursday, and get feedback in front of me by Friday. We’ve tightened that by a couple days, and daily conversations about his plans and barriers are helping me respond more quickly. Part of the trouble is that I’m building prototypes as I go, and of course that takes time. Maybe once my armory is a bit better established, it’ll feel easier to keep up.

One thing this experience is: intense. Intense, for both of us! Alex is trying to learn difficult material and contribute to an open question, on a very ambitious time frame. And I’m trying to make a transformative difference in his learning experience through second-party interventions. Both tasks feel awfully overwhelming. An air of commiseration helps the project, I think, but the strain is honestly quite palpable, alongside the constant sense of excitement. In any case, I’m looking forward to more.

————————

My thanks first and foremost for Alex, for so vulnerably opening up his learning journey to me. I’d also like to thank Michael Nielsen, Gary Wolf, Robert Ochshorn, Ben Reinhardt, Joe Edelman, and Nick Barr for helpful discussions about this effort.

View Post

Seminar on Dempster (1988): "The Spacing Effect: A Case Study in the Failure to Apply the Results of Psychological Research"; Feb 26 @ 9AM PST

tl;dr: join me (via Google Hangout) on Sunday, Feb 26th at 9AM PST to discuss Dempster's classic paper on why the spacing effect isn't deployed in educational settings. Please read the paper before attending; bring your noticings, wonderings, and ideas. [GCal]

In 1988, Frank Dempster tackled the question: the spacing effect is one of the best-established results of psychological science… so why hasn't it changed anything in education?

It's deliciously ironic that a) this is one of the most-cited papers for academics investigating memory systems; and b) despite being published the year I was born, roughly nothing has changed in the spacing effect's deployment in educational settings.

I think this paper gets a lot of things right—yet ignores what might be the most important barriers. I've been meaning for a while to write a "contra Dempster" essay, and that's always a nice reason to convene a discussion. Join me!

View Post

Three years of crowdfunded research

A public version of this letter is available here: https://andymatuschak.org/2022 

I’m an independent researcher. That’s an unusual position, for sure—but what’s even more unusual is that 2022 was my third year as a crowdfunded independent researcher. My primary income is, and has been, a membership program. Most researchers answer to a few grantmaking committees; I answer to hundreds of internet strangers.

The funny thing is that they don’t quite feel like strangers anymore. In 2020, I viewed my membership program simply as a funding source, a way to pay my bills without contorting myself through grant applications. But slowly, spontaneously, the program has grown into something much more central and vivid in my creative life. Now it’s a compelling rhythm, a context that scaffolds and energizes my work.

Three years in, the membership program is still evolving, and so are my feelings about it. But here’s a snapshot for the new year. How has crowdfunding shaped my work? How do I relate to it, emotionally? How might it evolve in the future? I’m sure my experience won’t generalize, but perhaps it’ll be of interest to others considering similar paths.

High-context listeners

Research often has a slow tempo; projects can span years. My membership program has created a faster loop inside the slower one, a space eager to observe the work as it unfolds. I’m mostly working on one enormous project, yet membership creates a context in which I can usefully publish quite regularly; this year: working prototypes, essays on projects and methods, scripted demo-talks of new designs, hours of audio.

I worried that writing for patrons would feel like making dutiful “reports”. In 2020, it did feel that way. But that was my fault. My emotional stance changed as I had more conversations with members and began to understand their mindsets. They aren’t program officers, looking for evidence of a productive grantee. Most are curious creatives, looking for an intimate view of an unusual life, a challenging creative process. I’d initially thought of members’ desire to “see behind the curtain” in a demanding, stressful way: people want me to finish prototypes they can use—sooner, faster! In reality, members’ comments generally suggest that they want to see what it’s like to do this strange thing I’m doing. They want to see the parts spread out on the table, see me move my hands, share in the moments when something finally clicks, however provisionally. Magic is exciting, but sometimes it’s even more exciting to dispel something magical, to glimpse the gears.

That’s fine for members. But what do I get out of this? The way I escaped the essay-as-duty frame is by recognizing: here’s a context which pushes me to think carefully about some aspect of my work. Here’s a context which delivers a meaningful hit of creative gratification—reliably, right now—while my long research project rolls unpredictably onward. With that frame, essays for members become a creative “move” in my toolbox. Choose a question, a detail, a practice, an idea; write what I think I think; discover much more in the process. These essays become part of actually doing the research, not an added burden.

All this sounds like typical (good) advice: writing helps you think! Write more, and you’ll think better! That’s true, but what makes the membership program different for me is that I’m writing for extremely high-context listeners. Most of my readers will have read tens of thousands of words about my projects. Probably more than a hundred of them have read a hundred thousand words about my recent work. (That’s… a bewildering sentence!) It means I can jump straight to the edge of the work, straight to what I’m thinking about. When I talk to a general audience about my research, I spend most of the time just providing background. Most of the response I get is stuff I’ve heard dozens of times before. By contrast, my patrons mostly want to meet me where I am, and that utterly changes how I relate to the writing.

Could you focus your writing on an extremely high-context audience without a membership program? I’m sure that many authors manage it, but it’s emotionally tough for me as a writer. Newsletters are ubiquitous these days. Subscribing just isn’t much of a signal. I’m on plenty of mailing lists I don’t really care about, and you probably are too. Most newsletter authors will need to assume a wide distribution of reader investment. Sure, one could choose to write for a tiny subset of that distribution at the far right tail, even if it means alienating most of your readership… but that requires steely grit as an author. On the other hand, if you’re one of my patrons, you’re one of a few hundred people who are directly funding my research. You’ve given me an unusually powerful signal of interest, and that makes it easy for me to communicate accordingly.

Here’s the bittersweet part. These regular high-context listeners are precious because they simulate some of what I’d have in a good university department, with a good lunch table and a good seminar series. I don’t have those things. I cobble together what I can through walks with peers around San Francisco, long email threads with distant colleagues, and these essays for my high-context members. I’m grateful for those venues, but I recognize that they’re a poor substitute for top-notch everyday social immersion. Still, all this is more than I had a couple years ago, and I’ll be experimenting with a few new mechanisms this year.

Another surprising way that my membership program simulates being at a university: I get the chance to help others grow. Many aspiring inventors have told me they’ve found it incredibly instructive and enabling to see my gears turn, to hear me reason about problems, to see me break down projects and make progress. I get it. My own growth has depended enormously on watching colleagues do those things in person. It’s gratifying to indirectly scale this sort of tacit knowledge, at least in some part. Sure, I hold office hours, and I write some explicitly didactic material—but upon reflection, I don’t think these are nearly as valuable as simply showing what I’m doing, in lots of detail.

Not being a “content creator”

You’ve probably read breathless essays about the “creator economy”. Thousands of writers earn an independent living through newsletter subscriptions. Many popular podcasts are available only to paying listeners. The membership platform I use, Patreon, began as a way to support musicians and video creators on a pay-per-creation basis. All these examples make a simple pitch: you pay a recurring fee, and in exchange you get access to a steady stream of exclusive new content.

Now, in the previous section, I discussed the regular essays I write for patrons. But unlike the author of a subscription newsletter: I’m not trying to be a “content creator.” Access to those essays is not what I’m selling—or at least not what I intend to sell. I’m a researcher, trying to invent tools to give us new cognitive and creative powers. For a newsletter author, published writing is the primary activity. But my essays are secondary, intermittent byproducts of the central work which consumes my days. I can’t (and don’t want to) compete with a full-time writer.

My pitch to members is less transactional. It’s more like patronage in a historical sense. My work is a public service, and my primary outputs are available for free. Becoming a member is like being a tiny grantmaker. It’s saying: “Yes, for the price of a monthly latte (or whatever), I’d like to help enable progress in the domains Andy’s pursing.” Now, that’s a fairly pure relationship. It’s basically an elaborate donation box. But then I muddy the waters: as a bonus, I say, you’ll also receive regular behind-the-scenes essays, events, early prototypes, and so on.

In past surveys of patrons, the main motivation for becoming a member was overwhelmingly to enable my research. Only a small fraction listed access to exclusive content as a primary factor. But that exclusive content sure does seem to matter! When I framed my membership program as a purer donation box, visitors were several times less likely to sign up and stick around. How should we square this?

The best explanation I have comes from fellow independent, Craig Mod: “This membership program is, at its core, like a mini NPR — of course, there are perks, but the main reason to become a member should be: Craig, ya weird bird, I want to see more of your work in the world.” I remember watching PBS membership drives as a kid. There was a full-time team making this happen, and they clearly thought perks were essential. On stone tablets given from on high to all the presenters: perks! Gotta shill the perks! Yes, with the support of viewers like you… but—the boxed DVD sets and the commemorative caps, every ten minutes! Membership costs much more than the merchandise, so it’s still mostly a donation, yet the perks obviously helped close the deal.

But I’m also a member of my local modern art museum, SFMOMA. Membership costs about four times the price of a regular ticket; members get free tickets to the museum for themselves and a companion. So if you go at least twice with a partner, membership pays for itself. The museum bravely tries to frame this like NPR/PBS: please support the non-profit museum… and, as a perk, you’ll get these free tickets! I like SFMOMA a lot, but if I’m being honest, I’ll confess that “supporting the museum” represents 0% of my membership motivation. It’s too diffuse a public good. I visit a lot, and I’m really just purchasing tickets through membership.

So is my membership program like NPR or like SFMOMA? Mostly a donation or mostly purchasing the perks? I’m quite clear whenever I write about it that prospective members should think of it as primarily supporting my research. But I feel I’m fighting a rising cultural trend. Subscription-gated content is everywhere now, and growing. I’m in the minority here, running a weird NPR-style membership program. If someone pays for five content creators’ subscriber-only content, then joins my program, it’s hard to imagine that “content creator” expectations wouldn’t start leaking over to me, if only subconsciously. I feel this “content creator” pressure, in some inchoate way I can’t quite describe, and it worries me. I do my best to ignore it, but I have to imagine it leaves a subtle influence.

Maybe my best defense is a goofy one. If you think of me as a content creator, my monthly membership fee looks like a “bad deal” by comparison to other content creators. So you’ll self-select out. Ta-da!

Events

One benefit of running a niche membership program is that I’ve gathered together a bunch of people with overlapping niche interests. My sense is that it’d make sense to help these people connect. I held about two dozen events for members this year, across a variety of formats. My motivation here has been partly selfish: maybe I can help grow the “scene” around my research, foster some future peers or collaborators?

Since I knew many members were working on novel user interfaces themselves, I began by hosting open office hours. I’d answer questions, generate ideas, host design crit, and so on. Perhaps a dozen members gamely brought their work in progress, and it was a delight to facilitate discussions of others’ projects. After ten sessions, though, this pool had mostly dried up. Just not that many people were working on big projects of their own appropriate to share in that venue.

I’ve had somewhat more luck with a series of research seminars focused on a specific notable paper or talk. These are great for me, since I choose a work I’d like to understand more deeply. I bring notes for discussion, but the other attendees always have lots of good questions and observations. Key to this is a strong norm: I ask people only to attend if they’ve actually read the paper or watched the talk. So the discussion ends up intense and high-signal. It’s also helped to invite attendees from outside my membership program who have special insight or interest in the work being discussed.

I’ve also held two “unconferences” for members. These don’t take much work for me to organize, since we hold them in Gather and members assemble the schedule. Like any unconference, sessions are informal and multi-format: demos, talks, workshops, problem-solving sessions, show-and-tell, etc. These events have been quite generative for me—particularly the second conference, for which I invited a few other small communities to join as well. Unfortunately, even in Gather, the all-important "hallway track" at these online conferences is much diminished. And since I was trying to help members connect, that's a major problem. The COVID bloom of new remote social tools doesn't seem to have solved it.

There’s one obvious related approach here I haven’t tried: creating an ongoing realtime discussion community, through tools like Discord, Zulip, or Discourse. I’ve joined lots of online communities on platforms like those, and they’ve never worked for me. They always end up feeling like a burden—one more inbox to check, another thing I have to keep track of—rather than a fount of joyful connection. Also, if I’m really trying to grow my “scene”, I don’t love the idea of a paywalled members-only discussion community. If we really care about good conversation, we want the community to include people who do good work and contribute good discussion. That set only partially overlaps the set of my patrons. I could add some layer of invitations, but I don’t want to be responsible for “playing host” on this scale. I continue chatting with others who have invested more heavily in these sorts of environments; maybe at some point I’ll find an convincing angle here.

Paying the bills

Friends aren’t quite sure how to frame the question. “So, how’s your crowdfunding… going?” What they really mean is: Andy, are you okay? Are you setting your life savings on fire?

The good news is that I’m able to pay my bills. Really—strangers on the internet are able to pay my bills. (I’ll pause for a moment to say: this is unbelievable.) The bad news is that the program’s growth has roughly plateaued.

My income sits somewhere between that of a grad student and a junior faculty member. This is… okay, though obviously not ideal. And because early-career people often write me, eager to follow my path, I want to caveat: this only feels okay to me because I worked in tech for a decade and saved well; we own our home; my supportive wife has a stable career; we have no kids and no debt besides our mortgage; etc etc. In other words, because of phenomenally fortunate prior circumstances.

In my 2020 reflection, I quoted Ivan Sutherland:

I find that I have only so much room for taking risks. When I can reduce the risk in some places in my life, I can more easily face risk in other areas. I provide myself the courage to do some things by reducing my need for courage in other areas.

This is still very true for me! Unfortunately, I’ll confess that crowdfunding consumes a bit more courage now than it did two years ago. I’ve become somewhat less optimistic about its long-term sustainability. Flat revenue at my current level is okay, but the second derivative is probably slightly negative, and I’m not far into the black.

The fundamental dynamics haven’t changed since 2020: there’s a low but steady churn rate (~2% per month), which must therefore be balanced by a steady rate of new members. The churn rate is surprisingly insensitive to my actions. New perks, changing my publication rate, making (what seems to me) faster or slower progress—none of these things has meaningfully affected churn over the past few years. It’s sort of comforting to know this is probably not a knob I can meaningfully change.

New people visiting the membership page sign up at the same rate year over year, but there were fewer new visitors in 2022 than in 2021. As I pointed out last year, this means that to “tread water”, I need to constantly expand my audience, get new people “into the top of the funnel.” At least for me, this mode of thinking seems awfully toxic to a research mindset. If my crowdfunding revenue falls too much, I’d rather pursue grants and other funding sources than try to fix the situation through “growth marketing”.

It’s also pretty clear that a membership program like mine isn’t likely to fund a team or an institution. In previous years, I’ve experimented with hiring teammates on a contract basis, using a separate tranche of funds a few donors generously provided for that purpose. I continued those experiments this year, with quite a lot more success. And in 2023, I’ll be collaborating with a full-time research fellow for six months. I’m excited about these collaborations! To push further in these directions, I expect I’ll need some separate source of funding, not my membership revenue.

That all sounds a bit gloomy, so let me close this section by saying: crowdfunding does still successfully pay my bills. It’s not likely to stop in the next year. And that’s really astonishing!

Wonder and puzzlement

Let me focus on that astonishment for a moment. Day to day, I often lose sight of just how weird my life is. After three years of this arrangement, it’s somehow become… almost mundane? Homeostasis in all things, I suppose.

But at practically every social event I attend, I receive a reminder in the form of this conversation, which plays verbatim on loop:

“What do you do?”
“I’m trying to invent tools that give people unusual cognitive or creative powers.”
“Cool! Is that a startup?”
“Er, no, it’s just research. I make prototypes and write about them.”
“Oh, at a university?”
“No, I’m independent.”
“So you’re like a freelancer, hired for contract research?”
“No, I work on my own ideas.”
“But… who pays for that?”
“Uh… a bunch of people on the internet.”

All this usually leaves my counterparty expressing some mix of wonder and puzzlement. I have to say: that’s an awfully good description of how I feel about the situation, too, when I’m paying attention.

Wonder—here is the gift of an endless open field, to explore as I see fit. Craig Mod described this as “feeling bestowed a permission to do the kind of work I believed I was capable of, but perhaps not strong enough to do entirely on my own.” I get that too, and I’m grateful for it, but much of the wonder I feel arises from a sense of permissionlessness: I don’t need to seek anyone’s approval to pursue whatever I find interesting. With no grant applications to write and no tenure committee to appease, I can square up to the true task (the much harder task!) of courageously chasing what I actually think is best. Not what I imagine the crowd wants me to do; not what will likely produce shiny short-term results; not what will let me rest comfortably in my sphere of competence. It’s a towering call to action, and I’m tremendously grateful for it, and (of course) I absolutely struggle to live up to it.

And so, now, the puzzlement—what are the terms of this glorious freedom I’ve been granted? I’ve made few explicit promises to my members, sure. But they’ve likewise made few promises to me. What pace of legible progress must I maintain? How far can my interests stray? If enrollment starts to decline, will I be able to do anything to reverse that trend? At what cost? These are hard questions to ask on a survey. I don’t think I could answer them reliably myself, for other people I support through membership. I certainly don’t expect members to give me an unlimited leash.

What’s harder about this puzzlement is: even if I could answer those questions, I must not act on them! Say that I knew project X would take a long time, and my patrons would be likely to lose patience. So what? Would that mean I shouldn’t do it? That’s a recipe for a terrible research practice.

So, day to day, doing good research in the context of my membership program means mostly paying attention to the wonder, and to the gratitude, and mostly ignoring the puzzlement. That’s not easy!

What makes it easier is: practically every conversation I have with a patron is wildly supportive, trusting, and generous. These messages gently un-ask all those puzzled questions. Here are hundreds of people who are simply excited to see what I come up with. And when I manage to let go of my grasping puzzlement, all that faith redoubles my wonder.

So, to all members, past and present—thank you, thank you, thank you! These past few years have been the most creatively fulfilling of my life, and you’re part of a small group which has personally made that possible.

View Post