Andy Matuschak

Prospects for consumer silent speech interfaces

Added 2022-07-31 21:42:51 +0000 UTC

Relatively soon—maybe in our lifetimes, maybe next century—we’ll have full duplex input and output links from our brains to computers. We’ll have synthetic telepathy. We may find ourselves augmented with an effectively unlimited working memory. What will “collective intelligence” mean when the boundary between your thoughts and mine can be controlled by software?

The details here will depend enormously on understandings of the brain which we don’t yet have, and on the physical limits of neural interfaces which haven’t yet been developed. Skating to that particular puck might feel too outlandish—the fog of uncertainty just too thick. So, could I interest you instead in some more tractable neighbors? If meaningful brain-computer interfaces (BCIs) are decades out, consider: what “poor man’s BCIs” might we pursue in the meantime? What “apps” might we build with them?

With a true BCI, you could compose a manuscript silently, at the speed of thought. I don’t have a true BCI, but I do have wireless earbuds. I enjoy going on long walks with an audio recorder running continuously. I’ll talk through research problems while I roam about the city, or while I stare at the ceiling on my couch. I also use the earbud microphone when I’m curled up in an armchair with a paper book, where there’s often no good surface for my notepad. A running audio track can capture my thoughts as I read. My pipeline will even let me dictate spaced repetition prompts mid-recording. But when my wife’s at home, I stop my chatter to avoid being a nuisance. Likewise, I often want to linger at a café or library on my walks, but dictation would be unwelcome there.

What if I could talk to my computer without making sound? This is the premise of silent speech interfaces, a family of sensor systems for interpreting spoken input without audible vocalization. Details vary, but the general idea is that you go through the motions of talking without engaging your vocal cords. These interfaces don’t quite operate at the speed of thought, but ubiquitous, unobtrusive, screenless input—even at the mere speed of speech—strikes me as a plenty interesting “poor man’s BCI”. In this overview, I’ll offer an opinionated look at the field from the perspective of practical consumer design opportunities.

Reading through the literature, my sense is that silent speech interfaces are on the cusp of tractability. Now looks like a promising time for an inventive technologist to step in. In particular, I notice that most publications are focused on restoring communication to people with speech disabilities. That’s wonderful, of course. But it also means there’s ample space for creative designers to envision how healthy consumers might use these interfaces in everyday contexts.

More importantly, researchers have only recently begun to use modern machine learning techniques in silent speech interfaces. When I started my literature review, I was delighted to find a recent book-length overview: An Introduction to Silent Speech Interfaces. The book provided a helpful survey of the various sensing techniques which had been tried, but the error rates and form factors left me pessimistic. On a whim, I thought: well, the book’s from 2017; let’s check what’s happened since then. Wham! The field started using deep learning in earnest! As we’ve seen in domain after domain, deep learning excels at processing noisy signals with structured regularities—regularities like human language. State-of-the-art error rates suddenly look quite promising. And when fidelity improves, the design parameters of sensors become less constrained.

Sensing modalities: an overview

So you’re talking, but without talking. How can we possibly interpret this sort of speech? I’ll begin with a schematic overview, then we’ll dig into specifics in the modalities which seem more feasible.

Neural. Speech starts in the brain. We can intercept those cortical signals with intracranial implants, or with sensors arranged externally against the scalp. I don’t expect these systems to become relevant for consumers anytime soon. Implants require surgery, and scalp-based EEG sensors are cumbersome and too lossy at present for arbitrary speech (see e.g. Gonzalez-Lopez et al, 2020).

Muscular. From the brain, signals associated with speech travel to muscles in our jaw, lips, tongue, and throat. We can intercept the electrical activation of these muscles with electromyography (EMG). In some locations, we can use “surface” EMG sensors placed against the skin above those muscles. These are highly sensitive, but—since they need to be mounted on the skin—fairly obtrusive. Alternatively, we can measure the motion of these muscles with accelerometers, magnetic sensors, and piezoelectric sensors. Or we can measure that motion indirectly through imaging: video, ultrasound, radar, and so on.

Acoustic. Finally, all that muscle activity results in speech. Or at least, it would if you were speaking normally. If you don’t vibrate your vocal cords, your speech will be very quiet—but perhaps still interpretable by sensitive microphones or vibration sensors.

Let’s take a look at some specific systems which seem more promising for near-term consumer applications.

Visual speech recognition, a.k.a. lip reading

Neural networks can recognize gestures through solid walls (Li et al, 2019). By comparison, lip reading seems like a piece of cake! The technical term of art for this task is “visual speech recognition”. No surprise: the field has made rapid progress, most recently doubling accuracy against a standard benchmark over just three years (see review by Sheng et al, 2022).

One way to evaluate the accuracy of text input systems is with the “word error rate”, which is defined as the number of errors (substitutions, deletions, and insertions) divided by the number of words in the original speech. For example, suppose I speak ten words. My transcription software misses one completely and mis-identifies the word “total” as “too tally”. The word error rate in that example would be 0.3—one deletion, one substitution, and one insertion, divided by ten words. For reference, the Android dictation service has a word error rate of around 0.2-0.3 (Koenecke et al, 2020). Note that lower scores are better for this metric.

The current state-of-the-art model (Prajwal et al, 2021) achieved word error rates of 0.23-0.31 against a data set of subtitled footage from BBC programs and TED talks. So at least for professionally-produced footage, we’re already at roughly the accuracy of consumer speech recognition software. There seems to be plenty of room for improvement: this is a relatively small model by deep learning standards, trained in two weeks on 4 GPUs. And they’re using GPT2 as an auxiliary language model to choose from candidate sentences. Presumably newer models would perform even better.

The main technical limitation I see is that this model would need careful tuning and compression to run in realtime, rather than as an offline batch operation. But my impression is that this looks quite achievable.

Speaking more practically, suppose we have reliable lip reading. What does this mean, in terms of form factors and contexts? How would we actually deploy it?

One obviously relevant posture is the standard smartphone stance: one arm outstretched, face awash in the screen’s sickly glow. The front-facing camera is in a great position to read your lips. But if you’re holding your phone anyway, I’m not sure this buys us much relative to a software keyboard. One advantage is that you wouldn’t need to actually look at the screen. This seems fairly meager.

I can also strew cameras around my house, pointed at where my lips might be. If my couch or armchair faces a television, its built-in camera might do the job. But I certainly don’t like the aesthetic of constant surveillance. For what it’s worth, I don’t buy “smart devices” if I can help it, and I install camera covers on my devices’ built-in cameras. This system would need to be extremely valuable for me to accept an always-on camera in my home.

One more intriguing angle: what if the camera’s mounted on my glasses? Elgharib et al (2020) demonstrated a system which infers a front-facing video feed from a wide-angle camera mounted sideways on the arm of a pair of glasses. It works remarkably well, after per-subject training.

I’m already willing to put on a wireless earbud when I’d like to use dictation. It’s easy to imagine putting on a special pair of glasses, or an attachment for my glasses, when I’d like to use silent speech.

Hearing very quiet speech

Another promising route is much more boring: what if we make a system which listens very carefully—so carefully that it can hear what you’re saying, even when other humans can’t?

This isn’t a new idea. The company Jawbone got its name from the technique: their flagship product used bone conduction microphones to improve speech quality. The military uses throat-mounted microphones (akin to stethoscopes) to improve signal in noisy environments like helicopters. Unfortunately, most of these systems still require the wearer to speak audibly. They just improve the quality of the resulting audio.

What if you could whisper? It’s probably good enough for a library or café. In 2017, Grozdić and colleagues achieved high accuracies for whispered speech recognition with ordinary microphones—albeit in the ideal conditions of a recording booth. Throat-mounted microphones should help in noisier environments, and early adaptations for whisper recognition look promising (Jou et al, 2004).

A whisper would still be a nuisance in my tiny home if someone else is in the same room. Also, extensive whispering may actually harm your vocal folds (Robin et al, 2006). One interesting alternative is SilentVoice (Fukumoto, 2018), which recognizes ingressive speech. In this format, you speak as if you’re whispering, but you move the articulatory muscles while inhaling rather than exhaling, and you don’t vibrate your vocal cords at all. This style of speech is much quieter than a whisper—practically inaudible to another nearby person in a silent room. Fukumoto achieved error rates of 0.1-0.24 with roughly 30 minutes of per-speaker adaptation data.

What I like about these modalities is that they use boring, widely available hardware, deployed in a convenient and unobtrusive format. Machine learning systems for audio-based speech recognition already run in realtime and are widely deployed. I’m a little less willing to put on a throat microphone than an earbud, but it’s not a deal-breaker.

Other modalities

If you dig into the literature around silent speech interfaces, you’ll notice that discussion overwhelmingly focuses on modalities I’ve ignored, rather than the ones described above. As far as I can tell, this is mostly due to a cultural focus on restoring communication to patients with speech disabilities. Much of this work is quite interesting, though in my view it’s much less applicable in consumer contexts.

One celebrated example is AlterEgo (Kapur et al, 2018), which recognizes speech by detecting the electrical activation of muscles in the face, using surface-mounted sensors. Because it requires no involvement of the vocal tract, it may help patients with disabilities serious enough to prevent them from using microphones or lip-reading devices (Kapur et al, 2020). But AlterEgo, like other myographic sensors, requires obtrusive placement on the face, and its lexicon is limited to a small number of pre-set phrases.

Perhaps I can interest you in TongueBoard (Li et al, 2019), a device in the form of a dental retainer studded with electronics? It tracks the motions of your tongue using capacitative touch sensors. Its discriminatory capacity is quite limited, with a supported lexicon of 15 words.

Unlike lip-reading and microphone-based modalities, both AlterEgo and TongueBoard enable invisible speech. You can mouth the words without opening your mouth. Good for spies!

If you’re game for intracortical implants, you can get excellent results these days. The state of the art system (Willett et al, 2021) manages 90 characters (~18 English words) per minute at very high accuracies. This is great news for paralyzed patients, but less relevant to consumers.

There are of course many more types of sensor systems, but you get the picture. None of these seem promising for consumer applications anytime soon.

Opportunities for silent speech

What would we do with a silent speech system? What does it enable that normal dictation systems do not?

We’ve already discussed unobtrusive note-taking while reading on the couch or in crowded environments. What about unobtrusive note-taking in a social setting? I bring a paper notebook to meetings in part because using my phone or laptop feels rude. I can imagine that physicians would love to take silent notes while performing an exam. I’d love to jot silent notes while on a walk with a friend—my preferred meeting format.

Full BCIs will enable synthetic telepathy: you and I will be able to chat with only our minds. Silent speech interfaces would enable a simpler form of this. With wireless earbuds providing ubiquitous unobtrusive audio, we have hands-free, screen-free, silent, bidirectional communications. What will we do with that? Well: as Vernor Vinge depicts in Rainbows End, classrooms will be utterly out of control. More seriously, I think the march towards increasing fidelity and ubiquity will continue to blur the lines between individuals. When the group chat is always playing in my mind, things will get weird. As someone who already struggles with the chaos of text-based group chat systems, this doesn’t exactly appeal… but it certainly intrigues.

More broadly, I have a sort of blind faith here. Progress in personal computing is so often presaged by new input and output modalities. The mouse unlocked the desktop GUI; trackpads enabled mainstream laptops; the touch screen sparked the mobile revolution; haptics give us screenless turn-by-turn directions; e-ink screens give us digital books on the beach; commoditized projectors make Dynamicland possible; etc etc. We don’t know yet what devices like Leap Motion or laser eye trackers are for. Likewise, I don’t know exactly what we’ll do with silent speech interfaces. But I think it’s worth finding out.