Andy Matuschak

What does spatial computing want to become?

Added 2024-03-01 06:42:59 +0000 UTC

I spent February trying something new: a creative holiday of sorts, occasioned by the Apple Vision Pro’s release. I gave the month over to experimentation in this new medium. To free myself to play, I made a rule that I wasn’t allowed to work on any past ideas or projects. Very coarsely, I wanted to know: what new questions and ideas does this system provoke in me? What new space of mediums becomes possible with high-quality passthrough, world tracking, and gaze-based input? What other affordances must be added—or discarded—to produce a medium of my dreams? A few weeks later, I can’t yet answer these questions to my satisfaction, but I’ll share what I’ve observed so far as a newcomer to this area.

Masks, glasses, and ubiquitous computing

A sacrificial form factor for living in the future

The Vision Pro uses a bristling camera array to show the outside world on the inside. This technique is called “passthrough”. It’s a huge improvement on older headsets, which completely blocked out the world. Still, like many, I don’t like the idea of spending hours each day with an opaque computer on my face. That said: I do wear glasses every waking hour of every day. If all this technology could be someday be compressed into normal-ish glasses, I’d happily wear them everywhere.

So, one way to view passthrough-based devices is as disposable stepping-stones to design and prototype for the glasses they’re supposed to become. Today’s transparent headsets, like Microsoft’s HoloLens, are so limited that I find it tough to get inspired as a designer. Passthrough lets me pretend to live in a possible future with ultra-high-quality transparent displays, and to explore what I might create in that future.

But there’s another possible path, arguably more compelling: the future computer’s not on your face, covering the world in private hallucinations. It’s out there in your environment, through some combination of pervasive chips, screens, sensors, projectors, and actuators. This is the vision of ubiquitous computing (“ubicomp”) and the agendas which sprung from it, with Dynamicland the most recent exemplar:

In this world, you’re surrounded by objects which can sense and respond. The blueprint lets you try and compare alternatives; the sticky notes remember their prior arrangements; your kitchen prep bowls show you what should go where. You’re interacting with physical objects, so you recover the subtle feedback of tactility and the pleasure of embodiment. This vision promotes social richness, not social isolation: the computational material is out there in the world, not hidden on your face, so it can be co-viewed and co-created just like any physical object. In this vision, computers recede into the background, like the electricity running invisibly through your walls, rather than a screen directly mediating your view of reality.

Those who favor the ubicomp agenda usually scorn headsets. In the ’91 paper originating that term, Mark Weiser writes:

Perhaps most diametrically opposed to our vision is the notion of virtual reality, which attempts to make a world inside the computer. Users don special goggles that project an artificial scene onto their eyes; they wear gloves or even bodysuits that sense their motions and gestures so that they can move about and manipulate virtual objects.

…

Even today, people holed up in windowless offices before glowing computer screens may not see their fellows for the better part of each day. And in virtual reality, the outside world and all its inhabitants effectively cease to exist. Ubiquitous computers, in contrast, reside in the human world and pose no barrier to personal interactions.

Mark’s talking about traditional virtual reality headsets, but much of his opposition still applies to augmented reality glasses. He wants a world where people move around and interact with each other and with physical objects: “only when things disappear in this way are we freed to use them without thinking and so to focus beyond them on new goals.” Tactility, sociality, and physical fidelity are all missing in the world of magic glasses.

It’s ironic, then, that one way to view the Vision Pro and its ilk is as a way to prototype for a future along the lines of ubicomp—one where computational surfaces are too cheap to meter, and no one’s wearing anything on their face.

Simulating ubicomp with the headset requires a little more imagination than than simulating augmented glasses. Virtual objects have no tactility, for instance. This leads me in different design directions: instead of floating windows, prototypes in this mindset involve tracking physical objects and projecting simulated behavior onto them using the headset display. You could call that “mixed reality”, but the difference in framing really is powerful. I have different ideas when I think of my explorations in terms of future augmented reality glasses, versus an ubicomp future without headsets.

What passthrough can do that glasses and ubicomp can’t

Now, let’s reverse our examination. In what sense is the passthrough-based form factor not just a “worse-but-tractable” version of future augmented reality glasses or ubicomp environments?

One obvious answer is that only opaque headsets can block out the world completely, whether to render a fully virtual scene, or just to obscure your unpleasant surroundings when in a plane or open-floor-plan office.

More interesting, to me, is that passthrough-based headsets can distort reality. They’re capturing the external world and reprojecting it internally, and so they can change what is reprojected. The Vision Pro already does this in subtle ways: it relights your hands when using immersive environments; it simulates dimmed lighting when viewing media; it casts virtual emissive light from videos onto your walls.

How might it be useful to more dramatically warp the external world?

I play the piano. Recently I’ve been trying some honky tonk, a style which features lots of big jumps in the left hand. The trouble is that I need to look at my hand to aim those leaps while also reading the sheet music. I end up whiplashing my head back and forth. Often I’ll need to memorize a passage—so that I don’t need to see the sheet music anymore—before I can really work on the hand mechanics.

So I built this prototype, which “warps” a view of my hands on the keyboard, superimposing them just under the sheet music:

(view full clip)

It’s not quite workable: the hand tracking is a little too imprecise and the world tracking a little too unstable. But it’s right on the edge. In this prototype, I use hand tracking data to render virtual “hands” and a virtual keyboard, but if I had access to the device’s cameras, I could superimpose the downward view directly into the forward view, simulating a strange sort of prism lens.

Now, some pianists would say that instead of helping myself see my hands, I should practice knowing where they are without looking. One of my teachers built a device for that purpose: a long board which would cover the keyboard, with side pieces to lift it a few inches above my hands. Playing with this really did force me to build stronger proprioception. The trouble was that every ten seconds or so, I’d lose my place on the keyboard. I’d need to reorient myself, but a five foot board is cumbersome to shift for a quick peek. And of course, to move the board, I needed to use at least one hand, “losing” its position.

Another kind of distortion available to passthrough-based devices is subtraction: dynamically redacting portions of the scene. In this prototype, my view of the keyboard is obscured, but if I gaze at my hands, they become visible as I lean towards them:

(view full clip)

I imagine that subtractive interactions like this could be useful for learning to operate other machines by touch, or by sound. One could also use this technique to create progressive scaffolding for physical objects. Suppose you bought a fancy digital camera with many knobs and buttons, but you don’t know how to shoot manual exposures. A training system could blur out everything but the shutter button and the focus dial. Then, once you’re comfortable, it could reveal the aperture dial, and so on.

Actually, subtraction seems more possible for augmented glasses than warping. Maybe you could implement it with thin-film transistors, like those fancy conference rooms with glass walls that become opaque when you flip a switch. I don’t know if those films can be made optically transparent enough when sitting so close to your eyes.

What headsets can do that ubicomp can’t do

It’s tempting to think of all headsets—even futuristic glasses—as a worse version of ubicomp visions, which would replace elaborate private hallucinations with physical dynamic media that we can touch and share, together, out in the world. So it’s helpful to ask: in what ways is this not true? What good can headsets do that ubicomp systems can’t, even if we have holographic displays and computers too cheap to meter?

One obvious answer is privacy. Headsets enforce this as a default much more aggressively than I’d usually want, but privacy does have its place. One ed-tech founder has told me that when students are trying something for the first time, they feel much more comfortable when they can keep their confused work private from their classmates.

A more stimulating consequence of privacy is asymmetry. That is, the same environment can present you and me with different dynamic representations. In card games, you often hide your hand from others; with a headset, you can hide big objects from each other. Asymmetry can make for interesting collaboration: in Keep Talking and Nobody Explodes, one player needs to defuse a complex bomb; the others can’t see or touch it, but they have the information needed to defuse it in a big technical manual; shouting ensues. More practically, if you and I are collaborating on a scale model of an architectural plan, but we have different specialties, each of us might want to show or hide different layers of the plan (framing, plumbing, electrical, etc), even as we work in the same physical space. If two trainees are collaborating on a procedure in an industrial plant, but one is more experienced than the other, it may be important to show different levels of scaffolded overlays on the machines.

Switching gears: gargantuan interface elements are more natural to headsets (though they can sometimes be implemented in ubicomp). For example, in the following sketch with Gray Crawford, we explored his idea of presenting a physical log of a multi-person conversation in an enormous strip stretching up through the ceiling and floor.

Another feature of that prototype is free-floating dynamic 3D elements. Those also seem difficult to achieve in an ubicomp system, unless Star Trek-style projected holograms become physically possible.

Headsets make it possible to anchor interface elements to a user’s head pose. These could be heads-up displays (a clock, a reminder of my next task), or more elaborate elements (keeping my sheet music visible no matter where I look). These kinds of interfaces often appear in science fiction designs, but I’m not sure how useful they actually are. A smartwatch can handle many of these needs, often more naturally. For other cases, in an ubicomp world, you could just make a fancy hat, like a bicyclist’s helmet with attached mirrors.

Interactions in space

Revolutions in computing often coincide with new methods for input (light pen, mouse, touch, gaze?) or output (teletype, bitmap display, smart earbuds?, head-mounted displays?). Spatial computing involves big changes on both sides. We’ve focused on output so far in this discussion; now let’s consider input.

Gaze and hand tracking

The detail which most surprised me at the Vision Pro launch announcement was Apple’s decisive dependence on gaze tracking: the system’s central interaction is look-and-pinch.

This model is more analogous to the mouse’s point-and-click than it is to touch: much of the time it feels like indirect interaction, like moving a pointer (with your eyes) and clicking it (with your hand), rather than iOS’s direct interaction, which feels like reaching out and directly manipulating an element. I think this is because we don’t directly act on objects with our eyes in the world. If I want to press a physical button, I reach out and press it. On iOS, I do the same. On visionOS, I look at it and, with my hand at my side, make a gesture. It’s a method of indirect action, like clicking a mouse at my side. Likewise, if I want to slide a physical sheet of paper, I might move it with my finger. I may or may not be looking at it. Scrolling on iOS works the same way, with 1:1 tracking. By contrast, if I want to scroll a visionOS view of a sheet of paper, I look at it and move my wrist, which forms a sort of loose elastic connection with the content—looser than scrolling on a trackpad.

In terms of creating a direct connection between intention and action, the indirection feels like a step backwards to me. Though, of course, I understand the decision ergonomically. The consequence is that even though the device is “hand controlled”, I’m not really using my hands in the rich sense that I use them in the physical world, or even in the dextrous sense that I use them on a smartphone. The feeling is more like an assistive peripheral, using gaze and hand tracking to implement a Bluetooth mouse. It’s astounding in its straightforwardness; it’s an astounding feat of engineering. Yet I’m left wondering what else we might do with these miraculous sensors.

Might gaze tracking unlock some more alien possibilities?

As a reader, I’d be very interested in leaving traces of my gaze on the pages. If I needed to re-read a sentence multiple times, that would jump right out. If I skipped over some material completely, that would also be clear. Perhaps ideas along these lines could be incorporated into the BookBridge project I’ve previously discussed.

In many types of meditation, one is instructed to adopt a soft, unfocused gaze. I can imagine creating a biofeedback interface which would help meditators enter and remain in that posture. One could also use gaze information to provide biofeedback around certain states of distraction.

Or, imagine falling into a fractal: whenever your gaze holds still on any part of it for a moment, you zoom into that part. The more you look, the more you fall into that spot. Yet anywhere you look, there’s always more detail, always unfolding, forever.

Ken Pfeuffer, lead author on the paper originating the core look-and-pinch interaction in 2017, has proposed a variety of other interesting multi-modal interactions. In “Gaze-Shifting”, he and co-authors suggest how a pen and gaze can be combined to permit complex indirect interactions without “losing” the pen’s location:

And in “PalmGazer”, he demonstrates a gaze-driven menu anchored in the user’s hand, freeing the other hand to perform simultaneous actions in space (like drawing or sculpting).

Head tracking

The headset makes continuous realtime estimates of the user’s head pose in 3D space. This information is mostly used to render virtual interface elements within the environment. If you turn your head, the virtual elements turn along with you.

But your head motion is also itself an input channel on these devices. What interesting new interactions does that make possible?

I was tickled by Matt Webb’s 2022 suggestion that leaning might make a great interaction for headsets. We naturally lean forward when we want to see an object in detail. For virtual objects, that could mean not just optical zoom, as it does in the physical world, but also semantic zoom. That is, objects can change their form as you lean closer to show more or different channels of information.

When I came up with the piano keyboard cover concept I showed earlier, I was at first unsure how I would allow the user to toggle the keyboard cover. A system standard pinch-tap would mean “losing” one hand’s pose. I thought briefly about connecting it to a foot pedal I use to flip pages in sheet music. Then I realized: leaning! To see your hands through the cover, just lean towards them. It feels incredibly natural and direct—much more “native to the medium” than look-and-pinch (which is of course much more flexible).

Leaning is also a continuous interaction. As I move closer to the piano, my hands fade in. If I just need a subtle hint, I can subtly lean. If I’m totally lost, I lean more. I get continuous feedback at every frame. There’s something ideological here for me: I think continuous interactions are fascinating and underexplored. One of my favorite interface designs ever is Alessandro Sabatelli’s magnificent 2013 leveling tool, now integrated into the iPhone’s “Measure” app (see GIF below). Chan Karunamuni’s 2017 gesture-based multitasking system is work of astounding originality and craftsmanship. Of my own collaborations at Apple, my favorite projects involved continuous interaction design: parallax, the back navigation gesture, the 3D page curl, etc. visionOS is almost entirely discrete, and I want more of the continuous.

(view full clip)

The other subtle miracle of head tracking is that it trivializes locomotion in mixed reality spaces. Virtual reality designers and researchers have spent decades trying to figure out how to let users move around in virtual environments. If you tie motion to a joystick, like in a video game, most people quickly become nauseous: the disagreement between the eyes’ perception of motion and the body’s (lack of) physical motion is very uncomfortable. So there’s a zoo of discrete locomotion systems. Often you point to where you want to go, then teleport with a momentary fade-to-black for comfort. Some continuous solutions have managed to subdue nausea by linking motion of the arm to motion of the body (for example, see Gray Crawford’s “waft” locomotion).

My point here is that moving around in VR is a surprisingly non-trivial design problem. But: with a working passthrough-based headset, you can walk around by just physically walking around. Yes, there’s a huge amount of computational complexity hiding there. From a design perspective, though, this “trivial” solution gracefully resolves a decades-long research question for most cases I care about.

In the traditional VR framing, the goal is to allow exploration of an arbitrarily large virtual world. So your home’s walls quickly become a problem, unless you have a baroque omnidirectional treadmill. And you’re blocking out the real world, so you’ll probably trip over furniture. So most VR experiences have you sit physically still, or move only within a small radius, while using indirect interaction to let you virtually move further when necessary.

The miracle of passthrough is not that it solves these problems. Arbitrarily large virtual worlds still require elaborate solutions. But, it turns out, most of what I want to do with head-mounted displays can be happily confined to the bounds of my living room. I’m happy viewing and positioning dynamic media elements in my home: I don’t (usually) need to be transported to some fully virtual environment. I can adapt the virtual environment to the constraints of my physical environment. All this reframes the problem. I don’t need arbitary locomotion. I just need to be able to move about my home, and position elements within it.

Then we can take advantage of the user’s location within the room as an input channel of its own. The prototype below is of an immersive choir experience.

(view full clip)

If you stand in the center of the room, you’re surrounded by voices from all parts. (This, on its own, is a remarkable experience!) But as you move towards any one part, it will become louder and the others much quieter, as if you’re off in a corner with your fellow tenors learning the part together. If there’s a place where your part has an important duet with another, you can move towards them while singing your own part to hear how the harmony shifts. The whole thing is a continuous interaction: you can sing along with the full choir, then shift a little to hear the sopranos better for one bar, then move back, then leap over to the tenors for a tricky passage.

This is the kind of demo you’d often see in historical virtual reality papers. Because headsets were opaque, the designer would have to create a virtual environment for you to move around in. People called this “room-scale VR”, and you’d traditionally prepare for it by piling up your furniture against the walls of your living room, so you didn’t run into anything. But there’s no reason this demo needs to be in a virtual environment. By situating it in my real living room, I don’t need to move any furniture. I naturally avoid any obstacles in my path.

(Incidentally, this demo is another example of warping external reality. Earlier, in the prototype which displayed a view of my hands alongside my sheet music, I warped visual space. In this demo, I warp acoustic space: even though the parts are only a few feet from each other, I exaggerate their effective acoustic distance, so that when you’re near one, it’s as if the others are very far away.)

Cameras and object tracking

Apple assiduously avoids the terms “virtual reality”, “augmented reality”, and “headset”, preferring “spatial computing”. What do Apple’s designers think that phrase means? I think the charitable interpretation is that they aren’t sure yet, or maybe that Apple simply hasn’t shown us yet.

What we see, so far, is that spatial computing is computing as you know it—except situated in your physical space. You can open iPad apps and position them as glowing panels around your home. There’s little meaningful relationship between the app and the physical environment, except the world-tracked position. We could think of this as “spatial” in the sense of “having a lot more space”, but in practice, with the display resolution and gaze precision we have today, it’s tough to productively arrange more interface real estate on this device than on my Mac.

What does “spatial computing” want to become? My provisional answer: I want to break down the walls between the dynamic world and the physical one. I want to imbue objects in my environment with the magic of computation. I want the opposite of arbitrarily-positioned floating windows with no semantic relationship to my environment. I want to interact with dynamic media in physical space, mostly by way of physical objects, using my full body and sensorium, alongside other people. That is, I more or less subscribe to the ubicomp vision, but I don’t have a strong opinion about whether I need to wear a headset to get there.

Very concretely, I’d like to create doing-centric explanatory mediums which involve objects and tasks in the physical world. I’d love to learn how to use machine tools by way of dynamic explanations which respond interactively to my actions, as in AdapTutAR:

Here’s where the cameras come in. Obviously, they’re what make the hand gestures and the world tracking possible. But they also enable the device—in principle—to understand the objects in my environment. They can provide the basic pose information I need to imbue physical objects with dynamic hallucinations. This is “object tracking”, and Apple doesn’t yet offer that functionality on the Vision Pro, though other devices do at various levels of sophistication. My sense is that frontier transformer models like Cutie are close to basically solving this problem, if not quite yet in realtime.

But the cameras can do more than track objects’ poses. In principle, they can see which keys I’m playing on the piano, infer what I’m cutting on my cutting board, spot which page I have open in my book. I want to be able to write on my whiteboard, with a marker, as a colleague writes on their whiteboard a thousand miles away, and to see their marks alongside mine as if we’re writing on the same wall.

For now, these deeper analyses all require purpose-built computer vision pipelines. That doesn’t align well with Apple’s privacy strategy, which tries to limit the need to expose users’ live camera data by offering generic models for analyzing the user’s hands, 2D image tracking, plane detection, and so on. Ideally, the kind of contentful sensing I’ve described would be possible without deep computer vision expertise. That way, we’ll get a more interesting range of experiments. Perhaps future large multimodal models will be able to perform realtime video analysis tasks with a developer prompt like “detect which key I’m touching on the piano.”

Given my interest in deeply fusing physical objects with dynamic media, the Apple Vision Pro is a surprisingly limited platform for experimentation. It just doesn’t expose the data I need to explore most concepts. Happily, I’ve learned enough this month to form a much clearer sense of my interests in this space. I find myself full of new ideas and curiosity—I’ll call that a successful creative holiday.

————————

My thanks to Bryan Clark, Gray Crawford, Laura Deming, Luke Miles, and Michael Nielsen for helpful conversation, and to the Dynamicland team for its deep influence on my thinking in this space (however much they might dislike headsets!)