After at least six years of rumours and countless billions spent on research and development, in early June Apple unveiled Vision Pro, its next-generation ‘augmented reality’ headset. While it reflects Apple’s unique design aesthetic, it has a long way to go before it becomes the ‘augmented reality’ device that Apple claims it to be.
The history of augmented reality stretches back nearly sixty years, to a brief, evocative white paper written by ARPA managing director Ivan Sutherland. In ‘The Ultimate Display’, Sutherland – the father of all interactive computing – imagined reversing the relationship between a user of a computer and the computer’s display.
Rather than having the display be somewhere ‘out there’ in front of the user, Sutherland suggested, why not create an environment analogous to that Alice experiences in Through the Looking-Glass – where everything, however imaginary, simply appears?
In that early paper Sutherland suggested a display that might be able to track a user’s eyes and head movements: where ever the users looked, the display would follow. It would always be between the user and the world, augmenting their view of the real world.
It took Sutherland three years of research to produce the first prototype of his Ultimate Display. Using a ceiling-mounted armature that supported and sensed the movements of pair of tiny television displays that projected, like a set of binoculars, through half-silvered mirrors, it blended computer-generated drawings with the real world.
A user would put their eyes up to this ‘Sword of Damocles’ (so named because it hung ominously above its user’s head), look in, into the computer generated world, and through, onto the real world.
Although crude by any estimation, Sutherland captured the essence of the ‘head mounted display’ – a display that can be everywhere because it sits directly over the eyes, tracking their movements in space. But the Sword of Damocles went far beyond the private universe we associate with ‘virtual reality’ toys; mixing computer-generated and real-world inputs created the first – and profoundly influential – ‘augmented reality’ system.
With so much to measure, integrate and deliver to the user, augmented reality proved so difficult in practice. Half a century passed before another system rivalled Sutherland’s first effort. In 2015, Microsoft stunned the technology world with its first-generation Hololens, a fully-realised, portable and thoroughly capable augmented reality device.
Fifty years had seen the mechanical armature of the Sword of Damocles disappear into gyroscopic sensors and a type of software known as SLAM – Simultaneous Localisation And Mapping. SLAM uses real-time input from multiple sensors to calculate the shape of a space and, with that data, can precisely position the user within it. SLAM can be done as a one-off – the Hololens will often scan a space it doesn’t recognise – but SLAM is also performed continually, because users of augmented reality continuously move within changing spaces.
To perform this continuous SLAMing of the real world, Hololens used another Microsoft technology: Kinect. Originally developed as video game controller for its wildly popular Xbox console, Kinect uses ‘time of flight’ – measuring the time photons take to bounce back to its cameras – to generate a highly accurate map of the distance of every object. Kinect, plus a few black-and-white video cameras, gave Hololens enough raw input to do an effective SLAM calculation.
All that sensing needed to support a realistic augmented reality system also means that every augmented reality device acts as a sophisticated surveillance device – by design. Call it augmented reality, and we’ll happily work alongside someone wearing one of Microsoft’s more refined second-generation Hololens headsets. Reframe it as a data hungry and privacy-invading surveillance system, and people react far more suspiciously. A live camera turned upon us demands that we edit ourselves.
We already saw this pushback more than a decade ago: when the early-and-crude ‘Google Glass’ came to market, users found themselves suddenly and sometimes violently ejected from spaces others wanted to maintain in privacy.
Both Google Glass and Microsoft’s Hololens suffer from a design weakness that has to this point proven impossible to overcome: neither works well in broad daylight. Our Sun is so bright, our eyes so sensitive, and – even after a half a century of development – our displays so crude, both Hololens and Glass displays disappear in sunshine.
It’s rumoured that Apple tried to solve this problem, but – at least in this moment – no display technology can compete with sunlight. This means that instead of Sutherland’s vision of a system that adds its augmentations within the real world, both the computer-generated world and the real world would have to be simulated: sampled, mixed together, and presented in a display that is actually fully immersive.
Although Apple’s Vision Pro does a good job faking augmented reality, it is really nothing more than the very best virtual reality system. The space between virtual reality and augmented reality ‘spectacles’ – see-through, looking more like sunglasses than opaque ski goggles – defines the gap between the scope of Apple’s vision and the massive firm’s best effort. That a three-trillion dollar business found this problem too hard to solve tells us something about the deep difficulties at the core of augmented reality; easy to understand, hard to manufacture.
That doesn’t mean Vision Pro is a failure. Far from it. Some of the design issues raised by Microsoft’s Hololens have been answered by Apple. For example, Vision Pro uses two cameras to carefully watch the position and action of the user’s hands and fingers, creating an elegant and simple library of common ‘gestures’ – the equivalent of the poking and pinching we do when using smartphones and tablets – to be able to operate within the ‘spatial computing’ environment comfortable. We won’t need the weird full-body interface manipulation presented in films like Minority Report. Instead, a few subtle, quiet gestures will achieve nearly any task.
To make that work, Apple had to double down on the number of cameras embedded within the device, ringing it with cameras both inside and out. Inward facing cameras track the position of the user’s ‘foveal centres’ – the areas of highest-definition vision – in order to selectively render those parts of the computer-generated image first, and in highest quality.
Outward facing cameras use SLAM to generate a model of the room, but they also watch user’s hands – and the movements of any other people in the space, so visitors can ‘interrupt’ a Vision Pro user with a gentle blending of their presence within the immersive environment.
It’s all beautifully done – pure Apple magic – and yet it all sits atop the most complex bit of surveillance kit ever developed.
When we get to the second or third iteration of Vision Pro – and Apple deems it suitable for its billions of customers – we’ll dive headlong into a culture of surveillance unlike anything we’ve ever imagined. Each of us will be capturing ourselves and others all the time. Will those captures stay inside these devices, or will they find their way to Apple? From Apple could it end up at Meta, Microsoft, Google – or Tencent?
Before we don these exciting new toys we need to think carefully about what data they collect, why, and how that data might be used – lest, on the other side of the looking glass, we find ourselves prisoners in a panoptic dystopia.