Things that we want computers to do for us they soon do very well. Fifty years of personal computing have taught us that these devices can be portable, powerful, and compelling, a lesson we seem to re-learn with every major wave of innovation.
Thirty years ago, when virtual reality had become the hot new area for exploration and innovation in technology, systems that could generate the illusion of three-dimensional objects in space cost millions of dollars, took up as much space as a commercial refrigeration unit – and consumed ten times as much power. Those constraints meant that despite massive public interest, early VR systems could only be found in well-funded labs at NASA, Boeing, and deep in the US military. VR inspired, but remained inaccessible to almost everyone.
A modest but profound innovation in software transformed VR from an ‘over there’ technology to one now ‘everywhere’: independently, two software teams working in the UK developed ‘software-based’ rendering. No longer would VR require million-dollar computers create and interact with a virtual world; instead, you could use a garden-variety PC from the mid 1990s. Within a few years that software had been incorporated into Microsoft Windows as ‘Direct3D’ – while the multi-billion-dollar builder of those million-dollar VR systems, Silicon Graphics, saw its market – and profits – collapse.
Today our smartphones draw virtual worlds to their screens with far more detail and precision than even the most powerful of those massive machines of a generation ago. Some of that achievement can be attributed to ‘Moore’s Law’ – a rule-of-thumb that predicted the doubling of computer power roughly every 24 months over the fifty years from 1965 to 2015. Yet Moore’s Law can’t explain all of the gains; the rest of that million-times improvement comes from a generation-long effort across both computer hardware and software to develop and optimise every part of the process of generating computer graphics – because those graphics had become essential to the allure of computing.
The original ‘first-person-shooter’ computer games – Castle Wolfenstein 3D and DOOM – each harnessed highly optimised computer graphics to deliver real-time 3D to very ordinary PCs. Millions loved playing those games, and every commercial interest in the PC space – chip makers (Intel), PC builders (IBM), operating system provider (Microsoft) and the games publishers (Activision, EIDOS and many others) leaned into the opportunity. Computers transformed into true multimedia machines, capable of effortlessly blending sound, video, 3D computer graphics and rich interactivity to produce gaming experiences that continue to astound and delight billions who stare down into their smartphones to catch a Pokémon, call out to their teammates in Call of Duty, and fight to the last in Fortnight.
History never repeats – but it does rhyme. The sort of excitement that enveloped virtual reality thirty years ago has landed on another shiny new technology – ‘Generative AI’. In the middle of 2022, artificial intelligence startup OpenAI unveiled ‘DALL-E’, a piece of software that translated text ‘prompts’ into images, drawing upon millions of images scraped from every corner of the Internet to generate works that had never existed before, turning prompts into pictures. It seemed almost magical – when had a computer ever been so creative? – and produced a flurry of introspection about the nature of art, the role of the artist, and abuse of copyright. Those questions remain open – and grow more difficult to ignore.
To make DALL-E operate requires two very substantial efforts: first, collecting and reducing all of these images to a ‘checkpoint’ model; second: translating a text prompt into a ‘path’ through that model – a mathematical process that generates a unique image. The creation of the DALL-E checkpoint model took many weeks and a massive array of cloud computers, each of them equipped with the very latest bits of silicon to accelerate the intense mathematical calculations required to add an image to the model. Translating a prompt into an image takes less time – but is no less computationally difficult. When OpenAI opened DALL-E to the public, they rented a massive amount of cloud computing resources from Microsoft to ensure they would be able to meet the demands of a public mesmerised by these new machine-generated images. Like those VR systems of the early 1990s, ‘Generative AI’ needed big, expensive systems to run – keeping it out of the hands of the public.
A modest but profound innovation in software transformed Generative AI from an ‘over there’ technology to one now ‘everywhere’: a small startup known as Stability AI created ‘Stable Diffusion’, a clever bit of code that brought the same prompt-to-image algorithm from DALL-E’s massive cloud network of computers to a slightly-better-than-average PC. No longer would anyone need a multi-million dollar network of cloud computers to explore this new frontier in creative computing; instead, you could use a $2000 PC. Stability AI released Stable Diffusion as open source – predictably, within a few weeks other clever programmers had further optimised their code. Before Christmas of 2022 – less than six months after DALL-E stunned the world – Generative AI apps for smartphones and tablets became widely available.
Fortunately, OpenAI had more to share than DALL-E. In early December 2022, the firm unveiled ChatGPT – a friendly, smart, and conversationally aware “chatbot”. In the few months since its release, it’s become incredibly popular – changing our expectations for how we interact with computers. However, ChatGPT is even more resource intensive than DALL-E. In another two-step process, OpenAI first ‘reads’ the whole of the Internet (not every last word, but the vast majority of them), and from this ‘trains’ a ‘Large Language Model’ or LLM. That training process – given the trillions of words we’ve written and posted – consumes huge computing resources over a period of weeks.
After all of the Internet has been ‘reduced’ to an large language model, a text ‘prompt’ from a user then initiates a ‘path’ through the LLM, which generates some ‘chatty’ output ‘tokens’ that get written back to the user. The LLM is so big, and that process of tracing a path through it to generate tokens so mathematically intensive that it’s estimated to cost perhaps half a cent of cloud computing resources per response.
A modest but profound innovation in software transformed Large Language Models from an ‘over there’ technology to one now to be ‘everywhere’…
Wait… am I repeating myself?
In the first weeks of 2022, a series of papers and technical innovations shared by some clever programmers show that we’re rapidly tracing the same path we saw with real-time 3D graphics and generative AI. These techniques make it possible for pretty much anyone – with enough time and help – to create their own large language models, and that anyone – with a sufficiently powerful PC – should be able to converse with that model on their own computer. No cloud required, no millions of dollars of infrastructure, all of this technical wizardry reduced to a clever piece of software.
We haven’t quite reached this point yet, but given the trajectory we appear to be tracing – both with Generative AI and with Large Language Models – we can be confident that long before the end of 2023 we’ll have something remarkably similar to ChatGPT running as an app on our smartphones, tablets and PCs. That’s going to be extraordinary – each of us will be acquiring our own personal ‘friend’, like Samantha in Spike Jonez’ film Her. We’ll be engaging in a lifelong conversation with a piece of software that can be funny, profound – and profoundly wrong.
Given that this science fiction future is now ‘closer than it appears’, we need to have a deep think about how we can use these new ‘personalities’ to help us be more human, more connected, and live more meaningful lives. That’s the subtle subtext of Her – and I reckon Jonez was onto something.