ChatGPT and the Imagenet moment

Published on
December 12, 2022
Benedict Evans
No items found.
ChatGPT and the Imagenet moment

A decade or so ago, systems based on something called ‘machine learning’ started producing really good results in Imagenet, a contest for computer vision researchers. Those researchers were excited, and then everyone else in AI was excited, and then, very quickly, so was everyone else in tech, as it became clear that this was a step change in what we could do with software that would generalise massively beyond cool demos for recognising cat pictures.

We might be going though a similar moment now around generative networks. 2m people have signed up to use ChatGPT, and a lot of people in tech are more than excited, and somehow even more excited than they were about using the same tech to make images a few weeks ago. How does this generalise? What kinds of things might turn into a generative ML problem? What does it mean for search (and why didn’t Google ship this)? Can it write code? Copy? Journalism? Analysis? And yet, conversely, it’s very easy to break it - to get it to say stuff that’s clearly wrong. The wave of enthusiasm around chat bots largely fizzled out as people realised their limitations, with Amazon slashing the Alexa team last month. What can we think about this, and what’s it doing?

The conceptual shift of machine learning, it seems to me, was to take a group of problems that are ‘easy for people to do, but hard for people to describe’ and turn them from logic problems into statistics problems. Instead of trying to write a series of logical tests to tell a photo of a cat from a photo of a dog, which sounded easy but never really worked, we give the computer a million samples of each and let it do the work to infer patterns in each set. Instead of people trying to write rules for the machine to apply to data, we give the data and the answers to the machine and it calculates the rules. This works tremendously well, and generalises far beyond images, but comes with the inherent limitation that such systems have no structural understanding of the question - they don’t necessarily have any concept of eyes or legs, let alone ‘cats’.

To simplify hugely, generative networks run this in reverse - once you’ve identified a pattern, you can make something new that seems to fit that pattern. So you can make more picture of ‘cats’ or ‘dogs’, and you can also combine them - ‘cats in space suits’ or ‘a country song about venture capitalists rejecting founders’.  To begin with, the results tended to be pretty garbled, but as the models have got it better the outputs can be very convincing.

However, they’re still not really working from a canonical concept of ‘dog’ or ‘contract law’ as we do (or at least, as we think we do) - they’re matching or recreating or remixing a pattern that looks like that concept.

I think this is why, when I ask ChatGPT to ‘write a bio of Benedict Evans’, it says I work at Andreessen Horowitz (I left), worked at Bain (no), founded a company (no), and have written some books (no). Lots of people have posted similar examples of ‘false facts’ asserted by ChatGPT. It often looks like an undergraduate confidently answering a question for which it didn’t attend any lectures. It looks like a confident bullshitter, that can write very convincing nonsense. OpenAI calls this ‘hallucinating’.

But what exactly does this mean? Looking at that bio again, it’s an extremely accurate depiction of the kind of thing that bios of people like me tend to say. It’s matching a pattern very well. Is that false? It depends on the question. These are probabilistic models, but we perceive the accuracy of probabilistic answers differently depending on the domain. If I ask for ‘the chest burster scheme in Alien as directed by Wes Anderson’ and get a 92% accurate output, no-one will complain that Sigourney Weaver had a different hair style. But if I ask for some JavaScript, or a contract, I might get a ‘98% accurate’ result that looks a lot like the JavaScript I asked for, but the 2% error might break the whole thing. To put this another way, some kinds of request don’t really have wrong answers, some can be roughly right, and some can only be precisely right or wrong, and cannot be ‘98% correct’.  

So, the basic use-case question for machine learning as it generalised was “what can we turn into image recognition?” or “what can we turn into pattern recognition?” The equivalent question for generative ML might be “what can we turn into pattern generation?” and “what use cases have what kinds of tolerance for the error range or artefacts that come with this?”  

This might be a useful way to think about what this means for Google and the idea of ‘generative search’ - what kind of questions are you asking? How many Google queries are searches for something specific, and how many are actually requests for an answer that could be generated dynamically, and with what kinds of precision? If you ask a librarian a question, do you ask them where the atlas is or ask them to tell you the longest river in South America?

But more generally, the breakthroughs in ML a decade ago came with cool demos of image recognition, but image recognition per se wasn’t the point - every big company has deployed ML now for all sorts for things that look nothing like those demos. The same today - what are the use cases where pattern generation at a given degree of accurate is useful, or that could be turned into pattern generation, that look nothing like the demos? What’s the right level of abstraction to think about? Qatalog, a no-code collaboration tool, is now using generative ML to make new apps - instead of making a hundred templates and asking the user to pick, the user types in what they want and they system generates it (my friends at Mosaic Ventures are investors). This don’t look like the viral generative ML demos, and indeed it doesn’t look like ML at all, but then most ML products today don’t ‘look’ like ML - that’s just how they work. So, what are the use cases that aren’t about making pictures or text at all?

There’s a second set of questions, though: how much can this create, as opposed to, well, remix?

It seems to be inherent that these systems make things based on patterns that they already have. They can be used to create something original (‘a cat in a space suit in the style of a tintype photo’), but the originality is in the prompt, just as a photograph can be art, or not, depending on where you point the camera and why. But if the advance from chatbots to ChatGPT is in automating the answers, can we automate the questions as well? Can we automate the prompt engineering?

It might be useful here to contrast AlphaGo with the old saying that a million monkeys with typewriters would, in time, generate the complete works of Shakespeare. AlphaGo generated moves and strategies that Go experts found original and valuable, and it did that by generating huge numbers of moves and seeing which ones worked - which ones were good. This was possible because it could play Go against itself and see what was good. It had feedback - automated, scalable feedback. Conversely, the monkeys could create a billion plays, some gibberish and some better than Shakespeare, but they would have no way to know which was which, and we could never read them all to see. Borges’s Library is full of masterpieces no human has ever seen, but how can you find them? What would the scoring system be?

Hence, a generative ML system could make lots more ‘disco’ music, and it could make punk if you described it specifically enough (again, prompt engineering), but it wouldn’t know it was time for a change and it wouldn’t know that punk would express that need. When can you ask for ‘something raw, fresh and angry that’s a radical change from prog rock?’ And when can a system know people might want that? There is some originality in creating new stuff that looks like the patterns we already have, but the originality that matters is in breaking the pattern. Can you score that?

There’s a joke that AI stands for ‘Anonymous Indians’, because before you can give an image recognition systems a million pictures of dogs and a million pictures of cats as data for automated training, actual humans in an outsourcing company have to label all those images. There are people in the loop. But then, every billion-scale system we use today relies on people in the loop. Google Search analyses how people interact with the internet just as much as it analyses the content itself. Instagram can recommend things to you by comparing what you seem to like with what a billion other people seem to like, not by knowing what those thing are themselves. Image recognition could move that to a different level of abstraction, but who, again, labels the images?

If there are always humans in the loop - if these things are all mechanical Turks in some way - then the question is how you find the right point of leverage. Yahoo tried paying people to catalogue the entire web one site at a time, and that was unscalable. Google, on one side, is based on the patterns of aggregate human behaviour of the web, and on the other side it gives you ten results and makes you pick one - manual curation by billions of users. The index is made by machine, but the corpus it indexes is made by people and the results are chosen by people. In much the same way, generative networks, so far, rely on one side on patterns in things that people already created, and on the other on people having new ideas to type into the prompt and picking the ones that are good. So, where do you put the people, at what point of leverage, and in what domains?

One of the ways I used to describe machine learning was that it gives you infinite interns. You don’t need an expert to listen to a customer service call and hear that the customer is angry and the agent is rude, just an intern, but you can’t get an intern to listen to a hundred million calls, and with machine learning you can. But the other side of this is that ML gives you not infinite interns but one intern with super-human speed and memory - one intern who can listen to a billion calls and say ‘you know, after 300m calls, I noticed a pattern you didn’t know about…’ That might be another way to look  at a generative network - it’s a ten-year-old that’s read every book in the library and can repeat stuff back to you, but a little garbled and with no idea that Jonathan Swift wasn’t actually proposing, modestly, a new source of income for the poor.  

What can they make, then? It depends what you can ask, and what you can explain to them and and show to them, and how explanation they need. This is really a much more general machine learning question - what are domains that are deep enough that machines can find or create things that people could never see, but narrow enough that we can tell a machine what we want?

Benedict Evans is a Venture Partner at Mosaic Ventures and previously a partner at a16z. You can read more from Benedict here, or subscribe to his newsletter.