← Back
In Depth

Exploring LLM Weirdness: How We Built It & What I Learned

an imageJosh Watzman
  • Engineering
Exploring LLM Weridness: A Quiz Game. 9 questions. 1 advanced AI. How hard could it be?

A couple of months ago, my coworker Jackson and I were messing around with GPT-4, and we got to wondering: could we make a game with an LLM teammate, and could we make the game actually fun? You may have played the resulting game, “Exploring LLM Weirdness: A Quiz Game” – we think we came up with something pretty interesting, and from the feedback we’ve received, a number of other folks do too!

He and I both learned a lot from building this, and in the spirit of continuing that learning, we are releasing the source code to the AI integration – hopefully useful both if you want example code for how to build an AI chatbot, or if you’re just curious how the game works behind the scenes. It’s not always beautiful code, but that’s part of the benefit of seeing a real app and not some example snippet.

Moreover, we are releasing a new server-side SDK for creating AI chatbots with Cord. Our team has built a number of these AI experiments over the last few months, and we always found ourselves copy-pasting the same code over and over. This SDK is a distillation of everything you need to hook up OpenAI or any other chatbot to Cord’s persistent messaging backend. (So it’s not just another ChatGPT UI replacement, but a way to hook up an AI chatbot to a full-featured, multi-user, persistent chat and messaging system!)

Although Jackson had a bunch of LLM experience, I did not, so I discovered a number of fascinating things along the way. Here are some of the most interesting things I learned!

An LLM is just fancy autocomplete

This obviously isn’t news for anyone who knows anything about how LLMs work, but it’s pretty interesting for those of us who don’t. I was very surprised by the extent to which an LLM is just a really, really, really fancy autocomplete. You can see this most easily in some of OpenAI’s deprecated APIs, where you just send an entire block of text and they send you back a completion of that text, exactly like autocomplete. The modern APIs add a bit of metadata, but in the end, the LLM isn’t doing anything more fancy or more complex than what your phone’s autocomplete does when you’re typing a text message. (Though of course the LLM computes that completion in a massively more fancy and complex manner!)

LLM APIs are shockingly simple

But the above does mean that the APIs to LLMs tend to be really simple. You give it a list of messages, annotated with who said what, and a set of instructions (the “system prompt”), and it gives you back what it thinks should be the next message in the sequence. That’s really all there is to it. There are a couple of other parameters you can tweak (max response length, “temperature”, a few others) but you can get a long way without understanding them at all. It was fascinating for me to see the degree to which such a complex system could have such a tiny API – something all of our systems should strive for.

Anything that isn’t just linear text is a huge problem

This is another implication of an LLM just being a fancy autocomplete, which took me a bit to really get my head around. Because it’s behaving like an autocomplete, finding the “most likely” continuation of a bit of text according to its training data, it performs best on tasks which closely fit that format. An LLM is really good at translating a bit of text, for example. But, even if it can easily tell you the rules of tic-tac-toe, it can’t really play the game, since an ASCII game board doesn’t nicely fit the format of a bit of text to be continued. Nor can it read downwards. We had a lot of fun adding questions to the quiz which hit on these areas!

Streaming is really important

Although the LLM APIs are mostly really simple, one of their more complex areas is streaming. Instead of just giving you the entire completion at once, they can (optionally) give you that completion bit-by-bit. When I was first getting things working, I figured that I’d deal with this complexity later, turn off the streaming, and just get something simple together. And, well, it did work – but it felt terrible! Absolutely unusably so. If you aren’t using streaming, you have to wait for the entire response to be computed and sent back to you at the end – and this takes a long time! Seeing the text stream in bit-by-bit helps decrease the perceived latency, even if it takes the same total amount of time for the whole message to get onto the screen. Even though I knew how important perceived latency was, I was still surprised by the staggering degree to which it affected the UX feel in this case. So I immediately wired together the machinery we needed to support streaming in the game.

Many LLM SDKs only deal with streaming text

When I was looking around at the LLM SDK landscape, I was excited to see that efficiently dealing with streaming text was central to the most popular ones. However, I was surprised at the degree to which that was basically all some of them did – get a blob of text from the LLM into React, maybe with a tiny bit of UI. When I think about actual uses of LLMs in industry – and not just ChatGPT clones – I imagine an AI assistant that you can bring into a conversation to help out a group of humans. From building Cord, I knew that this would mean that you need persistent threads, a nice UI on top of them, a solid composer to send messages, live synchronization across both the LLM and the various humans, etc. And none of the popular SDKs seemed to provide every part of this, which I found really surprising! (And is why we decided to release our own chatbot SDK.)

LLMs and humans are good at very different things

In retrospect, maybe I shouldn’t have been surprised by this – humans and computers in general are good at different things. But, in the original version of the game, we wanted the human and LLM to work together to solve a problem, and we had tremendous difficulty finding a set of problems where this would actually be fun. We needed something that both humans and LLMs are a little bit good at, so that both sides would need to contribute to the solution. However, we couldn’t find much of anything in that category – everything was either too easy for one or the other, or just impossible for both. So we ended up doubling-down on the idea that humans and LLMs are good at different things, resulting in the current game.

NextJS and Vercel are magical

Although I had very little experience with NextJS and Vercel, my coworker did, and he suggested we use them to get started quickly. I was surprised by – and really impressed by – how much NextJS made it easy to get started, think about our React UI and our server endpoints as a cohesive whole, and then deploy the whole thing to Vercel. It saved us a ton of time over doing these things by hand. On the flipside, there were definitely a number of things I found way too magical – for example, abstracting away servers is great, but it makes it harder than I’m used to for “code that needs to run once at server startup”. And setting an endpoint’s timeout by just exporting a specially-named const still feels weird!

The game is actually pretty neat

And, in the end, we shipped something that – if I do say so myself – is pretty neat. It wasn’t where I expected us to go when we set out, but I’m proud of what we ultimately ended up with.

As above, if you haven’t played the game, it’s still at quiz.cord.com, its source code is on github, and the server-side chatbot library we used to build it is available from npm.