Navigating GPT 3.5 vs GPT 4

October 07, 2023

Chippy, inspired by Microsoft’s Clippy, is a chrome extension that is powered by OpenAI’s GPT 3.5 and GPT 4 models. If you’ve tinkered with GPT 3.5 and 4, you’ll notice pretty quickly that GPT 4 produces better results. In fact, when someone complains about an invalid answer or hallucination, most likely they were using GPT 3.5. As you can see below:

GPT 3.5 hallucinates about a fake wikipedia article

GPT 4 does not hallucinate

So why not just use GPT 4? Well, GPT 3.5 is an order of magnitude cheaper and significantly faster than 4. So if you are building an app that leverages GPT 3.5, read on to hopefully avoid some of the pitfalls I ran into.

GPT 3.5 struggles with irrelevant information

GPT 3.5 has a notable drawback: it’s easily sidetracked by irrelevant information, especially when faced with complex queries. For example, let’s say you're navigating through a Github pull request and you prompt GPT 3.5 with a request to "write a sample PR based on the changeset". Given the context of a busy page, GPT 3.5 often falters. Whereas GPT 4 filters out the noise, and is able to write an accurate PR description:

GPT 3.5

GPT 4

Another unexpected complication was the relatively simple request of asking GPT 3.5 to reply to an email thread. While single-threaded chains were relatively straightforward, threads that branched off and introduced multiple recipients, BCC’s, forwards, and inline replies posed a much greater challenge.

GPT 3.5 routinely misunderstood the provided context and would, for example, not take into account the final reply in a thread when crafting a response.

We got around this by no longer sending a blob of text, but instead using Gmail specific library that can generate a list of emails including the timestamp they were sent. GPT 3.5 can accurately identify the last email after we provided structured context

Wrangling GPT 3.5 with OpenAI Functions

Contextualized follow-up questions is one of our most beloved features; not only does it help our users refine and explore the topic they were initially interested in, it also lets us demonstrate the utility of a Chippy in a more natural way. We started with a simple system prompt:

But to get any sort of consistent result as JSON, you will need to utilize OpenAI functions. The simple prompt above turned into this behemoth:

We added the following based on this guide from OpenAI:

Open AI Function to get structured JSON response back
constraints: “limit to 8 words”
reminders: “Remember, follow-up questions should NEVER be from the perspective of the bot and ALWAYS from the perspective of the user!”.
and few-shot: ‘Example label is: “Interested in PowerPoint. Can I help?”

Based on my research, OpenAI functions do not support few-shot examples. If it did, you could move a lot of this logic into it, but I had to duplicate to get consistent results.

You can see (and tweak) the final result here. Even after adding all these constraints, GPT 3.5 will still return more than three questions around 5% of the time. GPT 3.5 functions used to not return JSON consistently too, but that no longer appears to be the case as of a month ago according to my Sentry alerts.

As you can see, GPT 3.5 has its quirks. But with the right approach, you can achieve impressive results. Interested in experiencing it firsthand? Give Chippy a try.