While anticipation builds for GPT-4, OpenAI quietly releases GPT-3.5

While anticipation builds for GPT-4, OpenAI quietly releases GPT-3.5
, while GPT-4 is in the works.

Released 2 years ago, OpenAI’s if flawed was perhaps one of the first to show that AI can write convincingly, if not flawlessly, like a human. The successor to GPT-3, most likely called GPT-4, is expected to be unveiled in the near future, perhaps as soon as 2023. But in the meantime, OpenAI has quietly rolled out a series of AI models based on “GPT-3.5,” a previously-unannounced, improved version of GPT-3.

GPT 3.5 was featured Wednesday by ChatGPT. This is a fine-tuned, general-purpose chatbot that’s basically GPT-3.5. ChatGPT was launched in a public yesterday afternoon. It can engage with a variety of topics including programming and scientific concepts.

According to OpenAI, GPT-3.5 was trained on a blend of text and code published prior to Q4 2021. GPT-3.5, like GPT-3, learned the relationships between sentences and words using large amounts of text from the internet, including hundreds of thousands on social media posts and news articles.

Instead of releasing the fully trained GPT-3, OpenAI used it for several systems that were fine-tuned to specific tasks. Each system is available through the OpenAI API. One — text-davinci-003 — can handle more complex instructions than models built on GPT-3, according to the lab, and is measurably better at both long-form and “high-quality” writing.

According to OpenAI data scientist Jan Leike, text-davinci-003 is similar but not identical to InstructGPT, a family of GPT-3-based models released by OpenAI earlier this year that are less likely to generate problematic (e.g., toxic and highly biased) text while more closely aligning with a user’s intent. Text-davinci-003 — and by extension GPT-3.5 — “scores higher on human preference ratings” while suffering from “less severe” limitations, Leike said in a tweet.

The model still has some important limitations! The limitations of the model are still important, but they will be less severe or more frequently than in our previous blog post. https://t.co/r8zWWYSIJ6

— Jan Leike (@janleike) November 28, 2022

This seems to be the case anecdotally. Data scientists at Pepper Content, a content marketing platform, report that text-davinci-003 “performs better in understanding the ‘context’ behind a request and then using that to produce better content” while “hallucinating” less than GPT-3-based models. (Where it concerns text-generating AI, hallucination refers to an AI writing inconsistent, factually incorrect statements. )

In a test on OpenAI’s Playground website, which provides a UI frontend for the models, the Pepper Content team fed several prompts to text-davinci-003 and a model based on GPT-3 (text-davinci-002). Given “What is the philosophy behind WeWork?,” the GPT-3.5-based text-davinci-003 generated this:

WeWork is a global workspace provider that believes people are the most important asset in any organization. WeWork’s philosophy is to create a collaborative environment that allows people to work together in an efficient and flexible way. The company promotes collaboration and productivity while creating a welcoming and inspiring environment.

WeWork also believes in being a socially responsible company. WeWork seeks to reduce its environmental impact by providing meaningful work experiences and promoting diversity, inclusion, and sustainability. Through its unique community-based programming and events, WeWork strives to create meaningful experiences. The company believes that people can achieve more when they work together in a supportive and collaborative environment.

It’s not perfect — notice the repetitiveness and excessive commas. But the copy’s certainly more engaging than what the GPT-3-based text-davinci-002 produced:

At WeWork, we believe in openness, collaboration, and community. WeWork has reimagined the concept of a workspace by bringing together entrepreneurs, creators, and startups from around the globe — anyone who wants to make a difference and change the world. We are proud to be your neighbor, friend, and coworker.

It seems

GPT 3.5 is also more effective at creating blog posts. Here’s what the Pepper Content team got when they prompted text-davinci-003 to write a post about picking a sofa:

If you are looking to give your green living room a fresh look, adding the the best sofa for the space is a great way to start. There is more to selecting the right sofa than simply finding one that looks great. You want a piece that will be comfortable and complements your green living space decor.

In this blog, we’ll discuss how to choose a sofa for your green living space. We’ll discuss the various types of sofas, the design considerations to make when choosing a sofa, and how to find the right sofa for you. We will also give tips on how to care for your new sofa. This blog will provide all the information you need in order to choose the right sofa for your green living space.

It isn’t foolproof. It is not foolproof.

Sofa is one of the most basic requirements in a living room. It’s more than a piece furniture. It is an integral part of the decor in your living room. What should you look for in a sofa? You might be wondering what the criteria are for choosing the right sofa.

Experiments beyond Pepper Contents’s show that GPT-3.5 tends be more sophisticated and thorough in its responses to questions than GPT-3. For example, when YouTube channel All About AI prompted text-davinci-003 to write a history about AI, the model’s output mentioned key luminaries in the field, including Alan Turing and Arthur Samuelson, while text-davinci-002″s did not. All About AI also found that text-davinci-003 tended to have a more nuanced understanding of instructions, for instance providing details such as a title, description, outline, introduction and recap when asked to create a video script.

That’s no accident — a hallmark feature of text-davinci-003/GPT-3.5’s outputs is verboseness. This writer can sympathize. In an analysis, scientists at startup Scale AI found text-davinci-003/GPT-3.5 generates outputs roughly 65% longer than text-davinci-002/GPT-3 with identical prompts.

Perhaps less useful for most potential users but nonetheless entertaining, text-davinci-003/GPT-3.5 is superior at composing songs, limericks and rhyming poetry than its predecessor. Ars Technica reports that commenters on Y Combinator’s Hacker News forum used text-davinci-003 to write a poem explaining Albert Einstein’s theory of relativity and then re-write the poem in the style of John Keats. See:

If you want to understand Einstein’s thought
It’s not that hard if you give it a shot
General Relativity is the name of the game
Where space and time cannot remain the same
Mass affects the curvature of space
Which affects the flow of time’s race
An object’s motion will be affected
By the distortion that is detected
The closer you are to a large mass
The slower time will seem to pass
The farther away you may be
Time will speed up for you to see

The Scale AI team even found that text-davinci-003/GPT-3.5 has a notion of meters like iambic pentameter. See:

O gentle steeds, that bear me swift and


Through green fields and pathways so


My heart swells with pride to be with


We ride the world afresh to view

The wind blows through our hairs so


And stirs a passion deep within me.

My soul lifts, my spirits soar high ,

To ride along with you, my truest buddy, am I

Your strength, grace, courage and

your fire,

Inspire us to go beyond our sires.

No earthly bonds can keep us, only fate ,

To gallop on, our wond’rous course create

Relatedly GPT-3.5 is more witty than GPT-3, at least from a subjective perspective. Asking text-davinci-002/GPT-3 to “tell a joke” usually yields this:

Why did the chicken cross the road? To cross the road.

Text-davinci-003/GPT-3.5 has cleverer responses:

Q: What did the fish say when it hit the wall? A: Dam!

Q: What did one ocean say to the other ocean? A: Nothing, they just waved.

Scale AI had the model explain Python code in the style of Eminem, a feat which text-davinci-002/GPT-3 simply couldn’t accomplish:

Yo, so I’m loopin’ through this list

With each item that I find

I’m going to print every letter in each

of them

Dog, Cat, Banana, Apple, I’m gonna get’em

all with this rhyme

So, why is GPT 3.5 better than GPT-3 for these specific areas? Without additional information from OpenAI, we don’t have the right answer. An OpenAI spokesperson declined to comment. It’s likely that GPT-3.5 was involved in this. GPT-3.5, like InstructGPT was trained by human trainers who ranked the responses to prompts and rated them. This information was then fed back to the system, which adjusted its answers to suit the preferences of the trainers.

However, this does not make GPT-3.5 immune from the pitfalls to that all modern language models fall prey to. Because GPT-3.5 merely relies on statistical regularities in its training data rather than a human-like understanding of the world, it’s still prone to, in Leike’s words, “mak[ing] stuff up a bunch.” It also has limited knowledge of the world after 2021 because its training data is more sparse after that year. And the model’s safeguards against toxic output can be circumvented.

But, GPT-3.5 and its derivate models show that GPT-4, when it arrives, won’t necessarily require a large number of parameters to be the best text-generating system today. (Parameters are parts of a model that have been trained from historical data. They essentially define the skill level of the model on a particular problem. While some have predicted that GPT-4 will contain over 100 trillion parameters — nearly 600 times as many as GPT-3 — others argue that emerging techniques in language processing, like those seen in GPT-3.5 and InstructGPT, will make such a jump unnecessary.

One of those techniques could involve browsing the web for greater context, a la Meta’s ill-fated BlenderBot 3.0 chatbot. John Shulman, a research scientist and co-founder of OpenAI, told MIT Tech Review in a recent interview that OpenAI is continuing work on a language model it announced late last year, WebGPT, that can go and look up information on the web (via Bing) and give sources for its answers. At least one Twitter user appears to have found evidence of the feature undergoing testing for ChatGPT.

OpenAI also has a reason to develop lower-parameter models, as it continues to improve GPT-3. A 2020 study from AI21 Labs pegged the expenses for developing a text-generating model with only 1.5 billion parameters at as much as $1.6 million. OpenAI has raised over $1 billion to date from Microsoft and other backers, and it’s reportedly in talks to raise more. All investors, regardless of their size, can expect to see returns eventually.

Read More