Demystifying ChatGPT: A Deep Dive into Reinforcement Learning with Human Feedback

Comprehensive guide on understanding the inner workings of OpenAI’s ChatGPT, including a deep dive into the Reinforcement Learning with Human Feedback (RLHF) algorithm that powers it.

8 min readFeb 2, 2023

ChatGPT has revolutionized the field of conversational AI. Its ability to generate human-like responses has sparked interest from industries around the world, and it’s no wonder why. But how does ChatGPT work so flawlessly? have you ever considered the training process behind ChatGPT? Did you know that it was trained in a supervised manner? How does it avoid “Algorithmic Bias”? How does it perform without relying on conventional metrics? Let us explore the inner workings of this revolutionary AI model and discover what truly makes it tick!

First, lets initiate the talk about the problem we once faced earlier in large AI models, “Algorithmic Bias”. It refers to a systematic errors or unjustified outcomes in large AI models. As large AI models are used to be trained with tremendous amount of data and data in our real world are sometimes biased. For example, the model may be more likely to associate certain career titles, like CEO, with men, or certain domestic tasks, like cooking, with women. This type of bias can perpetuate harmful stereotypes and result in unfair outcomes in areas such as employment and lending.

The challenge in using a language model for diverse applications, such as creative storytelling, delivering informative and accurate facts, or generating executable code, lies in finding a loss function that can effectively capture all the required attributes. Given the limitations of our mathematical knowledge, most language models resort to using a simple next token prediction loss, like cross-entropy. However, to compensate for its shortcomings, metrics like BLEU or ROUGE are introduced to better reflect human preferences in the generated outputs. While these metrics provide some improvement, they still have limitations as they only compare the generated text to references using basic rules. So, if not through loss functions or metrics, how can we effectively train a language model to perform well in these varied applications?

If we can’t mathematically derive a loss function to handle the diverse range of attributes required for multiple applications like creative writing, factual information delivery, and code generation? why not involve humans in the training loop. By incorporating human feedback as a performance measure or even a loss to optimize the model, we can achieve better results. This is the idea behind Reinforcement Learning using Human Feedback (RLHF).

RLHF was first introduced by OpenAI in “Deep reinforcement learning from human preferences”. It initially focused on using RL to teach an agent to perform a backflip in a virtual environment. Since then, OpenAI has consistently employed human feedback in various research projects, including “Fine-Tuning Language Models”, “Learning to Summarize”, “Recursively Summarizing Books”, “WebGPT”, “Training language models to follow instructions” (ChatGPT is the new kid). OpenAI has built an optimized ecosystem involving humans for its training strategy. And, it has indeed proved to be fruitful.

A typical RLHF system looks like in the below flowchart, involves an “Agent”(RL algorithm) observing the environment and taking actions. Typically, the environment rewards the Agent for performing the correct actions towards achieving its designated goal. However, in RLHF, the rewards are calculated based on human feedback instead of the environment.

Source: Deep reinforcement learning from human preferences paper

Below is the video presented in the first RLHF paper shows an AI agent learning to perform a backflip. Humans are given two options for the AI’s behavior, and they select the one closest to achieving the goal. Through human feedback, the AI gradually improves its behavior and eventually succeeds in completing the backflip.

Before moving onto ChatGPT, let’s examine another OpenAI paper, “Learning to Summarize from Human Feedback” to better understand the working of RLHF algorithm on Natural Language Processsing (NLP) domain. This paper proposed a Language model guided by human feedback on the task of summarization. This improves the quality of the generated summary without algorithmic bias even when the input texts are biased. Below is the flowchart of proposed methodology,

Source: Learning to Summarize from Human Feedback paper

In short, A long form text is presented to the agent, which generates multiple summaries of the text. Humans rank these summaries and the reward model is optimized based on the generated text and the human feedback to mimic human reward. After the reward model is trained, a new training cycle starts with a different long form text being presented to the agent, which generates another summary. The trained reward model then ranks the summaries similarly to humans. This process is repeated in multiple cycles until the goal is achieved. Below is an example explaining the same,

RLHF in ChatGPT:

Now, Let’s delve deeper into the training process that involves a strong dependence on Large Language Models (LLMs) and Reinforcement Learning (RL). ChatGPT research, kind of replicate almost the similar methodology to “Learning to Summarize” paper. The whole training process can be broken down into three steps and shown in the figure below

Step I: Pretraining Task:

As in almost all the NLP task conquered by Transformers, OpenAI built this ChatGPT on top of GPT architecture, specifically the new GPT 3.5 series. The data for this initial task is generated by human annotators (called as AI trainers) playing both the sides — the user and the AI assistant. For example, a initial random prompt was presented to multiple AI trainers, who answered as the AI assistant, then asked follow-up questions as the user. This process was repeated several times by different trainers, and the data was combined together to fine-tune GPT-3.5 through traditional supervised learning. Now, the LLM with capability to answer any prompt was developed.

Step II: Preparing reward model

It’s time to add the key ingredient: human feedback in the training process. The aim is to turn a sequence of text into a scalar reward that mirrors human preferences. Just like summarization model, the reward model is constructed using comparison data. The initial language model is fed with a randomly chosen prompt to produce new text (multiple outputs). Human annotators will then rank these outputs based on quality and create a much better regularized dataset. They use a common ranking strategy such as an Elo system to assign a scalar reward signal for training. The resulting language model trained on this ranked dataset is named as “reward model”. At this point of time, we have both an initial language model that generates text and a preference model that scores any given text based on how well it’s perceived by humans.

Step III: RL based Fine-tuning LLM

Fine-tuning language models with RL-based policies has long been a dream until the arrival of the Proximal Policy Optimization (PPO) algorithm. PPO makes it possible to fine-tune some or all of the parameters of a language model. The language model acts as an agent that takes a prompt as input and outputs a sequence of text. The action space consists of all the tokens in the model’s vocabulary, while the observation space encompasses all possible input token sequences. The reward function is a combination of the preference model and constraints on the agent’s behavior.

Given a prompt from the dataset, two texts are generated. one from the initial LM and one from the current iteration of the fine-tuned agent. The generated text from the current iteration is passed onto the reward model trained in “Step-2” which generates a scalar reward based on the ranking similar to human preferences. Also, the same text is compared to the text from initial LM to compute a penalty on the differences between them. This protects the agent from fooling the reward model. The most preferred penalty for comparison is Kullback Leibler (KL) divergence, the KL divergence term helps to keep the updated policy of the language model close to its original pretrained state, avoiding drastic changes during each training batch. This helps to ensure that the model continues to produce coherent text outputs. So, the reward function is a combination of both penalty and scalar reward.

This reward function update rule maximizes the reward metrics in the current batch of data. PPO is trust region optimization algorithm that uses constraints on the gradient to ensure the update step does not destabilize the learning process. As the RL policy updates, users can continue ranking these outputs versus the agent’s earlier versions.

In conclusion, by involving people in the training process, models can learn human-like behavior. OpenAI has created an environment for this and will continue to improve it in the future. ChatGPT is a big success, but it raises questions about the role of AI in affecting human creativity and learning. Most people asked “is there a way to differentiate AI fabricated text and human text like deepfakes in computer vision?” and OpenAI replied with an answer of “Water marking the AI generated text”. Kindly follow me to get notified when that detailed coverage of that research is released.

Disclaimer: OpenAI has not released their paper for ChatGPT, based on hints provided in their website article and some earlier research articles published by them this blog is written.

Further Reading: To read further on RLHF and other techniques, below are some articles that inspired me,

Thanks for reading my article! Happy reading!