Skip to main content

Command Palette

Search for a command to run...

Notes on Secrets of RLHF in Large Language Models Part I: PPO

This is a summary of an important research paper that provides a 21:1 time savings. It was made interactively by a human and several AI's. The goal is to save time and curate good ideas.

Published
5 min read
Notes on Secrets of RLHF in Large Language Models Part I: PPO

Link to paper: https://arxiv.org/abs/2307.04964

Paper published on: 2023-07-11

Paper's authors: Rui Zheng, Shihan Dou, Songyang Gao, Wei Shen, Binghai Wang, Yan Liu, Senjie Jin, Qin Liu, Limao Xiong, Lu Chen, Zhiheng Xi, Yuhao Zhou, Nuo Xu, Wenbin Lai, Minghao Zhu, Rongxiang Weng, Wensen Cheng, Cheng Chang, Zhangyue Yin, Yuan Hua, Haoran Huang, Tianxiang Sun, Hang Yan, Tao Gui, Qi Zhang, Xipeng Qiu, Xuanjing Huang

GPT3 API Cost: $0.06

GPT4 API Cost: $0.15

Total Cost To Write This: $0.21

Time Savings: 21:1

The Power and Challenges of Large Language Models

Large Language Models (LLMs) have been making waves in the AI community, and for good reason. They hold the potential to bridge the gap between words and percepts, enabling more meaningful interactions with the real world. This capability brings us a step closer to realizing the dream of artificial general intelligence (AGI). But as with any powerful tool, LLMs come with their own set of challenges.

Aligning LLMs with Human Values

One of the primary concerns with LLMs is their potential to exhibit unintended behaviors and generate harmful content. It's essential that these models align with human values, which can be encapsulated in the three H's: helpful, honest, and harmless.

Reinforcement Learning from Human Feedback (RLHF) is a promising approach to align LLMs with user intent. However, successful RLHF training requires an accurate reward model, careful hyperparameter exploration, and a robust Proximal Policy Optimization (PPO) algorithm.

RLHF training is a complex process involving three main phases: pretraining, reward modeling, and reinforcement learning via PPO.

In the pretraining phase, the model learns to mimic human-annotated dialogue examples. This lays the groundwork for the subsequent phases.

Next, a reward model is trained to compare the preference of different responses based on human feedback. The reward modeling loss includes a modeling loss and an autoregressive language model (LM) loss. The reward function also includes a penalty based on the Kullback-Leibler (KL) divergence between the RL policy and the initial supervised model.

In the reinforcement learning phase, the agent receives a state from the environment (dialogue history) and generates the next token based on its policy. The goal here is to find an optimal behavior strategy that maximizes the cumulative reward over a trajectory. Policy gradient methods are employed to optimize the agent's policy.

To estimate the advantage function, which represents the difference between the Q function and the value function, the Generalized Advantage Estimation (GAE) method is used. GAE smoothly interpolates between high bias and high variance estimators, effectively managing the trade-off.

PPO and TRPO are two techniques in RL that aim to train a policy without jeopardizing its stability. TRPO introduces a constraint on policy updates by ensuring the KL divergence remains within an acceptable limit. On the other hand, PPO employs two distinct approaches: PPO-Penalty and PPO-Clip. PPO-Penalty addresses unconstrained optimization problems by using a penalty-based approach, while PPO-Clip uses a clipped version of the policy ratio in its objective to keep the new policy close to the old one.

The value function, which estimates the expected returns for each state, plays a crucial role in the PPO algorithm. Mixing pretraining gradients can help mitigate potential degradation in the model's language skills and knowledge retention during PPO.

Reward Model Training and Human Preferences

A reward model is trained to reflect human preferences and is used to fine-tune the model using Reinforcement Learning and human annotations. The trained reward model shows alignment with human preferences, especially the model trained on Chinese data. However, accuracy alone is insufficient as a criterion for the reward model.

PPO is the core algorithm for achieving alignment with human preferences. However, stabilizing RLHF training with language models is still an open question. Metrics are needed to reflect the quality of PPO training, and the training process of PPO can exhibit pattern collapse, where models are over-optimized and exhibit biased behavior.

Various metrics, such as reward score, training loss, perplexity, and KL divergence, can be used to monitor the training process. Implementing specific strategies and tricks in PPO training can help stabilize optimization and improve results.

The paper discusses different experiments on hyper-parameter tuning and tricks in PPO training for RLHF. The paper introduces the concept of "score reparameterization" and its impact on PPO training stability. Reward scaling is shown to not guide proper policy optimization in PPO training. Reward normalization and clipping can contribute to training stability in PPO. Advantages normalization and clipping can provide similar effects to reward clipping in PPO training.

Policy Constraints

Different policy constraints are explored, including token level KL-penalty, importance sampling, and entropy bonus. Pretrained initialization of the policy and critic models is discussed, with the policy model requiring supervised fine-tuning. The initialization of the critic model does not significantly affect PPO training.

The paper recommends constraining the instability of policy optimization on the reward level and using KL-penalty as a policy constraint. The researchers conducted experiments on the fine-tuning process of the policy model using PPO-max. They pre-trained the critic model and found that it improved training stability by providing better advantage estimation.

Comparing RLHF Models with SFT Models and ChatGPT

The researchers compared RLHF models with SFT models and found that RLHF models received higher preference ratings from human evaluators. They also compared RLHF models with ChatGPT and found that RLHF models were able to mitigate defeats to ChatGPT.

The researchers conducted language understanding evaluations and found that PPO-ptx mitigated the decline in NLU capabilities caused by PPO. Dialogue examples showed that RLHF-trained models generated responses with higher informational content and demonstrated better judgment in addressing harmful prompts.

The research is a significant step forward in understanding and improving the alignment of LLMs with human preferences. As we continue to explore and refine these models, we move closer to a future where AI not only understands our language but also respects our values. The journey is long, and the challenges are many, but the potential rewards are immense.