In this video, Minqi Jiang (https://minch.co/), a research scientist at University College London & Meta AI, discusses the capabilities of language models like GPT-3 and the effects of Reinforcement Learning from Human Feedback (RLHF) on these models. He explains how RLHF helps make language models more user-friendly by providing a more reliable interface to specific parts of the model, but also reduces the diversity of their outputs, which might be less desirable for certain creative tasks.

Minqi explains that a base language model, like GPT-3, is essentially trained to model the whole internet of text. This vast distribution covers both good and bad content, creating a chaotic and enormous model that can provide a wide range of responses. When prompted with a task, it’s difficult to anticipate how the model will complete it.

RLHF comes into play by fine-tuning the model on a reward signal that is learned from human preference data. This process introduces a bias into the model, making it more likely to generate outputs that were favored by the humans providing the preference data. This results in more reliable answers, but at the cost of diversity in the model’s output.

The process of RLHF can be thought of as a pruning process, where the aim is to remove the bad or undesired parts of the probability distribution and focus on the good ones. This can be seen as a form of robustification, but also potentially reduces the model’s creativity, as it becomes more convergent in its outputs.

In conclusion, RLHF provides a useful way to fine-tune language models to provide more reliable and user-friendly outputs, but it can also reduce the diversity and creativity of the model’s outputs. This trade-off between reliability and diversity is important to consider when using language models for various tasks and applications.

Credit for shoggoth meme: https://twitter.com/anthrupad

source