Understanding Generative Pre-Training: From Unsupervised Learning to Supervised Fine-Tuning
Let's first start by diving into the inner workings of Generative Pre-Training, a method used to train Generative Pre-trained Transformers. The Framework of Generative Pre-Training comprises two stages:
- Unsupervised Pre-training
In unsupervised pre-training, the goal is to train a large language model on a massive, unsupervised corpus of tokens without any labeled data.
- This process aims to maximize the likelihood of predicting each token based on its surrounding context.
Techniques like maximizing the probability of each token given its neighboring tokens within a window achieve this. The training involves a neural network, often optimized using stochastic gradient descent to adjust its parameters.
The Unsupervised Pre-training stage operates as follows:
Context Prediction: The model is trained to predict each token based on a window of surrounding tokens. For example, to predict the next word in the sentence “The cat sat on the…”, the model would consider “The cat sat on the…” as the context. By using the Maximum Likelihood Technique, the model adjusts its internal settings to make the predictions of the next word in the sequence as likely as possible, given the context.
Neural Network Utilization: The model uses a type of neural network – a Transformer decoder – which utilizes stochastic gradient descent to learn and adjust its internal settings
- Supervised Fine-Tuning
After training on a massive, unlabeled corpus (stage 1), we switch to a focused labeled dataset for fine-tuning. Each example in this dataset combines two parts:
Input Sequence: A sequence of tokens representing the text or data being processed.
Label (y): The desired outcome for the specific input sequence.
The labeled sequence is fed into the pre-trained GPT model obtained in stage 1. As the sequence progresses through the model, the final layer (transformer block) generates an embedding that captures the input's meaning and context.
Next, we add a new linear output layer on top of the pre-trained model:
This layer has its own set of parameters (Wy).
It takes the embedding from the final transformer block as input.
It uses its parameters (Wy) to predict the label (y) for the specific input sequence.
The entire process aims to maximize an objective function. This function essentially calculates the difference between the model's prediction (y) and the actual label (y) in the labeled dataset. By analyzing this difference, the model adjusts the parameters (Wy) of the new linear output layer to improve its prediction accuracy over time.
What is a Generative Pre-trained Transformer (GPT)?
A GPT combines three key elements:
Generative: This refers to the model's ability to generate new data, like text, code, or translations, based on the input it receives.
Pre-trained: This signifies that the model has been trained on a massive dataset beforehand. This pre-training allows it to learn general language patterns before being fine-tuned for specific tasks.
Transformer: The GPT model leverages a powerful neural network architecture called the Transformer. Within the Transformer, a specific component, the decoder, plays a crucial role. This decoder is responsible for generating output sequences based on the internal representations it learns from the input sequences.
Here are some key expert definitions of Generative Pre-trained Transformers:
OpenAI: A transformer-based machine learning model that can generate human-like text.
Ian Goodfellow: A type of deep learning model capable of generating coherent and contextually relevant text based on a given prompt.
Sebastian Ruder: A class of autoregressive language models that use transformer architectures and are pre-trained on large text corpora.
Jay Alammar: A powerful generative model capable of producing coherent and contextually relevant text by leveraging the vast amount of knowledge encoded in its pre-training data.