Pre-Training vs. Fine Tuning: Understanding the Difference

  1. Pre-training
  2. Fine tuning
  3. PEFT
    1. Adapter Tuning
    2. LoRA
    3. Quantization
    4. Prompt Modifications
      1. Hard Prompt
      2. Soft Prompt
        1. Prompt Tuning
        2. Prefix Tuning
        3. P-tuning

Pre-training

Pre-training is when you have the entire architecture of neural net model in front of you and you train it on a huge dataset from scratch. We can also call it self-supervised training because there are no separate labels assigned to the text data. We can use this unlabeled data for training by simply thinking of it as a next word prediction task where the next word is already made available to us. The large language models are generally pre-trained on variety of tasks involving creative writing, writing emails, text summarization among others.

Fine tuning

We can think of fine tuning as a process in which we freeze the parameters of all except the last few layers of the pre-trained model. Then, we train the parameters of the last layers on a downstream task such as sentiment analysis or classification using a labeled dataset which is much smaller in size compared to the data utilized for pre-training.

PEFT

Parameter efficient fine tuning — It includes ways that utilize far less memory foot print than the base pre-trained models and the generic fine tuned models. The generic idea behind PEFT remains the same as fine tuning— freeze most of the parameters of the model and train only a few for a downstream tasks. It is just that the parameters which are trained using PEFT are fewer compared to a generic fine tuned model.

There are however, exceptions (Hard Prompts) as well.

Adapter tuning

Adapters are pair of blocks that are added to the transformer network. Each adapter block contains 2 plain FC layers with a non linear activation in between them. Adapter block contracts and then expands the input data back to it’s base dimension. This is a generic method used in neural networks to reduce the number of parameters.

Adapter tuning is used in conjunction with Lora and quantization. Let’s say, we do 4 bit quantization for fine tuning a pre-trained model. The parameters which are to be tuned are decomposed using LoRA (SVD), which can also be understood as a category of adapters. The model is then tuned using dataset needed for fine tuning and the resultant model which is just a kind of plug-in adapter model is stored in an online repository called Adapter hub. This plug-in adapter model is small in size. Folks in community can then download the adapter model for inference from adapter hub for the same downstream task rather than creating their own adapter tuned model for same task.

LoRA

LoRA — Low rank adaption is a technique which uses a mathematical theorem called as SVD (singular value decomposition) to decompose a high rank matrices in to a combination of low rank matrices. This decomposition is carried out in the attention layer of transformer model. The parameters belonging to these low rank matrices are optimized during the training process for downstream task.

We do not change any parameters for a pre-trained model. Instead, only train low rank matrices, which happen relatively very quickly because of fewer parameters. The weights are additive. This means for inference, we just add the weights of low rank matrices to pre-trained weights without any additional latency.

Quantization

The key idea here is to reduce the precision (datatype) of the parameters of a pre-trained model to accommodate the model in a low memory hardware. A small fragment of overall parameters, which are used for tuning, are kept in high precision since they are to be tuned by the network in training process and reducing them to low precision might hamper the learning objective.

However, an important point to note here is if we are going for a 4 bit quantization, the parameters which are not to be tuned are also scaled up to high precision during the training process and are scaled back to lower precision value when the training process is over. This means they are stored at lower precision but are matching the precision of tunable parameters during the training process. This helps trainable parameters learn from it’s surroundings but the storage of model still keeps a lower memory footprint.

Prompt Modifications

Hard Prompts aka Prompt Engineering

Hard Prompts utilize manually written prompts that are passed along with input data. Hard Prompts are not updating any parameter of the model and rely solely on the power of the pre-trained model. The basic idea of using a hand-written prompt is to influence the next word that’s predicted by the LLM.

Soft Prompts

Soft Prompts are not manually written, rather, these are kind of numerical value prompts that are injected into the model either into input layer or all the layers of network. The right value of these numerical are learnt based on the downstream task during the training process of the model.

Prompt Tuning

Prompt Tuning should not be confused with prompt engineering — hard prompting. Let us create a trainable tensor — numerical — that is added to the input embeddings. The combined embeddings are then fine tuned during the training process on the downstream task.

Prefix Tuning

We prefix virtual tokens to all the model layers (transformer blocks). These virtual tokens are then optimized for the downstream task during training process. These tokens are added and optimized via separate FF (feed-forward) layer network that’s added to the overall architecture of transformer.

As shown in figure below, the parameters corresponding to the FF network, that embeds prefix into the transformer model, are dynamic. The parameters of rest of the model are frozen making this a parameter efficient method.

P-tuning

P-tokens are virtual tokens are added to the input layer only in a transformer and not to all the layers of network. They are optimized for downstream task during the training process.

Comments

Popular posts from this blog

Deploy FastAPI on AWS Lambda: A Step-by-Step Guide