Differences in Padding Strategies Between Decoder-only and Encoder-only Models
📌 What is Padding?
In Large Language Models (LLMs), padding is a method used to standardize sequence lengths for batch processing.
For example:
1 | Sentence 1: "I love NLP" |
Using the <pad>
token for alignment:
1 | "I love NLP <pad> <pad> <pad>" |
📌 Padding Positioning: Left vs Right
There are two common padding strategies:
Right padding:
1
"I love NLP <pad> <pad>"
Left padding:
1
"<pad> <pad> I love NLP"
Typically:
- Decoder-only models (e.g., GPT, Llama): Use Left padding.
- Encoder-only models (e.g., BERT): Use Right padding.
Transformers can be categorized into three main architectures:
Model Type | Representative Models | Characteristics | Common Applications |
---|---|---|---|
Encoder-only | BERT, RoBERTa, ALBERT, ELECTRA | Bidirectional attention | NLP tasks like text classification, named entity recognition |
Decoder-only | GPT, GPT-2, GPT-3, GPT-4, LLaMA, Mistral | Causal attention (Autoregressive) | Text generation, chatbots, writing assistance |
Encoder-Decoder | Transformer (original), T5, BART, mT5, PEGASUS | Encoder: bidirectional, Decoder: autoregressive | Machine translation, summarization, dialogue systems |
📌 Why Do Encoder-only Models (e.g., BERT) Use Right Padding?
- Encoder-only models (like BERT) aim to obtain representations for each token.
- These models use bidirectional attention, meaning each token attends to both past and future tokens.
- Slight shifts in position do not significantly impact model performance.
- Special tokens (e.g.,
[CLS]
) in BERT maintain a fixed position for tasks like classification, making right padding more natural.
Example:
1 | [CLS] Hello I love NLP [SEP] <pad> <pad> |
- Right padding keeps
[CLS]
and[SEP]
in stable positions, allowing the model to focus on meaningful tokens.
📌 Why Do Decoder-only LLMs Use Left Padding?
Decoder-only models (like GPT) are autoregressive, meaning each token is generated based only on previous tokens, and future tokens are masked.
- Positional Encoding Stability:
Left padding ensures that meaningful tokens have a consistent relative position, preventing position encoding misalignment.- When using absolute positional encoding, every token (including
<pad>
) gets a unique position index. - Padding tokens at the beginning do not affect the model’s attention mechanism due to masking.
- When using absolute positional encoding, every token (including
Example:
1 | Position Index: [ 1 2 3 4 5 6 ] |
The model only attends to tokens where the mask is 1, ignoring padding tokens.
Attention Masking:
Left padding ensures that<pad>
tokens do not interfere with token position encoding.
Illustration:
Token | 1 | 2 | 3 | 4 | 5 | 6 |
---|---|---|---|---|---|---|
Left | <pad> |
<pad> |
Hello | I | love | NLP |
Right | Hello | I | love | NLP | <pad> |
<pad> |
- With Left padding, the last valid token always remains in the same position.
- With Right padding, token positions shift, affecting positional encoding stability.
📌 Padding Differences in Training vs Inference
Phase | Padding Strategy | Reason |
---|---|---|
Training | Left padding for decoder-only; Right padding for encoder-only | Optimized for batch processing efficiency |
Inference | Typically, no padding for single sequences; Left padding for batched inference | Ensures stable positional encoding |
📌 Summary & Key Takeaways (TL;DR)
- Padding standardizes sequence lengths for batch processing.
- Decoder-only models (GPT, Llama) use Left padding to stabilize positional encoding and prevent future token leakage. Left padding tokens are masked out.
- Encoder-only models (BERT, RoBERTa) use Right padding since they employ bidirectional attention and rely on stable special token positions (e.g.,
[CLS]
). - Although padding tokens occupy positions in positional encoding, attention masks effectively filter them out, ensuring they do not affect model predictions.