Differences in Padding Strategies Between Decoder-only and Encoder-only Models
📌 What is Padding?
In Large Language Models (LLMs), padding is a method used to standardize sequence lengths for batch processing.
For example:
1 | Sentence 1: "I love NLP" |
Using the <pad> token for alignment:
1 | "I love NLP <pad> <pad> <pad>" |
📌 Padding Positioning: Left vs Right
There are two common padding strategies:
Right padding:
1
"I love NLP <pad> <pad>"
Left padding:
1
"<pad> <pad> I love NLP"
Typically:
- Decoder-only models (e.g., GPT, Llama): Use Left padding.
- Encoder-only models (e.g., BERT): Use Right padding.
Transformers can be categorized into three main architectures:
| Model Type | Representative Models | Characteristics | Common Applications |
|---|---|---|---|
| Encoder-only | BERT, RoBERTa, ALBERT, ELECTRA | Bidirectional attention | NLP tasks like text classification, named entity recognition |
| Decoder-only | GPT, GPT-2, GPT-3, GPT-4, LLaMA, Mistral | Causal attention (Autoregressive) | Text generation, chatbots, writing assistance |
| Encoder-Decoder | Transformer (original), T5, BART, mT5, PEGASUS | Encoder: bidirectional, Decoder: autoregressive | Machine translation, summarization, dialogue systems |
📌 Why Do Encoder-only Models (e.g., BERT) Use Right Padding?
- Encoder-only models (like BERT) aim to obtain representations for each token.
- These models use bidirectional attention, meaning each token attends to both past and future tokens.
- Slight shifts in position do not significantly impact model performance.
- Special tokens (e.g.,
[CLS]) in BERT maintain a fixed position for tasks like classification, making right padding more natural.
Example:
1 | [CLS] Hello I love NLP [SEP] <pad> <pad> |
- Right padding keeps
[CLS]and[SEP]in stable positions, allowing the model to focus on meaningful tokens.
📌 Why Do Decoder-only LLMs Use Left Padding?
Decoder-only models (like GPT) are autoregressive, meaning each token is generated based only on previous tokens, and future tokens are masked.
- Positional Encoding Stability:
Left padding ensures that meaningful tokens have a consistent relative position, preventing position encoding misalignment.- When using absolute positional encoding, every token (including
<pad>) gets a unique position index. - Padding tokens at the beginning do not affect the model’s attention mechanism due to masking.
- When using absolute positional encoding, every token (including
Example:
1 | Position Index: [ 1 2 3 4 5 6 ] |
The model only attends to tokens where the mask is 1, ignoring padding tokens.
Attention Masking:
Left padding ensures that<pad>tokens do not interfere with token position encoding.
Illustration:
| Token | 1 | 2 | 3 | 4 | 5 | 6 |
|---|---|---|---|---|---|---|
| Left | <pad> |
<pad> |
Hello | I | love | NLP |
| Right | Hello | I | love | NLP | <pad> |
<pad> |
- With Left padding, the last valid token always remains in the same position.
- With Right padding, token positions shift, affecting positional encoding stability.
📌 Padding Differences in Training vs Inference
| Phase | Padding Strategy | Reason |
|---|---|---|
| Training | Left padding for decoder-only; Right padding for encoder-only | Optimized for batch processing efficiency |
| Inference | Typically, no padding for single sequences; Left padding for batched inference | Ensures stable positional encoding |
📌 Summary & Key Takeaways (TL;DR)
- Padding standardizes sequence lengths for batch processing.
- Decoder-only models (GPT, Llama) use Left padding to stabilize positional encoding and prevent future token leakage. Left padding tokens are masked out.
- Encoder-only models (BERT, RoBERTa) use Right padding since they employ bidirectional attention and rely on stable special token positions (e.g.,
[CLS]). - Although padding tokens occupy positions in positional encoding, attention masks effectively filter them out, ensuring they do not affect model predictions.
All articles on this blog are licensed under CC BY-NC-SA 4.0 unless otherwise stated.