Differences in Padding Strategies Between Decoder-only and Encoder-only Models

📌 What is Padding?

In Large Language Models (LLMs), padding is a method used to standardize sequence lengths for batch processing.

For example:

1 2	Sentence 1: "I love NLP" Sentence 2: "Padding is useful in LLM training"

Using the <pad> token for alignment:

1 2	"I love NLP <pad> <pad> <pad>" "Padding is useful in LLM training"

📌 Padding Positioning: Left vs Right

There are two common padding strategies:

Right padding:
1
"I love NLP <pad> <pad>"
Left padding:
1
"<pad> <pad> I love NLP"

Typically:

Decoder-only models (e.g., GPT, Llama): Use Left padding.
Encoder-only models (e.g., BERT): Use Right padding.

Transformers can be categorized into three main architectures:

Model Type	Representative Models	Characteristics	Common Applications
Encoder-only	BERT, RoBERTa, ALBERT, ELECTRA	Bidirectional attention	NLP tasks like text classification, named entity recognition
Decoder-only	GPT, GPT-2, GPT-3, GPT-4, LLaMA, Mistral	Causal attention (Autoregressive)	Text generation, chatbots, writing assistance
Encoder-Decoder	Transformer (original), T5, BART, mT5, PEGASUS	Encoder: bidirectional, Decoder: autoregressive	Machine translation, summarization, dialogue systems

📌 Why Do Encoder-only Models (e.g., BERT) Use Right Padding?

Encoder-only models (like BERT) aim to obtain representations for each token.
These models use bidirectional attention, meaning each token attends to both past and future tokens.
Slight shifts in position do not significantly impact model performance.
Special tokens (e.g., [CLS]) in BERT maintain a fixed position for tasks like classification, making right padding more natural.

Example:

1	[CLS] Hello I love NLP [SEP] <pad> <pad>

Right padding keeps [CLS] and [SEP] in stable positions, allowing the model to focus on meaningful tokens.

📌 Why Do Decoder-only LLMs Use Left Padding?

Decoder-only models (like GPT) are autoregressive, meaning each token is generated based only on previous tokens, and future tokens are masked.

Positional Encoding Stability:
Left padding ensures that meaningful tokens have a consistent relative position, preventing position encoding misalignment.
- When using absolute positional encoding, every token (including <pad>) gets a unique position index.
- Padding tokens at the beginning do not affect the model’s attention mechanism due to masking.

Example:

1
2
3

Position Index: [ 1      2      3      4      5      6 ]
Token:         [ <pad>, <pad>, Hello,  I,   love,  NLP ]
Mask:          [  0,     0,     1,     1,     1,    1 ]

The model only attends to tokens where the mask is 1, ignoring padding tokens.
Attention Masking:
Left padding ensures that <pad> tokens do not interfere with token position encoding.

Illustration:

Token	1	2	3	4	5	6
Left	`<pad>`	`<pad>`	Hello	I	love	NLP
Right	Hello	I	love	NLP	`<pad>`	`<pad>`

With Left padding, the last valid token always remains in the same position.
With Right padding, token positions shift, affecting positional encoding stability.

📌 Padding Differences in Training vs Inference

Phase	Padding Strategy	Reason
Training	Left padding for decoder-only; Right padding for encoder-only	Optimized for batch processing efficiency
Inference	Typically, no padding for single sequences; Left padding for batched inference	Ensures stable positional encoding

📌 Summary & Key Takeaways (TL;DR)

Padding standardizes sequence lengths for batch processing.
Decoder-only models (GPT, Llama) use Left padding to stabilize positional encoding and prevent future token leakage. Left padding tokens are masked out.
Encoder-only models (BERT, RoBERTa) use Right padding since they employ bidirectional attention and rely on stable special token positions (e.g., [CLS]).
Although padding tokens occupy positions in positional encoding, attention masks effectively filter them out, ensuring they do not affect model predictions.