📌 What is Padding?

In Large Language Models (LLMs), padding is a method used to standardize sequence lengths for batch processing.

For example:

1
2
Sentence 1: "I love NLP"
Sentence 2: "Padding is useful in LLM training"

Using the <pad> token for alignment:

1
2
"I love NLP <pad> <pad> <pad>"
"Padding is useful in LLM training"

📌 Padding Positioning: Left vs Right

There are two common padding strategies:

  • Right padding:

    1
    "I love NLP <pad> <pad>"
  • Left padding:

    1
    "<pad> <pad> I love NLP"

Typically:

  • Decoder-only models (e.g., GPT, Llama): Use Left padding.
  • Encoder-only models (e.g., BERT): Use Right padding.

Transformers can be categorized into three main architectures:

Model Type Representative Models Characteristics Common Applications
Encoder-only BERT, RoBERTa, ALBERT, ELECTRA Bidirectional attention NLP tasks like text classification, named entity recognition
Decoder-only GPT, GPT-2, GPT-3, GPT-4, LLaMA, Mistral Causal attention (Autoregressive) Text generation, chatbots, writing assistance
Encoder-Decoder Transformer (original), T5, BART, mT5, PEGASUS Encoder: bidirectional, Decoder: autoregressive Machine translation, summarization, dialogue systems

📌 Why Do Encoder-only Models (e.g., BERT) Use Right Padding?

  • Encoder-only models (like BERT) aim to obtain representations for each token.
  • These models use bidirectional attention, meaning each token attends to both past and future tokens.
  • Slight shifts in position do not significantly impact model performance.
  • Special tokens (e.g., [CLS]) in BERT maintain a fixed position for tasks like classification, making right padding more natural.

Example:

1
[CLS] Hello I love NLP [SEP] <pad> <pad>
  • Right padding keeps [CLS] and [SEP] in stable positions, allowing the model to focus on meaningful tokens.

📌 Why Do Decoder-only LLMs Use Left Padding?

Decoder-only models (like GPT) are autoregressive, meaning each token is generated based only on previous tokens, and future tokens are masked.

  • Positional Encoding Stability:
    Left padding ensures that meaningful tokens have a consistent relative position, preventing position encoding misalignment.
    • When using absolute positional encoding, every token (including <pad>) gets a unique position index.
    • Padding tokens at the beginning do not affect the model’s attention mechanism due to masking.

Example:

1
2
3
Position Index: [ 1      2      3      4      5      6 ]
Token: [ <pad>, <pad>, Hello, I, love, NLP ]
Mask: [ 0, 0, 1, 1, 1, 1 ]
  • The model only attends to tokens where the mask is 1, ignoring padding tokens.

  • Attention Masking:
    Left padding ensures that <pad> tokens do not interfere with token position encoding.

Illustration:

Token 1 2 3 4 5 6
Left <pad> <pad> Hello I love NLP
Right Hello I love NLP <pad> <pad>
  • With Left padding, the last valid token always remains in the same position.
  • With Right padding, token positions shift, affecting positional encoding stability.

📌 Padding Differences in Training vs Inference

Phase Padding Strategy Reason
Training Left padding for decoder-only; Right padding for encoder-only Optimized for batch processing efficiency
Inference Typically, no padding for single sequences; Left padding for batched inference Ensures stable positional encoding

📌 Summary & Key Takeaways (TL;DR)

  • Padding standardizes sequence lengths for batch processing.
  • Decoder-only models (GPT, Llama) use Left padding to stabilize positional encoding and prevent future token leakage. Left padding tokens are masked out.
  • Encoder-only models (BERT, RoBERTa) use Right padding since they employ bidirectional attention and rely on stable special token positions (e.g., [CLS]).
  • Although padding tokens occupy positions in positional encoding, attention masks effectively filter them out, ensuring they do not affect model predictions.