In this post, I will share the steps to run the fine-tuned Gemma-2-2b-it model using vLLM. This guide will cover the installation process, environment configuration, and common troubleshooting tips.

Installation and Verification of vLLM

First, ensure that you have installed and verified vLLM version 0.5.3.

  1. Install vLLM:

    1
    !pip install vllm==0.5.3
  2. Verify the installation:

    1
    2
    3
    import vllm
    print(vllm.__version__)
    # Output: 0.5.3

Installing Flashinfer

Follow these steps to install Flashinfer, ensuring compatibility with your torch version and CUDA.

  1. Check the torch version and CUDA compatibility:

    1
    2
    3
    import torch
    print(torch.__version__) # Should output: 2.3.1+cu121
    print(torch.version.cuda) # Should output: 12.1
  2. Install Flashinfer:
    According to the documentation, Gemma runs on version 0.0.8. vLLM requires FlashInfer v0.0.8 (refer to vLLM Version and Flashinfer Documentation for details on Gemma 2).

    1
    !pip install flashinfer==0.0.8 -i https://flashinfer.ai/whl/cu121/torch2.3/

Updating Environment Variables for vLLM Backend

Ensure that Flashinfer is set as the attention mechanism backend for vLLM:

1
2
import os
os.environ["VLLM_ATTENTION_BACKEND"] = "FLASHINFER"

Testing vLLM

Here is the test code to generate text using vLLM:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
from vllm import LLM, SamplingParams
import random

llm = LLM(model="gemma-2-2b-model", trust_remote_code=True)

sampling_params = SamplingParams(
temperature=0.8,
max_tokens=512,
top_p=0.95,
top_k=1,
)

# Example test data
test_data = [{"text": "Input test text 1"}, {"text": "Input test text 2"}]

prompts = [
test_data[random.randint(0, len(test_data) - 1)]["text"],
]

outputs = llm.generate(
prompts,
sampling_params
)

By following these steps, you should be able to successfully run the fine-tuned Gemma-2-2b-it model.

Common Errors and Solutions

Here are some common errors you might encounter and their solutions:

  1. RuntimeError: CHECK_EQ(paged_kv_indptr.size(0), batch_size + 1) failed. 1 vs 257

    • Cause: Incorrect Flashinfer version.
    • Solution: Ensure you have installed the correct version of Flashinfer.
  2. TypeError: 'NoneType' object is not callable

    • Cause: Flashinfer is not installed.
    • Solution: Install Flashinfer following the steps above.
  3. ValueError: Please use Flashinfer backend for models with logits_soft_cap (i.e., Gemma-2). Otherwise, the output might be wrong. Set Flashinfer backend by export VLLM_ATTENTION_BACKEND=FLASHINFER.

    • Cause: Flashinfer backend is not set.
    • Solution: Set the environment variable VLLM_ATTENTION_BACKEND to FLASHINFER.

By following these detailed steps and solutions, you should be able to successfully run and debug the fine-tuned Gemma-2-2b-it model. If you encounter any issues, refer to the relevant documentation or seek help from the community.