Skip to main content
vLLM is a high-performance inference engine that provides outstanding throughput, efficient memory management, robust batch processing, and deep performance optimizations.
Before deployment, please review the official vLLM documentation to verify hardware compatibility.

Supported Models

This guide applies to the following models. You only need to update the model name during deployment. The following examples use MiniMax-M1-40k:

Environment Requirements

  • OS: Linux
  • Python: 3.9 – 3.12
  • GPU:
    • Compute capability ≥ 7.0
    • Memory requirements:
      • Model weights require: 495 GB
      • Each 1M context tokens require: 38.2 GB
    • Recommended configurations (adjust based on workload):
      • 8 × 80 GB GPUs: Supports up to 2M tokens of context
      • 8 × 96 GB GPUs: Supports up to 5M tokens of context
Supported versions:
  • Text01: vLLM ≥ 0.8.3
  • M1: vLLM ≥ 0.9.2
    • Versions 0.8.3 – 0.9.1 may cause unsupported model errors or precision loss. See details: vLLM PR #19592
To fix the “unsupported model” issue, modify the model config file: change architectures in config.json to MiniMaxText01ForCausalLM. See: MiniMax-M1 Issue #21 for details.

Deploy with Python

We recommend using a virtual environment (venv, conda, or uv) to avoid dependency conflicts. Install vLLM in a clean Python environment:
# Use CUDA 12.8
# Install with pip
pip install "vllm>=0.9.2" --extra-index-url https://download.pytorch.org/whl/cu128

# Or install with uv
uv pip install "vllm>=0.9.2" --torch-backend=auto
Run the following command to start the vLLM server. vLLM will automatically download and cache the MiniMax-M1 model from Hugging Face.
SAFETENSORS_FAST_GPU=1 VLLM_USE_V1=0 vllm serve MiniMaxAI/MiniMax-M1-40k \
    --trust-remote-code \
    --quantization experts_int8 \
    --dtype bfloat16

Deploy with Docker

Docker ensures a consistent and portable environment. First, fetch the model (make sure Git LFS is installed):
pip install -U huggingface-hub
huggingface-cli download MiniMaxAI/MiniMax-M1-40k
# Model will be stored at $HOME/.cache/huggingface
# If network issues occur, set a mirror
export HF_ENDPOINT=https://hf-mirror.com
Pull and run the vLLM Docker image:
docker pull vllm/vllm-openai:latest

docker run --runtime nvidia --gpus all \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "SAFETENSORS_FAST_GPU=1" \
    --env "VLLM_USE_V1=0" \
    -p 8000:8000 \
    --ipc=host \
    vllm/vllm-openai:latest \
    --model MiniMaxAI/MiniMax-M1-40k \
    --trust-remote-code \
    --quantization experts_int8 \
    --dtype bfloat16

Verify Deployment

Once started, test the OpenAI-compatible API with the following command.
curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "MiniMaxAI/MiniMax-M1",
        "messages": [
            {"role": "system", "content": [{"type": "text", "text": "You are a helpful assistant."}]},
            {"role": "user", "content": [{"type": "text", "text": "Who won the world series in 2020?"}]}
        ]
    }'

Experimental: Enable vLLM V1

Benchmarks show V1 delivers 30–50% better latency and throughput under medium–high concurrency, but slightly worse performance under single-thread workloads (due to missing Full CUDA Graph, which will be fixed in future releases). This feature has not yet been released, so it must be installed from source code.
git clone https://github.com/vllm-project/vllm
cd vllm
pip install -e .
After installation, you’ll need to set additional environment variables and disable prefix caching when starting the service.
VLLM_ATTENTION_BACKEND=FLASHINFER VLLM_USE_V1=1 \
    vllm serve MiniMaxAI/MiniMax-M1-40k \
    --trust-remote-code \
    --quantization experts_int8 \
    --dtype bfloat16 \
    --no-enable-prefix-caching

Troubleshooting

Hugging Face network issues

If network errors occur when downloading models, set a mirror:
export HF_ENDPOINT=https://hf-mirror.com

No module named ‘vllm._C’

The following error means a local folder named vllm conflicts with the installed package. Commonly happens when cloning the repo to run examples/. Rename the folder to fix it. See vLLM Issue #1814 for more details.
import vllm._C # noqa
ModuleNotFoundError: No module named 'vllm._C'

MiniMax-M1 model not supported

Your vLLM version is too old. Please upgrade to v0.9.2+. For versions between 0.8.3 – 0.9.1, see Environment Requirements above for more details.

Getting Support

If you encounter issues while deploying MiniMax models: We continuously improve the deployment experience and welcome your feedback!