Skip to main content
D
Devin
January 26, 2026
Download from GitHub

Project Overview

Robot Agent is built on MiniMax M2.1 text model and Pi05 VLA (Vision-Language-Action) model, enabling natural language-driven robotic arm manipulation in the LIBERO simulation environment. Demo
Understands user’s natural language instructions, decomposes complex tasks into executable steps, and coordinates multi-step task execution.
Uses MCP to invoke visual understanding capabilities, analyze scene images, verify task execution results, and enable closed-loop feedback control.
A vision-language-action model based on PaliGemma that generates precise robotic arm control actions based on scene images and task instructions.
Executes various robot manipulation tasks in a simulation environment powered by the MuJoCo physics engine.

System Architecture

User Command → MiniMax LLM (Task Planning) → Pi05 VLA (Action Execution) → LIBERO Sim
                    ↑                                                          ↓
             MCP Visual Understanding ← ────────── Scene Image ←───────────────┘
ModuleTechnologyDescription
Task PlanningMiniMax M2.1Understand user intent, decompose tasks
Visual UnderstandingMiniMax MCPScene analysis, result verification
Action ExecutionPi05 VLAVision-Language-Action model
SimulationLIBERO / MuJoCoRobot manipulation simulation

Quick Start

1

Clone Repository

git clone https://github.com/MiniMax-OpenPlatform/MiniMax-Agent-VLA-Demo.git
cd MiniMax-Agent-VLA-Demo
2

Configure API Keys

# MiniMax API Key (for LLM and MCP visual understanding)
# Get it at: https://platform.minimax.io/
export ANTHROPIC_API_KEY="your-minimax-api-key"

# HuggingFace Token (for downloading Pi05 model)
# Get it at: https://huggingface.co/settings/tokens
export HF_TOKEN="your-huggingface-token"
3

Download Pi05 Model

# Install huggingface_hub
pip install huggingface_hub

# Login to HuggingFace
huggingface-cli login

# Download Pi05 LIBERO fine-tuned model
python -c "
from huggingface_hub import snapshot_download
snapshot_download(
    repo_id='lerobot/pi05_libero',
    local_dir='./models/pi05_libero_finetuned'
)
"
Default model path: ./models/pi05_libero_finetuned. To change it, edit the MODEL_PATH variable in agent_mode.py.
4

Install Dependencies

# Create virtual environment
python -m venv .venv
source .venv/bin/activate

# Install all dependencies
pip install -r requirements.txt
Dependencies:
DependencyDescription
LeRobotHuggingFace robotics library (included)
MuJoCoDeepMind physics simulation engine
LIBERORobot manipulation simulation benchmark
MCPModel Context Protocol client
5

Run the Agent

# Set display (VNC environment)
export DISPLAY=:2

# Run Agent
python agent_mode.py
Select a task scenario after startup:
  • libero_object - Object generalization
  • libero_spatial - Spatial relationship understanding
  • libero_goal - Different action goals (recommended)

Supported Tasks

In the LIBERO Goal scenario, the Agent supports the following 10 manipulation tasks:
#Task CommandDescription
1open the middle drawer of the cabinetOpen the middle drawer of the cabinet
2put the bowl on the stovePlace the bowl on the stove
3put the wine bottle on top of the cabinetPlace the wine bottle on top of the cabinet
4open the top drawer and put the bowl insideOpen the top drawer and put the bowl inside
5put the bowl on top of the cabinetPlace the bowl on top of the cabinet
6push the plate to the front of the stovePush the plate to the front of the stove
7put the cream cheese in the bowlPut the cream cheese in the bowl
8turn on the stoveTurn on the stove
9put the bowl on the platePlace the bowl on the plate
10put the wine bottle on the rackPlace the wine bottle on the rack

Core Code Analysis

Agent Tool Definitions

The Agent interacts with the environment through two core tools:
AGENT_TOOLS = [
    {
        "name": "execute_task",
        "description": "Execute a manipulation task using the robot arm.",
        "input_schema": {
            "type": "object",
            "properties": {
                "task": {
                    "type": "string",
                    "description": "A manipulation task instruction"
                }
            },
            "required": ["task"]
        }
    },
    {
        "name": "get_scene_info",
        "description": "Capture camera image and use VLM to analyze the current scene.",
        "input_schema": {
            "type": "object",
            "properties": {},
            "required": []
        }
    }
]

MiniMax M2.1 Task Planning

Using Anthropic-compatible interface to call MiniMax M2.1:
from anthropic import Anthropic

client = Anthropic(
    base_url="https://api.minimax.io/anthropic",
    api_key=api_key,
    default_headers={"Authorization": f"Bearer {api_key}"}
)

response = client.messages.create(
    model="MiniMax-M2.1",
    max_tokens=4096,
    system=system_prompt,
    messages=conversation_history,
    tools=AGENT_TOOLS
)

MCP Visual Understanding

Using MCP to invoke MiniMax visual understanding for task verification:
from mcp import ClientSession, StdioServerParameters
from mcp.client.stdio import stdio_client

server_params = StdioServerParameters(
    command="uvx",
    args=["minimax-coding-plan-mcp", "-y"],
    env={
        "MINIMAX_API_KEY": api_key,
        "MINIMAX_API_HOST": "https://api.minimax.io"
    }
)

async with stdio_client(server_params) as (read, write):
    async with ClientSession(read, write) as session:
        await session.initialize()
        result = await session.call_tool("understand_image", {
            "image_source": image_path,
            "prompt": "Verify if the task was completed successfully."
        })

Technical Details

Pi05 Model Parameters

ParameterValue
Base ModelPaliGemma
Input2 camera images + robot arm state + language instruction
Output7-dim action (end-effector position delta + orientation delta + gripper)
Control Frequency10Hz
Max Steps280 steps/task

Agent Workflow

  1. User Input: Receive natural language instructions
  2. Task Planning: MiniMax M2.1 understands intent and maps to supported tasks
  3. Action Execution: Pi05 VLA generates robotic arm control sequences
  4. Result Verification: MCP visual understanding analyzes the scene to confirm task completion
  5. Feedback Loop: If verification fails, automatically retry the task

FAQ

Check if ANTHROPIC_API_KEY is correctly set to your MiniMax API Key.
  1. Check if HF_TOKEN is configured
  2. Check if MODEL_PATH path is correct
Make sure you have installed: pip install mcp and the uvx command is available.
Set export DISPLAY=:2 (VNC) or ensure you have an X11 environment.

Application Extensions

Based on the current architecture, developers can explore the following directions:
  • Multi-task Chaining: Implement automatic decomposition and sequential execution of complex tasks
  • Failure Recovery: Enhance error detection and automatic recovery capabilities
  • Real-world Deployment: Transfer simulation policies to physical robotic arms
  • Multi-modal Interaction: Combine speech recognition to enable voice-controlled robots

Summary

In this tutorial, we demonstrated how to build an intelligent robot using MiniMax M2.1 and MCP visual understanding:
  • MiniMax M2.1 understands user intent and converts natural language instructions into specific manipulation tasks
  • MiniMax MCP provides visual understanding capabilities to verify task execution results and enable closed-loop control
  • Pi05 VLA serves as the underlying executor, generating precise robotic arm actions based on visual input
  • LIBERO/MuJoCo provides a realistic physics simulation environment
This LLM + VLM + VLA collaborative architecture demonstrates the potential of large models in the field of robot control.