LLM-orchestrated gestures and real-time human-robot interaction, built on a Unitree G1 humanoid platform.

A major 5G infrastructure vendor sought to demonstrate how connectivity enables real-time intelligent robotic systems, using a Unitree G1 humanoid as the centerpiece of a live lab demo. The existing prototype repositories provided a starting point, but were fragmented and unstable, making them unsuitable for a live audience. The system needed to handle voice interaction, gesture recognition, and physical robot behaviors in a coordinated and reliable way, with low enough latency to feel genuinely responsive. Achieving that required building a clean control layer to serve as the interface between the robot's hardware and the AI orchestration layer running on top of it.
Focus built the robot control layer for the Unitree G1: a FastAPI server with REST endpoints for gestures and physical actions, plus WebSocket streams for real-time video and audio. The full stack ran containerized via Docker Compose, with RealSense camera integration for live visual input. To connect hardware to AI orchestration, Focus implemented an MCP server using fastmcp, giving an LLM direct access to trigger gestures, read live multimodal context, and coordinate behaviors in real time. The result: a humanoid that detected participant gestures, responded with matching physical actions, held conversations, and drew on the client's knowledge base to answer relevant questions.
.webp)
.webp)
.webp)
The client demonstrated live that connectivity powers intelligent robotics, with a humanoid that the public genuinely wanted to approach and engage.
An MCP server gave the LLM a unified interface to trigger physical gestures and read live multimodal context, eliminating brittle point-to-point integrations between perception, language, and action.
Real-time gesture detection made the interaction genuinely bidirectional: the robot recognized when participants were waving or extending a hand, and responded with the appropriate physical action.
A Docker Compose deployment packaged the entire stack, including RealSense camera streaming and WebSocket audio/video, making the system reproducible, portable, and ready for live demo environments.
