This post is also available in:
עברית (Hebrew)
A research team from the University of Illinois Urbana-Champaign, along with collaborators from Columbia University and UT Austin, has introduced a new system that allows robots to learn complex tool-use skills simply by watching video clips. The approach, called “Tool-as-Interface,” marks a departure from traditional robotics, which relies heavily on manual programming or sensor-intensive training setups.
The system enables robots to observe tasks—such as hammering, scooping, or flipping food—and reproduce them using only visual input from two camera angles. The method removes the need for motion capture suits, specialized tools, or remote human control.
According to TechXplore, at the core of the framework is a visual model called MASt3R, which converts two frames from ordinary videos into a 3D reconstruction of the scene. Using a technique known as 3D Gaussian splatting, the system then generates multiple synthetic viewpoints, allowing the robot to analyze the task from different angles.
To focus the robot’s learning on the interaction between the tool and its environment, the human is digitally removed from the scene using a segmentation model known as Grounded-SAM. This tool-centric view allows the system to understand the function and motion of the tool itself, rather than mimicking the human operator. As a result, learned skills are more easily transferred between different robotic platforms with varying hardware configurations.
The research team tested the method on five distinct tasks, including hammering nails, scooping meatballs, and kicking a soccer ball. The robots performed these actions with high success rates, outperforming traditional teleoperation-based training by 71% and reducing training time by 77%.
While promising, the system does have some limitations. It currently assumes tools are fixed to the robot’s gripper and can occasionally misjudge position when reconstructing camera views. Still, the team sees this as a key step toward enabling robots to learn from widely available video content such as online tutorials or home recordings.
The research was recognized with a Best Paper Award at ICRA 2025 and is available as a preprint on arXiv.