This post is also available in: 
     עברית (Hebrew)
עברית (Hebrew)
Researchers from Stephen James’s Robot Learning Lab in London have developed a groundbreaking system named Genima that leverages AI-generated images to train robotic machines. By fine-tuning Stable Diffusion, Genima enables the visualization of robots’ movements, guiding them both in simulations and the real world.
Genima functions as a behavior-cloning agent, transforming the traditional approach to robotic training. It “draws joint-actions” on RGB images, which are then processed by a controller that maps these visual targets to a sequence of joint positions. This innovative method allows robots, ranging from mechanical arms to humanoid robots and driverless cars, to learn complex tasks more efficiently.
According to Interesting Engineering, the researchers conducted extensive studies using Genima across 25 tasks from the RLBench simulation and 9 real-world manipulation scenarios. They discovered that by lifting actions into image-space, internet pre-trained diffusion models could generate policies that outperform state-of-the-art visuomotor approaches. Notably, Genima demonstrated robustness against scene perturbations and adaptability to novel objects, making it competitive even with 3D agents that utilize additional data such as depth and keypoints.
However, the study highlights that Genima has its limitations. As with all behavior-cloning agents, it focuses on distilling expert behaviors rather than discovering new ones. The system relies on camera calibration to render targets accurately, assuming that the robot is always visible from a specific viewpoint.
The researchers validated Genima’s effectiveness by benchmarking it against neural network ACT on real-robot setups. Using a Franka Emika Panda robot equipped with external and wrist cameras, they trained multi-task agents from scratch across nine diverse tasks, including handling dynamic and deformable objects. The system visualizes desired actions—such as opening a box or hanging a scarf—through colored spheres placed atop images, indicating future joint movements.
In trials, Genima completed 25 simulations and nine real-world tasks, achieving average success rates of 50% and 64%, respectively, according to MIT Technology Review. The potential of pre-trained diffusion models in transforming robotics parallels their revolutionary impact on image generation, signaling a promising future for robotic training techniques.

 
            
