AI Robotics: A Field Report on Imitation Learning with LeRobot

20:58

Executive Summary

In this report, the ML6 Robotics & AI team presents our experience preparing, executing, and evaluating two imitation learning use cases with different levels of complexity on custom datasets using the LeRobot framework. These learnings inspired us to join the 2025 LeRobot Hackathon and create a podium-finishing submission (team 297).

The Evolution of Robotics

Robotics has always spoken to the imagination. Even though it was not always called that, references to “automatons” can be found in Greek mythology dating back millennia. Throughout history, they expressed power, wonder, and the desire for human transcendence. Slowly but surely, curious attempts at automatic machines became increasingly common. By the end of the Victorian era, humanity had seen some truly uncanny creations under the mantle of “clockwork automatons.”

These early ventures eventually evolved to fill practical needs. The first industrial robot arm, called Unimate, was developed in the late 1950s. From then on, further advancements introduced the world to mobile robots capable of reasoning. In that category, Shakey is widely considered the first. Developed by the Stanford Research Institute in the 1960s, Shakey could process images and text as input and navigate a physical space as a result.

So where are we now? The recent, prominent advances in generative AI have given rise to an explosion in the capabilities of digital systems that generate anything from text to video and more. Thanks to progress in algorithms, hardware, and data availability, artificial intelligence has taken the world by storm.

Imitation Learning

While generative AI has revolutionized media creation, other fields, most notably physical AI, have taken note. For a long time, the hopes and expectations were that reinforcement learning (RL) could propel this field forward. Though it cannot be discounted and has produced impressive demos of quadrupeds and humanoids performing parkour or breakdancing, it has not yet proven itself in a broader, practical setting. Besides rare exceptions, all robotics in production today are rule-based systems that have to be carefully programmed or configured by hand for a limited set of well-defined scenarios.

But an alternative method is on the rise. Imitation learning (IL) has been around for a while but, despite some early successes, it never really took off. Today, thanks to improvements in generative techniques, it has surged forward. Now, it is possible to leverage transformers, diffusion models, and foundation vision-language models to produce instructions for mechanical actuators. These models aim to replicate expert demonstrations by evaluating their own predicted actions against ground truth data points.

Comparing reinforcement learning with imitation learning, the architectural differences are stark but the overall concept can be framed in a similar way. The policy still maps observations to actions, but in imitation learning the observations are pre-recorded instead of captured from a live environment. The ensuing action is compared to the expert action of that observation, yielding the loss as the optimization objective. This has two main benefits: no need to engineer a reward function, and no live environment required during training.

Imitation Learning

Leveraging the benefits of imitation learning introduces a trade-off, however. If the policy cannot explore environments independently, it will need data, and lots of it. Data gathering usually happens through teleoperation: an expert steers the robot to complete set goals, while video streams and actuator positions are recorded as input for the IL algorithm.

But there’s a problem: scale. Unlike text or image generation, robotics does not have internet-scale data. And since manual data gathering through teleoperation is expensive and time-consuming, alternative sources are needed. Two common methods involve generating synthetic data in simulation environments and performing pose estimation on videos of humans, animals, and robots.

Open-Source

All these developments both sparked the interest and have been driven by companies and enthusiasts globally, including Hugging Face. They created LeRobot, an open-source framework for implementing machine learning models. To complement the broad availability of these models, LeRobot also streamline access to low-cost robotics hardware like the Standard Open Arm 100 (SO100), for which they collaborated with TheRobotStudio. Designed to be 3D-printable and powered by low-cost, off-the-shelf electric motors, the team made accessibility their main focus. For this reason we employed these arms during our tests. The leading and open nature of LeRobot’s efforts has made their framework a standard in robotics policies.

The growing number of IL models available on or compatible with LeRobot fall into two main categories:

Narrow models, which specialize in one simple task without pre-training on robotics data. They’re efficient and easily deployable but lack generalization.
Foundation models, generally VLA (Vision-Language-Action) models that combine a vision-language backbone, pre-trained for “world knowledge”, with a custom action head. They undergo further training on large-scale robotics data.

So for more generalist use-cases there is a large need for data. Spearheading organizations like Physical Intelligence, NVIDIA and Google heavily invest in generating real, synthetic, and inferred robotics data. But as Ilya Sutskever once said, (internet) data is like fossil fuel: non-renewable and running out quickly. Robotics might be the key to creating a renewable data source, by continuously interacting with the world and feeding novel experiences back into datasets. For now, though, robotics data is more in line with precious metals: scarce and expensive to gather.

ML6 Field Report using LeRobot

At ML6 we put the SOTA of imitation learning models to the test. Armed with two SO100 arms, a couple of cameras, some ordinary objects and the LeRobot framework, we tested both narrow and foundation models on our own custom datasets.

In order to optimize model performance our setup abided by the following guidelines during real-world data collection via teleoperation:

Minimize visual noise: Remove irrelevant objects and occlude distracting backgrounds.
Clear operational area: Use an open space for both teleoperation and action areas.
Optimize camera setup: Place cameras in a way so they capture the maximum amount of useful information.

Failure or Success?

As for model evaluation, there are no real standardized methods or benchmarks in place. This is because of the difficulty of evaluating in a strict and objective manner. Since loss is the optimisation metric, this is a possible avenue but it turns out that it is not functional. Loss represents the difference between the predicted actions and the expert actions. This can be very low and yet the model might fail because of a tiny positional error in the arm. In object manipulation, a millimeter might mean the difference between failure and success.

Some have turned to simulated evaluation. This has two pitfalls:

Sim-to-real gap: Insights gained in simulation do not translate reliably to reality, making the evaluation at least partially unreliable.
Automatic evaluation: Proper evaluation of an episode happens on multiple aspects. This takes us back to the point about the complexity of writing good reward functions.

That leaves one option: real-world human evaluation. This is the most accurate form of evaluation, but it’s labour intensive and time-consuming. Reliable results require constant attention and focus from human evaluators. Additionally, it is often difficult to draw a line between failure and success. A model might exhibit aggressive motion during an episode but still achieve the goal of moving an object to a certain position. Do we accept this behavior and call it a success? Do we accept higher wear of the workspace while successfully manipulating the target? Do we accept that the object is very slightly off the mark on task completion? All this makes evaluation a considerably subjective process.

Testing Narrow Models: ACT

We began with the Action-Chunking Transformer or ACT. It outputs a sequence of actions for each input frame, enabling smoother and faster control than step-by-step models.

We began our experiments in the simplest way possible, with the ‘Hello World’ equivalent of robotics: ‘Pick & Place’. The goal of this task is to pick up an object and place it in a predefined position or container. In our case we used a brick and a small container alongside the base of the arm. We defined success simply as the brick ending up in the container from an in-distribution situation. The setup allowed for three video streams consisting of a top-down, frontal, and gripper-mounted camera view. To tackle the tests in a structured manner, we recorded modular datasets, creating 20 episodes for each individual position of the brick. For reproducibility during data collection and policy evaluation we used a fixed grid as the background. The hyperparameters were left at default for simplicity and consistency but we maximised VRAM usage by training with the largest possible batch size (24 in our case). It is recommended to increase it up to the limit.

Starting with a sanity test, we trained an initial model with a small dataset (10k frames) consisting only of the most central position in the workspace. It counts 20 episodes for a total duration of five and a half minutes.

Video 1 (1)

Brick sampling position (L) and Dataset sample episode (R)

Motion was jittery but the policy had definite potential for reliable performance with a current accuracy of 60%. A likely reason for the immature behavior was the training duration. The checkpoint that was tested was reached when the loss started plateauing. Tony Zhou, author of ACT, and others recommend training for much longer, after the loss has stabilized as this can improve the success rate and motion smoothness.

1_wNwWguc5S8z2Qn-uT2UnUQ Successful vs. Unsuccessful episode

With our first model evaluation under the belt we expanded the scope of the task by introducing brick positions on a single axis. The brick was placed on 5 positions on the vertical centre line of the workspace. Combined, the dataset totals 100 episodes and is 46k frames long or 25 minutes in duration.

Video 2 Brick sampling positions (L) and Dataset sample episode (R)

The resulting accuracy: 90%. This time the model displayed way more control, presumably because of the extended training duration. This accuracy, however, was only on in-distribution tests. The model was unable to generalize beyond the fixed positions in the dataset by interpolating or extrapolating known positions. Because camera dropouts and background variations were absent from the training data, the model was also unable to handle such conditions during inference.

2_xoeR0aeyl2L_kVu1ZnohBQ

Next, we introduced an extra dimension along the horizontal axis (137k frames). This brought the total duration up to 1 hour and 16 minutes, across 340 episodes.

Video 3

Brick sampling positions (L) and Dataset sample episode (R)

3_aDdUnbdOtom68r5vhzhG-g Successful vs. Unsuccessful episode

This time the model was successful in 79% of the in-distribution episodes and was also able to generalize somewhat reliably across brick positions around the centre point of the workspace.

Takeaway: ACT

ACT showed us the potential of narrow models. We experienced smooth control without any interruptions because of the computational efficiency and action chunking. But it is important to keep its limitations in mind. It requires a lot of high-quality data and exhibits minimal generalization.

Testing Foundation Models: GR00T-N1

To try a different approach, we moved on to GR00T-N1. It is a VLA model by NVIDIA. We kept the default hyperparameters and set the batch size to the maximum of 32. Considering the size of the model, we set up a remote training and inference server on Runpod. Inference on models of this size currently causes significant delays between action sequences. That means that demonstrations of GR00T-N1 inference episodes are not shown at full speed. Pauses between action sequences due to inference latency are left out.

Our first attempt at fine-tuning the foundation model with the complete Pick & Place dataset gave us not a single successful episode. It would follow general task movements to some extent but would never succeed at grabbing the brick. It seemed to lack precision or knowledge on how to continue when early errors put the arm position out of distribution. To address this, we tried surgically injecting ‘grabbing’ data into the dataset. These were very short episodes of the locating and grabbing motion where the gripper was already in the vicinity of the brick. This attempt also failed; the model was even more confused than before.

For the next task we explored the boundaries by presenting the model with a textile manipulation task. Robotics notoriously struggle here due to the stochasticity encountered. Academic teams from across the globe compete every year to push the frontiers of subject. In 2024, the competition was organized by UGent’s AIRO lab. Our approach to tackle the challenge consisted of a bimanual setup with three cameras. Two cameras attached to the gripper provided close-up perspectives for precise manipulation and one top-down camera delivered a global overview.

Unfolding is the most challenging step in textile manipulation by robots due to the enormous variability in initial configurations of the fabric. To stress-test the model we decided to give it a shot. Our dataset consisted of 100 variations of spreading the towel across the workspace with a dataset size of 53k frames which totaled 29 minutes.

4_orRT7vFnS8NCISEVRZBA1Q Dataset sample episode

Fine-tuning resulted in an accuracy of 60%. Considering the notorious difficulty of such tasks in robotics this was a really good start. In terms of model characteristics, we noticed a strong global task awareness but a lack of precision and subtask awareness. It would occasionally fall short of producing the correct subsequent step but continuous attempts indicated recognition of the model that the task remained unfinished.

4_4deOmjFlaBH7yxPvffY48w Successful vs. Unsuccessful episode

Encouraged, we moved on to a slightly easier task: folding the towel neatly. The dataset, again, consisted of 100 episodes (76k frames), taking 42 minutes in total.

5_NkA5xvi3Zp4-rGBuZQ5HxQ Dataset sample episode

The model impressively achieved 80% accuracy. Problems only occurred when the first folding step was imprecise, often leading the model to stop or execute the second step poorly. Additional demonstrations in those uncovered areas should drastically decrease the error rate.

5_JrjRQmwh1q0Gu9wD7Ug3PQ

Successful vs. Unsuccessful episode

Takeaway: GR00T-N1

GR00T-N1 demonstrated the ability to handle more complex tasks while maintaining data requirements comparable to those of a narrow model. Extensive pre-training on large-scale datasets appears to primarily enhance the model’s capacity to execute more complex tasks, but contributes less to reducing the data volume or diversity needed for effective fine-tuning. A notable limitation of this generation of VLA models is the stuttering motion they exhibit, caused by high inference and network latency.

Overall Learnings

From these tests, we conclude that the data we feed these models is the primary focus and should adhere to the following criteria for optimal results:

Accuracy should be as high as possible. The model learns to replicate the expert’s behavior, which means that carelessly recorded data will translate to suboptimal model behavior.
Controlled, sequential movements simplify action production because the model doesn’t have to account for many different, critical actuator positions and actions at the same time.
Comprehensive datasets expose the model to every possible variation of the task it may encounter. This ensures familiarity and know-how to tackle rarer positions.
Robust datasets equip the model to handle unforeseen situations. An object might, for example, slip out of the gripper’s grasp. If this is not sufficiently encountered during model training, the model might not know how to react.

Data isn’t the only factor. It’s also important to operate in highly controlled environments where distracting interactions are avoided.

Nearing the end of our deep-dive into imitation learning, LeRobot organized a worldwide hackathon. We participated, focussing on the issue of camera-instability during training and deployment. Our approach to solving this using gaussian splatting propelled us to the top 10, and resulted in 3rd place (team 297) overall as a result of community voting.

Current Improvement Avenues

In working with imitation learning models and actively participating in the open-source community we identified the following needs:

Improved model architectures: While employing generative AI techniques to produce actions is a good start, we need further iteration and experimentation to attain better methods that are highly accurate, data-efficient, and capable of keeping the inference speed low enough for high-speed control.
Stronger compute: Current edge, or even on-premises devices are unable to handle the compute requirements of current models. In order to reduce latency as much as possible, powerful local compute devices must become broadly accessible.
Standardized evaluation: Fair, efficient comparison requires a standard evaluation protocol.
Renewable data stream: This is somewhat of a chicken-and-egg problem. Physical AI has the potential for continuous data gathering but the current shortage of large-scale data, needed for model development, is acute.

Rapid Progression

All these challenges, however, are actively being worked on. In the short time between our research into imitation learning and the writing of this blog post, many new and improved models have already been released. SmolVLA and Physical Intelligence have both introduced methods for asynchronous inference that allow for faster and smoother control. In addition, iterations of state-of-the-art VLA models like GR00T-N1.5 and Pi0.5 have made strides in overall performance and generalization capabilities.

With NVIDIA recently releasing its newest Jetson generation, Thor, edge compute can now more easily run the largest VLA models.

The need for data is also actively being addressed. An effort led by Google DeepMind, to create the ImageNet for robotics has culminated in the Open-X-Embodiment dataset. This is a dataset of over 1 million real-world episodes containing 22 different embodiments. Continuous updates make it a major data source for pre-training. The main focus of the Open-X-Embodiment project is to scale real data and enable cross-embodiment generalization.

More recently, BEHAVIOR-1K represents a similar attempt at providing data on a large scale and proposing an evaluation benchmark. This release makes over 1,200 hours of simulated data available. The goal of BEHAVIOR-1K is to advance general household intelligence with long-horizon reasoning capabilities and benchmark embodied AI models using simulated evaluation.

And thus we see that the needs and shortcomings of the field are rapidly being addressed, suggesting that today’s small-scale, resource-intensive experiments will soon give way to full-fledged, production-ready implementations.

Physical AI is moving fast from labs to real-world deployment. If you have repetitive, structured tasks in a controlled environment, now is the time to explore automation opportunities. Cost-competitive robotics powered by imitation learning is closer than most expect.

At ML6, we continue exploring how imitation learning and foundation models can power the next generation of practical, cost-competitive robotics solutions.

AI Robotics: A Field Report on Imitation Learning with LeRobot

Robin Dehullu

The Evolution of Robotics

Imitation Learning

Open-Source