we won south park commons' embodied ai hack

In just 2 days, we built a system of robots for assisting you on your desktop from scratch. You can do tasks such as desk cleaning or organization just by talking to a voice agent.

For example, you can tell the robot to "put the screwdriver away" and it would pick up the screwdriver on one end of the table and put it to the bin at the other end of the table.

Behind the scene, this is a group of multiple systems working together.

First, the robots are orchestrated by a voice agent running on LiveKit. This agent has access to the overview of the table and can use such to make plans based on user-given commands.

If the user gives a task like clean the table, the agent would first check what objects are on the table and utilize the tools it have to execute the goal.

The agent has access to 3 tools.

The first tool is move_to, which is a P loop for controlling the robot slider to any object on the table. The robot is localized using an April tag while the object is localized using a VLM.

The second tool is run_policy, which allows us to run any of our trained ACT policy on demand.

We collected data and trained 2 policy at SPC:

Pick up is an ACT policy trained on 200 episodes of picking random stuff up at SPC. It is used when the agent wants to pick stuff on the table. We also trained a sparse reward model so the agent knows when the pick up policy has finished and can stop it.
Put down is another policy trained on 50 episodes of putting down stuff. However, it didn't work well so we just sample trajectories from the dataset randomly. Just dropping stuff doesn’t require much intelligence anyway.

The third tool is run_molmo, which utilizes MolmoACT2, a VLA that has proven great generalization capabilities on various robot embodiments. We fine-tuned this on all the dataset we gathered at SPC, so the model can better adapt to our embodiment. Results show that the model can reliably pick up and localize objects that were not in trained dataset, showing its capabilities to generalize. However, the model itself is jittery and takes a long time to converge, we suspect this is due to our naive method of remote inference and state sampling on the SO101 arm.

Behind the scene, the entire system is powered by our own arbitration and network infrastructure. The robot is not controlled from a single computer.

The voice agent lives on one laptop. The ACT policies live on one laptop. The slider control lives on one laptop. MolmoACT2 lives on a H200 instance in Finland. This is the power of our prior: LiveKit Portal.

The datasets, models, and source code of this project can be found here: https://github.com/livekit-examples/embodied-ai-hackathon. The priors used in this project are as following:

LiveKit Agents: Our voice agent orchestration framework.
LiveKit Portal: Our library for teleoperation and inference on any robot. It lets operators and policies connect to a robot from different hosts, eliminating network configuration or orchestration hassle. This allowed us to build independent systems for the robot in parallel.
LeSlider: A simple hardware extension to the SO-101 that we designed for fun.

Through this project, we showcase how robotics is a systems problem, how humans can interact with robots, how robots should be deployed in the wild, and how the future looks like.

At last, a big thank you to SPC for hosting this awesome hackathon. We’re really grateful for the help and the hospitality that the partners and members there have given us.