New AI system helps robots grasp unfamiliar objects with open-ended language prompts

Imagine you’re a guest at a friend’s place in a foreign country, and you decide to check out what’s in their fridge for a delicious breakfast. Many of the items inside look unfamiliar, packaged in containers and wrappings you’ve never seen before. However, just like humans can adapt to handling new objects, a team at MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) has developed a system called Feature Fields for Robotic Manipulation (F3RM). This innovative system combines 2D images with foundational model features to create 3D scenes, helping robots identify and grasp objects in their vicinity. F3RM can also understand and act on open-ended language prompts from humans, making it a valuable tool for navigating real-world environments filled with countless objects, such as warehouses and homes.

F3RM empowers robots to interpret natural language instructions, allowing them to manipulate objects effectively. This means that robots can understand less specific requests from humans and still accomplish the desired tasks. For instance, if a user tells the robot to “fetch a tall mug,” the robot can locate and pick up the item that best fits that description.

Creating robots that can adapt and generalize in real-world scenarios is an immensely challenging task. As Ge Yang, a postdoc at the National Science Foundation AI Institute for Artificial Intelligence and Fundamental Interactions and MIT CSAIL, explains, “We really want to figure out how to do that, so with this project, we try to push for an aggressive level of generalization, from just three or four objects to anything we find in MIT’s Stata Center. We wanted to learn how to make robots as flexible as ourselves, since we can grasp and place objects even though we’ve never seen them before.”

Learning ‘what’s where by looking’

This method could revolutionize the way robots handle item retrieval in vast fulfillment centers, dealing with the inherent chaos and unpredictability of such environments. In these warehouses, robots are typically provided with textual descriptions of the items they need to locate. Regardless of variations in packaging, these robots must accurately match the provided text to the actual objects to ensure customers’ orders are packed correctly.

Consider the fulfillment centers of major online retailers, which may house millions of distinct items, many of which a robot has never seen before. To operate effectively at such a massive scale, robots must possess the ability to comprehend the geometric and semantic aspects of various items, some of which may be stored in tight spaces. F3RM’s advanced spatial and semantic perception capabilities empower robots to become more proficient at pinpointing objects, placing them in bins, and facilitating efficient order packaging, ultimately streamlining the work of factory employees responsible for shipping customers’ orders.

Ge Yang emphasizes the versatility of F3RM, noting, “One thing that often surprises people with F3RM is that the same system also works on a room and building scale and can be used to build simulation environments for robot learning and large maps. But before we scale up this work further, we want to first make this system work really fast. This way, we can use this type of representation for more dynamic robotic control tasks, hopefully in real-time, so that robots that handle more dynamic tasks can use it for perception.”

The MIT team envisions F3RM’s potential applications beyond warehouses, suggesting it could be beneficial in urban and household settings. For instance, this approach could empower personalized robots to recognize and pick up specific items, enhancing their perceptual and physical understanding of their surroundings.

Senior author Phillip Isola reflects on the significance of this advancement, saying, “Visual perception was defined by David Marr as the problem of knowing ‘what is where by looking.’ Recent foundation models have become adept at recognizing thousands of object categories and providing detailed text descriptions of images. Simultaneously, radiance fields excel at representing the spatial distribution of objects within a scene. The fusion of these two approaches results in a comprehensive representation of ‘what is where’ in 3D space. Our work demonstrates that this combination is particularly valuable for robotic tasks, especially those involving the manipulation of objects in a 3D environment.”

Creating a ‘digital twin’

F3RM initiates its process by capturing a series of photos using a selfie stick-mounted camera. These 50 images are taken from various angles and poses, serving as the building blocks for a neural radiance field (NeRF). NeRF is a sophisticated deep learning technique that leverages 2D images to construct a 3D scene. It’s akin to creating a “digital twin” of the robot’s immediate environment, presenting a comprehensive 360-degree representation of what surrounds it.

However, F3RM’s capabilities don’t stop there. Alongside the highly detailed neural radiance field, it also generates a feature field enriched with semantic information. This is achieved through the utilization of CLIP, a vision foundation model trained on a vast dataset of hundreds of millions of images, enabling it to grasp a wide array of visual concepts. F3RM takes the 2D CLIP features extracted from the images snapped by the selfie stick and effectively elevates these features into a 3D representation, enhancing the system’s understanding of both the geometry and semantics of its environment.

Keeping things open-ended

After undergoing a few initial demonstrations, the robot puts its knowledge of geometry and semantics into action, allowing it to grasp objects it has never encountered before. When a user submits a text request, the robot dives into a vast array of potential grasping strategies to identify the most promising ones for picking up the requested item. Each potential grasp is evaluated based on factors such as its relevance to the user’s prompt, its resemblance to the demonstrations the robot has learned from, and whether it avoids any collisions. The grasp with the highest score is then selected and executed.

To showcase the system’s ability to understand open-ended human requests, the researchers challenged the robot to pick up Baymax, a character from Disney’s “Big Hero 6.” Despite F3RM not having direct training to pick up this specific cartoon superhero toy, the robot relied on its spatial awareness and vision-language features derived from the foundation models to determine which object to grasp and how to do it.

F3RM also empowers users to specify the object they want the robot to manipulate with varying levels of linguistic detail. For example, if there are both a metal mug and a glass mug, the user can request the “glass mug.” If the robot spots two glass mugs and one is filled with coffee while the other contains juice, the user can further specify by asking for the “glass mug with coffee.” The embedded foundation model features within the feature field make this level of flexible and open-ended understanding possible.

William Shen, an MIT Ph.D. student, CSAIL affiliate, and co-lead author, emphasizes the significance of this achievement, saying, “If I showed a person how to pick up a mug by the lip, they could easily transfer that knowledge to pick up objects with similar geometries such as bowls, measuring beakers, or even rolls of tape. For robots, achieving this level of adaptability has been quite challenging. F3RM combines geometric understanding with semantics from foundation models trained on internet-scale data to enable this level of aggressive generalization from just a small number of demonstrations.”

The research paper titled “Distilled Feature Fields Enable Few-Shot Language-Guided Manipulation” is available on the arXiv preprint server.

Source: Massachusetts Institute of Technology

Leave a Comment