Robot Intelligence through Perception Lab | Language Understanding for Robotics

In order for robots to work seamlessly alongside people, they must be able to understand and successfully execute natural language instructions. For example, someone using a voice-commandable wheelchair might direct it to “take me to the room across from the kitchen” or a person may command a robotic forklift to “pick up the pallet of tires and put it on the truck in receiving.” While natural language provides a flexible means of command and control, interpreting free-form utterances is challenging due to their ambiguity and complexity, differences in the amount of detail that may be given, and the diverse ways in which language can be composed.

Language Understanding as Probabilistic Inference

A graphical model for language understanding — A probabilistic graphical model that expresses the interpretation of the command "put the tire pallet on the truck."

We have developed methods that frame language understanding as a problem of inference over a probabilistic model that expresses the correspondence between phrases in the free-form utterance and a symbolic representation of the robot’s state and action space (Tellex et al., 2011; Chung et al., 2015; Arkin et al., 2017). These symbols represent actions that the robot can perform, objects and locations in the environment, and constraints for an optimization-based planner. Underlying these methods is a grounding graph, a probabilistic graphical model that is instantiated dynamically according to the compositional and hierarchical structure of natural language. The grounding graph makes explicit the uncertainty associated with mapping linguistic constituents from the free-form utterance to their corresponding groundings (symbols) in the world. This model is trained on corpora of natural language utterances paired with their associated groundings, enabling the framework to automatically learn the meanings of each word in the corpora. This has the important advantage that the complexity and diversity of the language that these methods can handle is limited only by the rules of grammar and the richness of the training data. As these models become more expressive and the robot’s skills more extensive, the nature of the underlying representation becomes critical—those that are overly rich may be too complex to evaluate, while those that are simplified may not be sufficiently expressive for a given task and utterance. We have developed algorithms that enable robots to learn the appropriate fidelity and complexity of these models from multi-modal data, resulting in representations that are “as simple as possible, but no simpler”¹ according to the instruction and task.

These models capture the expressivity and compositionality of free-form language and allow people to command and control robots simply by speaking to them as they would to another person. People have used these methods to communicate with a variety of robots including smart wheelchairs, voice-commandable forklifts, micro aerial vehicles, manipulators, and teams of ground robots. These models and their related inference algorithms were a core component of the Army Research Laboratory’s Robotics Collaborative Technology Alliance (RCTA), a multi-year, multi-institutional project to develop ground robots as capable teammates for soldiers (Arkin et al., 2020; Walter et al., 2022; Howard et al., 2022). By explicitly modeling the uncertainty in the grounding of a natural language utterance, the method enables novel, effective mechanisms for resolving ambiguity, e.g., by allowing the robot to engage the user in dialogue.

Language Understanding in Unknown Environments

Frames that show a robot retrieving a ball — Given the command to "retrieve the ball inside the box" in an unknown environment, the robot opportunistically learns a distribution over the world model from language and and acts using a learned policy that reasons over this distribution, which is updated as the robot observes more of the environment.

The aforementioned language understanding methods ground free-form language into symbolic representations of the robot’s state and action space (e.g., in the form of a spatial-semantic map of the environment). These “world models” are assumed to be known to the robot. While we have developed weakly supervised methods that enable robots to efficiently learn these models from multimodal cues, robots must be capable of understanding natural language utterances in scenarios for which the world model is not known a priori. This is challenging because it requires interpreting language in situ in the context of the robot’s noisy sensor data (e.g., an image stream) and choosing actions that are appropriate given an uncertain model of the robot’s environment.

Our lab developed an algorithm that enables robots to follow free-form object-relative navigation instructions (Duvallet et al., 2014), route directions (Hemachandra et al., 2015), and mobile manipulation commands (Patki et al., 2019; Walter et al., 2022), without any prior knowledge of the environment. The novelty lies in the algorithm’s treatment of language as an additional sensing modality that is integrated with the robot’s traditional sensor streams (e.g., vision and LIDAR). More specifically, the algorithm exploits environment knowledge implicit in the command to hypothesize a representation of the latent world model that is sufficient for planning. Given a natural language command (e.g., “go to the kitchen down the hallway”), the method infers language-based environment annotations (e.g., that the environment contains a kitchen at a location consistent with being “down” a hallway) using our hierarchical grounding graph-based language understanding method (Chung et al., 2015). An estimation-theoretic algorithm then learns a distribution over hypothesized world models by treating the inferred annotations as observations of the environment and fusing them with robot’s other sensor streams. The challenge then is to choose the best actions to take such that they are consistent with the free-form utterance, based upon the learned world model distribution. Formulated as a Markov decision process, the method learns a belief space policy from human demonstrations that reasons directly over the world model distribution to identify suitable actions. Together, the framework enables robots to follow natural language instructions in complex environments without any prior knowledge of the world model.

The algorithm that explicitly learns a distribution over the latent world model has been used by voice-commandable wheelchairs and ground robots to efficiently follow natural language navigation instructions in unknown, spatially extended, complex environments (Walter et al., 2022).

Sequence-to-sequence neural network architecture — We developed a neural sequence-to-sequence model for language understanding in unknown environments. Treating language understanding much like a machine translation problem, the model "translates" natural language (e.g., English) instructions into what can be thought of as their meaning in a "robot action" language according to the robot's history local observations of the environment

Alternatively, language understanding in unknown environments can be formulated as a machine translation problem. We developed a neural multi-view sequence-to-sequence learning model that ““translates” free-form language to action sequences (i.e., analogous to a robot ``language’’) based upon images of the observable world (Mei et al., 2016). The encoder-aligner-decoder architecture takes as input the natural language instruction as a sequence of words along with a stream of images from a robot-mounted camera, and outputs a distribution over the robot’s action sequence. More specifically, the encoder, which takes the form of a recurrent neural network, automatically learns a suitable representation (an embedding) of language, while the decoder converts this representation to a probabilistic action model based upon the history of camera images. The intermediate aligner empowers the model to focus on sentence “regions” (groups of words) that are most salient to the current image and action. In similar fashion to the grounding graph-based methods, language understanding then follows as inference over this learned distribution. A distinct advantage of this approach is that the architecture uses no specialized linguistic resources (e.g., semantic parsers, seed lexicons, or re-rankers) and can be trained in a weakly supervised, end-to-end fashion, which allows for efficient training and generalization to new domains.

Our neural sequence-to-sequence framework established the state-of-the-art performance on a dataset that serves as a benchmark evaluation in the community, outperforming methods that use specialized linguistic resources. As a testament to the generalizability of the method, we and others have demonstrated that this architecture can be adapted to perform a variety of language understanding and synthesis tasks.

References

An Intelligence Architecture for Grounded Language Communication with Field Robots

Thomas M. Howard, Ethan Stump, Jonathan Fink, Jacob Arkin, Rohan Paul, Daehyung Park, Subhro Roy, Daniel Barber, Rhyse Bendell, Karl Schmeckpeper, Junjiao Tian, Jean Oh, Maggie Wigness, Long Quang, Brandon Rothrock, Jeremy Nash, Matthew R. Walter, Florian Jentsch, and Nicholas Roy

Field Robotics 2022

Bib PDF
@article{howard21, author = {Howard, Thomas M. and Stump, Ethan and Fink, Jonathan and Arkin, Jacob and Paul, Rohan and Park, Daehyung and Roy, Subhro and Barber, Daniel and Bendell, Rhyse and Schmeckpeper, Karl and Tian, Junjiao and Oh, Jean and Wigness, Maggie and Quang, Long and Rothrock, Brandon and Nash, Jeremy and Walter, Matthew R. and Jentsch, Florian and Roy, Nicholas}, title = {An Intelligence Architecture for Grounded Language Communication with Field Robots}, year = {2022}, journal = {Field Robotics}, volume = {2}, pages = {406--436} }
Language Understanding for Field and Service Robots in a Priori Unknown Environments

Matthew R Walter, Siddharth Patki, Andrea F Daniele, Ethan Fahnestock, Felix Duvallet, Sachithra Hemachandra, Jean Oh, Anthony Stentz, Nicholas Roy, and Thomas M. Howard

Field Robotics 2022

Bib arXiv PDF
@article{walter21, author = {Walter, Matthew R and Patki, Siddharth and Daniele, Andrea F and Fahnestock, Ethan and Duvallet, Felix and Hemachandra, Sachithra and Oh, Jean and Stentz, Anthony and Roy, Nicholas and Howard, Thomas M.}, title = {Language Understanding for Field and Service Robots in a Priori Unknown Environments}, year = {2022}, journal = {Field Robotics}, volume = {2}, pages = {1191--1231} }

Multimodal Estimation and Communication of Latent Semantic Knowledge for Robust Execution of Robot Instructions

Jacob Arkin, Daehyung Park, Subhro Roy, Matthew R. Walter, Nicholas Roy, Thomas M Howard, and Rohan Paul

International Journal of Robotics Research 2020

Bib PDF

@article{arkin20,
  author = {Arkin, Jacob and Park, Daehyung and Roy, Subhro and Walter, Matthew R. and Roy, Nicholas and Howard, Thomas M and Paul, Rohan},
  title = {Multimodal Estimation and Communication of Latent Semantic Knowledge for Robust Execution of Robot Instructions},
  journal = {International Journal of Robotics Research},
  volume = {39},
  number = {10--11},
  pages = {1279--1304},
  year = {2020}
}

Language-guided Semantic Mapping and Mobile Manipulation in Partially Observable Environments

Siddharth Patki, Ethan Fahnestock, Thomas M. Howard, and Matthew R. Walter

In Proceedings of the Conference on Robot Learning (CoRL) Oct 2019

Bib arXiv PDF Video

@inproceedings{patki19a,
  author = {Patki, Siddharth and Fahnestock, Ethan and Howard, Thomas M. and Walter, Matthew R.},
  title = {Language-guided Semantic Mapping and Mobile Manipulation in Partially Observable Environments},
  booktitle = {Proceedings of the Conference on Robot Learning (CoRL)},
  year = {2019},
  address = {Osaka, Japan},
  month = oct
}

Contextual Awareness: Understanding Monologic Natural Language Instructions for Autonomous Robots

Jacob Arkin, Matthew R. Walter, Adrian Boteanu, Michael E. Napoli, Harel Biggie, Hadas Kress-Gazit, and Thomas M. Howard

In Proceedings of the IEEE International Symposium on Robot and Human Interactive Communication (RO-MAN) Aug 2017

Bib PDF
@inproceedings{arkin17, author = {Arkin, Jacob and Walter, Matthew R. and Boteanu, Adrian and Napoli, Michael E. and Biggie, Harel and Kress-Gazit, Hadas and Howard, Thomas M.}, booktitle = {Proceedings of the IEEE International Symposium on Robot and Human Interactive Communication (RO-MAN)}, title = {Contextual Awareness: {U}nderstanding Monologic Natural Language Instructions for Autonomous Robots}, year = {2017}, month = aug }

Listen, Attend, and Walk: Neural Mapping of Navigational Instructions to Action Sequences

Hongyuan Mei, Mohit Bansal, and Matthew R. Walter

In Proceedings of the National Conference on Artificial Intelligence (AAAI) Feb 2016

Bib arXiv PDF Code

@inproceedings{mei16,
  address = {Phoenix, AZ},
  author = {Mei, Hongyuan and Bansal, Mohit and Walter, Matthew R.},
  booktitle = {Proceedings of the National Conference on Artificial Intelligence (AAAI)},
  month = feb,
  pages = {2772--2778},
  title = {Listen, Attend, and Walk: {N}eural Mapping of Navigational Instructions to Action Sequences},
  year = {2016}
}

On the Performance of Hierarchical Distributed Correspondence Graphs for Efficient Symbol Grounding of Robot Instructions

Istvan Chung, Oron Propp, Matthew R. Walter, and Thomas M. Howard

In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) Oct 2015

Bib PDF Code

@inproceedings{chung15,
  address = {Hamburg, Germany},
  author = {Chung, Istvan and Propp, Oron and Walter, Matthew R. and Howard, Thomas M.},
  booktitle = {Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)},
  month = oct,
  title = {On the Performance of Hierarchical Distributed Correspondence Graphs for Efficient Symbol Grounding of Robot Instructions},
  year = {2015},
  pages = {5247--5252}
}

Learning Models for Following Natural Language Directions in Unknown Environments

Sachithra Hemachandra, Felix Duvallet, Thomas M. Howard, Nicholas Roy, Anthony Stentz, and Matthew R. Walter

In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA) May 2015

Bib arXiv PDF

@inproceedings{hemachandra15,
  address = {Seattle, WA},
  author = {Hemachandra, Sachithra and Duvallet, Felix and Howard, Thomas M. and Roy, Nicholas and Stentz, Anthony and Walter, Matthew R.},
  booktitle = {Proceedings of the IEEE International Conference on Robotics and Automation (ICRA)},
  month = may,
  title = {Learning Models for Following Natural Language Directions in Unknown Environments},
  year = {2015}
}

Inferring Maps and Behaviors from Natural Language Instructions

Felix Duvallet, Matthew R. Walter, Thomas Howard, Sachithra Hemachandra, Jean Oh, Seth Teller, Nicholas Roy, and Antony Stentz

In Proceedings of the International Symposium on Experimental Robotics (ISER) Jun 2014

Bib PDF
@inproceedings{duvallet14, address = {Marrakech, Morocco}, author = {Duvallet, Felix and Walter, Matthew R. and Howard, Thomas and Hemachandra, Sachithra and Oh, Jean and Teller, Seth and Roy, Nicholas and Stentz, Antony}, booktitle = {Proceedings of the International Symposium on Experimental Robotics (ISER)}, month = jun, title = {Inferring Maps and Behaviors from Natural Language Instructions}, year = {2014} }

Understanding Natural Language Commands for Robotic Navigation and Mobile Manipulation

Stefanie Tellex, Thomas Kollar, Steven Dickerson, Matthew R. Walter, Ashish G. Banerjee, Seth Teller, and Nicholas Roy

In Proceedings of the National Conference on Artificial Intelligence (AAAI) Aug 2011

Bib PDF

@inproceedings{tellex11,
  address = {San Francisco, CA},
  author = {Tellex, Stefanie and Kollar, Thomas and Dickerson, Steven and Walter, Matthew R. and Banerjee, Ashish G. and Teller, Seth and Roy, Nicholas},
  booktitle = {Proceedings of the National Conference on Artificial Intelligence (AAAI)},
  month = aug,
  pages = {1507--1514},
  title = {Understanding Natural Language Commands for Robotic Navigation and Mobile Manipulation},
  year = {2011}
}

A simplified version of a statement by Albert Einstein. ↩