The Robot Intelligence through Perception Lab develops intelligent, perceptually aware robots that are able to work effectively with and alongside people in unstructured environments.
Our research focuses on the development of advanced perception algorithms that endow robots with a rich awareness of their surroundings. We develop methods that enable robots to learn models of objects, locations, and people that enable them to usefully act within and interact with their environments. We are particularly interested in algorithms that take as input multi-modal observations of a robot's surround (e.g., laser range data, image streams, and a user's natural language speech) and infer properties of the objects, places, people, and events that comprise a robot's environment.
The following provides an outdated summary of some of the projects that we have worked on.
We are developing a neural sequence-to-sequence model that enables robots to follow natural language route instructions in a priori unknown environments. Our alignment-based encoder-decoder model with long short-term memory recurrent neural networks (LSTM-RNN) translates natural language instructions to action sequences based upon a world state representation. We introduce a multi-level aligner that empowers our model to focus on sentence "regions" salient to the current world state by using multiple abstractions of the input sentence. In contrast to existing methods, our model uses no specialized linguistic resources (e.g., parsers) or task-specific annotations (e.g., seed lexicons). This enables our model to generalize without sacrificing performance.
We are interested endowing autonomous agents with the ability to extract salient information from large knowledge bases and to then share this information with people using natural language. For example, a robotic weather forecaster needs to reduce an extensive collection of meteorological data to pertinent records (bottom-right) and then synthesize a forecast (top-right). We are developing an end-to-end, domain-independent neural encoder-aligner-decoder model for this so-called selective generation problem, i.e., the joint task of content selection and surface realization. Our model first encodes the full set of over-determined database event records via a memory-based recurrent neural network (LSTM), then utilizes a novel coarse-to-fine (hierarchical), multi-input aligner to identify the small subset of salient records to talk about, and finally employs a decoder to generate free-form descriptions of the aligned, selected records.
Navigational Instruction Generation as Inverse Reinforcement Learning with Neural Machine Translation, In Proceedings of the ACM/IEEE International Conference on Human-Robot Interaction (HRI), 2017. .
What to talk about and how? Selective Generation using LSTMs with Coarse-to-Fine Alignment, In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics — Human Language Technologies (NAACL HLT), 2016. .
Localization is critical in order for self-driving cars and other robotic vehicles to navigate autonomously. Most vehicles determine their position using GPS receivers, which suffer from limited precision and are prone to multi-path effects (e.g., in the so-called "urban canyons" formed by tall buildings), or by using cameras and other sensors to estimate their location relative to a prior map. We are developing robots that are able to accurately localize themselves by exploiting the availability of satellite images that densely cover the world. Our approach takes a ground image as input, and outputs the location from which it was taken on a georeferenced satellite image. We perform visual localization by estimating the co-occurrence probabilities between the ground and satellite images based on a ground-satellite feature dictionary. The method is able to estimate likelihoods over arbitrary locations without the need for a dense ground image database.
Natural language offers an intuitive and flexible means for humans to communicate with the robots that we will increasingly work alongside in our homes and workplaces. Recent advancements have given rise to robots that are able to interpret natural language manipulation and navigation commands, but these methods require a prior map of the robot's environment. We have developed a novel learning framework that enables robots to successfully follow natural language route directions without any previous knowledge of the environment. The algorithm utilizes spatial and semantic information that the human conveys through the command to learn a distribution over the metric and semantic properties of spatially extended environments. Our method uses this distribution in place of the latent world model and interprets the natural language instruction as a distribution over the intended behavior. A novel belief space planner reasons directly over the map and behavior distributions to solve for a policy using imitation learning. By learning and performing inference over a latent environment model, the algorithm is able to successfully follow natural language route directions within novel, extended environments.
Knowledge representations that model the metric, topological, and semantic properties of a robot's environment are integral to effective human-robot collaboration. For example, most techniques for natural language understanding assume knowledge of a semantic map that expresses the spatial-semantic properties of the environment (e.g., the existence, location, and extent of different rooms, their colloquial names, and the objects that they contain). We have developed a framework that enables robots to efficiently learn rich spatial-semantic environment models from natural language descriptions, such as those conveyed during a guided tour. Previous approaches either require that the maps be hard-coded by system designers or are limited to augmenting metric maps with higher-level properties (e.g., place type, object locations) that can be inferred from the robot's sensor data, but do not use this information to improve the metric map. The novelty of our algorithm lies in fusing high-level knowledge that people can uniquely provide through speech with metric information from the robot's low-level sensor streams. Our method jointly estimates a hybrid metric, topological, and semantic representation of the environment. This semantic graph provides a common framework in which we integrate information that the user communicates (e.g., labels and spatial relations) with metric observations from low-level sensors. Our algorithm efficiently maintains a factored distribution over semantic graphs based upon the stream of natural language and low-level sensor information. We find that the incorporation of information from free-form descriptions increases the metric, topological and semantic accuracy of the recovered environment model.
Learning Spatially-Semantic Representations from Natural Language Descriptions and Scene Classifications, In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), 2014. .
Learning Semantic Maps Through Dialog for A Voice-Commandable Wheelchair, In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) Workshop on Rehabilitation and Assistive Robotics, 2014. .