The TTIC Young Researcher Seminar Series features talks by Ph.D. students and postdocs whose research is of broad interest to the computer science community. The series provides an opportunity for early-career researchers to present recent work to and meet with students and faculty at TTIC and nearby universities.
To receive announcements regarding the seminar series, please subscribe to the mailing list.
For additional information, please contact Shiry Ginosar(shiry@ttic.edu)
Title: Towards Intelligent Robots Beyond Human Capability
Date: October 8, 2024 at 11 AM
Speaker: Ruoshi Liu, Columbia University
Host: Anand Bhattad (bhattad@ttic.edu)
Abstract: Humans are much smarter than robots today, with the ability to adapt quickly to a new environment and learn a new skill from a few examples, among other amazing things. However, human intelligence also comes with many limitations. Human perception is limited by the physical constraints of our sensors. Human’s prior knowledge is limited by the amount of data we are exposed to in our lifetime. Humans are limited by our physical capabilities, such as muscle strength and reaction time. In this talk, we will explore key directions where intelligent systems can achieve superhuman performance and reflect on the lessons learned from these pursuits.
Video Link: via Panopto
Title: Comparing deep learning-based autonomous driving algorithms with the brain activity of human drivers
Date: October 23, 2024 at 11 AM
Speaker: Kaylene Stocking, UC Berkeley
Host: Matt Walter (mwalter@ttic.edu)
Abstract: Understanding how cognition and learned representations give rise to intelligent behavior is a fundamental goal in both machine learning and neuroscience. However, in both domains, the most well-understood behaviors are passive and open-loop, such as image recognition or speech processing. In this work, we compare human brain activity measured via functional magnetic resonance imaging with deep neural network (DNN) activations for an active taxi-driving task in a naturalistic simulated environment. To do so, we used DNN activations to build voxelwise encoding models for brain activity. Results show that encoding models for DNN activations explain significant amounts of variance in brain activity across many regions of the brain. Furthermore, each functional module in the DNN explains brain activity in a distinct network of functional regions in the brain. The functions of each DNN module correspond well to the known functional properties of its corresponding brain regions, suggesting that both the DNN and the human brain may partition the task in a similar manner. These results represent a first step towards understanding how humans and current deep learning methods agree or differ in active closed-loop tasks such as driving.
Video Link: via Panopto
Title: Reverse Engineering CLIP
Date: November 6, 2024 at 11 AM
Speaker: Yossi Gandelsman, UC Berkeley
Host: Shiry Ginosar (shiry@ttic.edu)
Abstract: In this talk, I reverse engineer the computation inside CLIP, one of the most commonly used computer vision backbones. I analyze how individual model components affect the final CLIP representation. I show that the image representation can be decomposed as a sum across individual image patches, model layers, neurons, and attention heads, and use CLIP’s text representation to interpret the summands.When interpreting the attention heads, each head role can be characterized by automatically finding text representations that span its output space, which reveals property-specific roles for many heads (e.g. location or shape). Next, interpreting the image patches uncovers an emergent spatial localization within CLIP. Finally, the automatic description of the contributions of individual neurons shows polysemantic behavior - each neuron corresponds to multiple, often unrelated, concepts (e.g. ships and cars).
The gained understanding of different components allows three main applications: First, the discovered head roles enable the removal of spurious features from CLIP. Second, emergent localization is used for a strong zero-shot image segmenter. Finally, the extracted neuron polysemy allows the mass production of “semantic” adversarial examples by generating images with concepts spuriously correlated to the incorrect class. The results indicate that a scalable understanding of transformer models is attainable and can be used to detect model bugs, repair them, and improve them.Video Link: via Panopto
Title: AI Safety Through Interpretable and Controllable Language Models
Date: November 20, 2024 at 11 AM
Speaker: Peter Hase, University of North Carolina at Chapel Hill
Host: Karen Livescu (klivescu@ttic.edu)
Abstract: In a 2022 survey, 37% of NLP experts agreed that “AI decisions could cause nuclear-level catastrophe” in this century. This survey was conducted prior to the release of ChatGPT. The research community’s now-common concern about catastrophic risks from AI highlights that long-standing problems in AI safety are as important as ever. In this talk, I will describe research on two core problems at the intersection of NLP and AI safety: (1) interpretability and (2) controllability. We need interpretability methods to verify that models use acceptable and generalizable reasoning to solve tasks. Controllability refers to our ability to steer individual behaviors in models on demand, which is helpful since pretrained models will need continual adjustment of specific knowledge and beliefs about the world. This talk will cover recent work on (1) open problems in interpretability, including mechanistic interpretability and chain-of-thought faithfulness, (2) fundamental problems with model editing, viewed through the lens of belief revision, and (3) scalable oversight, with a focus on weak-to-strong generalization. Together, these lines of research aim to develop rigorous technical foundations for ensuring the safety of increasingly capable AI systems.
Video Link: via Panopto
Date: December 4, 2024
Speaker: Meghal Gupta, Berkeley
Host: Avrim Blum (avrim@ttic.edu) and Siddharth Bhandari (siddharth@ttic.edu)
Abstract: Estimating the median of a stream is one of the most basic problems in data sketching. Precisely, given a stream x1, x2, x3, …, xn of elements from some universe of size U and an accuracy ε, the goal is to give a space/time-efficient algorithm that outputs an element with rank between (½-ε)n and (½+ε)n. Estimating arbitrary quantiles has the same space/time complexity as estimating the median.
The simplest approach to quantile (or median) estimation stores every element individually, requiring nlogU memory. However, numerous algorithms have been developed that achieve significantly better bounds. Despite these results, the optimal algorithm was not known prior to this work. In this talk, we will describe an optimal deterministic quantile sketch that uses O(ε-1·(logn+logU)) bits of memory. This improves upon the previous best deterministic algorithm (Greenwald and Khanna 2003) and even upon the best randomized algorithm (Karnin, Lang, and Libery 2016).
This is joint work with Mihir Singhal and Hongxun Wu.
Title: Feature Learning in Infinite-Depth Neural Networks
Date: December 6, 2023
Speaker: Dingli Yu, Princeton University
Host: Zhiyuan Li (zhiyuanli@ttic.edu)
Abstract: By classifying infinite-width neural networks and identifying the optimal limit, Tensor Programs IV and V demonstrated a universal way, called μP, for widthwise hyperparameter transfer, i.e., predicting optimal hyperparameters of wide neural networks from narrow ones. Here we investigate the analogous classification for depthwise parametrizations of deep residual networks (resnets). We classify depthwise parametrizations of block multiplier and learning rate by their infinite-width-then-depth limits. In recent where each block has only one layer, we identify a unique optimal parametrization, called Depth-μP that extends μP and show empirically it admits depthwise hyperparameter transfer. We identify feature diversity as a crucial factor in deep networks, and Depth-μP can be characterized as maximizing both feature learning and feature diversity. Exploiting this, we find that absolute value, among all homogeneous nonlinearities, maximizes feature diversity and indeed empirically leads to significantly better performance. However, if each block is deeper (such as modern transformers), then we find fundamental limitations in all possible infinite-depth limits of such parametrizations, which we illustrate both theoretically and empirically on simple networks as well as Megatron transformer trained on Common Crawl.
Video Link: via Panopto
Title: Statistical physics insights into the dynamics of learning algorithms
Date: May 24, 2023
Speaker: Francesca Mignacco, Princeton University
Host: Nathan Srebro (nati@ttic.edu)
Abstract: In this talk, I will consider prototypical learning problems that are amenable to an exact characterization. I will show how dynamical mean-field theory from statistical physics can be used to derive an effective low-dimensional description of the network performance and the learning dynamics of multi-pass SGD. I will discuss how different sources of algorithmic noise affect the performance.
Video Link: via Panopto
Title: Exploring and Learning in Sparse Linear MDPs without Computationally Intractable Oracles
Date: October 25, 2023
Speaker: Noah Golowich, MIT
Host: Ali Vakilian (vakilian@ttic.edu)
Abstract: The key assumption underlying linear Markov Decision Processes (MDPs) is that the learner has access to a known feature map ϕ(x,a) that maps state-action pairs to d-dimensional vectors, and that the rewards and transitions are linear functions in this representation. But where do these features come from? In the absence of expert domain knowledge, a tempting strategy is to use the “kitchen sink” approach and hope that the true features are included in a much larger set of potential features. In this talk I will discuss our recent work revisiting linear MDPs from the perspective of feature selection. In a k-sparse linear MDP, there is an unknown subset S⊂[d] of size k containing all the relevant features, and the goal is to learn a near-optimal policy in only poly(k,logd) interactions with the environment. Our main result is the first polynomial-time algorithm for this problem. In contrast, earlier works either made prohibitively strong assumptions that obviated the need for exploration, or required solving computationally intractable optimization problems. As a corollary of our main result, we give an algorithm for learning a near-optimal policy in block MDPs whose decoding function is a low-depth decision tree; the algorithm runs in quasi-polynomial time and takes a polynomial number of samples. This can be seen as a reinforcement learning analogue of classic results in computational learning theory. Furthermore, it gives a natural model where improving the sample complexity via representation learning is computationally feasible. Based on joint work with Ankur Moitra and Dhruv Rohatgi. Paper link: https://arxiv.org/abs/2309.09457.
Title: TBD
Date: January 25, 2023
Speaker: Enric Boix, MIT
Host: Nathan Srebro (nati@ttic.edu)
Title: A Journey on Interpretability Methods in NLP: Inside-Out and Back
Date: January 11, 2023
Speaker: Lena Voita, University of Edinburgh
Host: Karen Livescu (klivescu@ttic.edu)
Abstract: In this talk, I will illustrate several approaches, motivations and ideas behind analysis of NLP models. As an example, we will take the standard Transformer model trained for the Machine Translation task and will look at it from different points of view. We will start our journey from examining model components and seeing how, and whether, model’s inductive biases facilitate performing the task. Next, we turn to the kinds of information the model relies on to make its predictions. Here we will use attribution methods and will see how some potentially pathological behaviours look on the inside. Finally, we will turn our view inside-out, i.e. from inner workings of the model to analysing its outputs. In this part, we will discover that the way the model learns to translate is very similar to how humans do it: from learning word-by-word translation first to becoming more fluent later. Most importantly, the two views (from the inside and from the outside) show the same process, and we will see how this process is reflected in these two types of analysis.
Title: Diffusion-LM Improves Controllable Text Generation
Date: November 16, 2022
Speaker: Lisa Li, Stanford
Host: David McAllester (mcallester@ttic.edu)
Abstract: Controlling the behavior of language models (LMs) without re-training is a major open problem in natural language generation. While recent works have demonstrated successes on controlling simple sentence attributes (e.g., sentiment), there has been little progress on complex, fine-grained controls (e.g., syntactic structure). To address this challenge, we develop a new non-autoregressive language model based on continuous diffusions that we call Diffusion-LM. Building upon the recent successes of diffusion models in continuous domains, Diffusion-LM iteratively denoises a sequence of Gaussian vectors into word vectors, yielding a sequence of intermediate latent variables. The continuous, hierarchical nature of these intermediate variables enables a simple gradient-based algorithm to perform complex, controllable generation tasks. We demonstrate successful control of Diffusion-LM for six challenging fine-grained control tasks, significantly outperforming prior work.
Video Link: via Panopto
Date: November 7, 2022 at 3:00pm
Speaker: June Vuong, Stanford
Host: Madhur Tulsiani (madhurt@ttic.edu)
Abstract: We design fast algorithms for repeatedly sampling from strongly Rayleigh distributions, which include as special cases random spanning tree distributions and determinantal point processes. For a graph G=(V, E), we show how to approximately sample uniformly random spanning trees from G in $O(| V | log ^2 |V|) time per sample after an initial $O(|E| polylog |E|)$ time preprocessing. This is the first nearly-linear runtime in the output size, which is clearly optimal. For a determinantal point process on k-sized subsets of a ground set of n elements, defined via an nxn kernel matrix, we show how to approximately sample in $\tilde{O}(k^\omega)$ time after an initial $\tilde{O}(nk^{\omega-1})$ time preprocessing, where $\omega<2.372864$ is the matrix multiplication exponent. The time to compute just the weight of the output set is roughly $k^\omega$, a natural barrier that suggests our runtime might be optimal for determinantal point processes as well. As a corollary, we even improve the state of the art for obtaining a single sample from a determinantal point process, from the prior runtime of $\tilde{O}(\min\{nk^2, n^\omega\})$ to $\tilde{O}(nk^{\omega-1})$.
In our main technical result, we achieve the optimal limit on domain sparsification for strongly Rayleigh distributions. In domain sparsification, sampling from a distribution $\mu$ on $\binom{[n]}{k}$ is reduced to sampling from related distributions on $\binom{[t]}{k}$ for $t \leq n$. We show that for strongly Rayleigh distributions, the domain size can be reduced to nearly linear in the output size $t=\tilde{O}(k)$, improving the state of the art from $t= \tilde{O}(k^2)$ for general strongly Rayleigh distributions and the more specialized $t=\tilde{O}(k^{1.5})$ for spanning tree distributions. Our reduction involves sampling from $\tilde{O}(1)$ domain-sparsified distributions, all of which can be produced efficiently assuming approximate overestimates for marginals of $\mu$ are known and stored in a convenient data structure. Having access to marginals is the discrete analog of having access to the mean and covariance of a continuous distribution, or equivalently knowing ``isotropy” for the distribution, the key behind optimal samplers in the continuous setting based on the famous Kannan-Lov\‘asz-Simonovits (KLS) conjecture. We view our result as analogous in spirit to the KLS conjecture and its consequences for sampling, but rather for discrete strongly Rayleigh measures.
Joint work with Nima Anari and Yang P. Liu. To appear in FOCS 2022.Video Link: via Panopto
Title: Efficient Convex Optimization Requires Superlinear Memory
Date: May 18, 2022
Speaker: Annie Marsden, Stanford University
Host: Nathan Srebro (nati@ttic.edu)
Abstract: How much memory is necessary to efficiently optimize convex functions? In this talk I will present a result which establishes that superlinear memory is required to be competitive with the best known quadratic memory algorithms. Specifically, any first-order algorithm must use either O(d^{1.25 – delta}) memory or make at least Omega(d^{1 + (4/3) delta}) queries. This resolves (a part of) a COLT 2019 open problem of Woodworth and Srebro.
Title: Hypercontractivity and Small-Set Expansion on High Dimensional Expanders
Date: April 27, 2022
Speaker: Max Hopkins, UCSD
Host: Madhur Tulsiani (madhurt@ttic.edu)
Abstract: Hypercontractivity is one of the most powerful tools in Boolean function analysis. Traditionally studied on the Boolean cube, recent years have seen a number of exciting applications of hypercontractivity on extended domains, most famously including the resolution of Khot’s 2-2 games conjecture. Unfortunately, beyond a few known examples our general understanding of hypercontractivity actually remains remarkably poor, severely limiting further avenues of application.
In this talk, we discuss the first steps towards a unified theory of hypercontractivity based on high dimensional expanders (HDX), a broad class of hypergraphs that have recently seen a series of breakthrough applications in coding theory and approximate sampling. Throughout the talk, we’ll pay special attention to the motivating application of characterizing small-set expansion in graphs, and briefly discuss how the line of work could lead to new insights towards resolving the unique games conjecture.
Based on joint work with Mitali Bafna, Tali Kaufman, and Shachar Lovett to appear at STOC 2022.
Title: Optimization Landscapes in Deep Learning
Date: November 6, 2019
Speaker: Kenji Kawaguchi, Massachusetts Institute of Technology
Host: Nathan Srebro (nati@ttic.edu)
Abstract: Deep learning has provided high-impact data-driven methods in various applications. However, theoretical guarantees in deep learning tend to provide too pessimistic insights with a gap from practical observations, often because of hidden special properties. Identifying such special properties can provide novel theoretical insights, and is potentially helpful for understanding and designing practical methods. In this talk, I will discuss special properties on nonconvex optimization landscapes of deep neural networks, as well as their implications on gradient descent methods and a few results on real-world applications.
Title: Exploring Surrogate Losses for Learning Neural Networks
Date: December 4, 2019
Speaker: Surbhi Goel, University of Texas
Host: Nathan Srebro (nati@ttic.edu)
Abstract: Developing provably efficient algorithms for learning commonly used neural network architectures continues to be a core challenge in machine learning. The underlying difficulty arises from the highly non-convex nature of the optimization problems posed by neural networks. In this talk, I will discuss the power of convex surrogate losses for tackling this underlying non-convexity. I will focus on the setting of ReLU regression and show how the convex surrogate allows us to get approximate guarantees in the challenging agnostic model. I will further show how these techniques give positive results for simple convolutional and fully connected architectures.
Title: Making contextual bandits as easy as regression
Date: February 19, 2020
Speaker: Dylan Foster, Massachusetts Institute of Technology
Abstract: A fundamental challenge in contextual bandits is to develop flexible, general-purpose algorithms with computational requirements no worse than classical supervised learning tasks such as classification and regression. Algorithms based on regression have shown promising empirical success, but theoretical guarantees have remained elusive except in special cases. We provide the first universal and optimal reduction from contextual bandits to online regression. We show how to transform any oracle for online regression with a given value function class into an algorithm for contextual bandits with the induced policy class, with no overhead in runtime or memory requirements. We characterize the minimax rates for contextual bandits with general, potentially nonparametric function classes, and show that our algorithm is minimax optimal whenever the oracle obtains the optimal rate for regression. Compared to previous results, our algorithm requires no distributional assumptions beyond a well-specified model, and works even when contexts are chosen adversarially.
Title: Learning From Sub-Optimal Data
Date: May 1, 2019
Speaker: Bradly Statie, University of Toronto
Host: Gregory Shakhnarovich (gregory@ttic.edu)
Abstract: Learning algorithms typically assume their input data is good natured. If one takes this input data and trains an agent with it, then the agent should, given enough time and compute, eventually learn how to solve the intended task. But this is not always a realistic expectation. Sometimes, the data given to an agent is flawed or fails to fully convey the correct problem. In other words, the input data is sub-optimal. In this talk, we will discuss two recent advances for overcoming sub-optimal data.
First, we consider the problem of imitation learning from sub-optimal demonstrations. In this setting, a robot receives failed or flawed demonstrations of a task. It must learn to infer, and subsequently complete, the intended task from only these failed demonstrations. Results are presented on a variety of robotics problems such as door opening and pick and place.
Second, we consider the problem of learning from sub-optimal reward functions. Often, the reward functions provided to reinforcement learning agents are derived by combining low level primitives such as agent position and velocity. For example, the reward for a robot learning to walk might be its forward velocity plus the position of its head. These reward functions are first and foremost intended for human consumption, not the consumption of an RL algorithm. Consequently, it might be possible to learn a better intrinsic reward function that it is easier for the RL algorithm to optimize against. We provide a new algorithm for learning such intrinsic reward functions. Optimizing against these learned intrinsic rewards leads to better overall agent performance than optimizing against the raw hand-designed reward function. Crucially, these reward functions can be learned on the fly without significant extra computational costs. Results are presented on a variety of MuJoCo tasks and some hard robotics problems such as block stacking.
Title: Non-Interactive Agreement and Dimension Reduction for Polynomials
Date: May 8, 2019
Speaker: Pritish Kamath, Massachusetts Institute of Technology
Host: Nathan Srebro (nati@ttic.edu)
Abstract: The ability to harness and work with randomness has been at the heart of information theory and theoretical computer science. Of particular interest to this talk is the use of randomness by a set of parties that are spread out geographically. For example, randomness can enable such parties to generate and share secrets.
The “Non-interactive Agreement Distillation” problem, specified by a joint distribution P(x,y) and a target alphabet size k, is defined as follows: Two players observe sequences (X_1, … , X_n) and (Y_1, … , Y_n) respectively where {(X_i, Y_i)} are drawn i.i.d. from P(x,y). Both players look at their share of randomness, and output an element of [k]. Their goal is to maximize the probability that their outputs are the same, while ensuring that their outputs are marginally uniform.
Given P(x,y) and k, what is the largest correlation (agreement probability) that the players can achieve? It turns out that this value is not well understood, even in some well motivated special cases. A priori, this value is not even “computable”, as there is no immediate upper bound on the number of samples the players need to draw in order to achieve the best possible correlation!
This talk will describe a recent line of work that obtains an explicit bound on the number of samples needed to get epsilon-close to the maximum achievable correlation. At the heart of our result is a new technique that we call “Dimension Reduction for Polynomials”. It can be viewed as a generalization of the well-known Johnson-Lindenstrauss lemma, which in this context, corresponds to the special case of degree-1 polynomials. We believe this technique could be of broader interest.
This talk will discuss the motivational aspects of the problem, its moral and technical connections to other problems in theoretical computer science and will mainly focus on the technique of “dimension reduction for polynomials”.
Based on joint works with Badih Ghazi, Prasad Raghavendra and Madhu Sudan. [arXiv:1607.04322, arXiv:1708.03808]
Title: TBD
Date: May 13, 2019
Speaker: Richard Zhang, Adobe
Host: Gregory Shakhnarovich (gregory@ttic.edu)
Abstract: In recent years, deep convolutional networks have proven to be extremely adept at discriminative labeling tasks. Not only do networks solve the direct task, they also learn an effective, general representation of the visual world. We explore the use of deep networks for image generation, or synthesis. Generation is challenging, as it is difficult to characterize the perceptual quality of an image, and often times there is more than one “correct” answer. However, we show that networks can indeed perform the graphics task of image generation, and in doing so, learn a representation of the visual world, even without the need for hand-curated labels.
We propose BicycleGAN, a general system for image-to-image translation problems, with the specific aim of capturing the multimodal nature of the output space. We further study image colorization and develop automatic and user-guided approaches. Moreover, colorization, as well as general cross-channel prediction, is a simple but powerful pretext task for self-supervised representation learning. We demonstrate strong transfer to high-level semantic tasks, such as image classification, and to low-level human perceptual similarity judgments. For the latter, we collect a large-scale dataset of human judgments and find that our method outperforms traditional metrics such as PSNR and SSIM. We also discover that many unsupervised and self-supervised representations transfer strongly, even comparable to fully-supervised methods. Despite their strong transfer performance, deep convolutional representations surprisingly lack a basic low-level property – shift-invariance. We propose to incorporate a classic but overlooked signal processing technique, low-pass filtering, into modern deep network architectures.
Title: Insights on Deep Representations with Applications to Healthcare
Date: October 17, 2018
Speaker: Maithra Raghu, Cornell
Host: Nathan Srebro (nati@ttic.edu)
Abstract: The past few years have seen enormous successes in applying deep neural networks in a variety of domains, from speech recognition to healthcare. However, as these models become more complex and are simultaneously used for more sensitive applications, it becomes increasingly important to better understand and intuit the properties of deep representations. In this talk I overview our development of Canonical Correlation Analysis (SVCCA) as a tool to study and compare latent representations learned by deep neural networks. The results provide insights on learning dynamics, help identify where different concepts are learned, and enable measuring the similarity of generalizing and memorizing networks. I demonstrate the importance of representational considerations in model design via a task on predicting doctor disagreements in a healthcare setting.
Title: Learning Sight from Sound
Date: March 28, 2018
Speaker: Andrew Owens
Host: Gregory Shakhnarovich (gregory@ttic.edu)
Abstract: From the clink of a mug placed onto a saucer to the bustle of a busy cafe, our days are filled with visual experiences that are accompanied by distinctive sounds. In this talk, we show that these sounds can provide a rich training signal for learning visual models. First, we propose the task of predicting what sound an object makes when struck as a way of studying physical interactions within a visual scene. We demonstrate this idea by training an algorithm to produce plausible soundtracks for videos in which people hit and scratch objects with a drumstick. Second, we show that ambient audio – e.g., crashing waves, people speaking in a crowd – can also be used to learn visual models. We train a convolutional neural network to predict a statistical summary of the sounds that occur within a scene, and we demonstrate that the learned visual representation conveys information about objects and scenes. Finally, we present an unsupervised learning method for training multi-modal networks that fuse audio and visual data, and apply the learned representation to a number of audio-visual learning tasks.
Title: Generalization and Self-Supervision in Deep Robotic Learning
Date: February 28, 2018
Speaker: Chelsea Finn
Host: Matthew Walter (mwalter@ttic.edu)
Abstract: Machine learning algorithms excel primarily in settings where an engineer can first reduce the problem to a particular function (e.g. an image classifier), and then collect a substantial amount of labeled input-output pairs for that function. In drastic contrast, humans are capable of learning from streams of raw sensory data with minimal external instruction. In this talk, I will argue that, in order to build intelligent systems that are as capable as humans, machine learning models should not be trained in the context of one particular application. Instead, we should be designing systems that can be versatile, can learn in unstructured settings without detailed human-provided labels, and can accomplish many tasks, all while processing high-dimensional sensory inputs. To do so, these systems must be able to actively explore and experiment, collecting data themselves rather than relying on detailed human labels.
My talk will focus on two key aspects of this goal: versatility and self-supervision. I will first show how we can move away from hand-designed, task-specific representations of a robot’s environment by enabling the robot to learn high-capacity models, such as deep networks, for representing complex skills from raw pixels. I will also present an algorithm that learns deep models that can be rapidly adapted to different objects, new visual concepts, or varying environments, leading to versatile behaviors. Beyond versatility, a hallmark of human intelligence is self-supervised learning. I will discuss how we can allow a robot to learn by playing with objects in the environment without any human supervision. From this experience, the robot can acquire a visual predictive model of the physical world that can be used for maneuvering many different objects to varying goals. In all settings, our experiments on simulated and real robot platforms demonstrate the ability to scale to complex, vision-based skills with novel objects.
Title: Look, Listen, and Speak: Vision Systems that Communicate with Natural Language
Date: February 14, 2018
Speaker: Lisa Anne Hendricks
Abstract: Powered by deep convolutional networks and large scale visual datasets, modern visual systems are capable of accurately recognizing thousands of visual categories. However, images and videos contain so much more than categorical labels: they contain information about where objects are located (in a forest or in a kitchen?), what attributes an object has (red or blue?), and how objects interact with other objects in a scene (is the child sitting on a sofa, or running in a field?). Natural language provides an efficient and intuitive way for vision systems to convey important information about a visual scene. In this talk, I will begin by discussing vision systems that “look and speak”. In particular, I will consider how to create more scalable image captioning systems by integrating external data sources. I will then discuss vision systems that “listen and look”. I will introduce the task of moment localization in videos with natural language and detail my effort to collect a large scale dataset for this task. Finally, natural language provides a way for vision systems to not only discuss what is in a scene, but how objects and attributes in an image support a decision, such as a classification decision. I will discuss vision systems that go beyond naming objects in a scene and can generate visual explanations which justify neural network decisions.
Title: Learning Single-view 3D Reconstruction of Objects and Scenes
Date: January 24, 2018
Speaker: Shubham Tulsiani
Host: Gregory Shakhnarovich (gregory@ttic.edu)
Abstract: In this talk, I will discuss the task of inferring 3D structure underlying an image, in particular focusing on two questions - a) how we can plausibly obtain supervisory signal for this task, and b) what forms of representation should we pursue. I will first show that we can leverage image-based supervision to learn single-view 3D prediction, by using geometry as a bridge between the learning systems and the available indirect supervision. We will see that this approach enables learning 3D structure across diverse setups e.g. predicting deformable models or volumetric 3D for objects, or inferring layered-depth images for scenes. I will then advocate the case for inferring interpretable and compositional 3D representations. I will present a method that discovers the coherent compositional structure across objects in a unsupervised manner by attempting to assemble shapes using volumetric primitives, and then demonstrate the advantages of predicting similar factored 3D representations for complex scenes.
Title: A Dynamical View of Optimization Algorithms
Date: October 25, 2017
Speaker: Ashia Wilson
Host: Nathan Srebro (nati@ttic.edu)
Abstract: Optimization is a core primitive in statistics, machine learning, and data analytics. Within these fields, the rapid growth in the scale and complexity of modern datasets has led to a focus on two classes of algorithms: gradient methods and momentum methods (also referred to as accelerated methods). Momentum methods, first proposed by Nesterov in 1983, achieve faster convergence rates than gradient methods. However, unlike gradient methods, they are not descent methods and providing robust performance guarantees remains a challenge. In the Euclidean setting, momentum methods can be understood as modeling the dynamics of a damped harmonic oscillator; making this intuition precise however, and generalizing it to other geometries has been difficult. Furthermore, derivations of momentum methods do not ow from a single underlying principle, but tend to rely on case-specific algebra using a technique - considered by many to be esoteric - called estimate sequences
The first part of our work introduces a variational, continuous-time framework for understanding momentum methods. We show that there is a family of Lagrangian functionals, that we call Bregman Lagrangians, which generate dynamics corresponding to momentum methods in continuous time. In particular, momentum methods can be understood as arising from various discretization techniques applied to these continuous time dynamics. The second part of our work strengthens this connection. We demonstrate how to derive families of Lyapunov functions which can certify rates of convergence for the continuous time momentum dynamics. We further demonstrate how the proofs of convergence of momentum methods can be understood as bounding discretization errors of the Lyapunov function when moving to discrete time. Along the way, we prove an equivalence between these family of Lyapunov functions and the technique of estimate sequences. The following is joint work with Andre Wibisono, Stephen Tu, Shivaram Venkataraman, Alex Gittens,Benjamin Recht, and Michael I. Jordan.
Title: Towards Hierarchical Multiscale Recurrent Neural Networks and their Applications
Date: June 7, 2017
Speaker: Junyoung Chung
Abstract: The recent resurgence of recurrent neural networks has led to remarkable advances in various applications including machine translation, speech recognition, speech synthesis and caption generation. However, learning both hierarchical and temporal representation has been among the longstanding challenges of recurrent neural networks. Multiscale recurrent neural networks have been considered as a promising approach to resolve this issue, yet there has been a lack of empirical evidence showing that this type of models can actually capture the temporal dependencies by discovering the latent hierarchical structure of the sequence.
In this talk, I will talk about my previous works on multiscale recurrent neural networks. I will show how a deep recurrent neural network with extra gating units can update its layers with different timescales and discover the underlying hierarchical structure of the sequences. The hierarchical multiscale recurrent neural networks have a potential to remove inherent problems of standard recurrent neural networks. In addition, the learned hierarchical structures can be useful information to many other downstream tasks such as extracting story segments for video understanding, adaptively compressing sequences for speech recognition and extracting sub-task structures in hierarchical reinforcement learning.
Title: Sample-Optimal Inference, the Method of Moments, and Community Detection
Date: May 17, 2017
Speaker: Samuel Hopkins
Host: Madhur Tulsiani (madhurt@ttic.edu)
Abstract: We propose a simple and efficient meta-algorithm for Bayesian estimation problems (i.e. hidden variable, latent variable, or planted problems). Our algorithm uses low-degree polynomials together with new and highly robust tensor decomposition methods. We focus on the question: for a given estimation problem, precisely how many samples (up to low-order additive terms) do polynomial-time algorithms require to obtain good estimates of hidden variables? Our meta-algorithm is broadly applicable, and achieves statistical or conjectured computational sample-complexity thresholds for many well-studied problems, including many for which previous algorithms were highly problem-specific.
As a running example we employ the stochastic block model – a widely studied family of random graph models which contain latent community structure. We recover and unify the proofs of the best-known sample complexity bounds for the partial recovery problem in this model. We also give the first provable guarantees for partial recovery of community structure in constant-degree graphs where nodes may participate in many communities simultaneously. This model is known to exhibit a sharp sample complexity threshold – with fewer than a very specific number of samples, recovering community structure becomes impossible. While previous explanations for this phenomenon appeal to sophisticated ideas from statistical mechanics, we give a new and simple explanation based on properties of low-degree polynomials.
Title: Semi-Supervised Learning with Context-Conditional Generative Adversarial Networks
Date: November 30, 2016
Speaker: Emily Denton
Host: Greg Shakhnarovich (gregory@ttic.edu)
Abstract: The main focus of this talk will be on a new simple semi-supervised learning approach for images based on in-painting using an adversarial loss. Images with random patches removed are presented to a generator whose task is to fill in the hole, based on the surrounding pixels. The in-painted images are then presented to a discriminator network that judges if they are real (unaltered training images) or not. This task acts as a regularizer for standard supervised training of the discriminator. Using our approach we are able to directly train large VGG-style networks in a semi-supervised fashion. We evaluate on STL-10 and PASCAL datasets, where our approach obtains performance comparable or superior to existing methods.
Title: A Task-Oriented Neural Dialogue System
Date: November 9, 2016
Speaker: Tsung-Hsien (Shawn) Wen
Host: Karen Livescu (klivescu@ttic.edu)
Abstract: Teaching machines to accomplish tasks by conversing naturally with humans is challenging. Currently, developing task-oriented dialogue systems requires creating multiple components and typically this involves either a large amount of handcrafting, or acquiring costly labelled datasets to solve a statistical learning problem for each component. In this work we introduce a neural network-based text-in, text-out end-to-end trainable goal-oriented dialogue system along with a new way of collecting dialogue data based on a novel pipe-lined Wizard-of-Oz framework. This approach allows us to develop dialogue systems easily and without making too many assumptions about the task at hand. The results show that the model can converse with human subjects naturally whilst helping them to accomplish tasks in a restaurant search domain.
Title: Conditional Quadratic Hardness for Data Analysis Problems
Date: October 26, 2016
Speaker: Arturs Backurs
Host: Yury Makarychev (yury@ttic.edu)
Abstract: The theory of NP-hardness has been remarkably successful in identifying problems that are unlikely to be solvable in polynomial time. However, many other important problems do have polynomial-time algorithms, but large exponents in their runtime bounds can make them inefficient in practice. For example, quadratic-time algorithms, although practical on moderately sized inputs, can become inefficient on big data problems that involve gigabytes or more of data. Although for many data analysis problems no sub-quadratic time algorithms are known, any evidence of quadratic-time hardness has remained elusive.
In this talk I will give an overview of recent research that aims to remedy this situation. In particular, I will outline conditional hardness results for two problems: computing the edit distance between two strings, and solving the optimization problems defined by Support Vector Machines (with Gaussian kernel) up to high accuracy. Specifically, we show that, if either of these two problems can be solved in time O(n^{2-delta}) for some constant delta>0, then the satisfiability of conjunctive normal form formulas with N variables and M clauses can be solved in time M^{O(1)} 2^{(1-epsilon)N} for a constant epsilon>0. The latter result would violate the Strong Exponential Time Hypothesis, which postulates that such algorithms do not exist.
Title: Learning Spatial Priors for Text to 3D Scene Generation
Date: October 19, 2016
Speaker: Angel Chang
Host: Kevin Gimpel (kgimpel@ttic.edu)
Abstract: The ability to form a visual interpretation of the world from natural language is pivotal to human communication. Being able to map descriptions of scenes to 3D geometric representations can be useful in many applications such as robotics and conversational assistants. In this talk, I will present the task of text to 3D scene generation, where a scene description in natural language is automatically converted into a plausible 3D scene interpretation. For example, the sentence “a living room with a red couch and TV” should generate a realistic living room arrangement with the TV in front of the couch and supported by a TV stand. This task lies at the intersection of NLP and computer graphics, and requires techniques from both.
A key challenge in this task is that the space of geometric interpretations is large while natural language text is typically under-specified, omitting shared, common-sense facts about the world. I will describe how we can learn a set of spatial priors from virtual environments, and use them to infer plausible arrangements of objects given a natural language description. I will show that a parallel corpus of virtual 3D scenes and natural language descriptions can be leveraged to extract likely couplings between references and concrete 3D objects (e.g., an “L-shaped red couch”, and the virtual geometric representation of that object). Finally, I will discuss a few exciting directions for future work at the intersection of NLP, graphics, and more broadly AI.
Title: Memory and Communication in Neural Networks
Date: October 5, 2016 at 10:00 am
Speaker: Sainbayar Sukhbaatar
Host: David McAllester (mcallester@ttic.edu)
Abstract: In this talk, I will talk about my two recent works on equipping neural networks with an external memory and communication. In the first part, I will explain about how an external memory can be attached to neural networks. The memory can store variable number of items, which then can be can accessed by a soft-attention mechanism. This whole model can be trained end-to-end with simple back-propagation. Since the model can choose which part of the input to read, it makes it suitable for tasks with out-of-order structure, such as question answering after reading a short story. In the second part, I will talk about how multiple neural networks can learn to communicate with each other to solve a common task. Instead of using discrete symbols for communication, we allow networks to exchange continuous vectors so it can be trained with back-propagation. I will demonstrate the model on multi-agent reinforcement learning tasks.
Title: Understanding scenes over time with unlabeled video
Date: September 21, 2016
Speaker: Carl Vondrick
Host: Gregory Shakhnarovich (gregory@ttic.edu)
Abstract: Learning models with a rich understanding of time is a key problem in machine perception. However, the need for large, labeled video datasets is a major obstacle because video is often expensive and ambiguous to annotate. In this talk, we capitalize on years of unlabeled, in-the-wild videos to train models for two temporal tasks, visual anticipation and sound recognition. First, we present a deep convolutional network for multi-modal regression, allowing us to model uncertainty and more accurately anticipate human actions. To support pixel-level predictions, we introduce a layered generative video model, facilitating dense extrapolation of videos. Second, we utilize videos’ natural synchronization between vision and sound to learn deep sound representations, which our experiments suggest learn some high-level semantics. We believe unlabeled video is a valuable resource for perception, and can impact many applications in robotics, recognition, and forecasting.
Title: Modular neural architectures for grounded language learning
Date: September 28, 2016
Speaker: Jacob Andreas
Host: Kevin Gimpel (kgimpel@ttic.edu)
Abstract: Language understanding depends on two abilities: an ability to translate between natural language utterances and abstract representations of meaning, and an ability to relate such meaning representations to the perceived world. In the natural language processing literature, these tasks are respectively known as “semantic parsing” and “grounding”, and have been treated as essentially independent problems. In this talk, I will present two modular neural architectures for jointly learning to ground language in the world and reason about it compositionally.
I will first describe a technique that uses syntactic information to dynamically construct neural networks from composable primitives. The resulting structures, called “neural module networks”, can be used to achieve state-of-the-art results on a variety of grounded question answering tasks. Next, I will present a model for contextual referring expression generation, in which contrastive behavior results from a combination of learned semantics and inference-driven pragmatics. This model is again backed by modular neural components—in this case elementary listener and speaker representations. It is able to successfully complete a challenging referring expression generation task, exhibiting pragmatic behavior without ever observing such behavior at training time. I will conclude by outlining possible applications of this framework to control and planning problems.
Title: Scene Understanding from RGB-D Images
Date: May 11, 2016
Speaker: Saurabh Gupta, Berkeley
Host: Greg Shakhnarovich (gregory@ttic.edu)
Abstract: In this talk, I will talk about detailed scene understanding from RGB-D images. We approach this problem by studying central computer vision problems like bottom-up grouping, object detection, instance segmentation, pose estimation in context of RGB-D images, and finally aligning CAD models to objects in the scene. This results in a detailed output which goes beyond what most current computer vision algorithms produce, and is useful for real world applications like perceptual robotics, and augmented reality. A central question in this work is how to learn good features for depth images in view of the fact that labeled RGB-D datasets are much smaller than labeled RGB datasets (such as ImageNet) typically used for feature learning. To this end I will describe our technique called “cross-modal distillation” which allows us to leverage easily available annotations on RGB images to learn representations on depth images. In addition, I will also briefly talk about some work on vision and language that I did on an internship at Microsoft Research.
Title: Neural Dialogue Generation
Date: April 20, 2016
Speaker: Jiwei Li (Stanford University)
Host: Kevin Gimpel (kgimpel@ttic.edu)
Abstract: Recent neural generation models present both new opportunities and new challenges for developing conversational agents. In this talk, I will describe how we have advanced this line of research by addressing four different issues in neural dialogue generation: (1) overcoming the overwhelming prevalence of dull responses (e.g., “I don’t know”) generated from neural models; (2) enforcing speaker consistency; (3) leveraging information from conversational history; and (4) applying reinforcement learning to foster sustained dialogue interactions.
Title: Initialization and Dual Expressivity of Neural Networks
Date: March 9, 2016
Speaker: Roy Frostig (Stanford)
Host: Nati Srebro (nati@ttic.edu)
Abstract: Neural network learning is seeing wide empirical success as an applied machine learning tool, yet we have only a nascent theoretical understanding of its recent advances, and of the design choices made in achieving them. In turn, the tools in use are often without guarantees and without useful formalisms to guide their development.
In this talk, I will present recent work that establishes a duality between neural networks and a certain notion of compositional kernels. The connection clarifies the effective modeling capacity of networks due to their architecture. We show that the data representation induced by networks under a common random initialization scheme is rich enough to express all functions in their dual kernel space. Indeed, in this space, easily learnable functions (/img.e. those of low norm) are expressive according to a succinct graph structure underlying the network architecture. An immediate upshot is that, although the network training objective is hard to optimize in the worst case, the initial weights form a good starting point from a modeling perspective (i.e. in inducing features).
The talk is based on a paper (arXiv:1602.05897) from work joint with Amit Daniely and Yoram Singer.
Title: Using Motion to Understand Objects in the Real World
Date: March 2, 2016
Speaker: David Held (Stanford)
Host: David McAllester (mcallester@ttic.edu)
Abstract: Many robots today are confined to operate in relatively simple, controlled environments. One reason for this is that current methods for processing visual data tend to break down when faced with occlusions, viewpoint changes, poor lighting, and other challenging but common situations that occur when robots are placed in the real world. I will show that we can train robots to handle these variations by modeling the causes behind visual appearance changes. If we model how the world changes over time, we can be robust to the types of changes that objects often undergo. I demonstrate this idea in the context of autonomous driving, and I show how we can use this idea to improve performance on three different tasks: velocity estimation, segmentation, and tracking with neural networks. By modeling the causes of appearance changes over time, we can make our methods more robust to a variety of challenging situations that commonly occur in the real-world, thus enabling robots to come out of the factory and into our lives.
Title: Machine Learning for Observational Studies
Date: February 17, 2016
Speaker: Uri Shalit (NYU)
Host: Nati Srebro (nati@ttic.edu)
Abstract: The proliferation of data collection in the healthcare, educational, and economic spheres, brings with it opportunities for extracting new knowledge with concrete policy implications. Examples include identifying best medical practices from electronic healthcare records, or understanding the implications of different teaching techniques from board of education surveys. Such policy decisions often hinge on understanding causal links - does medication A cause better outcomes than medication B? However, unlike randomized studies where a treatment is assigned to a random subpopulation, learning causal links from these so-called “observational studies” is difficult, because confounding factors might obscure the true causal links underlying the observed data.
In this talk I will discuss observational studies from a machine learning perspective. I will then show how we use machine learning techniques to try and learn causal relationships from these studies. Specifically we will discuss two methods: one based on using integral probability metrics to optimally re-weigh the data, and the other based on learning a balanced representation, using ideas from domain adaptation.
Title: Locality-Sensitive Hashing and Beyond
Date: February 10, 2016
Speaker: Ilya Razenshteyn (MIT CSAIL)
Host: Yury Makarychev (yury@ttic.edu)
Abstract: Locality-Sensitive Hashing (LSH) is a powerful technique for the approximate nearest neighbor search (ANN) in high dimensions.
In this talk I will present two recent results.
1) I will show a data structure for ANN for the Euclidean distance that provably outperforms the best possible LSH-based data structure. We proceed via designing a good data-dependent hash family.
2) I will show a practical and optimal LSH family for the cosine similarity (a.k.a. Euclidean distance on a sphere). It substantially outperforms the celebrated Hyperplane LSH family. Along the way, I will try to debunk two popular myths about LSH:
* LSH-based data structures consume too much memory and are thus impractical;
* Optimal LSH constructions are too complicated to be made practical.
The talk is based on two papers: arXiv:1501.01062 (joint with Alexandr Andoni, STOC 2015) and arXiv:1509.02897 (joint with Alexandr Andoni, Piotr Indyk, Thijs Laarhoven and Ludwig Schmidt, NIPS 2015).
Title: End-to-End Speech Recognition using Deep LSTMs, CTC Training and WFST Decoding
Date: February 3, 2016
Speaker: Yajie Miao (CMU)
Host: Karen Livescu (klivescu@ttic.edu)
Abstract: Deep learning has tremendously improved the performance of automatic speech recognition (ASR). Despite this progress, developing ASR systems remains a challenging task, requiring various resources, multiple training stages and significant expertise. This talk will present Eesen, an end-to-end ASR framework which drastically simplifies the existing speech recognition paradigm. Acoustic modeling in Eesen involves learning a single deep Long Short-Term Memory (LSTM) network predicting context-independent phonemes or characters. We adopt the connectionist temporal classification (CTC) objective function to learn the alignments between speech and label sequences. A nice property of Eesen is a generalized decoding method based on weighted finite-state transducers (WFSTs), which enables the efficient incorporation of lexicons and language models into CTC decoding. With experiments on various datasets and languages, we will see that our end-to-end systems achieve comparable recognition accuracy to the state-of-the-art hybrid approach. In addition, we will present empirical analysis to shed light on how CTC training behaves under different conditions.