The Machine Learning Seminar Series features weekly talks on recent research in machine learning.

The seminar is typically held Fridays 11-12am at TTIC

To receive announcements regarding the talks please subscribe to the mailing list..

Date: December 14, 2018 11am-noon

Venue: Room 526, 6045 S Kenwood Ave, Chicago, IL 60637

Abstract: Bayesian optimization (BO) aims at efficiently optimizing expensive black-box functions, such as hyperparameter tuning problems in machine learning. Scaling up BO to many variables relies on structural assumptions about the underlying black-box, to alleviate the curse of dimensionality. In this talk, we review several options to this end, with emphasis on the low effective dimensionality hypothesis. Starting with a sampled random embedding, we discuss several practical issues related to selecting a suitable search space. We also present alternative sampling methods for the embedding, as well as techniques to identify the low effective subspace. The performance and robustness gains of the proposed enhancements are illustrated on numerical examples.

Date: November 30, 2018 11am-noon

Venue: Room 526, 6045 S Kenwood Ave, Chicago, IL 60637

Speaker: Blake Woodworth, TTI Chicago

Abstract: We suggest a general oracle-based framework that captures different parallel stochastic optimization settings described by a dependency graph, and derive generic lower bounds in terms of this graph. We then use the framework and derive lower bounds for several specific parallel optimization settings, including delayed updates and parallel processing with intermittent communication. We highlight gaps between lower and upper bounds on the oracle complexity, and cases where the “natural” algorithms are not known to be optimal.

Date: October 26, 2018 11am-noon

Venue: Room 526, 6045 S Kenwood Ave, Chicago, IL 60637

Abstract: Neural networks perform extremely well at object recognition tasks, sometimes even meeting or exceeding human performance. However, neural networks have recently been shown to be susceptible to “adversarial attacks,” in which small (or even imperceptible) changes to an image completely alter it’s class label according to a neural net. This talk explores several new angles on the concept of adversarial attacks. First, I explore the idea of “poisoning” attacks, which manipulate a network at train time rather than test time. Then, I address a fundamental question about adversarial attacks: Are adversarial attacks inevitable? I’ll try to answer this question from a both a theoretical and experimental perspective.

Date: October 19, 2018 11am-noon

Venue: Room 526, 6045 S Kenwood Ave, Chicago, IL 60637

Abstract: We bring the tools from Blackwell’s seminal result on comparing two stochastic experiments from 1953, to shine a new light on a modern application of great interest: Generative Adversarial Networks (GAN). Binary hypothesis testing is at the center of training GANs, where a trained neural network (called a critic) determines whether a given sample is from the real data or the generated (fake) data. By jointly training the generator and the critic, the hope is that eventually the trained generator will generate realistic samples. One of the major challenges in GAN is known as “mode collapse”; the lack of diversity in the samples generated by thus trained generators. We propose a new training framework, where the critic is fed with multiple samples jointly (which we call packing), as opposed to each sample separately as done in standard GAN training. With this simple but fundamental departure from existing GANs, experimental results show that the diversity of the generated samples improve significantly. We analyze this practical gain by first providing a formal mathematical definition of mode collapse and making a fundamental connection between the idea of packing and the intensity of mode collapse. Precisely, we show that the packed critic naturally penalizes mode collapse, thus encouraging generators with less mode collapse. The analyses critically rely on operational interpretation of hypothesis testing and corresponding data processing inequalities, which lead to sharp analyses with simple proofs. For this talk, I will assume no prior background on GANs

Date: May 18, 2018 11am-noon

Venue: Room 526, 6045 S Kenwood Ave, Chicago, IL 60637

Speaker: Jonathan Weed, MIT

Abstract: Computing optimal transport distances between distributions is a fundamental problem that is becoming increasingly prominent in statistics, image processing, and machine learning. Key to the use of optimal transport in practice is the existence of fast, stable algorithms for computing these distances. In this work, we exhibit a simple approximation algorithm for this problem that runs in near-linear time. Our work is based on new analysis of the celebrated Sinkhorn algorithm for matrix scaling, as well as new results about the behavior of linear programs under entropic penalization. Joint work with Jason Altschuler and Philippe Rigollet.

Date: April 27, 2018 11am-noon

Venue: Room 526, 6045 S Kenwood Ave, Chicago, IL 60637

Speaker: Blake Woodworth, TTI Chicago

Abstract: The problem of handling adaptivity in data analysis, intentional or not, permeates a variety of fields, including test-set overfitting in ML challenges and the accumulation of invalid scientific discoveries. We propose a mechanism for answering an arbitrarily long sequence of potentially adaptive statistical queries, by charging a price for each query and using the proceeds to collect additional samples. Crucially, we guarantee statistical validity without any assumptions on how the queries are generated. We also ensure with high probability that the cost for M non-adaptive queries is O(\log M), while the cost to a potentially adaptive user who makes M queries that do not depend on any others is O(\sqrt{M}).

Date: November 17, 2017 11am-noon

Venue: Room 526, 6045 S Kenwood Ave, Chicago, IL 60637

Speaker: Jialei Wang, TTI Chicago

Abstract: The fast growing size of modern data sets and the nature of data acquisition motivate new algorithms and systems that are able to learn from distributed data efficiently. In this talk, I will present two recent work that explore the theoretical foundations of distributed machine learning. First, I will present novel algorithms to learn sparse linear predictors from distributed data. To match centralized optimal predictor, the number of rounds required to communicate only scales logarithmically with the number of machines. Second, I will discuss how to parallelize the stochastic gradient descent (SGD) with a more sophisticated minibatch prox scheme, which improves over minibatch SGD by allowing larger minibatch size without slowing down the convergence.

This is joint work with Mladen Kolar, Nathan Srebro, Weiran Wang and Tong Zhang.

Date: October 13, 2017 11am-noon

Venue: Room 526, 6045 S Kenwood Ave, Chicago, IL 60637

Speaker: Song Liu, University of Bristol

Abstract: Density ratio estimation has become a versatile tool in machine learning community recently. However, due to its unbounded nature, density ratio estimation is vulnerable to corrupted data points, which often pushes the estimated ratio toward infinity. In this paper, we present a robust estimator which automatically identifies and trims outliers. The proposed estimator has a convex formulation, and the global optimum can be obtained via subgradient descent. We analyze the parameter estimation error of this estimator under high-dimensional settings. Experiments are conducted to verify the effectiveness of the estimator.

Date: October 5, 2017 11am-noon

Venue: Room 526, 6045 S Kenwood Ave, Chicago, IL 60637

Abstract: We consider the fundamental problem of learning from expert advice, a.k.a online no-regret learning, where we have access to an offline optimization oracle that can be used to compute, in constant time, the best performing expert at any point in time. We consider the design of no-regret algorithms that are computationally efficient using such an oracle. We present structural properties under which we show oracle-efficient no-regret algorithms exist, even when the set of experts is exponentially large in a natural representation of the problem. Our algorithm is a generalization of the Follow-The-Perturbed-Leader algorithm of Kalai and Vempala that at every step follows the best-performing expert subject to some perturbations. Our design uses a shared source of randomness across all experts that can be efficiently implemented by using an oracle on a random modification of the history of the play at every time step.

Our second main contribution is showing that the structural properties required for our oracle-efficient online algorithm are present in a large class problems. As examples, we discuss applications of our oracle-efficient learning results to the adaptive optimization of large classes of auctions, including (1) VCG auctions with bidder-specific reserves in single-parameter settings, (2) envy-free item pricing in multi-item auctions, and (3) Myerson-like auctions for single-item settings.

This talk is based on joint work with Miro Dudik, Haipeng Luo, Rob Schapire, Vasilis Syrgkanis, and Jenn Wortman Vaughan.

Date: September 29, 2017 11am-noon

Venue: Room 530, 6045 S Kenwood Ave, Chicago, IL 60637

Abstract: In this talk we will consider linear structural equation models (SEMs) with non-Gaussian errors. In particular, these models assume that each observed variable is a linear functions of the other variables, plus some error term. The first half of the talk will consider SEMs in which the error terms may be dependent; these models correspond to mixed graphs. We propose empirical likelihood procedures for inference and estimation, and suggest several modifications–including a profile likelihood–in order to improve tractability and performance. We show that when the error distributions are non-Gaussian, the use of EL may increase statistical efficiency and improve assessment of significance. In the second half of the talk, we will consider SEMs which correspond to directed acyclic graphs (DAGs). It has been previously shown for DAGs, when the error terms in a SEM are non-Gaussian, the exact causal structure–not simply a larger equivalence class–can be identified. We show that for suitably sparse graphs, when the error terms follow a log-concave distribution (but are non-Gaussian), the graph can also be identified in the high dimensional setting where the number of variables may exceed the number of observations.

Date: June 2, 2017 11am-noon

Venue: Room 530, 6045 S Kenwood Ave, Chicago, IL 60637

Abstract: Modern applications require methods that are computationally feasible on large datasets but also preserve statistical efficiency. Frequently, these two concerns are seen as contradictory: approximation methods that enable computation are assumed to degrade statistical performance relative to exact methods. In applied mathematics, where much of the current theoretical work on approximation resides, the inputs are considered to be observed exactly. The prevailing philosophy is that while the exact problem is, regrettably, unsolvable, any approximation should be as small as possible. However, from a statistical perspective, an approximate or regularized solution may be preferable to the exact one. Regularization formalizes a trade-off between fidelity to the data and adherence to prior knowledge about the data-generating process such as smoothness or sparsity. The resulting estimator tends to be more useful, interpretable, and suitable as an input to other methods.

In this work, we propose new methodology for estimation and prediction under a linear model borrowing insights from the approximation literature. We explore these procedures from a statistical perspective and find that in many cases they improve both computational and statistical performance.

Date: May 12, 2017 10-11am

Venue: Room 530, 6045 S Kenwood Ave, Chicago, IL 60637

Abstract: In the probabilistic topic models, the quantity of interest—a low- rank matrix consisting of topic vectors—is hidden in the text corpus matrix, masked by noise, and Singular Value Decomposition (SVD) is a potentially useful tool for learning such a low-rank matrix. However, the connection between this low-rank matrix and the singular vectors of the text corpus matrix are usually complicated and hard to spell out, so how to use SVD for learning topic models faces challenges.

We overcome the challenge by revealing a surprising insight: there is a low-dimensional simplex structure which can be viewed as a bridge between the low-rank matrix of interest and the SVD of the text corpus matrix, and which allows us to conveniently reconstruct the former using the latter. Such an insight motivates a new SVD- based approach to learning topic models.

For asymptotic analysis, we show that under the popular probabilistic topic model (Hofmann, 1999), the convergence rate of the l1-error of our method matches that of the minimax lower bound, up to a multi-logarithmic term. In showing these results, we have derived new element-wise bounds on the singular vectors and several large- deviation bounds for weakly dependent multinomial data. Our results on the convergence rate and asymptotical minimaxity are new.

We have applied our method to two data sets, Associated Process (AP) and Statistics Literature Abstract (SLA), with encouraging results. In particular, there is a clear simplex structure associated with the SVD of the data matrices, which largely validates our discovery.

Date: March 31, 2017 10-11am

Venue: Room 526, TTIC. 6045 S Kenwood Ave, Chicago, IL 60637

Abstract: Detecting causal associations in time series datasets is a key challenge for novel insights into complex dynamical systems such as the Earth system or the human brain. Interactions in high-dimensional dynamical systems involve time-delays, nonlinearity, and strong autocorrelations, which present major challenges for causal discovery techniques such as Granger causality leading to low detection power. Through large-scale numerical experiments and theoretical analyses we further identify strong detection biases of common causal discovery methods: The detection power for individual links may depend not only on their causal strength, but also on autocorrelation and other dependencies. Here we introduce a reliable method for large-scale linear and nonlinear causal discovery that provides more detection power than current methods and largely overcomes detection biases, allowing to more accurately rank associations in large-scale analyses by their causal strength. The method is demonstrated on a global surface pressure dataset representing atmospheric dynamics.

Title: Combinatorial Inference

Date: March 17, 2017 10-11am

Venue: Room 526, 6045 S Kenwood Ave, Chicago, IL 60637

Speaker: Junwei Lin, Princeton University

Abstract: We propose the combinatorial inference to explore the topological structures of graphical models. The combinatorial inference can conduct the hypothesis tests on many graph properties including connectivity, hub detection, perfect matching, etc. In particular, our methods can be applied to any graph property which is invariant under the addition of edges. On the other side, we also propose a shortest self-returning path lemma to prove the general optimality of our testing procedures for various graph properties. The combinatorial inference is also generalized to the time-varying graphical models and we can infer the dynamic topological structures for graphs. Our methods are applied to the neuroscience by discovering hub voxels contributing to visual memories.

Date: March 10, 2017 10-11am

Venue: Room 526, 6045 S Kenwood Ave, Chicago, IL 60637

Abstract: Stochastic gradient descent procedures have gained popularity for iterative parameter estimation from large datasets because they are simpler and faster than classical optimization procedures. However, their statistical properties are not well understood in theory. And in practice, avoiding numerical instability requires careful tuning of key parameters. In this talk, we will focus on implicit stochastic gradient descent procedures, which involve parameter updates that are implicitly defined. Intuitively, implicit updates shrink standard stochastic gradient descent updates, and the amount of shrinkage depends on the observed Fisher information matrix, which does not need to be explicitly computed. Thus, implicit procedures increase stability without increasing the computational burden. Our theoretical analysis provides a full characterization of the asymptotic behavior of both standard and implicit stochastic gradient descent-based estimators, including finite-sample error bounds. Importantly, analytical expressions for the variances of these stochastic gradient-based estimators reveal their exact loss of efficiency, which enables principled statistical inference on large datasets. Part of ongoing work focuses on a crucial aspect of such inference with stochastic approximation procedures, which is to know when the procedure has reached the asymptotic regime.

Date: November 18, 2016 10-11am

Venue: Room 530, 6045 S Kenwood Ave, Chicago, IL 60637

Speaker: Weiran Wang, TTIC

Abstract: We study the sample and time complexity of canonical correlation analysis (CCA) in the population setting. With mild assumptions on the data distribution, we show that in order to achieve $\epsilon$-suboptimality in a properly defined measure of alignment between the estimated canonical directions and the population solution, we can solve the empirical objective exactly with $O(\frac{1}{\epsilon^? \Delta^2 \gamma^2}$ samples, where $\Delta$ is the singular value gap of the whitened cross-covariance matrix and $\frac{1}{\gamma}$ is an upper bound of the condition numbers of the auto-covariance matrices. Moreover, we can achieve the same learning accuracy by drawing the same level of samples and solving the empirical objective approximately with a stochastic optimization algorithm; this algorithm is based on the shift-and-invert power iterations and only needs to process the dataset for $O(log(1/\epsilon)$ passes. Finally, we show that, given an estimate of the canonical correlation, the streaming version of shift-and-invert power iterations achieves the same learning accuracy with the same level of sample complexity.

This is joint work with Jialei Wang, Dan Garber, and Nathan Srebro.

Date: November 11, 2016 10-11am

Venue: Room 530, 6045 S Kenwood Ave, Chicago, IL 60637

Speaker: Mark Hallen, TTIC

Abstract: Computational protein design algorithms search for modifications in protein systems that will optimize a desired function. Usually, this function involves chemical binding of a protein to another protein or to a drug. Binding is exquisitely sensitive to the 3-D geometry adopted by the system, with low-energy geometries preferred, so to optimize binding we must be able to optimize energy with respect to molecular geometry. Thus, our optimization—both over chemical modifications and over 3-D geometries—must take as input an “energy function,” which maps atomic coordinates to energy. In this work, we present methods to represent the energy function that are amenable to more efficient and more accurate protein design calculations than were previously possible. We pose the calculation of this representation as a machine learning problem. Two relatively simple models are found to be helpful: a continuous polynomial representation and a discrete expansion in local terms. These models open the possibility of modeling binding much more realistically in protein design calculations, while performing optimization efficiently.

Date: November 4, 2016 10-11am

Venue: Room 526, 6045 S Kenwood Ave, Chicago, IL 60637

Speaker: Eva Dyer Northwestern University

Abstract: As the size and complexity of neural datasets continue to grow, there is an increasing need for scalable and black-box approaches to extract knowledge from these data. In the first half of this talk, I will discuss my recent work in developing methods to uncover the structure and function of neural circuits in large-scale neural recordings. My aim is to provide an introduction to modern neural datasets for a ML audience, and discuss some of the challenges that we now face in developing learning algorithms for these data. In the second half of the talk, I will discuss my recent work in global black-box optimization (at UAI 2016). The idea behind our approach is to learn a convex relaxation of a (possibly) non-convex function by estimating its convex envelope (i.e., tight convex lower bound). We show that our approach can lead to provable convergence to the global optimum for classes of smooth functions.

This is work with Mohammad Gheshlaghi Azar (Google DeepMind), Konrad Kording (Northwestern), and Bobby Kasthuri (University of Chicago, Argonne National Laboratory).

Date: October 28, 2016 10-11am

Venue: Room 526, 6045 S Kenwood Ave, Chicago, IL 60637

Abstract: Extracting knowledge and providing insights into the complex mechanisms underlying noisy high-dimensional data sets is of utmost importance in many scientific domains. Networks are an example of simple, yet powerful tools for capturing relationships among entities over time. For example, in social media, networks represent connections between different individuals and the type of interaction that two individuals have. In systems biology, networks can represent the complex regulatory circuitry that controls cell behavior. Unfortunately the relationships between entities are not always observable and need to be inferred from nodal measurements.

I will present a line of work that deals with the estimation and inference in high-dimensional semi-parametric elliptical copula models. I will explain why these models are useful in exploring complex systems, how to efficiently estimate parameters of the model, and also provide theoretical guarantees that justify usage of the models in scenarios where more rigid Gaussian graphical models are commonly used.

Joint work with Rina Foygel Barber, Junwei Lu, and Han Liu.

Date: October 21, 2016 10-11am

Venue: Room 530, 6045 S Kenwood Ave, Chicago, IL 60637

Abstract: In this work, we show that there are no spurious local minima in the non-convex factorized parametrization of low-rank matrix recovery from incoherent linear measurements. With noisy measurements we show all local minima are very close to a global optimum. Together with a curvature bound at saddle points, this yields a polynomial time global convergence guarantee for stochastic gradient descent from random initialization.

(Joint work with Behnam Neyshabur, Nathan Srebro)

Date: October 14, 2016 10-11am

Venue: Room 526, 6045 S Kenwood Ave, Chicago, IL 60637

Speaker: Blake Woodworth, TTIC

Abstract: We provide tight upper and lower bounds on the complexity of minimizing the average of m convex functions using gradient and prox oracles of the component functions. We show a significant gap between the complexity of deterministic vs randomized optimization. For smooth functions, we show that accelerated gradient descent (AGD) and an accelerated variant of SVRG are optimal in the deterministic and randomized settings respectively, and that a gradient oracle is sufficient for the optimal rate. For non-smooth functions, having access to prox oracles reduces the complexity and we present optimal methods based on smooth- ing AGD that improve over methods using just gradient accesses.

Date: October 7, 2016 10-11am

Venue: Room 530, 6045 S Kenwood Ave, Chicago, IL 60637

Speaker: Nathan Srebro, TTIC

Abstract: As machine learning increasingly replaces human judgment in decisions protected by anti discrimination law, the problem of algorithmicly measuring and ensuring fairness in machine learning is pressing. What does it mean for a predictor to not discriminate with respect to protected group (e.g. according to race, gender, etc)?

We propose a notion of non-discrimination that can be measured statistically, used algorithmicly, and avoids many of the pitfalls of previous definitions. We further study what type of discrimination and non-discrimination can be identified with oblivious tests, which treat the predictor as an opaque black-box, and what different oblivious tests tell us about possible discrimination.

(Joint work with Mortiz Hardt and Eric Price)

Date: September 30, 2016 10-11am

Venue: Room 526, 6045 S Kenwood Ave, Chicago, IL 60637

Speaker: Hao Tang, TTIC

Abstract: Recent work on discriminative segmental models has shown that they can achieve competitive speech recognition performance, using features based on deep neural frame classifiers. However, segmental models can be more challenging to train than standard frame-based approaches. While some segmental models have been successfully trained end-to-end, there is a lack of understanding of their training under different settings and with different losses.

We investigate a model class based on recent successful approaches, consisting of a linear model that combines segmental features based on an LSTM frame classifier. Similarly to hybrid HMM-neural network models, segmental models of this class can be trained in two stages (frame classifier training followed by linear segmental model weight training), end-toend (joint training of both frame classifier and linear weights), or with end-to-end fine-tuning after two-stage training.

In this work we present several findings. First, models in this class can be trained with various kinds of losses. We present segmental models trained end-to-end with hinge loss, log loss, latent hinge loss, and marginal log loss. We compare several losses for the case where training alignments are available as well as where they are not. We find that in general, marginal log loss provides the most consistent strong performance without requiring ground-truth alignments. We also find that training with dropout is very important in obtaining good performance with end-to-end training. Finally, the best results are often obtained by a combination of two-stage training and fine-tuning.