How to Deal with Big Data? Understanding Large-scale Distributed Regression

Monday, April 1, 2019 - 1:25pm - 2:25pm
Lind 305
Edgar Dobriban (Wharton School of the University of Pennsylvania)
Modern massive datasets pose an enormous computational burden to practitioners. Distributed computation has emerged as a universal approach to ease the burden: Datasets are partitioned over machines, which compute locally, and communicate short messages. Distributed data also arises due to privacy reasons, such as with medical databases. It is important to study how to do statistical inference and machine learning in a distributed setting. In this talk, we present results about one-step parameter averaging in statistical linear models under data parallelism. We do linear regression on each machine, and take a weighted average of the parameters. How much do we lose compared to doing linear regression on the full data? Here we study the performance loss in estimation error, test error, and confidence interval length in high dimensions, where the number of parameters is comparable to the training data size. We discover several key phenomena. First, averaging is not optimal, and we find the exact performance loss. Second, different problems are affected differently by the distributed framework. Estimation error and confidence interval length increases a lot, while prediction error increases much less. These results match numerical simulations and a data analysis example. To derive these results, we rely on recent results from random matrix theory, where we also develop a new calculus of deterministic equivalents as a tool of broader interest.

Edgar Dobriban is an assistant professor of statistics at the Wharton School of the University of Pennsylvania. He obtained his PhD in Statistics in 2017 from Stanford University, advised by David Donoho, and his undergraduate degree in mathematics from Princeton University in 2012. His research interests are in developing statistical methods and theory for large-scale data analysis.