Data-dependent Coreset for Large-scale, Robust, and Dynamic Machine Learning
时间 Time： 周二 15:00 -16:00，2021 - 6 - 22
As the rapid development of big data, we often confront with large-scale and noisy dataset for many machine learning tasks. Coreset is a popular data compression technique that has been extensively studied before. However, most of existing coreset methods are problem-dependent and cannot be used as a general tool for a broader range of applications. A key obstacle is that they often rely on the pseudo-dimension and total sensitivity bound that can be very high or hard to obtain. Moreover, existing coreset methods are sensitive to outliers and cannot be efficiently constructed in a dynamic environment with data insertion and deletion. In this talk, we introduce a new data-dependent framework for coreset construction, which is useful for many popular optimization objectives like k-means/median clustering, Lasso, Ridge, Logistic regression, and Gaussian mixture model. In particular, our framework can effectively deal with outliers and dynamical updates. To the best of our knowledge, this is the first robust and fully-dynamic coreset construction method for these problems. Part of this work have recently appeared in ICML’20 and ICML’21.