Upcoming talks:
组织者:吴宇楠
时间:4 月 25 日 13:30~14:30
地点:双清综合楼C548
报告人: 胡懿娟 教授,北京大学北京国际数学研究中心
报告题目:Analysis of Differential Abundance in Compositional Data
摘要:In the era of big data, many sequencing-based molecular datasets are compositional, meaning they are expressed as percentages. While microbiome data is the most well-known example, single-cell subtype abundance data is also compositional in nature. These datasets are often sparse, containing numerous zero values due to the large number of features and limited sequencing depth. Compositional analysis typically assumes that only a small proportion of taxa are differentially abundant, while the ratios of relative abundances among the remaining taxa remain stable. Most existing methods rely on log-transformed data; however, log-transformation becomes problematic when zero counts are pervasive, often resulting in poor control of the false discovery rate (FDR). To address these challenges, we propose Logistic Compositional Analysis (LOCOM) — a robust logistic regression-based approach for compositional data analysis that eliminates the need for pseudocounts. LOCOM leverages permutation-based inference to account for overdispersion and small sample sizes. Additionally, it employs an asymptotic approach to enhance computational efficiency for large-sample datasets. To mitigate batch effects — commonly arising from systematic differences in sequencing depth in large-sample studies — LOCOM appropriately weights samples. Our simulations demonstrate that LOCOM consistently maintains FDR control while achieving significantly improved sensitivity compared to existing methods.
Past talks:
组织者:吴宇楠
时间:4 月 18 日 13:00~14:00
地点:双清综合楼C654
报告人:罗涛 副教授,上海交通大学数学科学学院
报告题目:Parameter Condensation in Neural Networks
摘要: In this talk, we will first introduce the phenomenon of parameter condensation in neural networks, which refers to the tendency of certain parameters to converge towards the same values during training. Then, for certain types of networks, we prove that condensation occurs in the early stages of training. We further analyze which hyperparameters and training strategies influence parameter condensation. In some cases, we even provide a phase diagram that delineates whether parameter condensation occurs. We will also briefly discuss the relationship between parameter condensation and generalization ability. Finally, towards the end of the training, we study the set of global minima and present a detailed analysis of its geometric structure and convergence properties.
————————————————————————————————————————————————————————————
组织者:吴宇楠
时间:4 月 18 日 16:00~17:00
地点:双清综合楼C654
报告人:蔡标 助理教授,香港城市大学商学院
报告题目:Latent Network Structure Learning from High Dimensional Multivariate Point Processes
摘要:Learning the latent network structure from large scale multivariate point process data is an important task in a wide range of scientific and business applications. For instance, we might wish to estimate the neuronal functional connectivity network based on spiking (or firing) times recorded from a collection of neurons. To characterize the complex processes underlying the observed point patterns, we propose a new and flexible class of non-stationary Hawkes processes that allow both excitatory and inhibitory effects. We estimate the latent network structure using a scalable sparse least squares estimation approach. Using a novel thinning representation, we establish concentration inequalities for the first and second order statistics of the proposed Hawkes process. Such theoretical results enable us to establish the nonasymptotic error bound and the selection consistency of the estimated parameters. Furthermore, we describe a penalized least squares based statistic for testing if the background intensity is constant in time. We apply our proposed method to a neurophysiological data set that studies working memory.
组织者:吴宇楠
时间:3 月 28 日 16:00~17:00
地点:双清综合楼C654
报告人: 苗旺 副教授,北京大学概率统计系
报告题目:Causal inference for dyadic data in randomized experiments with interference
摘要:Estimating the treatment effect in a network is of particular interest in online experimentation conducted everyday in social media companies. We investigate a novel setting where the outcome of interest comprises a series of dyadic outcomes, such as forwarding a message or sharing a link between friends and international trade relation between countries.Dyadic outcomes are pervasive in many social network sources and of particular interest in online experimentation (A/B testing).We propose a causal inference framework for dyadic outcomes in randomized experiments in the presence of network interference, and develop consistent estimators of the global average causal effect.We derive the convergence rate and variance bound of the proposed estimators,and provide a variance estimator that is conservative for quantifying the estimation uncertainty.We illustrate with a variety of numerical experiments and apply our approach to an online experiment in Wechat Channels.
组织者:吴宇楠
时间:3 月 21 日 16:00~17:00
地点:双清综合楼C654
报告人:王禹皓 助理教授,清华大学交叉信息研究院
报告题目:Residual permutation test for regression coefficient testing
摘要:We consider the problem of testing whether a single coefficient is equal to zero in linear models when the dimension of covariates p can be up to a constant fraction of sample size n. In this regime, an important topic is to propose tests with finite-population valid size control without requiring the noise to follow strong distributional assumptions. In this paper, we propose a new method, called residual permutation test (RPT), which is constructed by projecting the regression residuals onto the space orthogonal to the union of the column spaces of the original and permuted design matrices. RPT can be proved to achieve finite-population size validity under fixed design with just exchangeable noises, whenever p<n/2. Moreover, RPT is shown to be asymptotically powerful for heavy tailed noises with bounded (1+t)-th order moment when the true coefficient is at least of order n^{-t/(1+t)} for t \in [0,1]. We further proved that this signal size requirement is essentially rate-optimal in the minimax sense. Numerical studies confirm that RPT performs well in a wide range of simulation settings with normal and heavy-tailed noise distributions.
组织者:吴宇楠
时间:3 月 14 日 16:00~17:00
地点:双清综合楼C654
Speaker: Prof. Fan Li, Department of Statistical Science, Duke University
Title:Interacted two-stage least squares with treatment effect heterogeneity
Abstract:Treatment effect heterogeneity with respect to covariates is common in instrumental variable (IV) analyses. An intuitive approach, which we term the interacted two-stage least squares (2SLS), is to postulate a linear working model of the outcome on the treatment, covariates, and treatment-covariate interactions, and instrument it by the IV, covariates, and IV-covariate interactions. We clarify the causal interpretation of the interacted 2SLS under the local average treatment effect (LATE) framework when the IV is valid conditional on covariates. Our contributions are threefold. First, we show that the interacted 2SLS with centered covariates is consistent for estimating the LATE if either of the following conditions holds: (i) the treatment-covariate interactions are linear in the covariates; (ii) the linear outcome model underlying the interacted 2SLS is correct. Second, we show that the coefficients of the treatment-covariate interactions from the interacted 2SLS are consistent for estimating treatment effect heterogeneity with regard to covariates among compliers if either condition (i) or condition (ii) holds. Moreover, we connect the 2SLS estimator with the reweighting perspective in Abadie (2003) and establish the necessity of condition (i) in the absence of additional assumptions on potential outcomes. Third, leveraging the consistency guarantees of the interacted 2SLS for categorical covariates, we propose a stratification strategy based on the IV propensity score to approximate the LATE and treatment effect heterogeneity with regard to the IV propensity score when neither condition (i) nor condition (ii) holds.
组织者:吴宇楠
时间:3 月 14 日 14:00~15:00
地点:双清综合楼C548
报告人:喻达磊教授,西安交通大学数学与统计学院
Title:Unified optimal model averaging with a general loss function based on cross-validation
Abstract:Studying unified model averaging estimation for situations with complicated data structures, we propose a novel model averaging method based on cross-validation (MACV). MACV unifies a large class of new and existing model averaging estimators and covers a very general class of loss functions. Furthermore, to reduce the computational burden caused by the conventional leave-subject/one-out cross validation, we propose a SEcond-order-Approximated Leave-one/subject-out (SEAL) cross validation, which largely improves the computation efficiency. As a useful tool, we extend the Bernstein-type inequality for strongly mixing random variables that are not necessarily identically distributed. In the context of non-independent and non-identically distributed random variables, we establish the unified theory for analyzing the asymptotic behaviors of the proposed MACV and SEAL methods, where the number of candidate models is allowed to diverge with sample size. To demonstrate the breadth of the proposed methodology, we exemplify four optimal model averaging estimators under four important situations, i.e., longitudinal data with discrete responses, within-cluster correlation structure modeling, conditional prediction in spatial data, and quantile regression with a potential correlation structure. We conduct extensive simulation studies and analyze real-data examples to illustrate the advantages of the proposed methods.
2024-12-16/17
Zoom Meeting ID: 271 534 5558 Passcode: YMSC
https://us06web.zoom.us/j/2715345558?pwd=eXRTTExpOVg4ODFYellsNXZVVlZvQT09
时间:12 月 16 日 11:00~12:00
地点:双清综合楼C548
报告人:胡祺睿,清华大学统计学研究中心
报告题目: Simultaneous Inference for Eigensystems and FPC Scores of Functional Data
摘要:
Functional data analysis has become a pivotal field in statistics, emphasizing data represented by functions rather than scalar values. Although significant progress has been made in estimating fundamental elements such as mean and covariance functions, simultaneous inference for eigensystems and functional principal component (FPC) scores remains challenging.
In this talk, we introduce novel methodologies for the simultaneous inference of eigensystems and the distribution of FPC scores in densely observed functional data, along with the asymptotic properties, especially holding in C[0,1] and for a diverging number of estimators. We validate our approaches through simulations and apply them to electroencephalogram (EEG) data, demonstrating their practical utility in testing hypotheses related to FPCs and the distribution of FPC scores. Finally, we discuss extensions to two-dimensional functional data, functional time series, and a the unified theory bridging sparse and dense functional data.
时间:12 月 17 日 16:00~17:00
地点:双清综合楼C548
报告人: Chen Cheng, Statistics Department in Stanford University
报告题目:Towards modern datasets: laying mathematical foundations to streamline machine learning
摘要:
Datasets are central to the development of statistical learning theory, and the evolution of models. The burgeoning success of modern machine learning in sophisticated tasks crucially relies on the vast growth of massive datasets (cf. Donoho), such as ImageNet, SuperGLUE and Laion-5b. However, such evolution breaks standard statistical learning assumptions and tools.
In this talk, I will present two stories tackling challenges modern datasets present, and leverage statistical theory to shed insight into how should we streamline modern machine learning.
In the first part, we study multilabeling—a curious aspect of modern human-labeled datasets that is often missing in statistical machine learning literature. We develop a stylized theoretical model to capture uncertainties in the labeling process, allowing us to understand the contrasts, limitations and possible improvements of using aggregated or non-aggregated data in a statistical learning pipeline. In the second part, I will present novel theoretical tools that are not simply convenient from classical literature, such as random matrix theory under proportional regime. Theoretical tools for proportional regime are crucially helpful in understanding “benign-overfitting” and “memorization”. This is not always the most natural setting in statistics where columns correspond to covariates and rows to samples. With the objective to move beyond the proportional asymptotics, we revisit ridge regression (ℓ2-penalized least squares) on i.i.d. data X ∈ Rn×d, y ∈ Rn. We allow the feature vector to be infinite-dimensional (d= ∞), in which case it belongs to a separable Hilbert space.
时间:11 月 29 日,16:00-17:00
地点:双清综合楼C654
报告人:李伟,中国人民大学统计学院副教授
报告题目:Discovery and inference of possibly bi-directional causal relationships with invalid instrumental variables
摘要:Learning causal relationships between pairs of complex traits from observational studies is of great interest across various scientific domains. However, most existing methods assume the absence of unmeasured confounding and restrict causal relationships between two traits to be uni-directional, which may be violated in real-world systems. In this paper, we address the challenge of causal discovery and effect inference for two traits while accounting for unmeasured confounding and potential feedback loops. By leveraging possibly invalid instrumental variables, we provide identification conditions for causal parameters in a model that allows for bi-directional relationships, and we also establish identifiability of the causal direction under the introduced conditions. Then we propose a data-driven procedure to detect the causal direction and provide inference results about causal effects along the identified direction. We show that our method consistently recovers the true direction and produces valid confidence intervals for the causal effect. We conduct extensive simulation studies to show that our proposal outperforms existing methods. We finally apply our method to analyze real data sets from UK Biobank.
时间:11 月 22 日,16:00-17:00
地点:双清综合楼C654
报告人:孙玉莹,中国科学院数学与系统科学研究院副研究员
报告题目: Model averaging for time-varying vector autoregressions
摘要:This paper proposes a novel time-varying model averaging (TVMA) approach to enhancing forecast accuracy for multivariate time series subject to structural changes. The TVMA method averages predictions from a set of time-varying vector autoregressive models using optimal time-varying combination weights selected by minimizing a penalized local criterion. This allows the relative importance of different models to adaptively evolve over time in response to structural shifts. We establish an asymptotic optimality for the proposed TVMA approach in achieving the lowest possible quadratic forecast errors. The convergence rate of the selected time-varying weights to the optimal weights minimizing expected quadratic errors is derived. Moreover, we show that when one or more correctly specified models exist, our method consistently assigns full weight to them, and an asymptotic normality for the TVMA estimators under some regular conditions can be established. Furthermore, the proposed approach encompasses special cases including time-varying VAR models with exogenous predictors, as well as time-varying FAVAR models. Simulations and an empirical application illustrate the proposed TVMA method outperforms some commonly used model averaging and selection methods in the presence of structural changes.