Statistical Seminar-清华大学数学科学中心

Statistical Seminar

报告人 Speaker：彭洋，叶成龙

组织者 Organizer：吴宇楠

时间 Time：9:00~10:00, 14:00~15:00, 2025-12-26

地点 Venue：双清综合楼

Upcoming talks:

组织者：吴宇楠

时间：12 月 26 日 9:00~10:00

地点：双清综合楼B725

报告人：彭洋，北京大学数学科学学院

报告题目：Convergence and Inference of Distributional Reinforcement Learning

摘要：Distributional reinforcement learning (RL) has achieved remarkable success in various domains by modeling the full distribution of returns rather than just the expectation. Despite the rapid development of algorithms in recent years, the fundamental statistical properties underlying these methods remain largely underexplored.

In this talk, I will present a rigorous statistical framework for distributional RL. First, I will establish that the sample complexity of the distributional temporal difference (TD) learning algorithm is minimax optimal (up to logarithmic factors) under the 1-Wasserstein distance. A surprising implication of this result is that estimating the infinite-dimensional return distribution does not require more samples than estimating the expected return in classic RL. Second, I will introduce a model-based variant of the algorithm and demonstrate the asymptotic normality of the resulting estimators, thereby facilitating valid statistical inference for the return distribution.

To derive these results, we develop novel theoretical tools, including Freedman’s inequality in Hilbert spaces and sharp matrix concentration inequalities for Markovian data. These mathematical tools are of independent interest and have broad applications in other statistical problems. This talk is based on joint works published in the Annals of Statistics and NeurIPS, as well as several recent working papers.

组织者：吴宇楠

时间：12 月 26 日 14:00~15:00

地点：双清综合楼C548

报告人：叶成龙助理教授，肯塔基大学

报告题目：Deep Clustering Evaluation: How to Validate Internal Clustering Validation Measures

Deep clustering partitions complex high-dimensional data using deep neural networks for clustering. It involves projecting data into lower-dimensional embeddings before partitioning, which embarks unique evaluation challenges. Traditional clustering validation measures, designed for low-dimensional spaces, are problematic for deep clustering for two reasons: 1) the curse of dimensionality when applied to the high-dimensional input data, and 2) unreliable comparison of clustering results when applied to embedded data from different embedding spaces, owing to variations in training procedures and model parameter settings. This paper addresses these unresolved and often overlooked challenges in evaluating clustering within deep learning. We propose a systematic evaluation framework for internal clustering validation measures that: (1) theoretically establishes why traditional measures are ineffective when applied to input data or across disparate embedding spaces paired with partitioning outcomes; (2) identifies embedding spaces that endorse reliable evaluations by detecting groups with high agreement in ranking partitioning outcomes; and (3) develops a stable and robust scoring scheme by weighting index values computed across these identified embedding spaces. Experiments show that this new framework aligns better with external measures, effectively reducing the misguidance from the improper use of internal validation measures in deep clustering evaluation.

Past talks:

组织者：吴宇楠

时间：2025年 12 月 19 日 14:00~15:00

地点：双清综合楼C548

报告人：孔令臣教授，北京交通大学数学与统计学院

报告题目：高维多元异构数据的分位数回归分析

摘要：多源数据分析是现代统计学的热门课题之一，其目的是为了对多源数据集进行深入挖掘，以得到参数的精确估计。该领域涵盖了如联邦学习、迁移学习、融合学习在内的诸多学习方法。我们考虑了带有异质性结构的高维分位数回归模型，提出了以交替方向乘子法为基础的个性化联邦学习方法。理论上给出了算法的基础收敛性和模型解的统计性质。数值模拟和真实数据分析的结果验证了提出方法的有效性。

组织者：吴宇楠

时间：2025年 12 月 12 日 14:00~15:00

地点：双清综合楼C548

报告人：胡冠宇副教授，密歇根州立大学概率统计系

报告题目：From Hardwood to Heatmap: A Bayesian Dive into the Spatial Dynamics of Basketball Shot Selection

摘要：

Basketball shot charts offer rich spatial and contextual information that reveal nuanced patterns in player behavior and game strategy. In this talk, we present a novel Bayesian framework for modeling and interpreting the spatial dynamics of shot selection. We begin with a log Gaussian Cox process (LGCP) model to jointly analyze the locations and outcomes (made/missed) of shots across multiple games, capturing spatially varying covariate effects through hierarchical Gaussian processes. To facilitate efficient inference, we design a custom Markov chain Monte Carlo (MCMC) algorithm using a kernel convolution approach.

Building on this foundation, we introduce a complementary modeling strategy using Functional Bayesian Additive Regression Trees (FBART), which provides flexible, nonparametric regression capabilities and uncertainty quantification. To improve scalability and accuracy, we further propose the Adaptive Functional BART (AFBART) model, which employs adaptive basis functions to better capture the nonlinear and nonstationary nature of shot selection behavior.

Through extensive simulation studies and a real-world case study examining the shot charts of NBA legends Stephen Curry, LeBron James, and Michael Jordan, we showcase the power of our approach in extracting actionable insights. Our framework reveals how spatial context, playing conditions, and opponent characteristics influence shooting efficiency—offering practical tools for analysts, coaches, and players seeking a competitive edge.

组织者：吴宇楠

时间：2025年 12 月 5 日 14:00~15:00

地点：双清综合楼C548

报告人：刘默雷助理教授，北京大学医学部生物统计系&北京国际数学研究中心

报告题目：Multi-source stable variable importance measure via adversarial machine learning

摘要：As part of enhancing the interpretability of machine learning, it is of renewed interest to quantify and infer the predictive importance of certain exposure covariates. Modern scientific studies often collect data from multiple sources with distributional heterogeneity. Thus, measuring and inferring stable associations across multiple environments is crucial in reliable and generalizable decision-making. We propose MIMAL, a novel statistical framework for Multi-source stable Importance Measure via Adversarial Learning. MIMAL measures the importance of some exposure variables by maximizing the worst-case predictive reward over the source mixture. Our framework allows various machine learning methods for confounding adjustment and exposure effect characterization. For inferential analysis, the asymptotic normality of our introduced statistic is established under a general machine learning framework that requires no stronger learning accuracy conditions than those for single source variable importance. Numerical studies with various types of data generation setups and machine learning implementation are conducted to justify the finite-sample performance of MIMAL.

组织者：吴宇楠

时间：2025年 11 月 28 日上午 9:30~10:30

地点：双清综合楼C654

报告人：王涛教授，上海交通大学生命科学技术学院&数学科学学院

报告题目：Factor Models for High-dimensional Count Data

摘要：This talk presents recent advances in factor modeling for multivariate count data. We propose a maximum variational likelihood approach for estimation and inference under a multinomial factor-augmented inverse regression model, with asymptotic properties established in high-dimensional settings. For Poisson factor models, we introduce a data-driven criterion for determining the number of factors and prove its consistency as both sample size and dimensionality diverge. Extensions to zero-inflated models are also briefly discussed.

组织者：吴宇楠

时间：11 月 17 日上午 9:00~10:00

地点：双清综合楼C548

报告人：边蓉，中科院数学与系统科学研究院

报告题目：Data-Driven Statistical Model Selection and AI for Knowledge Discovery

摘要：In the era of big data and artificial intelligence, extracting meaningful information from complex data has become increasingly crucial across scientific domains. This talk presents my research on data-driven statistical methods and AI techniques, focusing on model selection and knowledge discovery. The main work addresses model selection for unnormalized probability densities with possibly dependent data. Existing methods face computational challenges due to intractable normalizing constants. I propose MIC, a fast and consistent selection criterion for nested models. MIC achieves consistency under mild regularity conditions while maintaining computational efficiency. Theoretical consistency is rigorously proven, and extensive simulations with real-world applications to financial, automotive, and wind direction data demonstrate its superior performance. Beyond model selection, my research extends to AI-driven knowledge discovery. I develop AutoMathKG, an automated mathematical knowledge graph integrating multiple sources using large language models, with a specialized Math LLM for knowledge completion and reasoning. I also propose WAGRank, an unsupervised keyphrase extraction model based on word attention graphs that exploits semantic information beyond word frequency. These works showcase data-driven statistical and AI methodologies for complex data, bridging theoretical foundations with practical applications.

组织者：吴宇楠

时间：2025年 11 月 14 日 14:00~15:00

地点：双清综合楼C548

报告人：曾涛副教授，浙江大学经济学院

报告题目：Variational Model Selection for Latent Variable Models with Massive Data

摘要：Modern statistical modeling and inference often face large-scale datasets in various scientific domains. And to build up a reasonable statistical model under these types of data, latent variable models have been favored with their augmented interpretation with the addition of missing information. Selecting a reliable candidate statistical model in this scenario, termed the model selection or model comparison problem, becomes both computationally challenging and theoretically delicate. A variety of detection algorithms have been proposed in the literature, offering different trade-offs between complexity and detection performance. In the modern Bayesian community, Variational Bayes (VB) has emerged as a widely used method for addressing complicated statistical inference under massive data. Frequentist properties of VB have been studied in recent years, thus providing solid theoretical foundations to build up a rigorous model selection procedure. In this study, we focus on a class of misspecified generic models, which includes latent variables, and examine the risk functions associated with predictive distributions derived from variational posterior distributions. These risk functions, defined as the expectation of the Kullback-Leibler (KL) divergence between the true data-generating density and the variational predictive distributions, provide a framework for assessing predictive performance. With latent variable models, we review the predictive distributions of VB posteriors and propose a novel information criterion accommodating this class of generic models based on the related risk function. Under certain regularity conditions, we demonstrate that the proposed information criterion is an asymptotically unbiased estimator of its risk function. Several computational methods have been introduced in order to facilitate the calculation of the variational information criterion with latent variables. Through comprehensive numerical simulations and empirical applications in economics and finance, we demonstrate the effectiveness of these variational information criteria in comparing misspecified latent variable models in the context of massive data.

组织者：吴宇楠

时间：2025年 11 月 10 日 14:00~15:00

地点：双清综合楼C548

报告人：李子林教授，东北师范大学数学与统计学院

报告题目：All-in-One Toolkit for Biobank-Scale Whole-Genome Sequencing Data Management and Analysis

摘要：Biobank-scale Whole-Genome Sequencing (WGS) studies are increasingly pivotal in unraveling the genetic bases of diverse health outcomes. However, managing and analyzing these datasets’ sheer volume and complexity presents significant challenges. We propose vcf2agds, an all-in-one toolkit that efficiently converts WGS data from Variant Call Format (VCF) format to the annotated Genomic Data Structure (aGDS) format, significantly reducing data size while supporting seamless genomic and functional data integration for comprehensive genetic analyses. Additionally, STAARpipeline equipped with the aGDS files enabled scalable, comprehensive and functionally informed WGS analysis, facilitating the detection of common and rare coding and noncoding phenotype-genotype associations. We applied the STAARpipeline to analyze Alzheimer disease (AD) in 459,216 samples from the UK Biobank. All analyses scale well in computation time and memory. We discover several potentially new significant associations with AD. As WGS datasets continue to expand in size and complexity, our proposed tools will be increasingly useful for unlocking the full potential of genomic research.

组织者：吴宇楠

时间：2025年 10 月 31 日 14:00~15:00

地点：双清综合楼C548

报告人：杨松山副教授，中国人民大学统计与大数据研究院

报告题目：Cost-aware Portfolios in a Large Universe of Assets

摘要：This paper proposes a finite-horizon mean-variance portfolio estimator, where the rebalancing decisions are made based on current information on asset returns and transaction costs. The study’s novelty is that the transaction costs are integrated within the decision process in a high-dimensional portfolio setting where the number of assets is larger than the sample size. We propose portfolio construction and rebalancing models with a nonconvex penalty considering two types of transaction cost, the proportional transaction cost and the quadratic transaction cost. We establish the desired theoretical properties of our estimator under mild regularity conditions. Monte Carlo simulations and empirical studies using S\&P 500 and Russell 2000 stocks show the satisfactory performance of the proposed portfolio and highlight the importance of involving the transaction costs when rebalancing a portfolio.

组织者：吴宇楠

时间：2025年 10 月 13 日上午 9:30~10:30

地点：双清综合楼C548

报告人：谢天教授，上海财经大学商学院

报告题目：Geometry Meets Portfolio: New Optimization Frontiers Beyond Model Averaging

摘要：In modern financial environments, portfolio optimization increasingly involves navigating non-convex, constrained landscapes where traditional model averaging techniques face significant limitations. This presentation introduces recent advances in geometry-aware and adaptively perturbed optimization methods that extend the frontier beyond classical averaging frameworks. By integrating stochastic exploration with Riemannian geometry, we develop a unified empirical approach that complements and enhances model averaging—particularly in high-dimensional, non-convex, or constraint-laden settings. Applications to real financial datasets highlight the benefits of these methods in escaping local traps, achieving near-global solutions, and improving out-of-sample portfolio efficiency. This talk bridges the gap between model averaging and modern optimization, offering fresh tools for researchers tackling complex decision-making under uncertainty.

组织者：吴宇楠

时间：2025年 9 月 26 日 14:00~15:00

地点：宁斋求真厅

报告人：方方教授，华东师范大学统计学院

报告题目：Effective Model Averaging

摘要：As an important alternative to model selection for handling model uncertainty, frequentist model averaging is attractive for its competitive prediction power. However, it has long been criticized for the deficiency of statistical inference especially for a general framework and non-nested models. In this talk, we present two general model averaging frameworks with ``effective model size”. In the first framework, we consider general likelihood estimation and use a weighted geometric mean of the conditional probability density functions estimated from different candidate models, allowing both parameter uncertainty and model misspeciﬁcation. In the second framework, we consider estimation with a working loss function and propose a general model averaging framework with inference, which is applicable to both nested and non-nested candidate models. The weight selection criteria are based on direct estimation of the prediction risk and “effective mode size" appears in both frameworks. Theoretical and empirical results are presented.

组织者：吴宇楠

时间：2025年 9 月 22 日 14:00–15:00

地点：双清综合楼C548

报告人：骆威副教授，浙江大学数据科学研究中心

报告题目：Facilitating model-based clustering by dimension reduction

摘要：The Gaussian Mixture Model (GMM) has been widely used for clustering analysis. It is commonly fitted by the maximal likelihood approach, which is computationally challenging due to the non-convex minimization, especially as the dimensionality grows. To address this issue, we propose a two-step approach by recovering the intrinsic low-dimensional structure of GMM under additional constraints on its heterogeneity; that is, there exists a low-dimensional linear transformation of the data, given which the rest of the data are normally distributed and thus redundant for clustering. Our approach first recovers the desired low-dimensional data based on Stein's Lemma and then uses the reduced data only to fit GMM. Its computational efficiency comes from both the lower dimensionality and denoising of the data. Under a sparsity assumption of the clustering pattern, our approach can be generalized in high-dimensional settings. With the aid of a novelly constructed pseudo response, it can also be embedded into a general framework of sufficient dimension reduction, which encompasses a wider class of methods beyond Stein's Lemma to recover the low-dimensional structure of GMM. These findings are illustrated in the numerical studies at the end.

组织者：吴宇楠

时间：5 月 30 日 14:00~15:00

地点：双清综合楼C548

Zoom Meeting ID: 271 534 5558 Passcode: YMSC

https://us06web.zoom.us/j/2715345558?pwd=eXRTTExpOVg4ODFYellsNXZVVlZvQT09

报告人：Dr. Jiaji Su，Department of Statistics and Data Science, National University of Singapore

报告题目：Principal Decomposition with Nested Submanifolds

摘要：Over the past decades, the increasing dimensionality of data has increased the need for effective data decomposition methods. Existing approaches, however, often rely on linear models or lack sufficient interpretability or flexibility. To address this issue, we introduce a novel nonlinear decomposition technique called the principal nested submanifolds, which builds on the foundational concepts of principal component analysis. This method exploits the local geometric information of data sets by projecting samples onto a series of nested principal submanifolds with progressively decreasing dimensions. It effectively isolates complex information within the data in a backward stepwise manner by targeting variations associated with smaller eigenvalues in local covariance matrices. Unlike previous methods, the resulting subspaces are smooth manifolds, not merely linear spaces or special shape spaces. Validated through extensive simulation studies and applied to real-world RNA sequencing data, our approach surpasses existing models in delineating intricate nonlinear structures. It provides more flexible subspace constraints that improve the extraction of significant data components and facilitate noise reduction. This innovative approach not only advances the non-Euclidean statistical analysis of data with low-dimensional intrinsic structure within Euclidean spaces, but also offers new perspectives for dealing with high-dimensional noisy data sets in fields such as bioinformatics and machine learning.

组织者：吴宇楠

时间：5 月 23 日 16:00~17:00

地点：双清综合楼C654

报告人：Chao Zheng Assistant Professor，School of Mathematical Sciences and Southampton Statistical Sciences Research Institute at the University of Southampton, United Kingdom

报告题目：Optimal Spatial Anomaly Detection—Theory and Applications

摘要：There has been a growing interest in multiple changepoints/anomaly detection problems recently, whilst their focuses are mostly on changes taking place on the time index. In this work, we investigate the anomaly-in-mean model on multidimensional spatial lattice, that is, to detect the number and locations of anomaly spatial regions from the baseline. In addition to the usual minimisation over cost function with a penalisation related to the number of anomalies, we also introduce a new penalty on the area of minimum convex hull that covers the anomaly regions. We show that our estimation on the number and locations of anomalies are consistent, and prove that the method achieves optimal localisation error under the minimax framework. We also proposed a dynamic programming algorithm to solve the penalised cost minimisation approximately and carry out large-scale Monte Carlo simulations to examine its performance. The method has a wide range of applications in climate problem. As an example, we apply it to detect the marine heatwaves using the sea surface temperature data from European Space Agent.

组织者：吴宇楠

时间：5 月 16 日 16:00~17:00

地点：双清综合楼C654

报告人：张敏教授，清华大学万科公共卫生与健康学院

生物统计科研思考与实践：方法、应用和数据

作为一名生物统计研究者，我将结合自身科研经历就该如何开展生物统计科研做一些探讨。在理论与应用融合的理想框架下，分享我在统计方法与健康医疗应用结合道路上的实践探索，包括遇到的挑战和成功案例。我将重点介绍变点与阈值估计方法的发展、生存分析模型的创新，以及基于心脏手术电子登记数据库的应用案例。同时，数据是生物统计研究中至关重要的一环但往往没有受到（生物）统计学家的足够重视。我也将介绍我最近在数据方面的工作以及我们的自然人群队列。

组织者：吴宇楠

时间：5 月 9 日 16:00~17:00

地点：双清综合楼C654

报告人：李小鸥副教授，美国明尼苏达大学统计系

报告题目：Globally-Optimal Greedy Active Sequential Estimation

摘要： Modern applications such as computerized adaptive testing, sequential rank aggregation, and heterogeneous data source selection increasingly rely on active sequential estimation to enhance parameter inference. This talk explores the design of adaptive experiment selection rules that maximize estimation accuracy while maintaining computational efficiency. Greedy information-based selection strategies, which optimize information gain one step ahead, are widely used due to their flexibility and broad applicability. However, their optimality in the multidimensional setting remains an open question. This talk addresses this gap by establishing rigorous guarantees for multidimensional active sequential estimation within a unified decision-theoretic framework. We prove that maximum likelihood estimators paired with a class of greedy selection rules achieve consistency, asymptotic normality, and optimal risk performance. Additionally, we extend these results to incorporate early stopping mechanisms. Extensive numerical studies on both synthetic and real-world datasets illustrate the advantages of the proposed methods.

组织者：吴宇楠

时间：4 月 25 日 13:30~14:30

地点：双清综合楼C548

报告人：胡懿娟教授，北京大学北京国际数学研究中心

报告题目：Analysis of Differential Abundance in Compositional Data

摘要：In the era of big data, many sequencing-based molecular datasets are compositional, meaning they are expressed as percentages. While microbiome data is the most well-known example, single-cell subtype abundance data is also compositional in nature. These datasets are often sparse, containing numerous zero values due to the large number of features and limited sequencing depth. Compositional analysis typically assumes that only a small proportion of taxa are differentially abundant, while the ratios of relative abundances among the remaining taxa remain stable. Most existing methods rely on log-transformed data; however, log-transformation becomes problematic when zero counts are pervasive, often resulting in poor control of the false discovery rate (FDR). To address these challenges, we propose Logistic Compositional Analysis (LOCOM) — a robust logistic regression-based approach for compositional data analysis that eliminates the need for pseudocounts. LOCOM leverages permutation-based inference to account for overdispersion and small sample sizes. Additionally, it employs an asymptotic approach to enhance computational efficiency for large-sample datasets. To mitigate batch effects — commonly arising from systematic differences in sequencing depth in large-sample studies — LOCOM appropriately weights samples. Our simulations demonstrate that LOCOM consistently maintains FDR control while achieving significantly improved sensitivity compared to existing methods.

组织者：吴宇楠
时间：4 月 18 日 13:00~14:00
地点：双清综合楼C654
报告人：罗涛副教授，上海交通大学数学科学学院
报告题目：Parameter Condensation in Neural Networks
摘要： In this talk, we will first introduce the phenomenon of parameter condensation in neural networks, which refers to the tendency of certain parameters to converge towards the same values during training. Then, for certain types of networks, we prove that condensation occurs in the early stages of training. We further analyze which hyperparameters and training strategies influence parameter condensation. In some cases, we even provide a phase diagram that delineates whether parameter condensation occurs. We will also briefly discuss the relationship between parameter condensation and generalization ability. Finally, towards the end of the training, we study the set of global minima and present a detailed analysis of its geometric structure and convergence properties.
————————————————————————————————————————————————————————————
组织者：吴宇楠
时间：4 月 18 日 16:00~17:00
地点：双清综合楼C654
报告人：蔡标助理教授，香港城市大学商学院
报告题目：Latent Network Structure Learning from High Dimensional Multivariate Point Processes
摘要：Learning the latent network structure from large scale multivariate point process data is an important task in a wide range of scientific and business applications. For instance, we might wish to estimate the neuronal functional connectivity network based on spiking (or firing) times recorded from a collection of neurons. To characterize the complex processes underlying the observed point patterns, we propose a new and flexible class of non-stationary Hawkes processes that allow both excitatory and inhibitory effects. We estimate the latent network structure using a scalable sparse least squares estimation approach. Using a novel thinning representation, we establish concentration inequalities for the first and second order statistics of the proposed Hawkes process. Such theoretical results enable us to establish the nonasymptotic error bound and the selection consistency of the estimated parameters. Furthermore, we describe a penalized least squares based statistic for testing if the background intensity is constant in time. We apply our proposed method to a neurophysiological data set that studies working memory.

组织者：吴宇楠

时间：3 月 28 日 16:00~17:00

地点：双清综合楼C654

报告人：苗旺副教授，北京大学概率统计系

报告题目：Causal inference for dyadic data in randomized experiments with interference

摘要：Estimating the treatment effect in a network is of particular interest in online experimentation conducted everyday in social media companies. We investigate a novel setting where the outcome of interest comprises a series of dyadic outcomes, such as forwarding a message or sharing a link between friends and international trade relation between countries.Dyadic outcomes are pervasive in many social network sources and of particular interest in online experimentation (A/B testing).We propose a causal inference framework for dyadic outcomes in randomized experiments in the presence of network interference, and develop consistent estimators of the global average causal effect.We derive the convergence rate and variance bound of the proposed estimators,and provide a variance estimator that is conservative for quantifying the estimation uncertainty.We illustrate with a variety of numerical experiments and apply our approach to an online experiment in Wechat Channels.

组织者：吴宇楠

时间：3 月 21 日 16:00~17:00

地点：双清综合楼C654

报告人：王禹皓助理教授，清华大学交叉信息研究院

报告题目：Residual permutation test for regression coefficient testing

摘要：We consider the problem of testing whether a single coefficient is equal to zero in linear models when the dimension of covariates p can be up to a constant fraction of sample size n. In this regime, an important topic is to propose tests with finite-population valid size control without requiring the noise to follow strong distributional assumptions. In this paper, we propose a new method, called residual permutation test (RPT), which is constructed by projecting the regression residuals onto the space orthogonal to the union of the column spaces of the original and permuted design matrices. RPT can be proved to achieve finite-population size validity under fixed design with just exchangeable noises, whenever p<n/2. Moreover, RPT is shown to be asymptotically powerful for heavy tailed noises with bounded (1+t)-th order moment when the true coefficient is at least of order n^{-t/(1+t)} for t \in [0,1]. We further proved that this signal size requirement is essentially rate-optimal in the minimax sense. Numerical studies confirm that RPT performs well in a wide range of simulation settings with normal and heavy-tailed noise distributions.

组织者：吴宇楠

时间：3 月 14 日 16:00~17:00

地点：双清综合楼C654

Speaker： Prof. Fan Li, Department of Statistical Science, Duke University

Title：Interacted two-stage least squares with treatment effect heterogeneity

Abstract：Treatment effect heterogeneity with respect to covariates is common in instrumental variable (IV) analyses. An intuitive approach, which we term the interacted two-stage least squares (2SLS), is to postulate a linear working model of the outcome on the treatment, covariates, and treatment-covariate interactions, and instrument it by the IV, covariates, and IV-covariate interactions. We clarify the causal interpretation of the interacted 2SLS under the local average treatment effect (LATE) framework when the IV is valid conditional on covariates. Our contributions are threefold. First, we show that the interacted 2SLS with centered covariates is consistent for estimating the LATE if either of the following conditions holds: (i) the treatment-covariate interactions are linear in the covariates; (ii) the linear outcome model underlying the interacted 2SLS is correct. Second, we show that the coefficients of the treatment-covariate interactions from the interacted 2SLS are consistent for estimating treatment effect heterogeneity with regard to covariates among compliers if either condition (i) or condition (ii) holds. Moreover, we connect the 2SLS estimator with the reweighting perspective in Abadie (2003) and establish the necessity of condition (i) in the absence of additional assumptions on potential outcomes. Third, leveraging the consistency guarantees of the interacted 2SLS for categorical covariates, we propose a stratification strategy based on the IV propensity score to approximate the LATE and treatment effect heterogeneity with regard to the IV propensity score when neither condition (i) nor condition (ii) holds.

组织者：吴宇楠

时间：3 月 14 日 14:00~15:00

地点：双清综合楼C548

报告人：喻达磊教授，西安交通大学数学与统计学院

Title：Unified optimal model averaging with a general loss function based on cross-validation

Abstract：Studying unified model averaging estimation for situations with complicated data structures, we propose a novel model averaging method based on cross-validation (MACV). MACV unifies a large class of new and existing model averaging estimators and covers a very general class of loss functions. Furthermore, to reduce the computational burden caused by the conventional leave-subject/one-out cross validation, we propose a SEcond-order-Approximated Leave-one/subject-out (SEAL) cross validation, which largely improves the computation efficiency. As a useful tool, we extend the Bernstein-type inequality for strongly mixing random variables that are not necessarily identically distributed. In the context of non-independent and non-identically distributed random variables, we establish the unified theory for analyzing the asymptotic behaviors of the proposed MACV and SEAL methods, where the number of candidate models is allowed to diverge with sample size. To demonstrate the breadth of the proposed methodology, we exemplify four optimal model averaging estimators under four important situations, i.e., longitudinal data with discrete responses, within-cluster correlation structure modeling, conditional prediction in spatial data, and quantile regression with a potential correlation structure. We conduct extensive simulation studies and analyze real-data examples to illustrate the advantages of the proposed methods.

2024-12-16/17

Zoom Meeting ID: 271 534 5558 Passcode: YMSC
https://us06web.zoom.us/j/2715345558?pwd=eXRTTExpOVg4ODFYellsNXZVVlZvQT09

时间：12 月 16 日 11:00~12:00

地点：双清综合楼C548

报告人：胡祺睿，清华大学统计学研究中心

报告题目： Simultaneous Inference for Eigensystems and FPC Scores of Functional Data

摘要：

Functional data analysis has become a pivotal field in statistics, emphasizing data represented by functions rather than scalar values. Although significant progress has been made in estimating fundamental elements such as mean and covariance functions, simultaneous inference for eigensystems and functional principal component (FPC) scores remains challenging.
In this talk, we introduce novel methodologies for the simultaneous inference of eigensystems and the distribution of FPC scores in densely observed functional data, along with the asymptotic properties, especially holding in C[0,1] and for a diverging number of estimators. We validate our approaches through simulations and apply them to electroencephalogram (EEG) data, demonstrating their practical utility in testing hypotheses related to FPCs and the distribution of FPC scores. Finally, we discuss extensions to two-dimensional functional data, functional time series, and a the unified theory bridging sparse and dense functional data.

时间：12 月 17 日 16:00~17:00

地点：双清综合楼C548

报告人： Chen Cheng, Statistics Department in Stanford University

报告题目：Towards modern datasets: laying mathematical foundations to streamline machine learning

摘要：

Datasets are central to the development of statistical learning theory, and the evolution of models. The burgeoning success of modern machine learning in sophisticated tasks crucially relies on the vast growth of massive datasets (cf. Donoho), such as ImageNet, SuperGLUE and Laion-5b. However, such evolution breaks standard statistical learning assumptions and tools.
In this talk, I will present two stories tackling challenges modern datasets present, and leverage statistical theory to shed insight into how should we streamline modern machine learning.
In the first part, we study multilabeling—a curious aspect of modern human-labeled datasets that is often missing in statistical machine learning literature. We develop a stylized theoretical model to capture uncertainties in the labeling process, allowing us to understand the contrasts, limitations and possible improvements of using aggregated or non-aggregated data in a statistical learning pipeline. In the second part, I will present novel theoretical tools that are not simply convenient from classical literature, such as random matrix theory under proportional regime. Theoretical tools for proportional regime are crucially helpful in understanding “benign-overfitting” and “memorization”. This is not always the most natural setting in statistics where columns correspond to covariates and rows to samples. With the objective to move beyond the proportional asymptotics, we revisit ridge regression (ℓ2-penalized least squares) on i.i.d. data X ∈ Rn×d, y ∈ Rn. We allow the feature vector to be infinite-dimensional (d= ∞), in which case it belongs to a separable Hilbert space.

时间：11 月 29 日，16:00-17:00

地点：双清综合楼C654

报告人：李伟，中国人民大学统计学院副教授

报告题目：Discovery and inference of possibly bi-directional causal relationships with invalid instrumental variables

摘要：Learning causal relationships between pairs of complex traits from observational studies is of great interest across various scientific domains. However, most existing methods assume the absence of unmeasured confounding and restrict causal relationships between two traits to be uni-directional, which may be violated in real-world systems. In this paper, we address the challenge of causal discovery and effect inference for two traits while accounting for unmeasured confounding and potential feedback loops. By leveraging possibly invalid instrumental variables, we provide identification conditions for causal parameters in a model that allows for bi-directional relationships, and we also establish identifiability of the causal direction under the introduced conditions. Then we propose a data-driven procedure to detect the causal direction and provide inference results about causal effects along the identified direction. We show that our method consistently recovers the true direction and produces valid confidence intervals for the causal effect. We conduct extensive simulation studies to show that our proposal outperforms existing methods. We finally apply our method to analyze real data sets from UK Biobank.

时间：11 月 22 日，16:00-17:00

地点：双清综合楼C654

报告人：孙玉莹，中国科学院数学与系统科学研究院副研究员

报告题目： Model averaging for time-varying vector autoregressions

摘要：This paper proposes a novel time-varying model averaging (TVMA) approach to enhancing forecast accuracy for multivariate time series subject to structural changes. The TVMA method averages predictions from a set of time-varying vector autoregressive models using optimal time-varying combination weights selected by minimizing a penalized local criterion. This allows the relative importance of different models to adaptively evolve over time in response to structural shifts. We establish an asymptotic optimality for the proposed TVMA approach in achieving the lowest possible quadratic forecast errors. The convergence rate of the selected time-varying weights to the optimal weights minimizing expected quadratic errors is derived. Moreover, we show that when one or more correctly specified models exist, our method consistently assigns full weight to them, and an asymptotic normality for the TVMA estimators under some regular conditions can be established. Furthermore, the proposed approach encompasses special cases including time-varying VAR models with exogenous predictors, as well as time-varying FAVAR models. Simulations and an empirical application illustrate the proposed TVMA method outperforms some commonly used model averaging and selection methods in the presence of structural changes.

学术活动

Statistical Seminar