Abstract:
When aiming to identify differential genomic outcomes such as gene expression or protein abundance, thousands of simultaneous hypothesis tests are routinely performed. These tests can be biased by the presence of unmeasured confounders and missing data. Recent advances in scRNA-Seq and CRISPR technologies have allowed for the study of case vs. control and the characterization of experimental perturbations at single-cell resolution, further exacerbating these challenges. We develop a large-scale hypothesis testing solution for multivariate generalized linear models in the presence of confounding effects. Next, realizing that a number of advantages can be accrued by taking a causal inference approach, we expand this solution by exploring doubly robust and proximal inference options as well.
As genomic studies progress from studying transcriptomic to proteomic readouts, new challenges have arisen, most notably large numbers of missing values. A common strategy to address this issue is to rely on an imputed dataset, which often introduces systematic bias into downstream analyses. By contrast, we develop a statistical framework inspired by doubly robust estimators that offers valid and efficient inference for proteomic data. Our framework relies on powerful machine learning tools, such as variational autoencoders, to augment the imputation quality with high-dimensional peptide data.
Bio:
Kathryn Roeder is the UPMC Professor of Statistics and Life Sciences in the Departments of Statistics & Data Science and Computational Biology. She earned her Ph.D. in statistics at Pennsylvania State University, after which she was on the faculty at Yale University for the six years before coming to Carnegie Mellon University in 1994. In 1997 she received the COPSS Presidents' Award for the outstanding statistician under age 40. In 2020 she was awarded the COPSS Distinguished Achievement Award and Lectureship. In 2019 she was inducted into the National Academy of Sciences. Her research group develops statistical tools applied to genetic and genomic data to understand the workings of the human brain, and the interplay with genetic variation. These methods rely on various statistical and machine learning methods, causal inference, latent space embedding, sparse PCA and high dimensional nonparametric techniques.
Video:http://archive.ymsc.tsinghua.edu.cn/pacm_lecture?html=Tackling_genomic_testing_in_the_presence_of_unmeasured_confounding_and_missing_data.html