The Columbia University Data Science Institute’s Foundations of Data Science Center is hosting a workshop designed to foster collaboration and knowledge sharing. Through talks and posters, Columbia researchers will showcase their work in the diverse realms of data science methods and applications.


Event Details

Wednesday, May 6 (9:30 AM – 1:00 PM ET)

Location: Columbia Engineering Innovation Hub
Address: 2276 12th Ave, New York, NY 10027 – Manhattanville

· · ─ · ─ · ·

9:30 AM – 10:00 AM: Breakfast & Check-In (30 min)

10:00 AM – 10:40 AM: Presentation: Genevera Allen, Professor of Statistics (40 min)

10:40 AM – 10:50 AM: Break (10 min)

10:50 AM – 11:30 AM: Presentation: Kaizheng Wang, Assistant Professor, IEOR (40 min)

11:30 AM – 11:40 AM: Break (10 min)

11:40 AM – 12:00 PM: Poster Spotlights

12:00 PM – 1:00 PM: Lunch and Poster Session


Speakers

Genevera Allen
Professor of Statistics, Columbia University

Talk Title: Inference for Clustering: Conformal Sets for Cluster Labels

Abstract: While clustering is ubiquitously used across science and industry, uncertainty in cluster assignments is rarely quantified with rigorous guarantees. We propose a novel conformal inference framework for clustering that returns confidence sets for cluster labels. The key challenge is that labels are unobserved and estimated from data, so naively using deterministic cluster labels can violate exchangeability and induce severe under-coverage. To address this, we propose split conformal clustering with stochastic labels, which samples labels from soft cluster labels, fits a soft classifier to predict these stochastic labels, and calibrates conformal scores to construct confidence sets for cluster labels at any query point. We establish a finite-sample lower bound on marginal coverage that reveals how under-coverage is controlled by two properties of the clustering algorithm: consistency of estimated soft labels and replace-one stability. Under mild conditions, we prove asymptotic coverage and verify these conditions for correctly specified parametric mixture models. Simulations for mixture models show that our method attains target coverage with informative set sizes, validating our theoretical results. Applications to clustering cell types in single-cell RNA-seq data demonstrate the practical utility and interpretability of our approach to quantifying cluster label uncertainty.

Kaizheng Wang Headshot

Kaizheng Wang
Assistant Professor of Industrial Engineering and Operations Research, Columbia Engineering

Talk Title: AI Personas for Human-Centered Inference and Decision-Making

Abstract: AI personas provide a flexible framework for modeling human behavior, particularly in cold-start settings with limited data on new users or scenarios. In this talk, I will present recent and ongoing work on AI-powered human digital twins from this perspective. I will begin with two decision-oriented applications: adaptive querying with persona priors, which enables efficient information acquisition under limited question budgets, and an LLM-based demand simulator for pricing, which combines rich product information with persona-level purchase probabilities for counterfactual policy evaluation. I will then discuss a calibration framework for improving simulator fidelity to human data, and conclude with a statistical framework for uncertainty quantification in LLM-based survey simulation, including valid inference under unknown distribution shift between synthetic and real data. Together, these works show how AI personas can support principled human-centered inference and decision-making.


List of Exhibitors & Poster Numbers

P01: A Pipeline for Enabling Path-Specific Causal Fairness in Observational Health Data

  • Faculty Advisor: Shalmali Joshi, Assistant Professor, Columbia University Department of Biomedical Informatics
  • Aparajita Kashyap, PhD Student, Vagelos Institute for Basic Biomedical Science
  • Sara Matijevic, PhD Student, University of Oxford Nuffield Department of Women’s and Reproductive Health
  • Noémie Elhadad, Associate Professor, Columbia University Department of Biomedical Informatics
  • Steven Kushner, Professor, Columbia University Department of Psychiatry

P02: Imperfect Influence, Preserved Rankings: A Theory of TRAK for Data Attribution

  • Faculty Advisor: Arian Maleki, Associate Professor of Statistics, Department of Statistics, GSAS
  • Han Tong, PhD Student, Department of Statistics, GSAS
  • Shubhangi Ghosh, PhD Student, Department of Statistics, GSAS
  • Haolin Zou, PhD, Department of Statistics, GSAS

P03: Model-Free Assessment of Simulator Fidelity via Quantile Curves

  • Faculty Advisor: Garud Iyengar, Professor, IEOR; Avanessians Director of the Data Science Institute
  • Faculty Advisor: Kaizheng Wang, Assistant Professor, IEOR
  • Yu-Shiou Willy Lin, PhD Student, IEOR

P04: TabImpute: Universal Zero-Shot Imputation for Tabular Data

  • Faculty Advisor: Anish Agarwal, Assistant Professor, IEOR
  • Jacob Feitelberg, PhD Student, IEOR
  • Dwaipayan Saha, PhD Student, IEOR
  • Kyuseong Choi, PhD Student, Statistics, Cornell University
  • Raaz Dwivedi, Assistant Professor, ORIE, Cornell University

P05: On the Gradient Domination of LQG Problem

  • Faculty Advisor: James Anderson, Associate Professor, Department of Electrical Engineering
  • Kasra Fallah, PhD Student, Electrical Engineering
  • Leonardo Toso, PhD Student, Electrical Engineering

P06: Pseudo-Labeling for Unsupervised Domain Adaptation with Kernel GLMs

  • Faculty Advisor: Kaizheng Wang, Assistant Professor, IEOR
  • Nathan Weill, PhD Student, IEOR

P07: SpeechScreen: Generalizability of Cognitive Impairment Detection Across Languages

  • Faculty Advisor: Maryam Zolnoori, Assistant Professor of Health Sciences Research (in Nursing), School of Nursing
  • Chih-Yuan Chang, Research Assistant, Data Science Institute (MSDS student)
  • Fatemeh Taherinezhad, Research Assistant, School of Nursing

P08: Explainable Asset Allocation and Portfolio Construction

  • Faculty Advisor: Ali Hirsa, Professor of Professional Practice, IEOR
  • Miao Wang, Postdoctoral Research Scientist, Mortimer B. Zuckerman Mind Brain Behavior Institute, Zuckerman Institute
  • Federico Klinkert, MSFE

P09: Buckle Up! Seatbelt: An Open-Source Python Library for Responsible AI Auditing of Generative AI Models

  • Michelle Lee, MPH, MS Candidate, Department of Data Science

P10: Unsupervised Generative Framework for Event-Driven Financial Time Series

  • Mihir Agarwal, MS, Data Science
  • Aaditya Barve, MS, Computer Science

P11: Generalizing Risk Parity via Optimal Risk Budgeting with Target Returns: Exact and Tight Approximation Algorithms

  • Faculty Advisor: Ali Hirsa, Professor of Professional Practice, IEOR
  • Viraat Singh, PhD Student, IEOR

P12: Evaluating LLM-Persona Generated Distributions for Decision-Making

  • Faculty Advisor: Will Ma, Roderick H. Cushman Associate Professor of Business, Decision, Risk, and Operations Division, Columbia Business School
  • Yunhan Chen, Undergraduate Student, Computer Science
  • Jakie Baek, New York University, Stern Business School
  • Ziyu Chi, New York University, Stern Business School

P13: ICYM²I: The illusion of Multimodal Informativeness Under Missingness

  • Faculty Advisor: Shalmali Joshi, Assistant Professor, Columbia University Department of Biomedical Informatics
  • Young Sang Choi, PhD Student, Biomedical Informatics, GSAS
  • Vincent Jeanselme, Former Postdoctoral Researcher, Biomedical Informatics
  • Pierre Elias, Assistant Professor, Biomedical Informatics and Cardiology

P14: SYN-DIGITS: A Synthetic Control Framework for Calibrated Digital Twin Simulation

  • Faculty Advisor: Assaf Zeevi, Kravis Professor of Business, Decision, Risk, and Operations Division, Columbia Business School
  • Faculty Advisor: Kaizheng Wang, Assistant Professor, IEOR
  • Yuhang Wu, PhD Student, Decision, Risk, and Operations, Columbia Business School
  • Grace Jiarui Fan, PhD Student, Finance Division, Columbia Business School;
  • Chengpiao Huang, PhD Student, Department of IEOR, Columbia University;
  • Tianyi Peng, Assistant Professor, Decision, Risk, and Operations Division, Columbia Business School

P15: GOPO: Policy Optimization Using Ranked Rewards

  • Faculty Advisor: Anish Agarwal, Assistant Professor, IEOR
  • Dwaipayan Saha, PhD Student, IEOR
  • Kyuseong Choi, PhD Student, Statistics, Cornell University
  • Woojeong Kim, PhD Student, Computer Science, Cornell University
  • Raaz Dwivedi, Assistant Professor, ORIE, Cornell University

P16: Learning Entity-Conditioned Topic Representations in Embedding Space

  • Kriste Kristovski, Associate Research Scientist, Data Science Institute
  • Aadhi Aravind, MS Student, Computer Science Department,