Foundations of Data Science Workshop (Spring 2026)

Name: Foundations of Data Science Workshop (Spring 2026)
Start: 2026-05-06T09:30:00-04:00
End: 2026-05-06T13:00:00-04:00

Wednesday, May 6, 2026
9:30 am - 1:00 pm

The Columbia University Data Science Institute’s Foundations of Data Science Center is hosting a workshop designed to foster collaboration and knowledge sharing. Through talks and posters, Columbia researchers will showcase their work in the diverse realms of data science methods and applications.

Event Details

Wednesday, May 6 (9:30 AM – 1:00 PM ET)

Location: Columbia Engineering Innovation Hub
Address: 2276 12th Ave, New York, NY 10027 – Manhattanville

· · ─ · ─ · ·

9:30 AM – 10:00 AM: Breakfast & Check-In (30 min)

10:00 AM – 10:40 AM: Presentation: Genevera Allen, Professor of Statistics (40 min)

10:40 AM – 10:50 AM: Break (10 min)

10:50 AM – 11:30 AM: Presentation: Kaizheng Wang, Assistant Professor, IEOR (40 min)

12:30 AM – 11:40 AM: Break (10 min)

11:40 AM – 12:00 PM: Poster Spotlights

12:00 PM – 1:00 PM: Lunch and Poster Session

Speakers

Genevera Allen
Professor of Statistics, Columbia University

Talk Title: Inference for Clustering: Conformal Sets for Cluster Labels

Abstract: While clustering is ubiquitously used across science and industry, uncertainty in cluster assignments is rarely quantified with rigorous guarantees. We propose a novel conformal inference framework for clustering that returns confidence sets for cluster labels. The key challenge is that labels are unobserved and estimated from data, so naively using deterministic cluster labels can violate exchangeability and induce severe under-coverage. To address this, we propose split conformal clustering with stochastic labels, which samples labels from soft cluster labels, fits a soft classifier to predict these stochastic labels, and calibrates conformal scores to construct confidence sets for cluster labels at any query point. We establish a finite-sample lower bound on marginal coverage that reveals how under-coverage is controlled by two properties of the clustering algorithm: consistency of estimated soft labels and replace-one stability. Under mild conditions, we prove asymptotic coverage and verify these conditions for correctly specified parametric mixture models. Simulations for mixture models show that our method attains target coverage with informative set sizes, validating our theoretical results. Applications to clustering cell types in single-cell RNA-seq data demonstrate the practical utility and interpretability of our approach to quantifying cluster label uncertainty.

Kaizheng Wang
Assistant Professor of Industrial Engineering and Operations Research, Columbia Engineering

Talk Title: AI Personas for Human-Centered Inference and Decision-Making

Abstract: AI personas provide a flexible framework for modeling human behavior, particularly in cold-start settings with limited data on new users or scenarios. In this talk, I will present recent and ongoing work on AI-powered human digital twins from this perspective. I will begin with two decision-oriented applications: adaptive querying with persona priors, which enables efficient information acquisition under limited question budgets, and an LLM-based demand simulator for pricing, which combines rich product information with persona-level purchase probabilities for counterfactual policy evaluation. I will then discuss a calibration framework for improving simulator fidelity to human data, and conclude with a statistical framework for uncertainty quantification in LLM-based survey simulation, including valid inference under unknown distribution shift between synthetic and real data. Together, these works show how AI personas can support principled human-centered inference and decision-making.

Poster Session

Submission Deadline: Tuesday, April 28 (11:59 PM ET)
Eligibility: The Columbia University community is welcome to submit research for consideration. This includes: Faculty Members and Affiliated Researchers, as well as currently enrolled undergraduate, graduate, PhD Candidates and Postdoctoral Researchers.
DSI will cover the cost of printing the poster, and will provide foam board and easels. Please size the poster to either 20×30 or 30×40 inches.

Submit

Foundations of Data Science Workshop (Spring 2026)

Event Details

Wednesday, May 6 (9:30 AM – 1:00 PM ET)

Speakers

Poster Session

Samory Kpotufe

Alexis Avedisian

Erin Elliott

Upcoming Events