Machine Learning and AI Seminar Series
About
This seminar series invites experts from across the country to come to Columbia and present the latest cutting-edge research in the field of Machine Learning and Artificial Intelligence. Running the gamut between theory and empirics, the seminar provides a single, unified space to bring together the ML/AI community at Columbia. Topics of interest include, but are not limited to, Language Models, Optimization for Deep Learning, Reinforcement and Imitation Learning, Learning Theory, Interpretability and AI Alignment, AI for science, Probabilistic ML, and Bayesian methods.
Hosts & Co-Sponsors: DSI Foundations of Data Science Center; Department of Statistics, Arts and Sciences, Columbia Engineering
Registration
Registration for all CUID holders is preferred. If you do not have an active CUID, registration is required and is due at 12:00 PM the day prior to the seminar. Unfortunately, we cannot guarantee entrance to Columbia’s Morningside campus if you register following 12:00 PM the day prior to the seminar. Thank you for understanding!
Please contact Erin Elliott, DSI Events and Marketing Coordinator at ee2548@columbia.edu with any questions.
RegisterNext Seminar
Date: Friday, December 12, 2025 (11:00 AM – 12:00 PM)
Location: Columbia School of Social Work, Room 311/312

Jason Weston, Research Scientist at Facebook, NY and a Visiting Research Professor at NYU
Title: Self-Improvement of LLMs
Abstract: Classically, learning algorithms were designed to improve their performance by updating their parameters (weights), while keeping other components, such as the training data, loss function, and algorithm, fixed. We argue that fully intelligent systems will be able to self-improve across all aspects of their makeup. We describe recent methods that enable large language models (LLMs) to self-improve in various ways, increasing their performance on tasks relevant to human users. In particular, we describe methods whereby models are able to create their own training data (self-challenging), train on this data using themselves as their own reward model (self-rewarding), and train themselves to better provide their own rewards (meta-rewarding). We then discuss the future of self-improvement for AI and key challenges that remain unresolved.
Upcoming Seminar Schedule (Spring 2026)
Please save the below dates and times to attend the seminar series.
Friday, February 6 (11:00 AM – 12:00 PM)
- Location: School of Social Work, Room C03
- Speaker: Lerrel Pinto, Assistant Professor of Computer Science at NYU Courant
Friday, February 20 (11:00 AM – 12:00 PM)
Friday, March 13 (11:00 AM – 12:00 PM)
- Location: School of Social Work, Room C03
- Speaker: Danqi Chen, Associate Professor of Computer Science, Co-Leader of Princeton NLP Group, Associate Director of Princeton Language and Intelligence, Princeton University
Friday, March 27 (11:00 AM – 12:00 PM)
Friday, April 10 (11:00 AM – 12:00 PM)
- Location: School of Social Work, Room C03
- Speaker: He He, Associate Professor of Computer Science and Data Science, NYU
Archive: Speaker Abstracts
-
Title: Architectural Choices in Scientific ML: A View Through the Lens of Theory
Abstract: In deep learning, small architectural changes—such as residual connections or normalization layers—have often had outsized impact. This talk examines how similar effects arise in recent applications of deep learning to the sciences. The central theme is that the architectural changes we identify are not suggested by current benchmarks, which remain much less mature than they are in image or language domains. Instead, they become visible through the right theoretical lenses. We will showcase several vignettes spanning graph neural networks (GNNs), time-dependent partial-differential equations (PDEs), and steady-state PDEs.
The first setting concerns graphs with bottlenecks or hubs: augmenting GNNs with edge-level state yields (provable) gains under constraints on depth and memory. We establish this using techniques from time–space tradeoffs in theoretical computer science, and show that neither “symmetry-only” theoretical accounts nor standard GNN benchmarks would detect this separation. The next setting concerns time-dependent PDEs, where adding an explicit memory layer via state-space models (e.g. S4) has negligible effect under full observability, but substantial impact under partial observation. This kind of phenomenon is predicted by Mori–Zwanzig theory—which also inspired the architectural change. Finally, in steady-state PDEs and operator learning, we show that Deep Equilibrium Model (DEQ)-based architectural changes have efficiency and robustness benefits. Here, the design is motivated by representation-theoretic constructions that simulate “unrolled” gradient descent in function space.
Based on several works including:
Rohatgi, D., Marwah, T., Lipton, Z. C., Lu, J., Moitra, A., & Risteski, A. (2024). Towards characterizing the value of edge embeddings in Graph Neural Networks. arXiv preprint arXiv:2410.09867
Ruiz, R. B., Marwah, T., Gu, A., & Risteski, A. (2024). On the benefits of memory for modeling time-dependent pdes. arXiv preprint arXiv:2409.02313
Marwah, T., Pokle, A., Kolter, J. Z., Lipton, Z., Lu, J., & Risteski, A. (2023). Deep equilibrium based neural operators for steady-state pdes. Advances in Neural Information Processing Systems, 36, 15716-15737
Marwah, T., Lipton, Z. C., Lu, J., & Risteski, A. (2023, July). Neural network approximations of pdes beyond linearity: A representational perspective. In International Conference on Machine Learning (pp. 24139-24172). PMLR
Marwah, T., Lipton, Z., & Risteski, A. (2021). Parametric complexity bounds for approximating PDEs with neural networks. Advances in Neural Information Processing Systems, 34, 15044-15055
Talk Date: Friday, November 21, 2025
-
Title: Learning normalized probability models with dual score matching
Abstract: Learning probability models from data is at the heart of many learning tasks. We introduce a new framework for learning normalized energy (log probability) models inspired from diffusion generative models. The energy model is fitted to data by two “score matching” objectives: the first constrains the gradient of the energy (the “score”, as in diffusion models), while the second constrains its *time derivative* along the diffusion. We validate the approach on both synthetic and natural image data: in particular, we show that the estimated log probabilities do not depend on the specific images used during training. Finally, we demonstrate that both image probability and local dimensionality vary significantly with image content, challenging simple interpretations of the manifold hypothesis.
Talk Date: Friday, November 7, 2025
-
Title: Rethinking Test-Time Thinking: From Token-Level Rewards to Robust Generative Agents
Abstract: We present a unified perspective on test-time thinking as a lens for improving generative AI agents through finer-grained reward modeling, data-centric reasoning, and robust alignment. Beginning with GenARM, we introduce an inductive bias for denser, token-level reward modeling that guides generation during decoding, enabling token-level alignment without retraining. While GenARM targets reward design, ThinkLite-VL focuses on the data side of reasoning. It proposes a self-improvement framework that selects the most informative samples via MCTS-guided search, yielding stronger visual reasoning with fewer labels. Taking this a step further, MORSE-500 moves beyond selection to creation: it programmatically generates targeted, controllable multimodal data to systematically probe and stress-test models’ reasoning abilities. We then interrogate a central assumption in inference-time alignment: Does Thinking More Always Help? Our findings reveal that increased reasoning steps can degrade performance–not due to better or worse reasoning per se, but due to rising variance in outputs, challenging the naive scaling paradigm. Finally, AegisLLM applies test-time thinking in the service of security, using an agentic, multi-perspective framework to defend against jailbreaks, prompt injections, and unlearning attacks–all at inference time. Together, these works chart a path toward generative agents that are not only more capable, but more data-efficient, introspective, and robust in real-world deployment.
Talk Date: Friday, October 24, 2025
-
Title: Discrete Diffusion Language Models
Abstract: While diffusion generative models excel at high-quality image generation, prior work reports a significant performance gap between diffusion and autoregressive (AR) methods on discrete data such as text or biological sequences. Our work takes steps towards closing this gap via a simple and effective framework for discrete diffusion. This framework is simple to understand—it optimizes a mixture of denoising (e.g., masking) losses—and can be seen as endowing BERT-like models with principled samplers and variational estimators of log-likelihood. Crucially, our algorithms are not constrained to generate data sequentially, and therefore have the potential to improve long-term planning, controllable generation, and sampling speed.
In the context of language modeling, our framework enables deriving masked diffusion language models (MDLMs), which achieve a new state-of-the-art among diffusion models, and approach AR quality. Combined with novel extensions of classifier-free and classifier-based guidance mechanisms, these algorithms are also significantly more controllable than AR models. Discrete diffusion extends beyond language to science, where it forms the basis of a new generation of DNA foundation models. Our largest models focus on plants and set a new state of the art in genome annotation, while also enabling effective generation. Discrete diffusion models hold the promise to advance progress in generative modeling and its applications in language understanding and scientific discovery.
Talk Date: Friday, October 20, 2025
-
Title: Gradient Descent Dominates Ridge: A Statistical View on Implicit Regularization
Abstract: A key puzzle in deep learning is how simple gradient methods find generalizable solutions without explicit regularization. This talk discusses the implicit regularization of gradient descent (GD) through the lens of statistical dominance. Using least squares as a clean proxy, we present two surprising findings.
First, GD dominates ridge regression: with comparable regularization, the excess risk of GD is always within a constant factor of ridge, but ridge can be polynomially worse even when tuned optimally. Second, GD is incomparable with SGD. While it is known that for certain problems GD can be polynomially better than SGD, the reverse is also true: we construct problems, inspired by benign overfitting theory, where optimally stopped GD is polynomially worse. Finally, GD dominates SGD for a significant subclass of problems — those with fast and continuously decaying covariance spectra — which includes all problems satisfying the standard capacity condition.
This is joint work with Peter Bartlett, Sham Kakade, Jason Lee, and Bin Yu.
Talk Date: Friday, October 6, 2025