Machine Learning and AI Seminar Series

About

This seminar series invites experts from across the country to come to Columbia and present the latest cutting-edge research in the field of Machine Learning and Artificial Intelligence. Running the gamut between theory and empirics, the seminar provides a single, unified space to bring together the ML/AI community at Columbia. Topics of interest include, but are not limited to, Language Models, Optimization for Deep Learning, Reinforcement and Imitation Learning, Learning Theory, Interpretability and AI Alignment, AI for science, Probabilistic ML, and Bayesian methods.

Hosts & Co-Sponsors: DSI Foundations of Data Science Center; Department of Statistics, Arts and Sciences, Columbia Engineering

Registration

Registration for all CUID holders is preferred. If you do not have an active CUID, registration is required and is due at 12:00 PM the day prior to the seminar. Unfortunately, we cannot guarantee entrance to Columbia’s Morningside campus if you register following 12:00 PM the day prior to the seminar. Thank you for understanding!

Please contact Erin Elliott, DSI Events and Marketing Coordinator at ee2548@columbia.edu with any questions.

Next Seminar

Date: Friday, March 10, 2026 (11:00 AM – 12:00 PM)

Location: Hamilton Hall, Room 702

Greg Durrett, Associate Professor, Computer Science Department and Center for Data Science, NYU Courant

Title: LLM Reasoning Beyond Scaling

Abstract: Agentic large language models can write and debug complex code, solve competition-level math problems, and conduct in-depth literature review. These reasoning capabilities are enabled by scaling of data: pre-training data to learn vast knowledge, fine-tuning data to learn natural language reasoning, and RL environments to refine that reasoning. In this talk, I will investigate the current LLM reasoning paradigm, its boundaries, and the future of LLM reasoning beyond scaling. First, I will describe the state of reasoning models and where I think scaling will lead to additional successes. I will then shift to discussing issues which are not resolved by pure scaling. First, I will describe our work on calibrating models’ decisions through better understanding of their environments. We find that explicitly telling an LLM its likelihood to succeed or fail at tasks allows it to reason about cost-benefit tradeoffs in its action space. Then, I will describe our new benchmark CREATE, which tests LLMs’ capabilities for associative creativity. I will highlight limitations of LLMs applied to creative tasks like scientific ideation and where I see future work making progress in these areas.

Upcoming Seminar Schedule (Spring 2026)

Please save the below dates and times to attend the seminar series.

Friday, April 17 (11:00 AM – 12:00 PM)

Location: Hamilton Hall, Room 702
Speaker: Danqi Chen, Associate Professor of Computer Science, Co-Leader of Princeton NLP Group, Associate Director of Princeton Language and Intelligence, Princeton University
Register

Archive: Speaker Abstracts

Title: A Mechanistic Theory of Safety: How Jailbreaking 1-Layer Transformers Taught us how to Steer LLMs

Abstract: Why are LLM guardrails fundamentally so easily broken, and how can we enforce them? This talk formalizes a mechanistic theory for studying safety problems. We begin with one-layer transformers, identifying rule-breaking as an inherent architectural vulnerability in the model’s attention mechanism. This mechanistic theory framework (LogicBreaks) taught us a critical lesson: if attention is the key to breaking rules, it may also be the key to enforcing them.

Building upon this insight, we expand the mechanistic theory to analyze attention-based interventions, arriving at InstaBoost: an incredibly simple yet highly effective steering method that boosts the model’s attention on user-provided instructions during generation. This technique, developed from analysis on one-layer transformers, provides state-of-the-art control over large-scale LLMs with just five lines of code.

Talk Date: Friday, March 27, 2026
Title: Alternative Test-Time Compute Scaling Strategies for Generative Models

Abstract: Recent trends in LLM development have focused on “Reasoning” models that expend large amounts of compute to improve their performance at inference time by producing many tokens. In this talk, we consider alternatives to the many-token paradigm. We will focus on models that perform efficient latent reasoning without verbalizing their outputs as tokens. We will also consider new generation strategies that bypass arduous and expensive token generating processes altogether.

Talk Date: Friday, February 20, 2026
Title: Robot Data is Not Enough Data

Abstract: The past decade of robot learning has been fueled by piles of human-teleoperated robot data. But this strategy is hitting a wall. Unlike computer vision and natural language processing, fields supercharged by mountains of passive, internet-scale human-labeled data, robotics faces a harsher reality. Robot data is expensive. It is slow. It is narrow. And most critically, we don’t even know which demonstrations or labels truly matter for embodied intelligence. Chasing more of the same is a dead end.

In this talk, I will argue that robot data alone will never deliver the leap we need. We must demand more. Robots should learn directly from humans. They should feel the world through touch, rather than staring at pixels alone. And they must go beyond purely reactive modes and instead reason, plan, and act with foresight. If we are serious about building intelligent machines, we must move beyond the fixation on “just more data” and instead embrace the hard, messy, human-centered problems that will define the next era of robotics.

Talk Date: Friday, February 6, 2026
Title: Self-Improvement of LLMs

Abstract: Classically, learning algorithms were designed to improve their performance by updating their parameters (weights), while keeping other components, such as the training data, loss function, and algorithm, fixed. We argue that fully intelligent systems will be able to self-improve across all aspects of their makeup. We describe recent methods that enable large language models (LLMs) to self-improve in various ways, increasing their performance on tasks relevant to human users. In particular, we describe methods whereby models are able to create their own training data (self-challenging), train on this data using themselves as their own reward model (self-rewarding), and train themselves to better provide their own rewards (meta-rewarding). We then discuss the future of self-improvement for AI and key challenges that remain unresolved.

Talk Date: Friday, December 12, 2025
Title: Architectural Choices in Scientific ML: A View Through the Lens of Theory

Abstract: In deep learning, small architectural changes—such as residual connections or normalization layers—have often had outsized impact. This talk examines how similar effects arise in recent applications of deep learning to the sciences. The central theme is that the architectural changes we identify are not suggested by current benchmarks, which remain much less mature than they are in image or language domains. Instead, they become visible through the right theoretical lenses. We will showcase several vignettes spanning graph neural networks (GNNs), time-dependent partial-differential equations (PDEs), and steady-state PDEs.

The first setting concerns graphs with bottlenecks or hubs: augmenting GNNs with edge-level state yields (provable) gains under constraints on depth and memory. We establish this using techniques from time–space tradeoffs in theoretical computer science, and show that neither “symmetry-only” theoretical accounts nor standard GNN benchmarks would detect this separation. The next setting concerns time-dependent PDEs, where adding an explicit memory layer via state-space models (e.g. S4) has negligible effect under full observability, but substantial impact under partial observation. This kind of phenomenon is predicted by Mori–Zwanzig theory—which also inspired the architectural change. Finally, in steady-state PDEs and operator learning, we show that Deep Equilibrium Model (DEQ)-based architectural changes have efficiency and robustness benefits. Here, the design is motivated by representation-theoretic constructions that simulate “unrolled” gradient descent in function space.

Based on several works including:

Rohatgi, D., Marwah, T., Lipton, Z. C., Lu, J., Moitra, A., & Risteski, A. (2024). Towards characterizing the value of edge embeddings in Graph Neural Networks. arXiv preprint arXiv:2410.09867

Ruiz, R. B., Marwah, T., Gu, A., & Risteski, A. (2024). On the benefits of memory for modeling time-dependent pdes. arXiv preprint arXiv:2409.02313

Marwah, T., Pokle, A., Kolter, J. Z., Lipton, Z., Lu, J., & Risteski, A. (2023). Deep equilibrium based neural operators for steady-state pdes. Advances in Neural Information Processing Systems, 36, 15716-15737

Marwah, T., Lipton, Z. C., Lu, J., & Risteski, A. (2023, July). Neural network approximations of pdes beyond linearity: A representational perspective. In International Conference on Machine Learning (pp. 24139-24172). PMLR

Marwah, T., Lipton, Z., & Risteski, A. (2021). Parametric complexity bounds for approximating PDEs with neural networks. Advances in Neural Information Processing Systems, 34, 15044-15055
Talk Date: Friday, November 21, 2025
Title: Learning normalized probability models with dual score matching

Abstract: Learning probability models from data is at the heart of many learning tasks. We introduce a new framework for learning normalized energy (log probability) models inspired from diffusion generative models. The energy model is fitted to data by two “score matching” objectives: the first constrains the gradient of the energy (the “score”, as in diffusion models), while the second constrains its *time derivative* along the diffusion. We validate the approach on both synthetic and natural image data: in particular, we show that the estimated log probabilities do not depend on the specific images used during training. Finally, we demonstrate that both image probability and local dimensionality vary significantly with image content, challenging simple interpretations of the manifold hypothesis.

Talk Date: Friday, November 7, 2025
Title: Rethinking Test-Time Thinking: From Token-Level Rewards to Robust Generative Agents

Abstract: We present a unified perspective on test-time thinking as a lens for improving generative AI agents through finer-grained reward modeling, data-centric reasoning, and robust alignment. Beginning with GenARM, we introduce an inductive bias for denser, token-level reward modeling that guides generation during decoding, enabling token-level alignment without retraining. While GenARM targets reward design, ThinkLite-VL focuses on the data side of reasoning. It proposes a self-improvement framework that selects the most informative samples via MCTS-guided search, yielding stronger visual reasoning with fewer labels. Taking this a step further, MORSE-500 moves beyond selection to creation: it programmatically generates targeted, controllable multimodal data to systematically probe and stress-test models’ reasoning abilities. We then interrogate a central assumption in inference-time alignment: Does Thinking More Always Help? Our findings reveal that increased reasoning steps can degrade performance–not due to better or worse reasoning per se, but due to rising variance in outputs, challenging the naive scaling paradigm. Finally, AegisLLM applies test-time thinking in the service of security, using an agentic, multi-perspective framework to defend against jailbreaks, prompt injections, and unlearning attacks–all at inference time. Together, these works chart a path toward generative agents that are not only more capable, but more data-efficient, introspective, and robust in real-world deployment.

Talk Date: Friday, October 24, 2025
Title: Discrete Diffusion Language Models

Abstract: While diffusion generative models excel at high-quality image generation, prior work reports a significant performance gap between diffusion and autoregressive (AR) methods on discrete data such as text or biological sequences. Our work takes steps towards closing this gap via a simple and effective framework for discrete diffusion. This framework is simple to understand—it optimizes a mixture of denoising (e.g., masking) losses—and can be seen as endowing BERT-like models with principled samplers and variational estimators of log-likelihood. Crucially, our algorithms are not constrained to generate data sequentially, and therefore have the potential to improve long-term planning, controllable generation, and sampling speed.

In the context of language modeling, our framework enables deriving masked diffusion language models (MDLMs), which achieve a new state-of-the-art among diffusion models, and approach AR quality. Combined with novel extensions of classifier-free and classifier-based guidance mechanisms, these algorithms are also significantly more controllable than AR models. Discrete diffusion extends beyond language to science, where it forms the basis of a new generation of DNA foundation models. Our largest models focus on plants and set a new state of the art in genome annotation, while also enabling effective generation. Discrete diffusion models hold the promise to advance progress in generative modeling and its applications in language understanding and scientific discovery.

Talk Date: Friday, October 17, 2025
Title: Gradient Descent Dominates Ridge: A Statistical View on Implicit Regularization

Abstract: A key puzzle in deep learning is how simple gradient methods find generalizable solutions without explicit regularization. This talk discusses the implicit regularization of gradient descent (GD) through the lens of statistical dominance. Using least squares as a clean proxy, we present two surprising findings.

First, GD dominates ridge regression: with comparable regularization, the excess risk of GD is always within a constant factor of ridge, but ridge can be polynomially worse even when tuned optimally. Second, GD is incomparable with SGD. While it is known that for certain problems GD can be polynomially better than SGD, the reverse is also true: we construct problems, inspired by benign overfitting theory, where optimally stopped GD is polynomially worse. Finally, GD dominates SGD for a significant subclass of problems — those with fast and continuously decaying covariance spectra — which includes all problems satisfying the standard capacity condition.

This is joint work with Peter Bartlett, Sham Kakade, Jason Lee, and Bin Yu.

Talk Date: Friday, October 3, 2025

Machine Learning and AI Seminar Series

About

Registration

Next Seminar

Date: Friday, March 10, 2026 (11:00 AM – 12:00 PM)

Location: Hamilton Hall, Room 702

Upcoming Seminar Schedule (Spring 2026)

Adam Block

Samory Kpotufe

Micah Goldblum

Alexis Avedisian

Erin Elliott

Archive: Speaker Abstracts

Machine Learning and AI Seminar Series

About

Registration

Next Seminar

Date: Friday, March 10, 2026 (11:00 AM – 12:00 PM)Location: Hamilton Hall, Room 702

Upcoming Seminar Schedule (Spring 2026)

Archive: Speaker Abstracts

Date: Friday, March 10, 2026 (11:00 AM – 12:00 PM)

Location: Hamilton Hall, Room 702