Hosted as part of the Machine Learning and AI Seminar Series in partnership with the DSI Foundations of Data Science Center; the Department of Statistics, Arts and Sciences; and Columbia Engineering


Speaker

Eric Wong, Assistant Professor, Computer and Information Science, University of Pennsylvania


Event Details

Friday, March 27, 2026 (11:00 AM – 12:00 PM ET)

Location: Hamilton Hall, Room 702

REGISTRATION DEADLINE: The Columbia Morningside campus is open to the Columbia community. If you do not have an active CUID, the deadline to register is at 12:00 PM the day before the event.

Register


Talk Information

A Mechanistic Theory of Safety: How Jailbreaking 1-Layer Transformers Taught us how to Steer LLMs

Abstract: Why are LLM guardrails fundamentally so easily broken, and how can we enforce them? This talk formalizes a mechanistic theory for studying safety problems. We begin with one-layer transformers, identifying rule-breaking as an inherent architectural vulnerability in the model’s attention mechanism. This mechanistic theory framework (LogicBreaks) taught us a critical lesson: if attention is the key to breaking rules, it may also be the key to enforcing them.

Building upon this insight, we expand the mechanistic theory to analyze attention-based interventions, arriving at InstaBoost: an incredibly simple yet highly effective steering method that boosts the model’s attention on user-provided instructions during generation. This technique, developed from analysis on one-layer transformers, provides state-of-the-art control over large-scale LLMs with just five lines of code.