Friday, March 27, 202611:00 am - 12:00 pm
Eric Wong, Assistant Professor, Computer and Information Science, University of Pennsylvania
Location: Hamilton Hall, Room 702
REGISTRATION DEADLINE: The Columbia Morningside campus is open to the Columbia community. If you do not have an active CUID, the deadline to register is at 12:00 PM the day before the event.
Register
Abstract: Why are LLM guardrails fundamentally so easily broken, and how can we enforce them? This talk formalizes a mechanistic theory for studying safety problems. We begin with one-layer transformers, identifying rule-breaking as an inherent architectural vulnerability in the model’s attention mechanism. This mechanistic theory framework (LogicBreaks) taught us a critical lesson: if attention is the key to breaking rules, it may also be the key to enforcing them.
Building upon this insight, we expand the mechanistic theory to analyze attention-based interventions, arriving at InstaBoost: an incredibly simple yet highly effective steering method that boosts the model’s attention on user-provided instructions during generation. This technique, developed from analysis on one-layer transformers, provides state-of-the-art control over large-scale LLMs with just five lines of code.