Name: Eliciting Harmful Capabilities by Fine-Tuning on Safeguarded Outputs
Start: 2026-04-02T22:00:00+00:00
End: 2026-04-03T01:00:00+00:00
Location: 30 Adelaide St E, Toronto, ON M5C 3G8, Canada

Jackson Kaunismaa presents his new paper “Eliciting Harmful Capabilities by Fine-Tuning on Safeguarded Outputs”. He will discuss why output-level safeguards on frontier models don’t actually make the ecosystem safe, and how anyone with an open-source model can fine-tune it on adjacent-domain outputs from safeguarded models to recover a large fraction of the capability gap between open-source and frontier models on harmful tasks. Event Schedule6:00 to 6:30 - Food and introductions6:30 to 7:30 - Presentation and Q&A7:30 to 9:00 - Open Discussions If you can't attend in person, join our live stream starting at 6:30 pm via this link. This is part of our weekly AI Safety Thursdays series. Join us in examining questions like: How do we ensure AI systems are aligned with human interests? How do we measure and mitigate potential risks from advanced AI systems? What does safer AI development look like?

Eliciting Harmful Capabilities by Fine-Tuning on Safeguarded Outputs

About this event

Topics & Tags

You might also like