Luma
Free
In 5 days

vLLM Inference Meetup · Boston

by vLLM Meetups and Events

About this event

Deep technical sessions. Live demos. Real conversations. If you're deploying, or scaling LLM inference, this is the room to be in. Join Red Hat AI, IBM, NVIDIA, The Open Accelerator, MIT, and the vLLM community in Boston for an evening of technical depth: Hear directly from vLLM maintainers and committersSee live demos of real inference workflowsLearn how to put your learnings into practice with the Open AcceleratorConnect with the engineers and platform teams pushing the state of the art Program Optional Pre-Event Workshop 3:30 PM — Doors Open for Workshop Attendees4:00–5:00 PM — Distributed Inference with llm-d: Your Production-Ready Path to Scalable LLM Inference About the workshop: llm-d is a distributed inference orchestration layer that reduces tail latency (P95/P99) through intelligent cache-aware routing. In this hands-on workshop, participants deploy Llama 3.1 8B with vLLM, benchmark single-GPU performance, scale to multiple GPUs with naive load balancing, and then use llm-d to demonstrate how cache-aware routing significantly reduces tail latency. Meetup Agenda 5:00–5:30 PM — Doors Open, Check-In5:30–5:40 PM — Welcome and Opening Remarks Saša Zelenović, Sr. Technical Marketing Manager, Red Hat AI 5:40–6:00 PM — Intro to vLLM and Project Update Michael Goin, vLLM Maintainer and Principal Engineer, Red Hat AI 6:00–6:20 PM — Getting Started with Model Compression for Fast and Efficient Inference Charles Hernandez, ML Engineer, Red Hat AI 6:20–6:40 PM — Accelerating LLM Inference with Speculative Decoding Helen Zhao, ML Engineer, Red Hat AIFynn Schmitt-Ulms, ML Engineer, Red Hat AI 6:40–7:00 PM — Dynamic Agent & Inference Optimization with NeMo Agent Toolkit Dhruv Nandakumar, Agent and Inference Engineering, NVIDIA 7:00–7:30 PM — Tackling Distributed Inference at Scale with llm-d and Kubernetes Carlos Costa, Distinguished Engineer, IBM 7:30–7:40 PM — From Meetup to Hackathon: Building Together in the Open AI Accelerator Stefanie Chiras, SVP, The Open Accelerator, Red Hat 7:40–8:00 PM — Community Discussion and Q&A 8:00–9:00 PM — Networking, Food and Drinks Who Should Come vLLM users and contributorsML and infra engineers working on inference and servingPlatform teams running GenAI in productionAnyone curious about efficient inference across local, cloud, and Kubernetes Before You Arrive Registration closes 24 hours before the eventUnregistered attendees cannot be admittedBring a photo ID for check-in See you in Boston! The inference conversation starts here.

Topics & Tags

AI
AI
Date & time
Mar 31 – Apr 1, 2026
America/New_York
Location
314 Main St, Cambridge, MA 02142, USA, Cambridge, United States
America/New_York
Attendance
89 going · 89 spots
111 spots remaining
Type
independent
SourceLuma
UpdatedMar 24, 2026

You might also like