Tech.Rocks Summit 2023

Managing a massive scale incident

Dec 7, 202316:05 - 4:40 PM

Main Stage

On March 8, 2023 Datadog experienced a massive global outage. In this talk, we will share the trigger for the incident and why it was a massive effort to recover from. We’ll cover the lessons we learned from this event and how we ran the incident response itself, successfully coordinating more than 500 engineers over 2+ days of continual response, and how we built an engineering organization capable of that feat (with minimal heroism).

