On March 8, 2023 Datadog experienced a massive global outage. In this talk, we will share the trigger for the incident and why it was a massive effort to recover from. We’ll cover the lessons we learned from this event and how we ran the incident response itself, successfully coordinating more than 500 engineers over 2+ days of continual response, and how we built an engineering organization capable of that feat (with minimal heroism).