3 min read
Empowering Production Support Teams: Training, Reflection, and Cultivating Culture for Success
Gerry Palaganas:
Nov 15, 2023 7:15:00 AM
In the world of Managed Services and Site Reliability Engineering (MS/SRE), the significance of comprehensive training in handling production incidents cannot be overstated. Even the best engineered solutions will experience outages. As systems grow in scale and complexity, failures will happen. These situations can be critical challenges for a company, potentially leading to significant revenue loss or reputational damage. The root cause is often something completely unanticipated, never encountered before, non-standard, or a complex convergence of issues requiring root cause analysis. The decisions made by the SRE teams during these moments of high-pressure, high-stakes, and urgent timelines are vital. So, the big questions are:
- How do you train an MS/SRE team to make good decisions when in the pilot’s seat during large incident?
- Experience goes a long way. How do I ramp up new team members?
- How do you maintain a good MS/SRE culture and mindset to address these situations?
Here are 3 key “habits” that are often overlooked in training production support teams: