At Netflix, we’ve experienced an unprecedented global increase in membership over the last several years. Not only are we seeing more members globally, more members are consuming more Netflix. This means that production outages today have far greater impact in much less time than it did compared to years before.
In order to continue providing great experiences for our members, we have to make sure the sophistication of our systems out-pace the growth and engagement of our members. Concretely, our MTTD and MTTR needs to decrease much quicker than Netflix membership and consumption increases. Our approach to accomplishing this is by having access to highly-granular, realtime operational insights into our streaming and studio systems.
However, while having this level of visibility into our production systems is great, it could quickly become cost-prohibitive. It’s equally important that these systems don’t end up costing more than our actual streaming and studio systems. To this end, we’ve built and open sourced Mantis to fulfill all of these needs – a platform that makes it easy for developers to build real-time, cost-effective, operations-focused applications.
Mantis has been live in production for several years and has given us tremendous value in operating Tier-1 critical systems. It processes trillions of events and petabytes worth of data every day which enables us to derive meaningful operational insights from our streaming and studio systems which ultimately reduce production impact on our members.
With Mantis, we’re able to economically ask and answer new questions in realtime about our systems without having to add new instrumentation. We can answer questions like “Which members are seeing playback issues for Stranger Things, season 3, episode 1 on iPhone in Canada?” without incurring heavy costs to our infrastructure bill.
In this talk, we’ll cover more technical details about Mantis and go through some examples of how we use Mantis to operate our production systems more effectively.