Category: Simian

Netflix Approach to the Cloud: Simian Army

10/20/2011

Ariel Tseitlin and Yury Izrailevsky from Netflix share their approach to cloud adoption using "Simian Army" suite of tools.
Below are the definition of the various tools Netflix engineers created:

Chaos Monkey, a tool that randomly disables our production instances to make sure we can survive this common type of failure without any customer impact.

Latency Monkey induces artificial delays in our RESTful client-server communication layer to simulate service degradation and measures if upstream services respond appropriately. In addition, by making very large delays, we can simulate a node or even an entire service downtime (and test our ability to survive it) without physically bringing these instances down. This can be particularly useful when testing the fault-tolerance of a new service by simulating the failure of its dependencies, without making these dependencies unavailable to the rest of the system.

Conformity Monkey finds instances that don’t adhere to best-practices and shuts them down. For example, we know that if we find instances that don’t belong to an auto-scaling group, that’s trouble waiting to happen. We shut them down to give the service owner the opportunity to re-launch them properly.

Doctor Monkey taps into health checks that run on each instance as well as monitors other external signs of health (e.g. CPU load) to detect unhealthy instances. Once unhealthy instances are detected, they are removed from service and after giving the service owners time to root-cause the problem, are eventually terminated.

Janitor Monkey ensures that our cloud environment is running free of clutter and waste. It searches for unused resources and disposes of them.

Security Monkey is an extension of Conformity Monkey. It finds security violations or vulnerabilities, such as improperly configured AWS security groups, and terminates the offending instances. It also ensures that all our SSL and DRM certificates are valid and are not coming up for renewal.

10-18 Monkey (short for Localization-Internationalization, or l10n-i18n) detects configuration and run time problems in instances serving customers in multiple geographic regions, using different languages and character sets.

Chaos Gorilla is similar to Chaos Monkey, but simulates an outage of an entire Amazon availability zone. We want to verify that our services automatically re-balance to the functional availability zones without user-visible impact or manual intervention.

I like the approach of the Simian Army to simulate failures and keep systems healthy, responsive, and available. Two follow-on thoughts:

Is the Simian Army a suite of COTS tools, homegrown scripts, or a combination of COTS customized.
What are the results of testing and simulation using these tools?

Would be great to see this in a case study format or detailed journal paper.
Entire post (Netflix) - http://techblog.netflix.com/2011/07/netflix-simian-army.html?m=1

0 Comments

Netflix Approach to the Cloud: Simian Army

Author

Archives

Categories