by Florian Dambrine

This is a first post in the series of three posts on GumSmash, an In-House event driven auto-remediation engine based on Ansible, that we developed at GumGum. In this post, I am going to explain what is an auto-remediation system and why we chose to implement our own solution. In second part, I will introduce you to the GumSmash architecture.

The Platform team at GumGum is in charge of operating AWS datacenters in multiple regions and provides other operational services to different engineering teams. A couple of months ago, our team decided to set up a system which allow us to run auto-remediations.

Let’s first define what is an auto-remediation:

“Auto remediation is an approach to automation that responds to events with automations able to fix, or remediate, underlying conditions. […]

— Evan Powell – Stackstorm

With this definition being said, we can identify the different components that an auto remediation system should provide:

We decided to implement our own auto-remediation tool named GumSmash instead of choosing a popular open source project like Stackstorm which we found too sophisticated for our use cases. As Netflix decided to go with a hybrid solution for Winston (wrapper on top of Stackstorm), we decided to use the Ansible open-source project as our underlying engine to execute playbooks and connect to hosts.

The main reason for choosing Ansible comes from the fact that we have been using it for the past 4 years not only as a configuration management tool but also for solving complex orchestration problems. Moreover, Ansible is written in python and is easily extensible thanks to its plugins architecture which makes it easy for us to turn it into an auto-remediation engine. Finally, it is an easy to learn tool for developers who would like to contribute to the core features in the future.

Before digging into how we did this project, let’s make sure we have all the components required to build an auto-remediation system:

In addition to these prerequisites, we had to think about a way to get some data from such a system. How many auto-remediations of a kind run every day? How many of them succeeded or failed ? At what time did this automation run on this specific host? Indeed when such a system goes live, you are probably not going to run thousands of auto-remediations every day (at least at our current scale)! That is why it’s simply enough to report this data into one of the popular log analysis tools such as Kibana, Splunk or Sumologic.

We had to think about safeguards as well. How do you stop triggering auto-remediations in case of an outage? How do you deal with auto-remediation concurrency? How to prevent multiple auto-remediation processes from picking up the exact same problem? In a scenario of a bad version deployment, how do you prevent automations from running and making things worse?

Moreover as a DevOps/Platform team, our main goal is to provide the best experience to our engineering teams. Auto-remediation system is great but it can be even more powerful when developers can write their own remediations! We wanted to provide the ability to our developers to write their own automations without having to learn Ansible. If you are a Java developer and you want to write something in Groovy to remediate your problem? Fine, just write your code in a template file and GumSmash will take care of the heavy lifting: Deploy the file, execute it, gather the result of the execution and run additional actions available from a pool of pre-built actions (such as restart a service, wait for the service come up, put the instance back in service, send notifications, …).

We will have a chance to dig into those aspects in upcoming blog posts. For now, let me show you the overall architecture of GumSmash.

As mentioned above, clients interact with the auto-remediation engine by sending messages to SQS. Messages sent to GumSmash are JSON formatted and contain the following fields:

  • instance_id: The AWS instance ID of the host
  • private_ip: The IP address of the host
  • inventory: The AWS region the host is running in (for example us-east-1)
  • source: A namespace defining the pattern of the JSON message. This will provide useful information to the engine on how the message should be decoded and processed.
  • playbook: The name of the playbook which should should be triggered by the auto-remediation engine to remediate the issue on the host.

Here is an example of JSON message sent from host i-123456789 running in us-east-1 and having troubles with starting tomcat:

Thanks to SQS and JSON messages, it becomes really easy to interact with GumSmash. Any kind of programing language can leverage the AWS SDK to interface with the auto-remediation engine but also monitoring servers like Icinga can post messages using a simple bash script wrapping an AWS cli command.

On the other side, the GumSmash engine continuously polls from the queue to check for new messages. When a message is picked up from the Queue, GumSmash parses the payload as Json and looks for the source field. Based on the value of this field it will expect specific fields to be there and read them. The source field makes GumSmash pluggable! If you want to implement an other type of message parsing, fine, just do it!

GumSmash will then generate the ansible-playbook command to run based off several fields extracted from the message:

The same way Ansible implements the concept of plugins, GumSmash implements role plugins. A role plugin is like a classic Ansible role which fulfills a specific goal. For example, such an auto-remediation system should be able to send notifications. Some of the developers would prefer sending simple emails while others may want to trigger PagerDuty alerts or notify a HipChat room or a Slack channel.

Those different type of notifications are implemented by notify-* ansible role plugins in GumSmash (notify-email, notify-pagerduty, notify-hipchat, notify-slack). So far, we have built 3 types of role plugins which are:

  • action-* : Common set of actions taken on a remote host (register/deregister with/from an ELB, wait for an event to happen on a port or in a file, do a monitored instance reboot)
  • collect-* : Ansible role which collects information on a remote host (Collect thread dumps, collect log files, collect system metrics)
  • notify-* : Ansible role able to send a notification of any kind
  • The diagram below summarizes an auto-remediation workflow:

To conclude, this seemingly simple architecture have done wonders at GumGum, especially for our Ad Server engineering team! It takes care of fixing bad server startups due to timeout or download failures. During a deployment on 500+ servers, it is very common to have such failures happen occasionally. GumSmash has been able to eliminate time consuming manual intervention in such cases.

I would like to end this post with a word of caution: Auto-remediation can definitely save a lot of time and manual work, but if you are auto-remediating your systems too often then it may hide an underlying issue that requires a permanent fix. And that is why it’s important to keep an eye on such as system.

Guides