Friday, June 14, 2013

evolution of appops

goal of monitoring: no news is good news and...

This post started out in 2011 to summarize the operating system that helped run admob operations successfully as a high availability system. I was part of the team that created and ran these systems and so had a good view of these. And had a part to play in setting direction of the evolution of the system.

how did we get there?

Started simple - built a health checking system that can be trusted to throw an alert at the first sign of trouble. This meant every check will go off if there is a widespread issue, so at the outset we decoupled the alerting from the checking systems. This allowed us to silence alerts enmasse without suppressing alerts.
This helped build confidence in the monitoring system.
Starting from a simple no-news-is-good-news which is trustworthy, we added a time series db (rrd), moved to self-service monitoring and then to event correlation. The job's not done, now the foundation is there to empower developers in the production ecosystem and also help us all comprehend a common language that consists of metrics such as latency, utilization, wait times and timeouts.
In the following there is a detailed sketch of each of these discrete steps that formed the foundation.
Simple lessons that endured along the way - don't optimize prematurely; build for the near future not the super-distant imagined future.

V1: Central Monitoring and Alerting systems

A central server checking health of all servers and services that are required to run the network.
key elements
  • ping check - reachable node
  • process existence check
  • logs check
  • port check
  • black box test
pros
  1. simple to implement
  2. a central place to control access- checks and checked nodes
  3. easy to snapshot and recover
  4. checks the health of the network, albeit from a single point

cons
limited by polling frequency - 300 services 1 per second
single point of failure
can be mitigated by watching from a 3rd party location

V2: Historical data

As number of servers and services increased as the product got more successful. This led to a substantial (20 to 200) increase in servers and by design there was an increase in network distance. There came a need to optimize resource utilization.
The question to answer: What was the behavior last week? Has there been an improvement wi this change?

We worked  to get a timeseries database of system + application events. 
At this time quantile latency metrics entered  admob common parlance.
This meant for all latency, utilization metrics median, 95th, 99th and variance of a time series distribution were being picked out.
Knowing these metrics made for focused resource planning. The key benefit was realized because we were able to question the expectations of performance in situations where a certain resource was getting starved way ahead of the others.
This helped drive a very numbers driven canary process for all new releases - application servers and OS packages could be subjected to the same measuring yardstick.
This allowed us to focus time on seeking out better ethernet drivers for support higher packet rates. And also meant reduction of systems that we didn't need, like software raid on webservers.

V3: Decentralize checks

Passive checks: Decentralize checks; in this version we moved checks pertaining to process (existence, health and logs) to the local machine. The central checks were kept in place for aggregate servers like load balancers, cross correlation rsyslog, database replication and such.

This resulted in 2 wins
  1. reduced the traffic between the central server and the leaf nodes - which got over a scaling bottlneck in the monitoring service
  2. Allowed us to use the same template for cross datacenter monitoring

V4: High Frequency Event

There timeseries database was becoming a bottleneck for how much data we could collect. At this time we started looking for a solution that could handle higher sampling rates - until this the sampling interval was at 5minutes.

  1. rabbitmq for event collection for rapid collection 
  2. separate set of consumers to process and store events

Graphite emerged as a winner for visualization and whisper database as an upgrade from rrd...

in conclusion

The work is ongoing and in this time two of the underlying systems were also open sourced thanks to support from google.
  • rocksteady - the event correlation framework
  • astmob - asset database, a source of truth

No comments: