Thursday, July 8, 2021

Maxims - for a service in production

tldr; maxims that served well me as an SRE, single version in production, rollout/rollback testing, symptom alerting, SLI, Hitless upgrades, Production changes that are auditable.

Over the last 8 years these are some of the patterns I find repeatedly prove valuable, and I have called operational metric(s) that it affects, #metric

Rollback testing. It's not sufficient to outline in a production readiness checklist, this should be exercised regularly for correctness, limitations and timing. Other dimensions to it including but not limited to configurations, binary and the qualification process. #mttr

Single version in production - when running a production service whose core characteristic is feature velocity I have seen this principle keeps the chaos in check. Other considerations such as engineering team discipline, maturity of the team, modularity of the code base must be factored into the successful adoption of this policy. #cogntive-load #mttr 

Symptom based alerting - no news is good news for service where the SLOs are well-tested and do represent end to end customer experience. In my experience that, happy-slos is a destination in the journey nowhere near the beginning. #diagnostic #triage #tools #observability

Service Level Objectives( SLOs) - an all important tool in establishing an agreement with the consumers of the service. Indictors are essential pre-step to setting Objectives. So iterating over SLIs is essential to determine a good objective. And good objective is not subjective - there's a value at which the customer is happy. This provides an error margin for the service to use for maintenance and upgrades. #observability #objective-measure

Hitless Upgrades - running a service in production means it will need to be updated periodically, for feature changes and bug fixes. Maintenance, dependency management - all operations needing a restart in place ranging from machine repair, operating system upgrades, etc. #maintenance #toil

I hope to add more notes for each in more detail on subsequent posts.

Friday, June 14, 2013

evolution of appops

goal of monitoring: no news is good news and...

This post started out in 2011 to summarize the operating system that helped run admob operations successfully as a high availability system. I was part of the team that created and ran these systems and so had a good view of these. And had a part to play in setting direction of the evolution of the system.

how did we get there?

Started simple - built a health checking system that can be trusted to throw an alert at the first sign of trouble. This meant every check will go off if there is a widespread issue, so at the outset we decoupled the alerting from the checking systems. This allowed us to silence alerts enmasse without suppressing alerts.
This helped build confidence in the monitoring system.
Starting from a simple no-news-is-good-news which is trustworthy, we added a time series db (rrd), moved to self-service monitoring and then to event correlation. The job's not done, now the foundation is there to empower developers in the production ecosystem and also help us all comprehend a common language that consists of metrics such as latency, utilization, wait times and timeouts.
In the following there is a detailed sketch of each of these discrete steps that formed the foundation.
Simple lessons that endured along the way - don't optimize prematurely; build for the near future not the super-distant imagined future.

V1: Central Monitoring and Alerting systems

A central server checking health of all servers and services that are required to run the network.
key elements
  • ping check - reachable node
  • process existence check
  • logs check
  • port check
  • black box test
pros
  1. simple to implement
  2. a central place to control access- checks and checked nodes
  3. easy to snapshot and recover
  4. checks the health of the network, albeit from a single point

cons
limited by polling frequency - 300 services 1 per second
single point of failure
can be mitigated by watching from a 3rd party location

V2: Historical data

As number of servers and services increased as the product got more successful. This led to a substantial (20 to 200) increase in servers and by design there was an increase in network distance. There came a need to optimize resource utilization.
The question to answer: What was the behavior last week? Has there been an improvement wi this change?

We worked  to get a timeseries database of system + application events. 
At this time quantile latency metrics entered  admob common parlance.
This meant for all latency, utilization metrics median, 95th, 99th and variance of a time series distribution were being picked out.
Knowing these metrics made for focused resource planning. The key benefit was realized because we were able to question the expectations of performance in situations where a certain resource was getting starved way ahead of the others.
This helped drive a very numbers driven canary process for all new releases - application servers and OS packages could be subjected to the same measuring yardstick.
This allowed us to focus time on seeking out better ethernet drivers for support higher packet rates. And also meant reduction of systems that we didn't need, like software raid on webservers.

V3: Decentralize checks

Passive checks: Decentralize checks; in this version we moved checks pertaining to process (existence, health and logs) to the local machine. The central checks were kept in place for aggregate servers like load balancers, cross correlation rsyslog, database replication and such.

This resulted in 2 wins
  1. reduced the traffic between the central server and the leaf nodes - which got over a scaling bottlneck in the monitoring service
  2. Allowed us to use the same template for cross datacenter monitoring

V4: High Frequency Event

There timeseries database was becoming a bottleneck for how much data we could collect. At this time we started looking for a solution that could handle higher sampling rates - until this the sampling interval was at 5minutes.

  1. rabbitmq for event collection for rapid collection 
  2. separate set of consumers to process and store events

Graphite emerged as a winner for visualization and whisper database as an upgrade from rrd...

in conclusion

The work is ongoing and in this time two of the underlying systems were also open sourced thanks to support from google.
  • rocksteady - the event correlation framework
  • astmob - asset database, a source of truth

Tuesday, November 18, 2008

logging, but just enough to be useful to troubleshoot

... to making operations engineering a real task

Take for instance a situation where there are reported timeouts on connections to servers - there could be several culprits here.
  1. server taking too long to process requests ; resulting in non-acceptance of connections
  2. name lookup failure [if client is using names, not ip]
  3. network layer drops
Each of these 3 can be rat's hole to debug if you don't know where to start. In every debug operation in a production environment i have started with logs; logging comes at a price and needs centralization for ease of consumption. The key point of this post it to emphasize the need to have awareness inculcated into all forms of development to focus on leaving a usable "rice trail" so these troubleshooting exercises can actually be done after that event occurence.
Basic principles of logging:
  • log key entry and exit functions in your application
  • seed in time information where it'd be useful - for instance if a connection is critical to the performance of the next step; have a try-catch block that captures return codes.
  • always have a timeout on a remote call - i.e. know what you need and how fast [do not rely on the server's sensitivity always] - and its a great piece of feedback to the server
  • network flows are important and put that into place ** this area is new to me, more on this later.
Don't :
  • log DEBUG/INFO messages on production systems - it eats into precious IO resources
An operations environment aware development engineer is one that has an instinct for these; for the others i am still looking for the quickstart book!

Wednesday, January 23, 2008

Uptime, QoS, SLA

Found this very interesting  
post
and I liked the emphasis on learnings and an iterative process of improvement.
The 99.99 or greater availability promise is a good one for the business but unless I have a rock solid plan for recovery and have meticulously documented past outages, history is bound to repeat.

Wednesday, November 28, 2007

Add swap space on a running system

I found a few systems that were setup with low swap [= to original RAM] on the system and now all my alerts for swap usage are going off.

On debian the steps to add a regular file for swapping were as follows:


  1. make the swap file - create a contiguous file and label it as such


  2. dd if=/dev/zero of=fourGfile count=4000000 bs=1024
    chmod 0600 fourGfile
    mkswap fourGfile

  3. Mount it as a swap partition


  4. swapon -v fourGfile

  5. Add it to /etc/fstab


  6. /var/cache/swapfile/fourGfile none swap sw 0 0



This was it and now we have enough swap on the systems,


bash-3.1# free -l
total used free shared buffers cached
Mem: 8180284 7866476 313808 0 84660 7323408
Low: 8180284 7866476 313808
High: 0 0 0
-/+ buffers/cache: 458408 7721876
Swap: 7951784 95304 7856480

NOTE: Swapping on a system is not a good thing for performance but it adds to total virtual memory that a process can access on the system.

Thursday, September 13, 2007

NFS exports

Installation and configuration was a snap ; and in 10 minutes we had the filesystem exported and mounted on the target system.

It involved the following:

  1. package - nfs-user-server – the description from the
    package had these notes. hasn't been an issue for us so far.

    This package contains all necessary programs to make your Linux machine act as an NFS server, being an NFS daemon (rpc.nfsd), a mount daemon (rpc.mountd).Unlike other NFS daemons, this NFS server runs entirely in user space. This
    makes it a tad slower than other NFS implementations, and also introduces some awkwardnesses in the semantics (for instance, moving a file to a different directory will render its file handle
    invalid)
    . There is currently no support for file locking.

  2. Portmap had to be reconfigured to take out the loopback.

        • File changed:/etc/default/portmap

        • #OPTIONS="-i 127.0.0.1"

  3. Edited /etc/exports to add directory to export

    • /mnt/data3 target.mount.host (rw,sync,no_root_squash)

    • no_root_squash is to allow for root on the remote system to
      be root on the exported fs; we needed this to be able to write from
      tape extraction.

  4. Restarted nfs-user-server and portmap.

  5. showmount -e to confirm exports were successful

Thursday, August 30, 2007

Bios upgrade on a Debian system

This one specifically refers to Dell and Debian - which is what we run.
Since Dell supports Windows and RHEL as its standard platforms, it seems like we are on our own.

DRAC firmware upgrade
Bhushan found that the DRAC firmware upgrades can be done using the DRAC interface itself.
  1. Login to DRAC web
  2. Goto "Media:Virtual Flash"
  3. Upload Image file
Prerequisites:
Download the firmware image from dell.com for the system.

CLI - go racadm
the racadm command line - ssh to the DRAC IP and run racadm help fwupdate. This one I haven't run and will update some day in the future.
A very detailed reference to set of steps from Marius are available here.