Tuesday, November 18, 2008

logging, but just enough to be useful to troubleshoot

... to making operations engineering a real task

Take for instance a situation where there are reported timeouts on connections to servers - there could be several culprits here.
  1. server taking too long to process requests ; resulting in non-acceptance of connections
  2. name lookup failure [if client is using names, not ip]
  3. network layer drops
Each of these 3 can be rat's hole to debug if you don't know where to start. In every debug operation in a production environment i have started with logs; logging comes at a price and needs centralization for ease of consumption. The key point of this post it to emphasize the need to have awareness inculcated into all forms of development to focus on leaving a usable "rice trail" so these troubleshooting exercises can actually be done after that event occurence.
Basic principles of logging:
  • log key entry and exit functions in your application
  • seed in time information where it'd be useful - for instance if a connection is critical to the performance of the next step; have a try-catch block that captures return codes.
  • always have a timeout on a remote call - i.e. know what you need and how fast [do not rely on the server's sensitivity always] - and its a great piece of feedback to the server
  • network flows are important and put that into place ** this area is new to me, more on this later.
Don't :
  • log DEBUG/INFO messages on production systems - it eats into precious IO resources
An operations environment aware development engineer is one that has an instinct for these; for the others i am still looking for the quickstart book!

No comments: