Monday, February 28, 2011

Monitoring

Most enterprise IDM systems are very complicated and complex beasts with redundancy in not only the presentation layer but in the application layer and the data persistence layer. This can make it hard to answer the simple question "Is the system working properly or not". In most cases you also want to be able to spot issues early so that you can fix them before they become a problem that may take down the entire system.

The best way to ensure that all components are healthy and all services are up is to implement a comprehensive monitoring program so what are the things that you tend to want monitor?

The most basic monitoring policy is the "wait until the end user yells" approach. In this approach you simply wait for the end user to start screaming and as long no one is screaming then things must be fine. This approach have some significant limitations so it is not the way I would recommend.

Once you start talking monitoring you usually discover that the corporation has some kind of standardized monitoring tool that you should/must use. These tools usually can provide the following functionality:

  • Host monitoring (is the server OS up)
  • CPU/Memory/disk monitoring
  • Process monitoring
Monitoring can be done either with a monitoring agent that is installed on each server or through an agentless approach. Having basic infrastructure monitoring can be very useful as it will alert you about creeping issues such as small memory leaks or logs that slowly but steadily eats up all available disk space. The trick is to make sure you actually can fine tune the alarm threshold and response level as you go along. In most cases you do want to be told if the CPU suddenly spikes from a max of 25% to 95% but as being woken up at 2 am every second Wednesday because the CPU load spikes for a few seconds during a batch load may not be ideal for you (or your marriage) you do want the ability to put in exceptions in the logic.

In most cases you will have some kind of network or port monitoring as part of your load balancer setup. Given that port oriented network monitoring configuration tends to get very complex I will write about this in a separate post.

Next step is to look at the application aware monitoring. This is usually accomplished by looking at the application logs and escalating entries that are following a certain pattern (i.e. whose log level is ERROR). You can also look for specific error messages that you know are thrown when a specific error condition occurs.

Once you have monitoring in place you should be able to sleep better at night. At least as long as your monitoring doesn't wake you up reporting nuisance errors.

No comments:

Post a Comment