The best way to ensure that all components are healthy and all services are up is to implement a comprehensive monitoring program so what are the things that you tend to want monitor?
The most basic monitoring policy is the "wait until the end user yells" approach. In this approach you simply wait for the end user to start screaming and as long no one is screaming then things must be fine. This approach have some significant limitations so it is not the way I would recommend.
Once you start talking monitoring you usually discover that the corporation has some kind of standardized monitoring tool that you should/must use. These tools usually can provide the following functionality:
- Host monitoring (is the server OS up)
- CPU/Memory/disk monitoring
- Process monitoring
In most cases you will have some kind of network or port monitoring as part of your load balancer setup. Given that port oriented network monitoring configuration tends to get very complex I will write about this in a separate post.
Next step is to look at the application aware monitoring. This is usually accomplished by looking at the application logs and escalating entries that are following a certain pattern (i.e. whose log level is ERROR). You can also look for specific error messages that you know are thrown when a specific error condition occurs.
Once you have monitoring in place you should be able to sleep better at night. At least as long as your monitoring doesn't wake you up reporting nuisance errors.