It used to be that good preventive maintenance meant walking around, looking, listening, and generally being a good observer. Attention to detail was important. It was the sounds, and sometimes the smell that told you something was wrong. Writing down meter readings was a normal event. And while some of that remains, most of today’s preventive maintenance can be done at your desk. With a little scripting it can be sent to your email; allowing it to be done easily from almost anywhere.
Logs, SNMP traps, and a variety of other items need to be considered when doing good PM today. Not all SNMP traps end up in the logs. Notice I said logs, plural. Most systems have several logs that are for different subsystems. One log may contain entries that are specific to the hardware, while another may be for the operating system (OS). Some logs relate directly to the application. With computer-based systems you really do need to dig as deep as possible; and on a regular basis. For instance, a quick check of Windows 7 shows log categories that include Application, Security, Setup, System and Services. All of these can show problems that relate to basic system health. It is likely that most of these categories are not looked at until too late. Most groups that check logs may only look at specific logs that relate directly to their area of expertise. Yes, problems with the system will eventually find their way in to the application logs, but many times these problems can be caught much earlier when all logs are regularly scanned for issues.
A good example of watching a problem develop can be seen on most digital receivers. RF paths are unreliable, forward error correction (FEC) is used to correct errors that occur through the transmission path. At any given time there will be some number of errors, and (hopefully) all, or most, will be corrected using the FEC. Tracking those numbers over time can show degradation in the antenna and receiver. The same is true in a computer NIC, hard drive system or memory. There will be some errors that are corrected, and over time that number may grow. However, when the number of errors increases, either sharply, or over a predetermined threshold, it is time to find the source of the errors and correct them. The only way to find those gradual increases is by tracking the numbers over time, just as we did with meter readings.
Digital systems can easily lull you into complacency. Things work great…until they don’t. Then what? The truth is the system probably left plenty of clues, many of which were likely missed because no one was looking. There are tricks that can be used; management systems can gather and report log errors, scripts can search log files and their contents. Even something as simple as the size of the log file can be an indicator of developing problems. If the files are normally 2K, a 4K file might be a warning sign that needs closer examination.
Another thing to be aware of is the various thresholds employed by the manufacturers. Manufacturers might categorize events as warnings, minor errors and major errors. A lack of major errors does not mean all is well. Look at minor errors and warnings. Ask the manufacturer if logged events other than errors or warnings should be tracked. If you don’t understand the events in the logs, ask about them.
Manufacturers are usually open to sharing that information with their customers. Throughout the software development cycle, all sorts of items are written to the logs as part of the debugging process. Many of the logged events detail how the system is really functioning, much like the margin numbers on digital receivers.
Today’s systems have a remarkable and somewhat uncanny ability to adapt. While that often makes our jobs easier, it can also mask an incredible number of problems. Some of those problems will come back and bite. Some won’t. Like all PM programs the question is when do you want to find out? On your terms? Or after the system is down and it’s too late? Personally, I would rather find out during the day instead of receiving a frantic phone call at 4AM.