I was listening to episode one of the Vicious Circle podcast at work today (with headphones, of course), and Alan hit on one of my serious pet peeves. He also used an example that I am too familiar with. The peeve in question being where IT departments spend beaucoup bucks on a solution that can be done better for less. The specific example Alan used was Nagios vs. Netcool, two monitoring platforms I’ve used extensively. There are other examples I’ve encountered at work, but this one is too perfect.
Two years ago, before I was a network engineer, I worked as a NOC engineer. The job of a NOC engineer is basically to monitor the network and call someone when something breaks. To facilitate that task, a network operations center will be outfitted with a network monitoring system of some sort. When I started the job we used Nagios as that monitoring system. Nagios is freeware and it runs on Linux. Its system requirements are not burdensome. In our case we were running it on Fedora Core 4 on redundant POS Dell servers. It can run a variety of checks against network equipment and servers from ping checks to SNMP polls and service monitoring (HTTP, SMTP, etc). It’s open source and there are hundreds of add-ons for it from new web front-ends to extra service checks and auto-discovery scripts. With a little scripting know-how, it was easy to configure Nagios to do basically whatever we wanted.
Nagios has one downside: it’s management intensive. Autodiscovery is not a reliable way to find new network equipment, and frankly you should know whenever something new is added to your network. Thus, every new piece of equipment has to be added to the monitoring system manually. Fortunately, doing so is easy. A trained monkey could manage Nagios. With nine NOC engineers, three per shift, each trained in Nagios administration, the management overhead was very light. Each of us was also capable of troubleshooting the monitoring system individually, which is handy when it breaks at 3AM.
Unfortunately, someone higher up decided that real NOCs don’t use Nagios. Nagios is free and everyone knows that free is unreliable and bad. Not to mention unrespectable. Never mind that it worked, required little training to use and only slightly more training to manage, and the program was capable of doing everything we needed it to do. No, we needed something pretty. Something new. Something. . . expensive!
Enter Netcool, a monitoring solution from IBM of respectable breeding. I’m not going to sugar-coat it, I fucking hate Netcool. Nagios ran on any hardware, and it monitored a few thousand nodes of our network without any problem. In order for Netcool to do the same thing, the consultants we hired to install and configure the system and then train us on its use insisted that we needed hyper-expensive Sun servers (but I repeat myself), and of course we needed to run Netcool itself on Solaris X. It took them a year to figure out how to make it run and to import our network maps, and they were the experts. Nevermind that during the entire time we had a working system that did the same thing. The cost of the endeavour? Let me just say that we could have hired 50 people at my salary for a year.
Nagios has an HTTP front-end that runs in any browser, anywhere. Netcool’s front end uses the Java Runtime Environment, another piece of garbage software that I loathe. A side-rant on Java: I have six different platforms that I need to use Java to manage, and I have to run two different versions of JRE to do that, because half the programs only run on an older version of Java. Every time a new version of the JRE comes out it breaks something new. Fuck Java in the ear. Give me a CLI any day. If I can’t manage it via SSH, you have failed as a vendor.
Anyway, we were promised that Netcool would perform automatic network discovery, cutting down on management time and effort. We were also told that Netcool would use our network diagrams to determine the systems that any given fault affected, and then isolate the root cause without setting off hundreds of alarms to sift through. Both of those claims turned out to be complete bullshit. Cutting down on management overhead? Is that why we hired someone to manage the monitoring software full-time? Is that why instead of having a dozen people who knew our monitoring platform inside and out we now have one, who spent weeks at IBM training for the job? Is that why, millions of dollars and a year and a half after full deployment, we still have a solution that’s only marginally better than Nagios was, and not at all better than it would have been if we had put in the same effort to upgrade that platform?
I’m so glad I don’t have to deal with that horse-shit Netcool software any more. You couldn’t pay me enough to do it again.
- They were likely full of shit, but management bought what they were selling [↩]