Here's the scenario. You've been hired by a company to create a fledgling web site. The goal is to provide company information, an on-line store, and most-importantly a web-based subscription service featuring the company's premier product. You are given free reign to make all of the hardware/software decisions. You have set up the hardware, designed databases and written the application software. Everything is installed into a climate controlled hosting facility near the office. Everyone's excited with the end result and just a little nervous. You are more than a little nervous. There was concern by important team members about some of your choices in the creation of this site and you really want to hit this one out of the park.
A number of years ago that was me and my fears were realized. Fairly early on we started getting spontaneous rebooting of the server during peak loads. All eyes were on me and I needed to get to the bottom of this before confidence in my abilities faded. I talked to the hardware support people and they believed that we were having a hardware problem. Again, not good as the vendor was one of the initial sore points. Some replacement parts were sent along with some software updates. I installed the new hardware and software; then, I sat back and crossed my fingers.
You already guessed that this didn't work, right? What would be the purpose of writing this if it did? I sat in meetings with plenty of people telling me the obvious, this needs to be fixed ASAP and being asked the question a technical person always hates to hear, “How long will this take to fix?”
I got on the phone with the Internet provider. They assured me that there couldn't be a problem on their end because other customers would be having similar troubles. That made sense to me but one of my colleagues felt otherwise. He came up with the idea of putting an electric alarm clock in the server cabinet. The thinking was that if the server went down and the alarm clock was not flashing, we had a hardware problem. If the alarm clock flashed the problem was with the hosting company.
That weekend the server failed again. I was prepared for this and expected to check the clock and find it flashing. When I made the trip I found it was not flashing! The words that came out of my mouth cannot be repeated here but suffice it to say I was not a happy camper. I had one more trick up my sleeve. Instead of plugging the clock into the cabinet, I put it on the Uninterruptible Power Source (UPS). A UPS is kind of like a big battery that allows computers to run even if the power goes down.
The next morning I found that the server had gone down again overnight. This time when I went to the hosting facility I saw what I wanted to see, a flashing red display. Now I knew the problem was in the UPS. I figured out that a last minute addition of some new hardware had caused this to be underpowered for its use. A quick trip to the Big Box electronics store and I was in business. Problem solved.
So is this merely a cautionary tale with a clever solution? I don't think so. Just as a chain is only as strong as its weakest link, so are our electronics. I wonder how many of us have some brand new prized electronic toy (computer, plasma TV, you get the idea) plugged into some dubious power strip dug out of the basement.
Finally, it boils down to questioning your assumptions. We made the mistake of plugging the clock into the cabinet first because we suspected that was the source of our problem, not considering that a working UPS would have not allowed this to happen. Try to break problems down to the smallest level of detail. In our situation we completely overlooked the fact that there was a terribly important piece of equipment that was truly the root of the problem. Now, get going and replace those old power strips.