Summary

Jerry Ferrell

Job Title Software Alchemist

Activity Details
<b>0</b> Forum Posts 0 Forum Posts RSS
<b>4</b> Blog Entries 4 Blog Entries RSS
Blogs Blogs
Time for a Solution

Here's the scenario. You've been hired by a company to create a fledgling web site. The goal is to provide company information, an on-line store, and most-importantly a web-based subscription service featuring the company's premier product. You are given free reign to make all of the hardware/software decisions. You have set up the hardware, designed databases and written the application software. Everything is installed into a climate controlled hosting facility near the office. Everyone's excited with the end result and just a little nervous. You are more than a little nervous. There was concern by important team members about some of your choices in the creation of this site and you really want to hit this one out of the park.


A number of years ago that was me and my fears were realized. Fairly early on we started getting spontaneous rebooting of the server during peak loads. All eyes were on me and I needed to get to the bottom of this before confidence in my abilities faded. I talked to the hardware support people and they believed that we were having a hardware problem. Again, not good as the vendor was one of the initial sore points. Some replacement parts were sent along with some software updates. I installed the new hardware and software; then, I sat back and crossed my fingers.


You already guessed that this didn't work, right? What would be the purpose of writing this if it did? I sat in meetings with plenty of people telling me the obvious, this needs to be fixed ASAP and being asked the question a technical person always hates to hear, “How long will this take to fix?”


I got on the phone with the Internet provider. They assured me that there couldn't be a problem on their end because other customers would be having similar troubles. That made sense to me but one of my colleagues felt otherwise. He came up with the idea of putting an electric alarm clock in the server cabinet. The thinking was that if the server went down and the alarm clock was not flashing, we had a hardware problem. If the alarm clock flashed the problem was with the hosting company.


That weekend the server failed again. I was prepared for this and expected to check the clock and find it flashing. When I made the trip I found it was not flashing! The words that came out of my mouth cannot be repeated here but suffice it to say I was not a happy camper. I had one more trick up my sleeve. Instead of plugging the clock into the cabinet, I put it on the Uninterruptible Power Source (UPS). A UPS is kind of like a big battery that allows computers to run even if the power goes down.


The next morning I found that the server had gone down again overnight. This time when I went to the hosting facility I saw what I wanted to see, a flashing red display. Now I knew the problem was in the UPS. I figured out that a last minute addition of some new hardware had caused this to be underpowered for its use. A quick trip to the Big Box electronics store and I was in business. Problem solved.


So is this merely a cautionary tale with a clever solution? I don't think so. Just as a chain is only as strong as its weakest link, so are our electronics. I wonder how many of us have some brand new prized electronic toy (computer, plasma TV, you get the idea) plugged into some dubious power strip dug out of the basement.


Finally, it boils down to questioning your assumptions. We made the mistake of plugging the clock into the cabinet first because we suspected that was the source of our problem, not considering that a working UPS would have not allowed this to happen. Try to break problems down to the smallest level of detail. In our situation we completely overlooked the fact that there was a terribly important piece of equipment that was truly the root of the problem. Now, get going and replace those old power strips.

Entropy and Project Management

In addition to being a computer geek, I am also a science geek. I like to look to the world of physical and biological sciences for inspiration in the realm of computer science. Such as applying the concepts of DNA repairing enzymes to preemptive file system repair in Operating Systems, but that is a for another post. Today I'm going to touch on the notion of software entropy.


Entropy, aka, the 2nd Law of Thermodynamics states that energy in a system tends to dissipate unless additional energy is provided or an obstruction is encountered. If we take a pitcher of hot water and pour half into an open container, common sense and experience will tell us that it will cool to room temperature. If we were to pour the remainder into a thermos, it would hold its heat for a considerably longer period of time. This is due to barriers that prohibit the loss of energy.


In software systems this entropy takes the form of slower performance, less reliability, and increased complexity over time. Anyone who has experienced the slowdown of a computer over time knows of which I speak. From the standpoint of project management it is important to remember this, “Your software is trying to fly apart.” Serious safe guards need to be put in place in the form of source code control and formal design requirements. Good software is that way because sufficient effort is being expended at every level to make it so.


In my experience this process plays out over time. For instance, a project starts out with lots of enthusiasm, buy in from the users, and high expectations. There is a big push for the first release and once the initial kinks are worked out, everyone is happy. Users begin to accumulate suggestions for enhancements and important people in the organization make “requests” for changes. There will be significant pressure to sneak these updates into the mix. Not unlike the way Microsoft was continually providing secret updates to Internet Explorer.


These changes will complicate future releases because they are introduced in a short-sighted manner that is calculated to cause the least amount of disruption to the production system. Consideration to issues of design or maintenance is thrown by the wayside due to the urgency of the requests.


Some principles and tools that I think will keep a project on track over its life cycle are:

  1. Appropriate tools (Source code management, data dictionary, bug tracking software)

  2. An official DBA

  3. An editor of documentation and help files

  4. A formal change request policy with support from upper management


Depending on the size of your organization some of these suggestions might seem obvious. Much of my career has been spent in organizations with smaller development staffs, where team members wear multiple hats. In these situations, the above considerations are not necessarily a given. My feeling is that these sorts of organizations are prime candidates for software entropy due to the closeness of the technical and business team members and the desire to get things done.


Just remember this, the road to downed systems is paved with good intentions.

Getting Out of Debt

All of the attention to the “credit crisis” in the news lately has reminded me of something I once saw in my travelings around the net. It immediately peaked my interest when I heard it. The term I'm thinking of is “technical debt”. It is used to describe the obligation an organization incurs when a technical design or decision is made to facilitate a short term need that ultimately proves more costly over time.

There are two types of technical debt: short term and long term. Short term debt would be a badly chosen design. People do not usually see this one coming. Certainly nothing anyone here would know about. Blame it on the contractors and interns and move on. The other kind of technical debt is long term. Ironically, this is often made as a strategic decision. It will often take the form of pressure to get something, anything, out the door. You can always fix it in version 2, right?

I like this concept because it provides a way of speaking strategically about technology in terms that are more meaningful to non-technical stakeholders. We have a very sophisticated way to discuss financial risk analysis. It would be nice to be able to tap into this mindset when discussing your next killer app with management.

Here are some conversations you can have with non-technical business colleagues to move this discussion forward:

Use your current maintenance budget as an initial estimate of your current technical debt in monetary terms. This gets the ball rolling by focusing the discussion on monetary terms that can actually be measured. Here we are talking about the costs to keep the system running smoothly, not enhancing it.

Refrain from talking about technical debt in terms of new software features.
Concentrate on the cost of supporting legacy software or the current value of technical debt as a portion of total technology spending. You will get a better hearing when money is part of the discussion.

Think strategically but keep your options open.
All debts are not created equal. Some stem from good business decisions; others are caused by sloppy technical practices or badly defined requirements.

Ideally, these concepts would become part of the normal dialog between the technical and business parts of your organization. I have found that well chosen metaphors can flesh out technical concepts that may be hard to grasp at face value.

Epoch Failure

Recently I was thinking back about the Y2K bug. Remember how companies spent tons of money and effort trying to avoid the impending crisis? Some people were actually stockpiling provisions because they thought society was going to come unhinged. Mercifully, the whole event passed without very much fuss.

 

I started fast-forwarding to see if I could find other crises laying in wait to destroy civilization as we know it. I offer up the following, feel free to add your own.

 

The Unix/Linux epoch bug also know as the billennium bug. System time on Unix-based machines is measured as the number of seconds since January 1, 1970. Since this is stored as 32-bit value, we know this value will over-flow on Tuesday, January 19, 2038. After that date, all date values will appear as 1/1/1970. Now that's a long time away and probably won't pose a problem but Linux is being used in many embedded applications. Who knows what 32-bit Linux devices will still be in place in 30 years. That might be a good day to pass on the elevator and take the stairs.

 

Running out of Social Security Numbers. With a total of one billion numbers available at the start you would think this wouldn't be an issue any time soon. We have already used about half of the numbers and at current rates these will run out in less than 76 years. That's a lot sooner than you might expect. Currently numbers are not re-used and a whole class of combinations are not available. There are about 127million numbers that will never be assigned. None with all the same number (i.e. 111-11-1111), none containing all 13's. And, no one will get a number containing any combination of 666.

 

At the current rates it is expected that the US and its territories will run out of 10-digit telephone numbers and need to move to a 12-digit scheme in 2025. High growth areas will feel the pinch much sooner because of the way area-codes are used.

 

Finally, Internet addresses are in a bit of a crisis. Current estimates predict that we will exhaust the 32-bit addresses that are currently in use as part of IPV4 before January of 2011. Ouch! That's looming large and breathing down our necks. Fortunately IPV6 will give us some breathing room in the form of 6.5 billion addresses.

 

Is there a lesson to be learned from all of this? I think that there is. When designing any new system, think long and hard about key items. Keys literally, in the form of database indexes, primary keys, etc but also key values in the general sense. Think about how these values will change over time and as the volume of data increases. Take your best guess about what the requirements will be and multiply by 1,000 or more. Those who do not learn from history are doomed to repeat it.

Showing 4 results.