Code Quality (part 3)

[dropcaps round=”no”]T[/dropcaps] his is our third post in the series about software Code Quality and a practical method to measure it. See the first post for an overview. Read the second post to learn how to measure Defectiveness and Maintainability.

3) Supportability

Supportability is often correlated with maintainability, but sometimes they diverge. Supportability is determined by your ability to find and fix defects quickly, view system availability and performance, and predict system problems. When the system goes down at 1am you don’t want to find out that the exception logging is not working or that somebody decided to implement some great new relatively untested framework/design because they just read about it in some article/book/one of my blogs… oops, forget that last one. But seriously, sometimes supportability and maintainability can be opposites.

Let us take a more concrete example to illustrate;

A software developer may code in such a way that many parts are easily pluggable and abstracted through interfaces or other such means. This is powerful in terms of maintainability as it becomes very easy to update the system. However, it may also impact supportability as it may not be clear what concrete implementation is actually being used, and why. In some cases there may be multiple locations where it is determined what concrete implementation to use. At 1am, with the system down, you may be pulling your hair out trying to figure out why WhizzyValidation is being used instead of FuzzyValidation. Then finally, you realize that the developer overrode the default lookup function that uses the configuration file and instead, manually coded WhizzyValidation to be used as the implementation of IValidateThisSucker, but forgot to limit the scope to OopsMyNewCrappyControl.

So, how do you measure the supportability of code? You take the mean square of the SLOCs times the … ahhhh never mind, I have a better idea. Similar to maintenance, I propose you directly measure supportability, you introduce an issue into the system and measure how easy it is to resolve. Perhaps you ‘accidentally’ run a bad SQL script that bloats a table past reasonable system limitations, or disable the queuing service during the load test

You do have load testing don’t you? If not, the HealthCare.gov team is looking for new members.

In other words, simulate something that is likely to happen in production, which generally follows Murphy’s law, so that is just about anything bad.

I remember on one consulting engagement a client had intermittent COM+ errors, it took Microsoft Support a month or so to figure out that the primary DNS server would occasionally hang, causing the problem. To make it worse, it was not falling back to the secondary DNS after a reasonable timeout. Someone had to wade through MBs of dump files to find the issue because the code was not built with supportability in mind. This is one area where you might not want to follow Microsoft’s lead; after all, STOP 0x0000000A and the IIS 500 error are not the best examples of clear exception handling.

It is also important to have visibility into the system availability and performance. Usually an IT dashboard is created to see how many users are on the system, what is the overall load on the system, etc. Tools like newrelic, solarwinds, or the myriad of systems management servers are used. The key is that the application must provide the data to be monitored, if it does not, supportability will suffer. Without those tools it will also be much harder to predict system problems and to diagnose system problems. Many times you need a history to determine what is different ‘this time’ vs last time, maybe there were too many users from one client, or a server started misbehaving. Again, it is time for a little history;

For a large application we struggled with support as there was no central dashboard. We had performance counters, exception logging, a systems management server, but they were not tied together, and did not integrate with the application. To find out how many users were currently on the system, if there were any stuck operations, or to determine which server(s) were misbehaving; IT had to issue queries and individually review each server’s performance information, etc. It took a lot of manpower, and although our IT staff was pretty darn good, they did not develop the system, so it usually required some development help. We quickly built an IT dashboard that consolidated the queries, integrated with the application to gather metrics, and provided easy system maintenance items directly on the dashboard (mostly just restarting the misbehaving services). The number of times development was pulled-in to address support issues on production was cut by 10x, and clients rarely noticed any problems since IT was alerted to them and could act on them before the client saw them. In addition, IT had easy control and visibility, also reducing the time they had to spend supporting the system. Luckily, for the most part the application had been built with supportability in mind, with performance counters, exception logging, etc. It was mostly a matter of connecting the pieces, but there were definitely areas that needed improvement. Before the updated dashboard I would have rated the application about 2/5, after the update it was about a 4/5 for supportability.

4) Performance

Who needs performant code? I mean, a web page loading in 25s is fine right? We put on one of those cool hourglass thingies or those exciting new animated loading icons and the user will wait, and wait, and wait… Luckily, performance is actually one of the easiest things to measure, unfortunately it also means that shoddy performance is a clear sign of non-quality code. The trick with performance is that there are many ways to get performance, and there are always trade-offs. From a code quality perspective the main thing that matters is the outcome, no rule/algorithm can determine if one architectural choice was better than another. To illustrate, let us take an example.

In a previous life I was looking to implement a cleaner way to code the data tier in an application and there was this relatively new technology called LINQ. So I did a quick experiment, writing up a small prototype (You do prototype before just plugging it into an application don’t you?), and wow, what a load of non-performant crapola. Sure, I could save a couple of man-months in development, but I would have to parallelize most operations to 10 servers to get back to normal performance levels. For smaller applications and MVPs [Note: Minimally viable product, i.e. a prototype pushed to production] it might have been fine, but when a single operation could calculate 100M+ balances, just 1ms overhead per calculation would add hours to the calculation time. That savings in development would actually cost much more in support or in building a parallel processing system (multi-threading was already used), additional servers, additional admins to administrate the servers, etc. Someone looking at the code from outside might say that a custom data tier was overkill, but what really mattered in this case was that the code perform, not that the code used the latest and greatest (i.e. half-baked) thing from Redmond or from some team hacking together the latest untested ‘framework.’

What determines acceptable performance? Your users. Yes, users can have unreasonable demands; by all means try to negotiate, but ignore user requirements at your own peril. You should have an understanding of what the performance requirements are for each piece of the project/sprint. If the code does not meet those requirements, it is not high quality code. In most projects you will have to make a tradeoff between performance and cost to build, and that is fine. Performance does not necessarily mean that everything has to happen immediately, appropriate queuing can be utilized to great effect. But even with queuing you have to be careful that something that should take 2 minutes does not take hours.

Note that some books/blogs advocate leaving performance tuning to the end; others say to do it from the get-go. I would have to agree with those that advocate continuous performance testing/measurement. A critical part is to do testing with realistic data, and max data. There are many stories of teams developing great code, testing performance with some small subset of data, and then when rolling to production, finding out that the code died a quick death at realistic loads.

Code Quality (part 3)

3) Supportability

4) Performance

About the Mark Lythgoe