Back To Square Zero: troubleshooting

Showing posts with label troubleshooting. Show all posts

Wednesday, February 4, 2015

Logging and Error-Handling Guidelines

I'm pretty sure I sound kind of logging-obsessed to my fellow developers sometimes. I'm sure I sound like a broken record sometimes when I ask to see the log when troubleshooting an issue, but when logging is done right it provides a clear path toward getting to the root cause of a problem. To my mind, this is about 1000 percent better than the alternative, which mostly consists of guessing and randomly trying things, followed by hiding from management when they come around demanding a status update.

So, logging is extremely valuable- at least good logging is. But what makes for good logging? I'm tempted to say "I know it when I see it", but in fact over the years I have managed to codify some guidelines about logging and error handling, which I'll lay out here.

We use log4net as our standard logging package for .NET projects. It's been around forever, and I know there are sexier options out there, but the guidelines below kind of assume log4net. If you use a different logging framework, I'm pretty sure most of the recommendations will still apply.

In a normal production environment, the logging level in configuration should be set to INFO. In log4net XML configuration, it would look something like this:

<log4net>
<appender name="TheAppender" etc.. />
<root>
<level value="INFO" />
<appender-ref ref="TheAppender" />
</root>
</log4net>

When you are testing or troubleshooting, the level can be changed to DEBUG to gather more information about the system. That means that as you write log statements, you'll need to think about whether you'll want to see them generated as part of normal behavior, or only in debugging/troubleshooting situations.

Each class should have its own Logger instance. We declare it like this:

private static readonly log4net.ILog Log = log4net.LogManager.GetLogger(System.Reflection.MethodBase.GetCurrentMethod().DeclaringType.Name);

This allows the logger to automatically pick up the name of the class, and including the fully-qualified class name on everything allows it to be copied "as is" into a new class in one action without needing to add any "using" directives.

When coding, try to anticipate what types of log messages that would be helpful to a future troubleshooter. This is a bit of a balance, because you want to avoid "spamming" the log with low-value information.

Log every request coming in at the highest code level with a level of INFO. This means controller methods, Main(), etc. This gives you a permanent record of all the requests that you handle.

When logging unexpected error conditions or exceptions you can't recover from, do so with a logging level of ERROR. This may sound obvious, but if you're consistent about it, you can quickly search through any log for the problem that Support is probably asking you about.

Exceptions that are logged should be done so with the full details of the exception, including message, stack trace, inner exception (if any), etc. Log4net makes this easy. Having the full stack trace in the log is enormously valuable for troubleshooting.

Log every condition that can be handled but may be a sign of trouble with a logging level of WARN. These often provide the "smoking gun" for more inscrutable errors.

When you get bug reports around unexpected edge cases, log the condition when you're fixing the issue. These are often WARN statements.

Log every database read and write (in Entity Framework, etc.) with a logging level of DEBUG. These are almost always too frequent to log as INFO (and can leak sensitive information) but are enormously helpful for troubleshooting.

Log other interactions with the outside world (file interaction, communication with external web services, etc.) with a logging level of INFO, unless they're so common they really spam the log, in which case go with DEBUG. The same reasoning as for database access - out-of-process calls of any type are a frequent source of failure.

Log any other helpful data, as needed, with a logging level of DEBUG. These types of log statements tend to be up to the discretion of the developer, and often they may be removed once a section of code is stable.

Errors that are not caught anywhere in the code should be caught and logged by the framework if possible. ELMAH is one tool that does this for ASP.NET web applications. If not possible, the entry point method (controller method, Main()) should have a general catch exception that logs the error.

If exceptions are caught, they should be re-thrown or completely handled. Under no circumstances should they leave the data in a bad state.

Try/catch blocks should be relatively rare and exist to handle specific error conditions. I've seen a lot of code where developers put a try/catch block in almost every method. This just adds cruft and overhead without adding any value. I think in most cases developers who do this are falling into a "cargo cult" mentality where they think they must add try/catch blocks everywhere in order to "do error handling right." They may be half-right - they should think about error handling in every method, but in my experience most of the time the best thing to do is simply let the exception be thrown and handled up the call stack.

Do not return status codes to indicate whether there was an error in a method. If there was an error, throw an exception from the method. This simplifies the calling code significantly as it does not have to check the return values or status codes of method calls. Older programmers who started out their careers using C or other languages without exception support are probably the most guilty of this.

Sometimes it makes sense to catch exceptions at a lower level in order to log specific details about the exception at that level. If this is done, be sure to re-throw the exception.
When re-throwing an exception, it is usually better to use “throw” by itself, rather than throwing the caught exception. Here is a quick example:

string base64 = …
byte[] bytes;
try
{
bytes = Convert.FromBase64String(base64);
}
catch (FormatException ex)
{
// Log the bad data for help in troubleshooting the problem
Log.ErrorFormat("Bad base64 string: '{0}'", base64);
throw; // this preserves the full stack trace
// do not do this as it resets the stack trace:
// throw ex;
}

In the code above, the only reason there is a try/catch block is so we can log the bad base64 string in the case that it cannot be converted to a byte array. By calling "throw" rather than "throw ex", the stack trace will appear upstream as if it were not caught at all at this level.

Avoid NullReferenceExceptions as they typically do not contain adequate information about what went wrong. Validating method arguments and throwing ArumentNullException, ArgumentOutOfRangeException, etc. is the best practice to avoid this.

Along the same lines, be sure to validate the result of data queries, including Entity Framework or LINQ queries, and throw an appropriate exception. An exception of this type ("Contact #123 was not found in database") is much more helpful than a NullReferenceException occurring many method calls later.

Client applications need to be able to handle various failed requests to the server (server not responding, 404 errors, 500 errors, etc.). If this is not done, it often leaves the client application "hanging" in an unresponsive state.

Taking a step back, you'll want to think about the process around all this. When a bug or problem is discovered, part of getting it fixed or resolved should be a discussion of what additional logging or instrumentation would have helped to diagnose the issue. Since you generally can't (or shouldn't) do source-level debugging on a production system, you have to rely on production logs, which should help you reproduce the issue in a development environment.

Wednesday, August 14, 2013

Support and Troubleshooting Techniques for Software Developers

Developers at the company where I work, including myself, do a regular rotation as 2nd or 3rd level product support. Basically, whenever our regular customer support folks have support questions they can't resolve on their own, they reach out to the on-call developer. As you might expect, this is not our favorite part of the job, but sometimes it's rewarding. To it a little more bearable, sometimes I think of it as detective work. Think of a police procedural TV program like "Law and Order", where the detectives going around interviewing eyewitnesses to the original crime, which produces more leads, etc.

I'm assuming at this point that a support person has come to you with a customer problem that they can't solve. At this point, your job is to go where support people generally can't - into the code.

Root Cause Analysis

Here's my main point: Work Backwards. Yes, I said this in the last post as well, and I'll repeat the rest of it: Start at the point where the error first shows up, figure out what condition(s) cause that to occur, then figure out what causes that, etc. Methodically work your way up the chain of causality. Don't attempt a fix without any analysis, unless you have a really good reason to believe it will work in this situation - in which case, you're not troubleshooting, you're just applying a remedy to a known issue. If support didn't know about the issue, be sure they add it to their knowledge base.

Start with some Basic Information

I probably sound like a broken record to our support team, because I'm always asking for two things right off the bat:

1. What version of the software is the customer running? If you have a single, web-based application that everyone in the world hits, this won't be a concern.

2. Can you get me the logs?

I need to know what version they're on so I know which version of the code to pull to start the causality investigation. I want the logs because the logs usually tell me a lot more about the error than the user-facing error message does (and sometimes there's no user-facing error message at all, just a "something didn't save correctly" or "so-and-so didn't respond." If it's an exception thrown, the logs will give me the full exception stack trace, which is invaluable. Again, this is the way we tie the error to the code.

If your software does not log errors when they occur, stop right now and make it a high priority. Do not release your next version without it. If you encounter resistance on this, start looking for another job, because I can't think of many things more frustrating than having to support software that doesn't have a way to tell you when errors occur, and what the nature of those errors is.

Some other things to ask the customer (or have Support ask the customer):

If something had been working and suddenly started having problems, what has recently changed in the environment? New hardware, new software, version upgrades, etc.?
Is the problem isolated to one user? If so, start investigating what is different about that user. Insufficient permissions?
Is the problem isolated to a single workstation? If so, same thing - what's different about that workstation?
Is the problem isolated to a single piece of data? Perhaps some data got corrupted.

Be Methodical

Don't be in a too much of a hurry - if you follow the steps, you'll almost always eventually come to the right answer. Apply logical reasoning: as a software developer, that's a big part of what you get paid for.

For example, if the user is seeing a particular error message, I need to know exactly what the error message says - preferably a screenshot or a text copy of the message. As a developer, your job is to tie the error back to the code. Sometimes you'll know exactly where in the code to look, but often, especially if your codebase is large, you'll need to search for the error text in the code.

Static Analysis

Once you've found the line of code that produces the error message, start doing some static analysis. Use your IDE's "find usages" or "find references to" functionality (or in the worst case, the straight "find" functionality) and start going up the call stack. The possible call stacks may start fanning out, but if you cross-reference against log messages, plus what you know about the configuration of the system, you'll usually have a good idea of the code path. You may need to go back to the support team or the client to get more information.

In modern AJAX web applications, the error may first be thrown on the server, but the call stack will eventually lead to a web service call, and from there into client-side JavaScript. The same rules apply, although the functional, untyped world of JavaScript can make static analysis a bit more difficult.

If you can't find the error or exception message in your code, it's probably coming from a third-party library you're depending on. Next stop: Google (or your favorite search engine) or maybe stackoverflow. Once you find the library function call involved, look for where it's called in your code, then follow the call stack up as before.

OK, That Didn't Work...

If nothing has panned out so far, here are some other things to try. These are roughly in order of easiest-to-hardest.

Is this a known issue? Check your bug database, release notes, knowledge base, etc. Yes, support should have done this, but you probably have access to more resources (the bug database, for example). Also, with your perspective as a developer, you may search the knowledge base in a different way and come up with different (sometimes better) results.
If the error is repeatable but you might be able to bump up the logging level (turn on Debug logging, for example), then reproduce the error and get the new, enhanced log. The debug messages may tell you enough about what's going on to determine the problem.
Sometimes you'll need to look at the change history of code, to find out when something was broken (or was fixed). If your changesets are tied to tickets, ether enhancements or bugs, it will further shed light on why a particular change was made.
See if your testing team can repeat the error. If so, they'll probably enter it as a defect.
See if you can repeat the error in your development environment. If so, you can probably capture the problem in the debugger, which will give you much richer information that the static analysis approach detailed above.

What To Do If You Get Stuck

Brainstorm the problem with your colleagues. They may remember a crucial piece of information you don't have, or come up with an approach you haven't thought of.
Relax your assumptions. If all assumptions were valid, the application would be working perfectly, right?
Some problems are really hard to diagnose because there are multiple problems interacting. Sometimes this is because there's a bug in the error-handling code
Remember that everyting is obeying physical laws, and there is a rational cause and effect going on. Do not treat the system as though it is magical.
That said, occasionally problems happen that you just can't explain. Hopefully this is very rare.

Workarounds

If it's a problem in production, you may actually have two separate tasks: getting it working ASAP. This may involve manually fixing some data in a database, or it might require a workaround or temporary expedient of some kind. If the problem is due to a software bug, making sure a bug report is entered so it gets triaged, prioritized, and (hopefully) fixed at some point.

Make It Easier to Troubleshoot Next Time

If the logging or other visibility into the code is inadequate, making the issue difficult to troubleshoot, the bug report may be "the so-and-so module has insufficient logging for troubleshooting and support," or better yet, something more specific, e.g. "The Flooper module doesn't log the full path of the file when it gets a 'cannot open file' error."

One thing I've learned about logging over the years is you definitely want to be able to log every significant interaction with the external environment: file opens, database interactions, web service calls, etc. You can't control these external things, but you can control exactly how you call them. In the worst case, these log messages could serve as a helpful bug report to the developers of the external module, but in most cases, you'll find the problem is with your own code.

Troubleshooting Techniques for Software Support

In this post, I'd like to talk about some basic troubleshooting skills for software support people. If you're such a support person, I'm assuming you can apply some basic logic to a problem, but I'm not assuming you're any sort of programmer. In a future post, I'll talk about the same subject, only from the point of software developers.

Be Specific

Oftentimes problem reports come in rather vague - "there was a problem with the application." Uhhh, can you be a bit more specific? A lot of the job involves information-gathering, and often it's iterative - you get a problem report, do a little research and analysis, ask for more information, do more analysis, etc.

Be Methodical - Work Backwards From the Error

Here's my main point: Work Backwards. Start at the point where the error first shows up, figure out what condition(s) cause that to occur, then figure out what causes that, etc. Methodically work your way up the chain of causality. Don't attempt a fix without any analysis, unless you have a really good reason to believe it will work in this situation - in which case, you're not troubleshooting, you're just applying a remedy to a known issue. Good for you for having an effective knowledge base!

Don't Make the Problem Worse

Maybe it's because I work with medical software that enhances patient safety, but to me this brings to mind the Hippocratic Oath: "First, do no harm." Don't make the problem worse! I'll say it again: Work backwards. As Gene Kranz in the movie Apollo 13 said, "Work the problem, people." You probably don't have lives at stake, but the same principles apply. Do not jump to conclusions, and don't just "try stuff" hoping it will work! A lot of times, your first thought may be "Hmmm, I wonder if changing the twizzle setting to false might do something." Yes, it might, but it also stands a very good chance of making things worse, or causing a second, unrelated problem.

Educate Yourself

The more you know about how the application works, and how the pieces work together, the more effective you'll be. For example, if you know where local data is stored, you can look there to see if things generally look OK (file names, file sizes, etc.). This may lead you to investigate file permissions problems, etc. You don't need to know the nuts and bolts of everything down to the code level, but if you generally know how data flows through the system, what format it's in (XML, JSON, text, binary, etc.), and generally what the data is used for, you'll be much more effective. If you don't know any of this, and it's not written down anywhere, ask a developer to spend a few minutes mapping it all out with you.

Likewise, you'll make yourself a much more valuable support person if you familiarize yourself with tools that can help in troubleshooting. For example, if you're supporting any kind of web-based application, a tool that allows you to see the HTTP traffic back and forth between the browser and server will be invaluable. (Fiddler is probably the most popular such tool). Learn about it and play with it during your downtime. If you frequently deal with file contention problems in a Windows environment, get familiar with Process Monitor or other free troubleshooting tools from Microsoft.

Communicate Effectively

If you need to get help from developers or other people, be sure to summarize the problem with the relevant context. If you just forward an email thread without any additional explanation or context, it will probably not make much sense to the recipient, and they'll have to come back and ask you for more details, which will just waste everyone's time.