A personal bugbear of mine is developers not being able to write clear, effective logging. This seemingly trivial task appears to cause a great number of developers no end of problems, and yet it shouldn’t necessarily be that hard. Why is it then, that when I go to interrogate a log file I have to trawl though kilobytes (or worse) of meaningless rubbish to determine that:
24/04/2013 09:26:19 [ERROR] [Batch941-Thread5]: Unhandled exception has occurred null
That’s a real error message from a system I work on, and that’s all it had to say about the matter. I despair.
There’s a few basic things you can consider that will make you logging a lot more effective.
Use the appropriate log level
Fatal should be reserved for when an application, or a thread, is about to be terminated in a fashion that really isn’t expected. Errors should indicate something recoverable [at an application level] that’s gone wrong that wasn’t expected. The vast majority of fatal and error log messages are really warnings, that is messages indicating that an error has occurred but we’ve been able to carry on. Any occurrences of a fatal or error level message in your logs should have attendant bug reports with either code or configuration fixes to remove those errors. Informational messages should relate to things that people will care about day-to-day, or as additional log output for an initial higher level log output. Everything else is a debug message and will generally be turned off in production systems.
Provide the right level of information
Logs are text, are often large (I’m looking at 2 production log files that are in excess of 12Mb) and are often going to be parsed with simple tools. If I use
grep ERROR *.log I should be able to see each error, and enough information about that error to give me a high level overview of what is happening. More diagnostic information can follow the initial error at lower logging levels. There should be enough information following the error that someone reading the log file with default production settings can diagnose the issue, but not so much that they’re drowning in log lines.
Format the messages correctly
Be mindful that when you concatenate strings you may need spaces and other delimiters between output. When you split your output over multiple lines those lines may not be seen on an initial parse of the file. Also, be mindful of how values are going to be displayed. With Java the default
toString method on an object isn’t the most useful of outputs in a log file. In contrast, some objects are verbose in the extreme and may break the formatting of your error message by spewing a multiline string onto your single line error message.
Some real world examples
I regularly check our production log files for a number of reasons and find myself facing such doozies as:
30/04/2013 08:45:41 [ERROR] [[ACTIVE] ExecuteThread: '2' for queue: 'weblogic.kernel.Default (self-tuning)']: Error
The kicker here is that this could be one of a number of errors. If I see this twice in a log file I have no way of knowing if it’s the same error twice, or two different errors. The error message is badly formatted with the information [hundreds of lines of it] on the next line. Sadly I see this more than a couple of times a day and, as it’s a third party bit of code that’s responsible, theres’ not much I can do about it.
30/04/2013 15:29:28 [INFO ] [[ACTIVE] ExecuteThread: '7' for queue: 'weblogic.kernel.Default (self-tuning)']: Email transport connection established
30/04/2013 15:29:29 [INFO ] [[ACTIVE] ExecuteThread: '7' for queue: 'weblogic.kernel.Default (self-tuning)']: Email transport connection closed
126 occurrences in todays log files. This needs to be a debug message and an error output if it fails to establish or close the email transport connection. Ironically enough, as I was digging into the code to fix this I discovered that when it does go wrong it reports the error 3 times in 3 different places, resulting in 3 error lines in the logs. Worse still 2 of these lines only state “Error sending email” with no other information other than a stack trace from the exception. That’s three slightly different stack traces, two useless error lines and one useful error line for 1 error which could easily add 15 minutes to the diagnosis time while the developer tries to work out what’s going on.
01:17:45 [ERROR] [[ACTIVE] ExecuteThread: '7' for queue: 'weblogic.kernel.Default (self-tuning)']: Failed to create URI Object
Well over 1000 occurrences today alone! My grepfu is reasonable so it I altered it to show the line after that. Turns that someone is trying something nasty with some of our search URLs, but that wasn’t immediately obvious from the error, instead I was resorting to the stack trace. Not only could the error be improved, but we can also downgrade this to a warning. Looking at the code this is only going to happen in two cases: either someone has put a bug in the code, in which case we’ll see thousands of identical warnings and do something about it; someone is trying something nasty when we’d see hundreds of variations on the warning which we can investigate and then improve or ignore depending on the outcome of the investigation. Bug raised to fix both the handling of dodgy URLs and the logging level.
I’m sure I could dig out loads more and thats just with a cursory glance of the files. Logging is an important diagnostic tool, but it’s only as good as you make it.