OSchmi's thoughts on IT and the rest of the world: June 2012

Saturday, June 30, 2012

What is wrong with log levels, Takeaway from #devopsdays

Looking at logfiles, it should be easy to determine required actions depending on some attribute of each log message. The most ovious attribute from my perspective should be the log level. But an open space discussion group at #devopsdays Mountain View 2012 came to the conclusion that this does not currently work.

Developers and operations people have a different understanding about what actions are required to happen on log messages depending on their log level. One of the participants (sorry could not remember her name) suggested that we should be able to agree on the convention that log entries with either error or warning level should be acted upon, if seen on production. Suprisingly for me many disagreed. The loudest one was Jordan Sissel. He said this is a language problem, as devs using a different interpretation than ops, and we will not be able to change the world and make all third party apps and tools to stick to this convention. And thinking about our logfiles, and how many errors show up there nobody would like (and be required) to be woken up for at 3 o clock in the morning, I think he is right.

But why? What is wrong with out current log levels? Shouldn't it be obvious how log levels translate into (required) actions?

Having a Java History, and being used to use log4j, I revisited the available log levels and how I used them.

FATAL -> Very rarely used, Only on startup failure, App will shut down after this message
ERROR -> There is something wrong the application can not suitably compensate for
WARN -> There is something wrong that should not be, but the application can compensate
INFO -> Some context what is currently going on in the app, what relevant business activities are happening (major workflow steps)
DEBUG -> show a lot of details abaout what is going on in the app.
TRACE -> never used this level

I would like to see my production systems run on INFO level. Only in special situation I would switch log level to DEBUG for some interesting classes on production, to get more context what is happening. As Log4j allows to set the log level by category, which is usually bound to the class name of the class that is calling the logger, I can usually do that safely for some time (without restarting the app) as I will do it on a single class or package only.

Unfortunately log4j is not the only logging framework in the world, not even in java. There are lots of them, and many of these try to differentiate themselfes on the log levels. This is was the java builtin Logger uses :

SEVERE (highest value)
WARNING
INFO
CONFIG
FINE
FINER
FINEST (lowest value)

The lower level values of this logger indicate something that I did not get on the log4j levels. There are different concerns addressed by the levels. Severity of an issue at the higher level, and detail or volume control on the lower levels.

From the "what should I (or an automatic log analyser) do" perspective, and to remove the language problem, more explicit levels would be helpfull, like:

PAGE_SOMEONE_INSTANTLY
PAGE_SOMEONE_IF_REPEATED
CONTEXT

For detailed tracing a LOGFILE_KILLER level could be added. This should clarify the language problem in terms of actions required (assuming that there is a basic understanding of the english language), as it explicitly says what should happen.

Maybe it should even be simpler, only PAGE_SOMEONE, as the instantly vs. if repeated decision could happen outside of the system.

And of course we will not be able to change the world, but we could start to change the part of the world, we can control.

log4j is a trademark of The Apache Software Foundation, Oracle and Java are registered trademarks of Oracle and/or its affiliates.

Bug Levels, or how simplicity made our life easier

Had a conversation with Jordan Sissel at devops days regarding log levels and the different understanding between dev and ops people. Jordan called this a language problem. I will write another post about that topic, but the discussion reminded me of another situation, where some confusion existed, due to unclear definition of levels. Here is the story.

Some years ago, we had more than 300 Bugs in our bugtracking system, about 50% of them older than 3 month and some even older than a year. It took us some effort to manage these bugs, and we were quite frustrated with this situation. Especially as the IT seemed to be the only department that cared about the bugs. Product management did not care, they more focussed on new features.

Here are some of the things we did and what we achieved:

Many bugs were left unassigned for quite a long time, and sometimes bugs would ping pong back and forth between teams. To overcome that, we introduced a dedicated role to manage these bug, called bug dispatcher. This guy would try to find out which team was most probably responsible for fixing the bug and push them to fix it. He would among other things be involved in discussions with product owners if a given bug should be fixed now, or not. This helped a lot to bring new bugs to the right teams fast, but it did not really reduce the amount of open bugs a lot.

So next we decided to close all bugs older then 3 month. If they were not fixed within this time, they could not be important. But this also required some effort, as we were not consequent enough to just automatically close them, but we ran around, and talked with the PO to close them.

As this was not such a great succes, we had a look at our bug levels and the duration a bug would stay open depending on the level assigned. We had quite a lot of different bug levels in our tracking system at this time :

Blocker
Critical
Major
Normal
Minor
Enhancement

As far as I remember, we basically took the levels the tool was shipped with. Nobody could explain, if and how the handling of a minor bug in relation to a major bug should differ. What time would be acceptable.

It showed up, that bugs having the first three levels assigned, would be fixed within a reasonable amount of time, all other levels were interpreted as: "we will never fix this".

Having understood that, we drastically changed the number of bug levels and the rules applied for handling these.

We now have only two bug levels left:

Critical
Normal

Critical means, we are or will be loosing money or reputation due to this bug, which means we fix it immediately if found in production, or we will not push this release to production if found on staging.

Normal means, we fix it with one of the next releases.

If the bug is in the tracking system, it will be fixed, unless the product owner excepts the bug behaviour as acceptable for the product, and closes the bug.

This reduced the number of open bugs drastically. And it also took away a lot of discussions. Some simple and easy to understand rules made solving this problem much easier. Getting rid of all the other levels enforced a decision.

Simplicity makes our live much easier now.