Saturday, June 30, 2012

What is wrong with log levels, Takeaway from #devopsdays

Looking at logfiles, it should be easy to determine required actions depending on some attribute of each log message. The most ovious attribute from my perspective should be the log level. But an open space discussion group at #devopsdays Mountain View 2012 came to the conclusion that this does not currently work.

Developers and operations people have a different understanding about what actions are required to happen on log messages depending on their log level. One of the participants (sorry could not remember her name) suggested that we should be able to agree on the convention that log entries with either error or warning level should be acted upon, if seen on production. Suprisingly for me many disagreed. The loudest one was Jordan Sissel. He said this is a language problem, as devs using a different interpretation than ops, and we will not be able to change the world and make all third party apps and tools to stick to this convention. And thinking about our logfiles, and how many errors show up there nobody would like (and be required) to be woken up for at 3 o clock in the morning, I think he is right.

But why? What is wrong with out current log levels? Shouldn't it be obvious how log levels translate into (required) actions?

Having a Java History, and being used to use log4j, I revisited the available log levels and how I used them.
  • FATAL -> Very rarely used, Only on startup failure, App will shut down after this message
  • ERROR -> There is something wrong the application can not suitably compensate for
  • WARN -> There is something wrong that should not be, but the application can compensate
  • INFO -> Some context what is currently going on in the app, what relevant business activities are happening (major workflow steps)
  • DEBUG -> show a lot of details abaout what is going on in the app.
  • TRACE -> never used this level
I would like to see my production systems run on INFO level. Only in special situation I would switch log level to DEBUG for some interesting classes on production, to get more context what is happening. As Log4j allows to set the log level by category, which is usually bound to the class name of the class that is calling the logger, I can usually do that safely for some time (without restarting the app) as I will do it on a single class or package only.

Unfortunately log4j is not the only logging framework in the world, not even in java. There are lots of them, and many of these try to differentiate themselfes on the log levels.  This is was the java builtin Logger uses :
  • SEVERE (highest value)
  • WARNING
  • INFO
  • CONFIG
  • FINE
  • FINER
  • FINEST (lowest value)
The lower level values of this logger indicate something that I did not get on the log4j levels. There are different concerns addressed by the levels. Severity of an issue at the higher level, and detail or volume control on the lower levels. 

From the "what should I (or an automatic log analyser) do" perspective, and to remove the language problem, more explicit levels would be helpfull, like:  
  • PAGE_SOMEONE_INSTANTLY
  • PAGE_SOMEONE_IF_REPEATED
  • CONTEXT
For detailed tracing a LOGFILE_KILLER level could be added. This should clarify the language problem in terms of actions required (assuming that there is a basic understanding of the english language), as it explicitly says what should happen.

Maybe it should even be simpler, only PAGE_SOMEONE, as the instantly vs. if repeated decision could happen outside of the system.

And of course we will not be able to change the world, but we could start to change the part of the world, we can control.

log4j is a trademark of The Apache Software Foundation, Oracle and Java are registered trademarks of Oracle and/or its affiliates.

1 comment:

  1. I don't think that we can solve the "language problem" by calling things different. On top of that we won't manage to change all apps to conform to the new log levels so that in the end we will have some apps with the new log levels and lots of old apps with the old levels.

    Why not simply go back to the roots and start turning every ERROR into an action?

    Of course I would recommend a soft migration, first try this in development and then slowly pull it into production. Maybe have a whitelist file of "known" errors that are ignored. And then start by taking 10 (or 100) entries out of this file every week.

    That way we would achieve the same without turning all the world upside down. Just use what is there properly.

    ReplyDelete