Developers and operations people have a different understanding about what actions are required to happen on log messages depending on their log level. One of the participants (sorry could not remember her name) suggested that we should be able to agree on the convention that log entries with either error or warning level should be acted upon, if seen on production. Suprisingly for me many disagreed. The loudest one was Jordan Sissel. He said this is a language problem, as devs using a different interpretation than ops, and we will not be able to change the world and make all third party apps and tools to stick to this convention. And thinking about our logfiles, and how many errors show up there nobody would like (and be required) to be woken up for at 3 o clock in the morning, I think he is right.
But why? What is wrong with out current log levels? Shouldn't it be obvious how log levels translate into (required) actions?
Having a Java History, and being used to use log4j, I revisited the available log levels and how I used them.
- FATAL -> Very rarely used, Only on startup failure, App will shut down after this message
- ERROR -> There is something wrong the application can not suitably compensate for
- WARN -> There is something wrong that should not be, but the application can compensate
- INFO -> Some context what is currently going on in the app, what relevant business activities are happening (major workflow steps)
- DEBUG -> show a lot of details abaout what is going on in the app.
- TRACE -> never used this level
Unfortunately log4j is not the only logging framework in the world, not even in java. There are lots of them, and many of these try to differentiate themselfes on the log levels. This is was the java builtin Logger uses :
- SEVERE (highest value)
- WARNING
- INFO
- CONFIG
- FINE
- FINER
- FINEST (lowest value)
From the "what should I (or an automatic log analyser) do" perspective, and to remove the language problem, more explicit levels would be helpfull, like:
- PAGE_SOMEONE_INSTANTLY
- PAGE_SOMEONE_IF_REPEATED
- CONTEXT
Maybe it should even be simpler, only PAGE_SOMEONE, as the instantly vs. if repeated decision could happen outside of the system.
And of course we will not be able to change the world, but we could start to change the part of the world, we can control.
log4j is a trademark of The Apache Software Foundation, Oracle and Java are registered trademarks of Oracle and/or its affiliates.
I don't think that we can solve the "language problem" by calling things different. On top of that we won't manage to change all apps to conform to the new log levels so that in the end we will have some apps with the new log levels and lots of old apps with the old levels.
ReplyDeleteWhy not simply go back to the roots and start turning every ERROR into an action?
Of course I would recommend a soft migration, first try this in development and then slowly pull it into production. Maybe have a whitelist file of "known" errors that are ignored. And then start by taking 10 (or 100) entries out of this file every week.
That way we would achieve the same without turning all the world upside down. Just use what is there properly.