Wednesday, September 26, 2012

Single Minute Exchange of Die or how to counter "this will not work"

Just learned some details about the software development process of a big german enterprise. They have a 4 month iteration cycle, where any project has to run through several (3-4) of these iterations from idea to live. Thus an ideas takes something around 12 to 16 month to go live. And only one iteration is actually working on the code. The others are analysis, specification and testing and deployment. Then (hopefully) every 4 month the whole system consisting of a lot of components will be deployed.

Thinkng about this situation I wonderd how it would be like, to try to convince the responsible people to reduce this cycle time and / or decouple the components in terms of their release cycle. In this discussion I expected a lot of "This will not work" statements.

Changing an enterprises mindset to replace "this will not work" with "we currently do not know, how this can possibly work but we will give it a try" sounds like a good first challenge to take in such an organization. Especially as it felt like the middle management was not very much looking for a change in this process.
This change in langauge could bring discussions and focus more on the "how" than on the "if". Unbelievably they changed the process from three to four month just recently, which just sounds like a change into the wrong direction. Not following one of my favourite agile quotes "If it hurts, do it more often".

So how to convince a group of middle management folks to try this first step and get a change going?

That is, where I remembered the Single Minute Exchange of Die part of "just in time production" which I heard about in the context of the toyota production system. I would assume, that Shigeo Shingo experienced a lot of "this will not work", when he proposed his idea of changing a process that usually took hours or days to only take minutes. He definitely did not know the perfect solution when he started, they changed the process iteration by iteration to make it more efficient. But they started and made it a great success.

The previously mentioned process change from three to four month iteration interval additionally indicates a batch size or complexity problem. And still it seams to be counter intuitive for many people that smaller batch sizes are more efficient. I remember reading a blog post disussing batch size in a software development context, but unfortunaltely could not remember where I read it. When searching for batch size however, I found this article very helpfull. And the type of constraints section from this page on the same site fits even better. Actually the whole site seams worth reading.

Maybe I should give Gordratts book What Is This Thing Called Theory of Constraints a try.

Saturday, June 30, 2012

What is wrong with log levels, Takeaway from #devopsdays

Looking at logfiles, it should be easy to determine required actions depending on some attribute of each log message. The most ovious attribute from my perspective should be the log level. But an open space discussion group at #devopsdays Mountain View 2012 came to the conclusion that this does not currently work.

Developers and operations people have a different understanding about what actions are required to happen on log messages depending on their log level. One of the participants (sorry could not remember her name) suggested that we should be able to agree on the convention that log entries with either error or warning level should be acted upon, if seen on production. Suprisingly for me many disagreed. The loudest one was Jordan Sissel. He said this is a language problem, as devs using a different interpretation than ops, and we will not be able to change the world and make all third party apps and tools to stick to this convention. And thinking about our logfiles, and how many errors show up there nobody would like (and be required) to be woken up for at 3 o clock in the morning, I think he is right.

But why? What is wrong with out current log levels? Shouldn't it be obvious how log levels translate into (required) actions?

Having a Java History, and being used to use log4j, I revisited the available log levels and how I used them.
  • FATAL -> Very rarely used, Only on startup failure, App will shut down after this message
  • ERROR -> There is something wrong the application can not suitably compensate for
  • WARN -> There is something wrong that should not be, but the application can compensate
  • INFO -> Some context what is currently going on in the app, what relevant business activities are happening (major workflow steps)
  • DEBUG -> show a lot of details abaout what is going on in the app.
  • TRACE -> never used this level
I would like to see my production systems run on INFO level. Only in special situation I would switch log level to DEBUG for some interesting classes on production, to get more context what is happening. As Log4j allows to set the log level by category, which is usually bound to the class name of the class that is calling the logger, I can usually do that safely for some time (without restarting the app) as I will do it on a single class or package only.

Unfortunately log4j is not the only logging framework in the world, not even in java. There are lots of them, and many of these try to differentiate themselfes on the log levels.  This is was the java builtin Logger uses :
  • SEVERE (highest value)
  • WARNING
  • INFO
  • CONFIG
  • FINE
  • FINER
  • FINEST (lowest value)
The lower level values of this logger indicate something that I did not get on the log4j levels. There are different concerns addressed by the levels. Severity of an issue at the higher level, and detail or volume control on the lower levels. 

From the "what should I (or an automatic log analyser) do" perspective, and to remove the language problem, more explicit levels would be helpfull, like:  
  • PAGE_SOMEONE_INSTANTLY
  • PAGE_SOMEONE_IF_REPEATED
  • CONTEXT
For detailed tracing a LOGFILE_KILLER level could be added. This should clarify the language problem in terms of actions required (assuming that there is a basic understanding of the english language), as it explicitly says what should happen.

Maybe it should even be simpler, only PAGE_SOMEONE, as the instantly vs. if repeated decision could happen outside of the system.

And of course we will not be able to change the world, but we could start to change the part of the world, we can control.

log4j is a trademark of The Apache Software Foundation, Oracle and Java are registered trademarks of Oracle and/or its affiliates.

Bug Levels, or how simplicity made our life easier

Had a conversation with Jordan Sissel at devops days regarding log levels and the different understanding between dev and ops people. Jordan called this a language problem. I will write another post about that topic, but the discussion reminded me of another situation, where some confusion existed, due to unclear definition of levels. Here is the story.
 
Some years ago, we had more than 300 Bugs in our bugtracking system, about 50% of them older than 3 month and some even older than a year. It took us some effort to manage these bugs, and we were quite frustrated with this situation. Especially as the IT seemed to be the only department that cared about the bugs. Product management did not care, they more focussed on new features. 

Here are some of the things we did and what we achieved:

Many bugs were left unassigned for quite a long time, and sometimes bugs would ping pong back and forth between teams. To overcome that, we introduced a dedicated role to manage these bug, called bug dispatcher. This guy would try to find out which team was most probably responsible for fixing the bug and push them to fix it. He would among other things be involved in discussions with product owners if a given bug should be fixed now, or not. This helped a lot to bring new bugs to the right teams fast, but it did not really reduce the amount of open bugs a lot.

So next we decided to close all bugs older then 3 month. If they were not fixed within this time, they could not be important. But this also required some effort, as we were not consequent enough to just automatically close them, but we ran around, and talked with the PO to close them.

As this was not such a great succes, we had a look at our bug levels and the duration a bug would stay open depending on the level assigned. We had quite a lot of different bug levels in our tracking system at this time : 
  • Blocker
  • Critical
  • Major
  • Normal
  • Minor
  • Enhancement
As far as I remember, we basically took the levels the tool was shipped with. Nobody could explain, if and how the handling of a minor bug in relation to a major bug should differ. What time would be acceptable.

It showed up, that bugs having the first three levels assigned, would be fixed within a reasonable amount of time, all other levels were interpreted as: "we will never fix this".
    
Having understood that, we drastically changed the number of bug levels and the rules applied for handling these.

We now have only two bug levels left:
  • Critical
  • Normal
Critical means, we are or will be loosing money or reputation due to this bug, which means we fix it immediately if found in production, or we will not push this release to production if found on staging.

Normal means, we fix it with one of the next releases.

If the bug is in the tracking system, it will be fixed, unless the product owner excepts the bug behaviour as acceptable for the product, and closes the bug.

This reduced the number of open bugs drastically. And it also took away a lot of discussions. Some simple and easy to understand rules made solving this problem much easier. Getting rid of all the other levels enforced a decision.

Simplicity makes our live much easier now.

Monday, April 30, 2012

I want office desks with wheels

The one problem I remember from all IT retrospective meetings, many from back in times where DevOps was not yet practiced, is "communication", or more precisely "missing communication".

One of the principles behind the agile manifesto I think fits here is:

The most efficient and effective method of conveying information to and within a development
team is face-to-face conversation. 

Although I would replace development team with organization.
I think it is well known, that geographical distance affects communication. Somewhere (can't remember where) I read a hypothesis, that you basically have four classes of communication depending on distance: Same room, same building, same country, other country. And Between each class communication quality/rate/propability will drop significantly. Maybe by an order of magnitude?

Communiction needs people to overcome a wittingly or unwittingly present barrier. The more complicated communication is, the higher will be this barrier. Will you pick up the phone, will you move to check if the other one is at his desk or do you fall back to writing an email, which will make communication asynchronous, and is a one way communication prone to misunderstanding. 

If you are in the same room, you can communicate with everybody just by speeking up. Hopefully only people involved with the same product are in this room, and there are no (cubicle) walls.

If you are in the same building, You can stand up and walk to the person you need, but you may not find her and waste time.

In the same country, you will most probably use electronical communication options. 


Other countries may add timezone and language issues.  

In short, people should talk, face to face if possible. 

But what, if you are not in the same room, but in the same building. And your issue with somebody else is more than a short talk. Something you have to work on for let's say hours or even days together.

Well, I think you should sit together for this time. People/teams should relocate to optimize communication. And this should be as easy as possible. Great if you have Laptops and wifi, and there is some free desk space around. If not, why not using desks with wheels, so it is easy to move it around, plug in power and network, and off we go.

I want that for years, and did not get it yet. But now I read the Valve employee handbook. Wow, these guys, among other very interesting ideas, have desks with wheels, and an automatic tool to locate where people are sitting, depending on where there workstations are plugged in.

I like that idea. Especially as I think it is not a coincidence, that the word agile is a term related to motion / movement.

Let's move!  

Wednesday, March 21, 2012

Reading "The design of everday things" will change your life

Some month ago, I read a tweet from @oschoen, about the book "The Design of Everyday Things" from Donald A. Norman stating that reading this book (quote:) "will ruin your ability to walk through doors without complaining". I wondered why, and started reading.

Now some month later, after reading the book, I know what @oschoen meant.
Just visited the office of a law firm. Very stylish office design, but even people working there for some time do not know how to operate the doors of the wardrobe. And that is only one example. Doors that give no clues about how to use them, or even worse, give misleading cues. Taps that win design prices but until you know how they work, you always have to fiddle around with them for some time.

To make it worse, there are catastrophes that happen because of the bad design of switches or other things needed to operate power plants, trains, airplanes or whatever. And people feeling bad about themselfes not understanding how things work, instead of blaming the designer who built them.

Everybody designing products, mobile or PC applications and web interfaces should give this book a try.


Tuesday, March 20, 2012

Berlin city center, stupid speed limit of 10 km/h

For some time now, there is a speed limit of 10 km/h in on the minor streets around Berlin's Hackescher Markt, a famous place for Tourists, but also famous for art galleries and craftsmen.

Because my wife has her workshop around this area, I often have to drive through some of these streets. This is not much fun.

Why?
Well, this speed limit is completely useless. Nobody takes it seriously. If you drive 20 km/h you will get much fun. Especially if you have a cab or a parcel service driver behind you. If they can not go at least 50km/h they will start pushing you faster.

I could understand to limit the speed to 30km/h to reduce the amount of traffic and noise for people living there or visiting the area. But I do not see the local people stick to this 10km/h speed limit. Even bicycles would need to go slow, and of course nobody does that. But if you put a limit in place, then please also enforce it. Otherwise this teaches everybody to ignore trafic rules.

I never saw the police measuring around there. Many people would loose their driving license every day if they would. Or the limit would have been changed allready, as many people would have complained.
So please, Berlin City Administration, do something! And I would prefer to see a suitable speed limit.

PS: I just learned that police does measurements at the Gipsstrasse. So be carefull :-)


Saturday, March 10, 2012

What I learned from John Allspaw and Eric Ries about root cause analysis

In his talk Advanced Postmortem Fu and Human Error 101 at the 2011 velocity conference, John Allspaw talked among other things about root cause analysis. One of his points was, that there is no such thing as a root cause for any given incident in complex systems. Its more like a coincidence of several things, that make a failure happen. I liked his visualization using several slides of Swiss cheese, where accidentally the holes of several slides of cheese are aligned in a way that a straight like can run through the holes, as a symbol for something bad happening.

In hindsight, there often seems to be a single action, that would have prevented the incident to happen. This one thing is searched for by managers when they do a root cause analysis. They hope to be able to prevent this incident from ever happening again. But if the incident was possible due to the coincidence of many events, it makes no sense to search for a singular root cause. This would lead to a false sense of security, as in a complex system there are to many ways, an incident can possibly happen.

Now I just recently read "Lean Startup" from Eric Ries. In one of the last chapters he suggested to use the five why method on incidents. So if something unexpected happens, ask why it happened. The next why is then applied to the answer of the previous question. First I thought, he is "only" looking for the root cause, and this would not make to much sense, as explained above. But there is the point that asking these questions will not only find an underlying cause, but better will bring to the light a chain of events leading to the incident. And Eric Ries recommends to find a counter measure on every level of the chain of events. This will make sure, that we are not just fixing the symptoms, but will improve the immune defense of our system.

I like that idea. It imposes much more work than only preventing the "root cause" but it gives a much better understanding of the system and is a good training for everybody.

Saturday, March 3, 2012

dbdeploy with checksum validation

We just published a modified version of Graham Tackley's dbdeploy tool at github. Dbdeploy is enhanced with checksum validation of allready applied scripts. It will throw an Exception if an allready applied database script has changed afterwards.
Thanks to Michael Gruber for implementing this change while being a part of our Delivery Chain Taskforce.

Friday, March 2, 2012

database schema evolution followup

Having thought about the comment of pvblivs to my previous post, a substantial difference between version control of software and database changes became obvious to me. It is not a big deal, I just never thought about this before.

For software, the version control system (VCS)  represents a hopefully consistent state of your software for any revision. This state will be delivered to your dev/staging/production system. Maybe the software is build into an executable format, but basically it is taken as a whole. Maybe for optimization you only push the actuall differences to your systems, but still the final state of the target system will be represented by the revision in your VCS. If you want to know what has changed between to revisions, you ask your VCS what the differences are between these revisions, and it will show you.

With database changes, you actually write scripts that describe the transition between previous state and required state. So the db developer manually does the job that the VCS does for software. The VCS is basically used for storage and historization of these transitions, only.

In contrast to software, databases typically have two streams of evolution. Structural changes and Application data is pushed like software bottom up from development to production. The other stream is a top down stream of data added to the database by users. I think that this is the reason why we define changes and not the required target state as we have to find a way to merge the two streams.

A question I have now is, should there be a tool that automatically determines the required changes between two database states?

I remember listening to a presentation about a db migration tool, I think it was liquibase, you could point at two different databases and instruct it to migrate one to the state of the other. And I remember the bad feeling I had regarding that idea. Mainly because I did not like the idea to move changes from staging to productionthis way, because you would have to make damned sure not to accidentaly delete the production data. You would need to define very well, which application data to move and which user data to leave as it is. But maybe I should rethink that.

What do you think?

Tuesday, February 28, 2012

database schema evolution versus schema versioning

Why we have choosen to use dbdeploy for our database evolution over flyway and some other candidates.
 
Flyway as most other tools, assumes that you have a schema version, that is strictly monotonically increasing. This means, you can only evolve from a lower version to a higher version. As we are not yet doing continuous delivery, we still have a staging environment, where we test a release candidate. And from time to time it happens, that this RC needs a database change. And that is where trouble starts.
Let us assume, that the latest staging version uses db version 22.13, the trunk is currently on 23.5. If you now need a db change in staging, it will increas the schema version to 22.14. But your dev databases are allready at 23.5. So flyway will not allow to add the missing script to advance from 22.13 to 22.14 on your dev databases, as these are allready on 23.5. The only way to add the required change would be to recreate the database from scratch, which gets a little bit complicated and time consuming, if you are working an a more than 10 years old application, as we are.

The main reason I can come up with, for this behaviour is, that this guarantees a consistent order of changes. And thus solves eventually occuring dependencies between database changes. For example db version 23.5 may change a table that was introduces with 22.12. Thus 23.5 will fail if 22.12 is not yet applied.

However, with 100 people working on one big web application -which needs changing, but this is another story- most changes will not depend on each other as they affect different functional parts of the application. And often changes are developed in parallel which makes sctrictly monotonically increasing nmbering in svn difficult.

To allow an efficient development, the dbdeploy way of doing things looked more appropriate for us. Dbdeploy also uses a unique numbering scheme for naming the changes, but it does not enforce the order as strictly als flyway. If you allready applied 100, 110 and 120, you can still create a 115, and get it deployed. Dbdeploy basically removes the allready applied scripts from the set of available scripts and applies the remaining scripts in the order given by their numbers. Dependencies between scripts are at your risk.  

So the basic difference is versioning a schema versus evolving a schema by applying (hopefully small and independent) changes. The only thing we missed in dbdeploy was the capability to warn about changed scripts allready applied to the database. Thus we added a sha256 checksum to the dbdeploy changelog table, and added checksum comparison for scripts allready applied. If a allready applied script was changed, we will set up the database to a previous production version by importing an anonymized dump and apply the missing features. As this is currently our normal way of database deployment, we know how to do that. But my strong hope is that we will only have to do that in one out of a hundred cases, as this takes 15 minutes. Applying the missing changes takes less than a minute as far as we have experienced up to now.

Monday, February 27, 2012

Oracle database shutdown and startup via jdbc

While working on the automation of our database setup procedures, we learned that you can shut down and start up an oracle db instance remotely via jdbc. It took some trials and finally reading this blog post to understand how to achive that.

From the oracle java api documentation it is not obvious, that you have to call several methods to achieve the same effect of a shutdown transactional from within sqlplus. You actually should issue four commands as shown in the following code example


   import static oracle.jdbc.OracleConnection.DatabaseShutdownMode.TRANSACTIONAL;
   ...
 
   OracleConnection connection = connectionHolder.getConnection();
    try {
      connection.shutdown(TRANSACTIONAL);

      Statement stmt = connection.createStatement();
      stmt.execute("alter database close normal");
      stmt.execute("alter database dismount");
      stmt.close();
      connection.shutdown(OracleConnection.DatabaseShutdownMode.FINAL);
    } finally {
      connection.close();
    }

For startup  it is :

    connectionHolder.getPrelimAuthConnection().startup(NO_RESTRICTION);

    Statement statement = connectionHolder.getConnection().createStatement();
    statement.execute("alter database mount");
    statement.close();
 
 
Both calls require a connection with sysdba or sysoper role. Startup requires a preliminary auth connection. To get one use

  private OracleConnection getPrelimAuthConnection() 
      throws SQLException 
  {
    Properties props = new Properties();
    props.put(USER_KEY, username);
    props.put(PASSWORD_KEY, password);
    props.put(ROLE_KEY, "sysdba");
    props.put(PRELIM_AUTH_KEY, "true");

    OracleConnection newConnection = 
       (OracleConnection) DriverManager.getConnection(connectionString, props);
    return newConnection;
  }

How to get things done aka agile change

In an IT organisation with 100 or more employees and a 10 year old million lines of code base, things tend to become slow. Some reasons for that are complexity, fear of breaking something, and lack of knowledge, especially about the legacy stuff. The older problems are, the more hesitation is felt.

After a couple of infrastructure improvement projects I participated in, there is the one most important lesson I learned. The one thing that helps most, is to stop hesitating and "just" start doing something.

  • Find the right people, 
  • understand the problems, 
  • pick the most important problem, 
  • collect solution ideas to reduce this problem a bit 
  • and decide on the idea to start with. 
An important criteria to select the first / next idea to work on should be the time required to evaluate the idea. You should get feedback after one or two days of working on it (aka time boxing).  This makes simple solution ideas more likely to be choosen, which is good. Suprisingly often the first idea I can come up with, is much more complex than the idea finally followed after discussing the ideas with a small team. The implementation of the idea is not necessarily finished after one or two days. But if you do not see sufficient progress, you should rethink problem and solution with what was learned during these days. Often new ideas are borne in this time. Just do not ride that horse until it is dead.

As an example we were looking for a solution to get our database deployment in our dev environment faster. Someone came up with the idea to use an empty database instead of an anonymized production dump. This ideas was around for at least five years, but nobody really tried and and many problems were expected to happen. Having decided to try that idea, we managed to have a minimal dump available and our application starting and running with it within less than a day. So we knew it would work. We still spend two weeks to automate the creation of this dump and to build some quality assurance around the automation of minimal dump creation and database deployment, but we knew that it would work and save us around 10 minutes of deployment time.