Tuesday, March 20, 2012

Berlin city center, stupid speed limit of 10 km/h

For some time now, there is a speed limit of 10 km/h in on the minor streets around Berlin's Hackescher Markt, a famous place for Tourists, but also famous for art galleries and craftsmen.

Because my wife has her workshop around this area, I often have to drive through some of these streets. This is not much fun.

Why?
Well, this speed limit is completely useless. Nobody takes it seriously. If you drive 20 km/h you will get much fun. Especially if you have a cab or a parcel service driver behind you. If they can not go at least 50km/h they will start pushing you faster.

I could understand to limit the speed to 30km/h to reduce the amount of traffic and noise for people living there or visiting the area. But I do not see the local people stick to this 10km/h speed limit. Even bicycles would need to go slow, and of course nobody does that. But if you put a limit in place, then please also enforce it. Otherwise this teaches everybody to ignore trafic rules.

I never saw the police measuring around there. Many people would loose their driving license every day if they would. Or the limit would have been changed allready, as many people would have complained.
So please, Berlin City Administration, do something! And I would prefer to see a suitable speed limit.

PS: I just learned that police does measurements at the Gipsstrasse. So be carefull :-)


Saturday, March 10, 2012

What I learned from John Allspaw and Eric Ries about root cause analysis

In his talk Advanced Postmortem Fu and Human Error 101 at the 2011 velocity conference, John Allspaw talked among other things about root cause analysis. One of his points was, that there is no such thing as a root cause for any given incident in complex systems. Its more like a coincidence of several things, that make a failure happen. I liked his visualization using several slides of Swiss cheese, where accidentally the holes of several slides of cheese are aligned in a way that a straight like can run through the holes, as a symbol for something bad happening.

In hindsight, there often seems to be a single action, that would have prevented the incident to happen. This one thing is searched for by managers when they do a root cause analysis. They hope to be able to prevent this incident from ever happening again. But if the incident was possible due to the coincidence of many events, it makes no sense to search for a singular root cause. This would lead to a false sense of security, as in a complex system there are to many ways, an incident can possibly happen.

Now I just recently read "Lean Startup" from Eric Ries. In one of the last chapters he suggested to use the five why method on incidents. So if something unexpected happens, ask why it happened. The next why is then applied to the answer of the previous question. First I thought, he is "only" looking for the root cause, and this would not make to much sense, as explained above. But there is the point that asking these questions will not only find an underlying cause, but better will bring to the light a chain of events leading to the incident. And Eric Ries recommends to find a counter measure on every level of the chain of events. This will make sure, that we are not just fixing the symptoms, but will improve the immune defense of our system.

I like that idea. It imposes much more work than only preventing the "root cause" but it gives a much better understanding of the system and is a good training for everybody.

Saturday, March 3, 2012

dbdeploy with checksum validation

We just published a modified version of Graham Tackley's dbdeploy tool at github. Dbdeploy is enhanced with checksum validation of allready applied scripts. It will throw an Exception if an allready applied database script has changed afterwards.
Thanks to Michael Gruber for implementing this change while being a part of our Delivery Chain Taskforce.

Friday, March 2, 2012

database schema evolution followup

Having thought about the comment of pvblivs to my previous post, a substantial difference between version control of software and database changes became obvious to me. It is not a big deal, I just never thought about this before.

For software, the version control system (VCS)  represents a hopefully consistent state of your software for any revision. This state will be delivered to your dev/staging/production system. Maybe the software is build into an executable format, but basically it is taken as a whole. Maybe for optimization you only push the actuall differences to your systems, but still the final state of the target system will be represented by the revision in your VCS. If you want to know what has changed between to revisions, you ask your VCS what the differences are between these revisions, and it will show you.

With database changes, you actually write scripts that describe the transition between previous state and required state. So the db developer manually does the job that the VCS does for software. The VCS is basically used for storage and historization of these transitions, only.

In contrast to software, databases typically have two streams of evolution. Structural changes and Application data is pushed like software bottom up from development to production. The other stream is a top down stream of data added to the database by users. I think that this is the reason why we define changes and not the required target state as we have to find a way to merge the two streams.

A question I have now is, should there be a tool that automatically determines the required changes between two database states?

I remember listening to a presentation about a db migration tool, I think it was liquibase, you could point at two different databases and instruct it to migrate one to the state of the other. And I remember the bad feeling I had regarding that idea. Mainly because I did not like the idea to move changes from staging to productionthis way, because you would have to make damned sure not to accidentaly delete the production data. You would need to define very well, which application data to move and which user data to leave as it is. But maybe I should rethink that.

What do you think?

Tuesday, February 28, 2012

database schema evolution versus schema versioning

Why we have choosen to use dbdeploy for our database evolution over flyway and some other candidates.
 
Flyway as most other tools, assumes that you have a schema version, that is strictly monotonically increasing. This means, you can only evolve from a lower version to a higher version. As we are not yet doing continuous delivery, we still have a staging environment, where we test a release candidate. And from time to time it happens, that this RC needs a database change. And that is where trouble starts.
Let us assume, that the latest staging version uses db version 22.13, the trunk is currently on 23.5. If you now need a db change in staging, it will increas the schema version to 22.14. But your dev databases are allready at 23.5. So flyway will not allow to add the missing script to advance from 22.13 to 22.14 on your dev databases, as these are allready on 23.5. The only way to add the required change would be to recreate the database from scratch, which gets a little bit complicated and time consuming, if you are working an a more than 10 years old application, as we are.

The main reason I can come up with, for this behaviour is, that this guarantees a consistent order of changes. And thus solves eventually occuring dependencies between database changes. For example db version 23.5 may change a table that was introduces with 22.12. Thus 23.5 will fail if 22.12 is not yet applied.

However, with 100 people working on one big web application -which needs changing, but this is another story- most changes will not depend on each other as they affect different functional parts of the application. And often changes are developed in parallel which makes sctrictly monotonically increasing nmbering in svn difficult.

To allow an efficient development, the dbdeploy way of doing things looked more appropriate for us. Dbdeploy also uses a unique numbering scheme for naming the changes, but it does not enforce the order as strictly als flyway. If you allready applied 100, 110 and 120, you can still create a 115, and get it deployed. Dbdeploy basically removes the allready applied scripts from the set of available scripts and applies the remaining scripts in the order given by their numbers. Dependencies between scripts are at your risk.  

So the basic difference is versioning a schema versus evolving a schema by applying (hopefully small and independent) changes. The only thing we missed in dbdeploy was the capability to warn about changed scripts allready applied to the database. Thus we added a sha256 checksum to the dbdeploy changelog table, and added checksum comparison for scripts allready applied. If a allready applied script was changed, we will set up the database to a previous production version by importing an anonymized dump and apply the missing features. As this is currently our normal way of database deployment, we know how to do that. But my strong hope is that we will only have to do that in one out of a hundred cases, as this takes 15 minutes. Applying the missing changes takes less than a minute as far as we have experienced up to now.

Monday, February 27, 2012

Oracle database shutdown and startup via jdbc

While working on the automation of our database setup procedures, we learned that you can shut down and start up an oracle db instance remotely via jdbc. It took some trials and finally reading this blog post to understand how to achive that.

From the oracle java api documentation it is not obvious, that you have to call several methods to achieve the same effect of a shutdown transactional from within sqlplus. You actually should issue four commands as shown in the following code example


   import static oracle.jdbc.OracleConnection.DatabaseShutdownMode.TRANSACTIONAL;
   ...
 
   OracleConnection connection = connectionHolder.getConnection();
    try {
      connection.shutdown(TRANSACTIONAL);

      Statement stmt = connection.createStatement();
      stmt.execute("alter database close normal");
      stmt.execute("alter database dismount");
      stmt.close();
      connection.shutdown(OracleConnection.DatabaseShutdownMode.FINAL);
    } finally {
      connection.close();
    }

For startup  it is :

    connectionHolder.getPrelimAuthConnection().startup(NO_RESTRICTION);

    Statement statement = connectionHolder.getConnection().createStatement();
    statement.execute("alter database mount");
    statement.close();
 
 
Both calls require a connection with sysdba or sysoper role. Startup requires a preliminary auth connection. To get one use

  private OracleConnection getPrelimAuthConnection() 
      throws SQLException 
  {
    Properties props = new Properties();
    props.put(USER_KEY, username);
    props.put(PASSWORD_KEY, password);
    props.put(ROLE_KEY, "sysdba");
    props.put(PRELIM_AUTH_KEY, "true");

    OracleConnection newConnection = 
       (OracleConnection) DriverManager.getConnection(connectionString, props);
    return newConnection;
  }

How to get things done aka agile change

In an IT organisation with 100 or more employees and a 10 year old million lines of code base, things tend to become slow. Some reasons for that are complexity, fear of breaking something, and lack of knowledge, especially about the legacy stuff. The older problems are, the more hesitation is felt.

After a couple of infrastructure improvement projects I participated in, there is the one most important lesson I learned. The one thing that helps most, is to stop hesitating and "just" start doing something.

  • Find the right people, 
  • understand the problems, 
  • pick the most important problem, 
  • collect solution ideas to reduce this problem a bit 
  • and decide on the idea to start with. 
An important criteria to select the first / next idea to work on should be the time required to evaluate the idea. You should get feedback after one or two days of working on it (aka time boxing).  This makes simple solution ideas more likely to be choosen, which is good. Suprisingly often the first idea I can come up with, is much more complex than the idea finally followed after discussing the ideas with a small team. The implementation of the idea is not necessarily finished after one or two days. But if you do not see sufficient progress, you should rethink problem and solution with what was learned during these days. Often new ideas are borne in this time. Just do not ride that horse until it is dead.

As an example we were looking for a solution to get our database deployment in our dev environment faster. Someone came up with the idea to use an empty database instead of an anonymized production dump. This ideas was around for at least five years, but nobody really tried and and many problems were expected to happen. Having decided to try that idea, we managed to have a minimal dump available and our application starting and running with it within less than a day. So we knew it would work. We still spend two weeks to automate the creation of this dump and to build some quality assurance around the automation of minimal dump creation and database deployment, but we knew that it would work and save us around 10 minutes of deployment time.