Database versioning best practices



The topic described in this article is a part of my Database Delivery Best Practices Pluralsight course.

Keeping track of your application’s database is not an easy task. Database schemas tend to mismatch in different environments, data in one of the databases may miss some crucial piece of data. Such occasions can be irritating, especially when caught in production.

The situation gets worse when you develop redistributable software. In this case, each of your clients has their own database instance whose structure may differ from others’. In such a project, keeping track of your clients’ databases can become a nightmare.

Let’s look at the database versioning best practices that help us deal with this problem.

Database versioning: the problem

When you are a single programmer working on a project that is not yet shipped to production, there is no such problem as database versioning. You just change your DB schema the way you want and it always works.

Problems arise when your software starts operating in production or a new team member joins you to work on database-related parts of your project. As soon as you have more than one database instance, they start getting out of sync. Even with a single instance, it takes a significant amount of time to synchronize the changes when more than one developer work with it.

Does it look familiar? I bet you were in such situations, probably more than once. I certainly was.

Database versioning best practices

Fortunately, we are not alone. There are plenty of materials written on that topic as well as software that is aimed to solve this problem. I recommend this book if you want to dive deeper into the subject. It is an ultimate guideline for how to evolve your database along with the code that uses it.

Alright, so what are these database versioning best practices?

Best practice #1: we need to treat the application database and the reference data in it as regular code. That means we should store both its schema and the reference data in a source control system.

Note that this rule includes not only schema of the database but also the reference data in it. Reference data is the data that is mandatory to run the application. For example, if you have a dictionary of all customer types possible on which existence your application relies, you should store it in the source control system as well.

Best practice #2: we have to store every change in the database schema and in the reference data explicitly. This means that for every modification we make we should create a separate SQL script with the changes. If the modification affects both the schema and the reference data, they should be reflected in a single script.

Database versioning best practices

Adhering to this rule is a vital part of building a successful database versioning system. Many projects have their database schema stored in a source control but often it is just a snapshot of the latest database version whatever that is. All the changes in it are tracked by the source control system itself, they are not stored explicitly. Moreover, often the changes in the reference data are not tracked at all.

Such tools as Visual Studio database project emphasize that approach and urge programmers to use auto-generated upgrade scripts for schema update. While this may work well in small projects, in larger projects, tracking changes in the database using auto-generated scripts becomes a burden. We will talk about Visual Studio database project and other tools available in the next post.

Best practice #3: every SQL script file must be immutable after it is deployed to production or staging environment.

The whole point of storing the changes in separate files is to be able to track each of them. When we modify the existing SQL scripts we lose all the benefits the database versioning best practices provide us. Keep the script files unchangeable after their deployment. If you need to turn down the changes that are already shipped – create a separate script for that.

Best practice #4: all changes in the database’s schema and reference data have to be applied through the scripts. Neither of them can be applied manually.

If we modify the database passing over our scripts, the whole idea of database versioning becomes worthless, so we need to make sure the changes are made only via the SQL scripts we create.

Best practice #5: every developer in the team should have their own database instance.

Often, teams start with a single database in the developer environment. That works well at the beginning but when the database grows large enough, simultaneous modifications of it become harder and harder until at some point stop working at all.

Changes programmers make are often incompatible so it’s a good idea for each programmer to have a separate DB instance to avoid such collisions. If developers do modify related pieces of the DB schema simultaneously, such conflicts can be resolved using a source control system, just like the conflicts in C#/Java/etc code.

Moreover, if you have several branches of your code base, you might also want to create a separate DB instance for each of them, depending on how different the databases in these branches are.

Best practice #6: database version should be stored in the database itself. I usually tend to create a separate table named Settings and keep the version there. Don’t use complex notations like “x.y.z” for the version number, just use a single integer.

What are the benefits of such approach?

So what benefits these database versioning best practices give us?

The first and the most important advantage is that when we use this approach, we don’t have the problems with the database schema mismatch anymore. Automatic upgrades to the latest version solve them completely, of course if we fully adhere to the rules described above.

This is especially useful when you don’t have a single production database, but every client has their own DB instance. Managing DB versions in such circumstances might become hell if you don’t employ proper versioning techniques.

Another gain these best practices provide is a high cohesion of the database changes. It means that every notable modification in the schema and the reference data is reflected in a single place and not spread across the application.

The SQL upgrade scripts also grand high cohesion is a sense that they contain every DB change required for a feature, so it’s easy to understand what modifications were made in the database in order to unlock a particular functionality. Keeping both schema and data changes related to each other in a single file also helps a lot.

The approach described in this post is applicable even if you didn’t follow it from the very beginning. To put it into practice, you just need to create an initial script with the database schema you have right now in production and start changing it incrementally from that moment. The current version should become version #1 from which you can move further using the techniques we discussed above.

Summary

In this article, we looked at the database versioning best practices. In the next posts, we’ll see what software are there at our disposal. I’ll also show a lightweight tool I use for applying SQL upgrade scripts.

Other articles in the series

Share




  • Stephen

    You can use the upgrade-framework library from VMware to help with this: https://github.com/vmware/upgrade-framework

  • Julien

    Hi,
    You only refer to backward compatibility issue. Do you have any advice for forward compatibility, especially for redistributable software? Thank you.

    • http://enterprisecraftsmanship.com/ Vladimir Khorikov

      I actually didn’t encounter situations where forward compatibility would work well. In my opinion, the attempts to process data without knowing its semantics often lead to many disasters. Data loss and data corruption are among them.

      I think the best approach here is to gracefully inform the administrator that the database is not compatible to the version of the software installed and refuse to continue working after that. This shouldn’t be a big deal, though, as installers should automatically upgrade the existing database to the version compatible with the software version being installed.

  • Bob

    “Best practice #6: database version should be stored in the database itself. I usually tend to create a separate table named Settings and keep the version there. Don’t use complex notations like “x.y.z” for the version number, just use a single integer.”

    If your database version is a single integer, then you’re assuming your version control will never have any branches, right? That assumption is rarely even true for a single developer.

    I can tell how old/mature a database migration system is by whether they’ve discovered (and solved) this problem yet. Rails fixed it years ago. Alembic isn’t at 1.0 yet but they solved it from the start.

    • http://enterprisecraftsmanship.com/ Vladimir Khorikov

      If your database version is a single integer, then you’re assuming your version control will never have any branches, right?

      No, I use this approach for environments with branches as well.

      I did use the “x.y.z” notation before, but I don’t see how it helps resolve conflicts between branches any better than a single integer do.

      • http://www.drugbank.ca Craig Knox

        Rails uses the timestamp, which I believe makes the most sense.

        • Bob

          It’s pretty good. Most of the time it’s good enough. One of the key benefits is that Rails stores the database state as a set of migrations which have been applied, not just a single integer, which makes it easier to figure out where you are, when you need to fix things by hand.

          I think Alembic is a little bit nicer, in some ways, since it uses a unique ID, with explicit “parent migration” links. You tell it where in the migration DAG you want to be (which is usually “head”, but not always), and it gets you there.

          I suppose that’s the difference between the two communities, in a nutshell. In Rails, they try to make everything automatic, and it’s up to you to fix it, if it turns out the automatic way doesn’t work. In Python, they try to make everything explicit, and let you know if there’s any ambiguity.

      • Bob

        Can you explain how you store branch state in a single integer? I feel like I’m missing something.

        Master has migration M1, branch B1 has migration M2, branch B2 has migration M3. What integer does a B1 database have? What integer does a B2 database have? When you merge B1 and B2 into master, what integer does the database have then?

        • http://enterprisecraftsmanship.com/ Vladimir Khorikov

          Perhaps, it is I who’s missing something, I believe the situation here is the same as with the “x.y.z” notation, except that a single integer is a bit simpler to manage.

          In your example it would look like this:
          1) Master’s DB is of version 1. B1 with M2 – 2, B2 with the M3 migration – also has version 2 as B1 and B2 don’t know of each other.
          2) B1 merges to master. Master’s DB is now of version 2.
          3) B2 gets the latest version before merging, sees there’s a new migration M2 B2 didn’t know of. B2 then adjusts their code: changes M3 to have the version 3 instead of 2 and modifies the script itself if there are any conflicting changes with M2.

          • Bob

            In the scenario you’re describing, the database doesn’t (and can’t) know its own state. When it’s “version 2”, that could be any branch downstream of version 1. That’s a problem.

            Suppose the programmer who wrote B1 merged and deployed it, and then the programmer who wrote B2 decides to merge it. He has to know that B1 was already deployed, so he picks the correct migration to call “2” and “3”. Either way he picks, he’s still screwing everybody who was working on the “wrong” side, because half the programmers will have a working database where “2” means M1 and the other half will have a working database where “2” means M2.

            Also, when you merge B1 and B2, you potentially have to renumber any number of downstream migrations. It’s easy when they only have one migration each, but it’s common for there to be more. If B1 had M2, M3, and M4, and B2 had M5, M6, and M7, you’d have to manually figure out an acceptable ordering for them.

            I don’t think I ever proposed a dotted “x.y.z” notation, so I don’t know where that came from. I do think, though, that you need unique IDs for migrations, and/or an explicit ordering, to avoid merge hell.

          • http://enterprisecraftsmanship.com/ Vladimir Khorikov

            The database doesn’t know its state, it’s the branch that does. So, in this case, master branch, B1, and B2 will contain the information about the latest version (and all previous versions) of the database they work with.

            Either way he picks, he’s still screwing everybody who was working on the “wrong” side

            I think this is where the misunderstanding comes from. How does he screw them? He needs to rearrange the migrations numbers, that’s true, but it doesn’t affect anything except the numbers of the migrations – they should be adjusted to avoid overlapping.

            In my experience, even when I had two branches with a lot of DB work done in each of them, the hardest part was not how to name the migrations, it was the actual code in those migrations that was potentially intersecting with each other. It doesn’t matter how we outline our migrations, either by timestamp or by integer, the hardest part still needs to be done manually: we need to check out all the migrations made since the last merge and reiterate all the migrations in our branch to make sure the code we worked on sill works.

          • http://dvae.net/blog/ dvae

            Flyway solves this by storing the hash digest (md5) of the file in the version table, along with the version number. This way it can check that the file has not changed since it was applied, or it can detect the difference between branch 1 and branch 2 of a given migration.

          • https://about.me/ferventcoder ferventcoder

            Flyway sounds exactly like RoundhousE, by comparison both started around the same time (Flyway’s first commit looks to be in April 2010 https://github.com/flyway/flyway/commit/169014f7eae42dd984a4c8bb31ed60033a1de3c1, where RH was June 2009 https://code.google.com/p/roundhouse/source/list?num=25&start=4, now at https://github.com/chucknorris/roundhouse).

          • https://about.me/ferventcoder ferventcoder

            I think as long as the integer is globally unique across instances, it is probably fine. In most DB migration tools it is not a globally unique integer, and that might be where Bob is getting hung up. RoundhousE let’s you pick the way you version, and also talks about those different methods – https://github.com/chucknorris/roundhouse/wiki/Versioning

  • Axel Fontaine

    Flyway does exactly this, is mature and enjoys widespread adoption: http://flywaydb.org/

    • http://enterprisecraftsmanship.com/ Vladimir Khorikov

      Spoilers! 🙂
      Was going to discuss the tools in the next post.

      • http://www.jtwebman.com/ JT Turner

        You should have just done that post.

    • https://about.me/ferventcoder ferventcoder

      Weird, I was about to state the same about RoundhousE https://github.com/chucknorris/roundhouse/wiki. @axel_fontaine:disqus it seems you and I both approached this same issue in nearly exactly the same way back in 2009/2010. Must be a smart way if we both took the same approach. 🙂

  • JD

    I’d push back on #5, but mainly due to the size of the databases we use. By forcing all changes to be scripted, checked in, and having a CI tool like Jenkins deploy we’ve managed multiple tracks making db changes.

    I’d actually be worried that you are pushing off integration issues as individual developers work in a vacuum without being aware of potential conflicts as early as possible.

    Beyond that, we implement the other rules and find them highly effective in continuous integration and identifying who is responsible for broken builds.

    We rely on our CI tool to identify what version of code has been deployed to each environment.

    • http://enterprisecraftsmanship.com/ Vladimir Khorikov

      You are right, the size of the database might become a problem but I still think that #5 is useful in such situations, mainly for the sake of isolation while developing. For example, if you continuously run integration tests that work with the database while you are coding, it might be simply impossible for two developers to use the same database because of the data those integration tests generate and validate. Often, in order for the integration tests to work properly, they must be run in isolation from each other.

      Regarding being aware of potential conflicts – I agree they should be resolved as soon as possible and developers should communicate heavily in order to do that. At the same time, I do think programmers are better off working on their own DB instances.

      • Amo Chumber

        In the approach where each developer has their own database you would probably supplement that with an integration stage in your pipeline. This environment would *probably* be where your testers would do their testing and pulls in changes that make up the next release. Yes, you may hit conflicts but I think that is unavoidable in any build/deployment pipeline but the key is to never let them get to production.

        I always picture this as the basis:

        Local Dev (own instances) > integration (shared) > staging (do not touch) > production (DO NOT TOUCH!)

        🙂

        • http://enterprisecraftsmanship.com/ Vladimir Khorikov

          Absolutely! There definitely should be a place (QA, STG, etc.) where all the changes go prior to getting into production and where QA people can test the application.

  • https://about.me/ferventcoder ferventcoder

    Since someone mentioned best practice #6, I’ll just take the next number –

    Best Practice #7: every change should be auditable. Quite a few companies are subject to some sort of auditing (Sarbanes Oxley in the US, others in other countries), which includes auditing of IT controls. Knowing exactly who changed the database on Production, when and what was changed are very important for these companies, and possibly your company. It’s typically an afterthought for developers and DBAs, but it is vital when a company is audited.

    • https://about.me/ferventcoder ferventcoder

      These are the features of RoundhousE, which includes auditing – https://github.com/chucknorris/roundhouse/wiki/Features – I suspect @axel_fontaine:disqus’s Flyway is quite similar, except it looks like Flyway supports more database types and running on more OS platforms.

  • Philippe

    For .Net guys, you could use https://github.com/schambers/fluentmigrator or http://dbup.github.io/ to easily apply theses (very good!) practicies…

    • Alexander Erygin

      +1 for FluentMigrator.

  • Alex Yates

    Thanks for your post. I agree with a lot of your points about this importance of DB source control, reference data and repeatable deployments etc. This stuff is vitally important.

    However, I’m afraid I disagree with being so one-sided.

    I work for a company that provides tooling to support database development in both a migration script driven approach (as you describe) and a declarative/state driven approach so I have a fair amount of experience both ways. It’s my day job.

    I believe there are pros and cons to each approach and that some projects are better suited to one or the other. This is a pretty enormous topic and too much for one comment. I discuss the relative pros and cons here. I’d be really interested to hear your thoughts:

    http://workingwithdevs.com/delivering-databases-migrations-vs-state/

    A few of the things I struggled with in your post:

    “When we modify the existing SQL scripts we lose all the benefits the database versioning best practices provide us.”

    Isn’t the whole point of source control to track changes to a particular set of files, rather than just adding new ones? You can use a network share for that.

    “Don’t use complex notations like “x.y.z” for the version number, just use a single integer.”

    What’s wrong with semantic versioning?

    “When the database grows large enough, simultaneous modifications of it become harder and harder until at some point stop working at all.”

    Isn’t it the opposite? Migrations driven solutions are great at defining exactly how an upgrade works, but become a pain when projects get large with many contributors/teams editing the project. In these scenarios managing all the upgrade scripts becomes a nightmare, whereas working declaratively allows for much easier merging of conflicts.

    • http://enterprisecraftsmanship.com/ Vladimir Khorikov

      Alex,

      Thank you for your comment, I’m pleased to hear a feedback from someone directly related to the development of the tools I’m so familiar with.

      I agree there are two ways of implementing database continuous delivery. I’ve been employing both of them for almost 10 years now (VS database project tooling more recently, Redgate products before that – on one hand; fluent migrations, flyway-like software – on the other). During this period, I did come up with a strong opinion which, for the most part, I described in this post.

      I will probably share my personal experience with both migration-driven and state-driven approaches in the next post and tell why I think the former is more convenient way to do database delivery.

      I discuss the relative pros and cons here

      Thanks for sharing, it was an interesting talk! I agree with most of the points you make. There are some of them I disagree with to some extent:

      1) While it’s true the simultaneous work on a single database using the migrations approach leads to merging conflicts, in most cases they can be handled painlessly. For the most part, developers don’t work in a vacuum and communicate heavily with each other on the changes they are about to make so that the conflicts can be reduced to minimum. Secondly, it’s a good practice to create integration tests that work with the database. They help tackle and quickly fix the “last migration wins” problems if they occur.

      2) You mentioned that state-driven approach is good for merging code (such as functions and stored procedures) in the database. I agree with that. Nevertheless, I personally think that relying on stored procedures too much is a bad practice because it is one of the forms of breaking the Separation of Concerns principle. I strongly believe the application logic should be stored in the application code (C#/Java/etc) itself, not in the database. In the cases when you can’t avoid using SQL code (for example, for some sophisticated select query), that SQL should also be stored as part of the application code – along with the C# code that uses it, not in the database. That being said, I think that a project where the SoC principle is properly applied doesn’t contain many (if at all) SPs and functions, only tables, and the data in them. And that what makes migration-driven appealing: it allows us to handle data migrations so well (keeping in mind potential merge problems and mitigating them as I described in #1).

      Isn’t the whole point of source control to track changes to a particular set of files, rather than just adding new ones? You can use a network share for that.

      The problem with changing the migrations that are already deployed is that it becomes impossible to track the changes after that. Source control doesn’t help much with that.

      What’s wrong with semantic versioning?

      If by semantic versioning you mean version numbers like 1.2.5, then my answer is the KISS principle. Multi-digit version numbers don’t add any value comparing to a single digit, so we are better off choosing a simpler notation.

      Isn’t it the opposite?

      In regards to the merge conflicts – please see above. Also, I’ll write about my personal thoughts on that topic (state-driven vs migration-driven) in the next post, I would be glad to hear your thoughts on that.

      • Alex Yates

        Thank you for your considered response. You do make some very good points.

        I agree that modifying buggy upgrade scripts in a migrations based model is bad. It means that, depending on when customers upgraded, they might be in different states – which is bad. This does pose a different problem though, that in order to upgrade a DB you need to transition through a bad state before you get to the good state. In general deployments may take longer as you need to transition through every VCS commit, rather than upgrading directly from point X to point Y.

        The fact that modifying migration scripts is bad does not detract, however, from the fact that if this is what you are doing you lose the benefit of being able to easily diff one version against another. In order to understand the differences between vX and vY in a migrations model you need to read every line of code in between. One of the key benefits of source control is that it allows you to easily diff the state of particular sets of files to see a simple summary of the changes.

        I guess you are probably right about semantic versioning in source control. This should probably be used externally and in releases etc as it provides context/information to users and sets certain expectations, however it probably isn’t needed in source control. My bad.

        I don’t disagree that it is probably better to keep logic in the application and minimise the use of procedures and functions etc. I guess my problem is that this is only practical advice for greenfield projects. Most of the teams I work with are supporting at least one legacy database that is burdened with technical debt, often including many stored procedures. For these people it would be far easier to adopt a process that is well suited to help them manage the problems that they have, rather than an approach that is well suited to small, dumb data stores.

        I also don’t agree that we can make an assumption that most teams are small enough and united enough to be in regular contact etc so that they are aware of each others changes. While I agree this is a fantastic goal and many teams are lucky enough to be there already, what I often see in practice is different silos responsible for dev and administration, both independently making changes and failing to communicate them. (Often DevOps has bypassed database devs and DBAs!) Often I also see large numbers of dev teams working on different applications that all share the same DB. The developers are good at communicating with their team members but poor at communicating with other teams who touch the same DB. Finally there are often vendors who want to make changes to your databases and they are rarely going to work to the processes of each of their customers. I agree that these aren’t ideal situations, but they aren’t easy things to fix in the short term. Using a state based version control process can help.

        And with state based processes, you can use diff tools to create the upgrade script, but most vendors (including us!) would certainly recommend testing those scripts first. You are entirely right to suggest that diff tools have their flaws. The various providers of these tools tend to provide their own fixes to these problems, such as the SSDT refactor log, pre/post scripts and Redgate’s migration scripts. I also really like the ReadyRoll Deploy-Changes idea where it handles the tables in a migrations way and the procedures in a state way – although this brings its own problems!

        Once again, thank you for your post and the time you have taken to respond articulately to my comments. I believe that for many projects you are right. I look forward to reading your next post!

        • http://enterprisecraftsmanship.com/ Vladimir Khorikov

          You raise very interesting points!

          You are right, my position is more applicable to “ideal” situations which very often might not be the case. Indeed, in a legacy project with a lot of logic in the database it might be too hard to switch towards the “all logic in the application code” approach. Also, I totally agree communications between teams can be not as good as we want them to be. For such projects, state-driven approach may indeed be the best choice.

          ReadyRoll Deploy-Changes

          I didn’t know vendors provide mixed solutions, I think I need to take a look at it.

          Thank you for your comments and again thank you for the talk you linked! It was extremely interesting to see what the current state of the question is, you narrated it in a lucid and balanced way.

  • https://hugo.barrera.io/ Hugo Osvaldo Barrera

    I’d solve #6 quite differently. Add a “migrations” table, and a row for each script (with the date applied, and the script’s name or maybe a UUID for each script).

    This way concurrent branches won’t have issues with number collision, and devs can find out if a DB is lacking a certain script (remember that the order for concurrent branches might vary between dev-environments).

    This also allows to state which previous scripts each on depends on.

  • Mikhail Victorovich

    I have a question about the best practice #2:
    I work on a team of three developers all of whom make changes to the DB schema. We are still about three months away from out first deployment to a client.
    We already have in our version control a script that creates our DB schema. Should we start creating separate SQL scripts for the changes in the database now or should we wait until we have our first deployment of the database to a client? Some of my colleagues believe that we should continue modifying our only script until we have our first deployment and then start creating additional scripts. Any opinions on this? Thanks!

    • http://enterprisecraftsmanship.com/ Vladimir Khorikov

      That’s a great question!

      Indeed, if you don’t have any of production data, you can postpone adhering to the best practices described here to a point in future when you do your first deployment to production. Until that moment, you can just recreate the schema and the reference data in your database every time from scratch without bothering about the data in it.

      I would say keeping the DB in a single script until the first deployment is a good tactics that can eliminate the unnecessary maintenance overhead. One thing I would do differently, though. I would probably start the transition to the migration-based approach a bit earlier than the actual production deployment (maybe by a week or so), just to get used to it beforehand. So that when the DB is shipped to the client, you already have all required processes in place.

      • Mikhail Victorovich

        Thanks for a quick reply. We decided to start the transition to the migration based approach even a little earlier than a week before the first deployment. Maybe, 3-4 weeks. Thanks again!

  • Leo Cavalcante

    Thanks for your post, it inspires me to this:
    https://github.com/leocavalcante/mym