CQS with Database-Generated Ids

By Vladimir Khorikov

Mark Seemann brings up a very interesting subject in his post: how to fit Command Query Separation principle in case you have to save a brand-new object in a database and also need the created id back? Sure, you can have GUIDs for identifiers (which have some drawbacks as I’ll show later on), but what if you really need integers?

I’ve been asked the same question for several times, so in this post I’ll share the solution I use for this problem.

Drawbacks of classic solutions

It is a common approach to have database-generated identifiers. It significantly simplifies the code as you don’t have to bother about creating unique id for every entity you save. To persist an entity, many programmers use code like this:

public interface IRepository<T>

{

    int Create(T item);

}

As Mark fairly shows, this design violates CQS principle because the Create method is obviously meant to be a command, but returns a value. He also introduces a solution:

public interface IRepository<T>

{

    void Create(Guid id, T item);

    int GetHumanReadableId(Guid id);

}

While this solution clearly solves the problem of violation CQS, it does it at a high cost. First of all it brings additional complexity introducing two methods instead of one. Secondly, this solution might suffer a performance hit.

Although the performance concerns may not be an issue in most cases, it is important to understand them so you could avoid such issues when they are vital.

So, what are they? Besides the fact that this design results in two calls instead of one, GUIDs hurt performance of inserts. Relational databases use B+ tree structure to store data. If data coming in is not ordered, then a lot of IO work is required to rearrange the leafs. GUIDs are random, so they may lead to heavy performance drawbacks, especially with large tables.

We could use sequential GUIDs instead and it would solve the problem with leaf rearranging, but we will still have two more problems. The minor one is that GUIDs use 16 bytes instead of 4-8 bytes that are used by INTs and BIGINTs. And those additional 8 bytes would be in every index of the table, not only in clustered one. The major problem is that if we need integer Ids, we still have two calls instead of one.

Database generated integers may hurt performance of inserts as well but for another reason. When a database generates a new auto-incremented value for a primary key, it takes a lock to avoid race conditions. It causes a performance issue that shows up in scenarios with multithread or bulk inserts.

The solution

A better solution here is to preload a batch of Ids and use them for assigning to new objects on the application side. Here is how it can be done.

We can use a separate table to track the batches of Ids we already issued for every entity type:

CREATE TABLE dbo.Ids

(

       EntityId int PRIMARY KEY,

       BatchNumber int NOT NULL

)

The column names are pretty self explaining. Batch number is basically an index of a pile of Ids that client can allocate as it wants. For example, if you have a batch with number 14 and the batch size is 100, then you can distribute Ids 1400, 1401, …, 1499 among new objects. Another client at the same time might reserve batch number 15 and assign keys from 1500 to 1599. As every batch is reserved for a single client, the Ids they generate never collide. Note that you can use any size of the batch, but it’s easier to work with a round number like 100.

There’s a very well suited analogy for this approach. You can think of id batches as IP addresses. ICANN gives you a bunch of IPs, and you can distribute as many of them as you want within the limit of the CIDR range. Batch number acts just like a CIRD range here.

Another benefit of such solution is that you don’t need to touch the database to get your Id back. It’s a crucial concern. First of all, it allows you to increase performance of bulk inserts and related inserts. For example, you might need to insert a parent and a child. In this case the parent’s Id is required for child object creation. Secondly, it fits the Unit of Work concept: you have to contact the database only once, when you decide to commit all the work done in the unit. With database-generated Ids, it might require you to do an intermediate insert if, say, you need to know the object’s id right after you create it. Also, if you choose to rollback the creation, you’ll have to delete the inserted object instead of just not inserting it.

Getting a new batch: code with errors

But how exactly you should issue a new batch number? Simply by incrementing the number in the Ids table using SQL query. First, let’s look at how you should not do it. Here’s the code from a project I worked on:

CREATE PROCEDURE dbo.sp_GetNextBatchNumber

       @EntityId int

AS

BEGIN

       IF EXISTS (SELECT * FROM dbo.Ids WHERE EntityId = @EntityId)

       BEGIN

             BEGIN TRAN

 

             SELECT BatchNumber

             FROM dbo.Ids

             WHERE EntityId = @EntityId

 

             UPDATE dbo.Ids

             SET BatchNumber = BatchNumber + 1

             WHERE EntityId = @EntityId

 

             COMMIT TRAN

       END

       ELSE

       BEGIN

             SELECT 0

 

             INSERT dbo.Ids (EntityId, BatchNumber)

             VALUES (@EntityId, 1)

       END

END

There’re two problems with this code. First, it will have deadlocks in heavily loaded system. Look at the first part of the “if” statement. When a transaction selects the current batch number, it obtains a shared lock. After that it tries to upgrade the lock to exclusive to update the record. If another transaction selects the batch number with the same entity Id after the first transaction ends selecting, but before it starts the update, the deadlock will occur. Two transactions will acquire the shared lock, and both of them will also try to acquire an exclusive lock, waiting for each other to release the shared lock.

The second problem is that there will be primary index violation exceptions in heavily loaded systems. Look at the second part of the “if” statement. If two transactions try to insert a record with the same entity id simultaneously, they will collide. The only way to isolate two inserts is to use serializable isolation level. But it leads to a heavy performance degradation because you have to wrap the whole stored procedure in a transaction with serializable isolation level, causing all calls to this procedure to perform sequentially.

Getting a new batch: corrected version

Here is a corrected version of the stored procedure:

CREATE PROCEDURE dbo.sp_GetNextBatchNumber

    @EntityId nvarchar(100)

AS

BEGIN

    BEGIN TRAN

            

       UPDATE dbo.Ids

       SET BatchNumber = BatchNumber + 1

       WHERE EntityId = @EntityId

 

       SELECT BatchNumber 1

       FROM dbo.Ids

       WHERE EntityId = @EntityId

 

       COMMIT TRAN

END

Note, that update is performed to acquire an exclusive lock right away. The main point here is to acquire all the required locks at once so that no other transaction could wedge itself in. Also note, that insert statement was deleted. It is much simpler to insert new rows manually for the new entity types. It allows not to isolate the code with serializable level and thus to perform much better.

With all that said, we can use the design that is still simple and also fits CQS principle:

public interface IRepository<T>

{

    void Create(T item);

}

What about drawbacks? Every design must have them, right? Yes, there is one. As you can see the Ids in batches are operated in memory, so if the application is rebooted all unused Ids will be lost, leading to holes in id sequences. The id sequences are still increasing, so there won’t be any performance drawbacks, but these holes might seem annoying, especially if your batch size is large.

That is why you should carefully consider the batch size. If it’s too low (1-10), then you might not gain a lot of performance benefits as you’ll need to ask the database for a new batch every 1-10 new objects. If it’s too large (1000 and more), then the holes will be particularly annoying. Also, this might lead to sequence exhausting if you use 4 byte integers for Ids. You should consider the typical usage scenarios for your application and decide what size will fit you. I prefer to start with the size of 100 and then tune it depending on use cases. Of course, you may assign different batch sizes for different entity types.

The last thing I’d like to write about is that all this functionality is already implemented in NHibernate. It’s called Hi/Lo id generation algorithm. Here is an example of fluent mapping you can use:

public class YourEntityMap : ClassMap<YourEntity>

{

    public YourEntityMap()

    {

        Table(“[dbo].[YourEntity]”);

        Id(x => x.Id).GeneratedBy.HiLo(

            “[dbo].[Ids]”,

            “BatchNumber”,

            “100”,

            “EntityId = 1”);

    }

}

Summary

Do I always use this approach? Yes, if I use NHibernate. If I use another ORM, or don’t use it at all, then it depends on the project’s size and its complexity. For a smaller one, it’s okay to break CQS and use database generated Ids. For larger projects, it might be better to switch to Hi/Lo.

Benefits of Hi/Lo:

  • It fits CQS
  • Inserts are faster
  • No need to use GUIDs
  • It fits Unit Of Work concept

Drawbacks of Hi/Lo:

  • Holes in id sequences
  • The solution is more complicated if you don’t use NHibernate
LinkedInRedditTumblrBufferPocketShare




  • http://blog.ploeh.dk Mark Seemann

    The criticism you raise against my proposed solution is based on the false assumption that the GUID would have to be the clustered index, and primary ID, of the database.

    In SQL Server terms (I know SQL Server best), the integer could also be the primary ID, and the clustered index of the table. Then the GUID would only need to have a UNIQUE constraint, and a non-clustered index.

    The second false assumption is that clients always have to to query the GetHumanReadableId method straight after invoking the Create method. In general, you only need the integer if a human is involved; otherwise, machines can deal with the original GUID just fine.

    My solution isn’t without drawbacks, but the same can be said about the solution proposed here. In the end, when the discussion is about performance, arguments don’t count, but measurements do.

    • Vladimir Khorikov

      My goal wasn’t actually to raise a criticism to the solution you proposed, my point was to bring another option here, with its own pros and cons, sorry if this article looked offensive to you. Frankly, all the solutions you and I mentioned have some drawbacks. It’s always vital to apply them with taking into account all the circumstances of the particular project developed.

      So, returning to the topic, the second assumption you are talking about: I wasn’t actually writing about performance concerns GetHumanReadableId method brings on the table, it’s merely about the design complexity. In most cases it’s better to have one single method instead of two.

      The first assumption. Well, even if you have a table structure like this:

      MyTable

      ID int (PRIMARY CLUSTERED INDEX)
      GuidID unqieidtifier (UNIQUE INDEX)

      you will have to generate integer ID anyway. In case of database generated IDs there will be performance drawbacks as well as in case of non-sequential GUIDs.

      Again, every solution has drawbacks. I tried to fully describe the drawbacks of Hi/Lo as well as its merits.

      • http://blog.ploeh.dk Mark Seemann

        Please rest assured that I wasn’t offended at all :)

        I only wanted to point out that part of the disadvantages you described can be addressed, but I also yield that some of them can’t.

      • Archangel

        In SQL server an identity column causes no performance hit – the server keeps the next identity for all tables in memory and uses hardware level atomic incrementing so there would be no performance hit – if you’re using the guid as the object identifiers and the identity solely as the clustering key (as such you really don’t care what it is during insert and you’re never storing one objects assigned cluster identity as a reference) then you have essentially the same performance as using hi/lo or preallocated sets of integers with much easier implementation

        • http://enterprisecraftsmanship.com/ Vladimir Khorikov

          Hi Archangel.

          Seems like you are right. I checked the performance with the 2012 SQL Server, and I didn’t notice any performance degradation for auto-generated integer Ids comparing with Hi/Lo.

          I’m pretty sure there was such issue in earlier versions of SQL Server, through. Seems like my knowledge of it is out of date.

          Thank you for the comment, I’ll fix the article to reflect that.

  • Van Ly

    I’m quite new to CQS and I’d like to ask a humble question: The Create method is a command, then if the object passed to this method have some changes in their property, does it violate any rule? I mean that we can always get the new id from the object itself, so we don’t need to return another integer. Is it good or bad practice?

    • http://enterprisecraftsmanship.com/ Vladimir Khorikov

      If I understood your question correctly, you are asking if it’s ok to pass an Id yourself? I don’t think it violates any rules, so I would go ahead with such implementation.

      • Van Ly

        No, I mean that the Repository fills the Id after completing the command. As the principle, the command is to modify the system state. Then the question is: is it ok for the command to modify the object passed in? if ok so we can get the new id from the object after calling Create().

        • http://enterprisecraftsmanship.com/ Vladimir Khorikov

          It’s okay for a command to modify an input parameter. However, the CQS principle discourages returning anything from commands. On the other hand, CQS itself is not a hard rule, but rather a “follow if possible” one. So just look at your model and if it makes sense to make a command to return something – do it.

          • Van Ly

            Thanks for your reply.

    • http://blog.ploeh.dk Mark Seemann

      FWIW, I’ve attempted to answer your question here: http://blog.ploeh.dk/2016/05/06/cqs-and-server-generated-entity-ids

      • Van Ly

        Wow, very nice article! Thanks.