CQS with Database-Generated Ids

Mark Seemann brings up a very interesting subject in his post: how to fit Command Query Separation principle in case you have to save a brand-new object in a database and also need the created id back? Sure, you can have GUIDs for identifiers (which have some drawbacks as I’ll show later on), but what if you really need integers?

I’ve been asked the same question for several times, so in this post I’ll share the solution I use for this problem.

Drawbacks of classic solutions

It is a common approach to have database-generated identifiers. It significantly simplifies the code as you don’t have to bother about creating unique id for every entity you save. To persist an entity, many programmers use code like this:

public interface IRepository<T>
{
    int Create(T item);
}

As Mark fairly shows, this design violates CQS principle because the Create method is obviously meant to be a command, but returns a value. He also introduces a solution:

public interface IRepository<T>
{
    void Create(Guid id, T item);
    int GetHumanReadableId(Guid id);
}

While this solution clearly solves the problem of violation CQS, it does it at a high cost. First of all it brings additional complexity introducing two methods instead of one. Secondly, this solution might suffer a performance hit.

Although the performance concerns may not be an issue in most cases, it is important to understand them so you could avoid such issues when they are vital.

So, what are they? Besides the fact that this design results in two calls instead of one, GUIDs hurt performance of inserts. Relational databases use B+ tree structure to store data. If data coming in is not ordered, then a lot of IO work is required to rearrange the leafs. GUIDs are random, so they may lead to heavy performance drawbacks, especially with large tables.

We could use sequential GUIDs instead and it would solve the problem with leaf rearranging, but we will still have two more problems. The minor one is that GUIDs use 16 bytes instead of 4-8 bytes that are used by INTs and BIGINTs. And those additional 8 bytes would be in every index of the table, not only in clustered one. The major problem is that if we need integer Ids, we still have two calls instead of one.

Database generated integers may hurt performance of inserts as well but for another reason. When a database generates a new auto-incremented value for a primary key, it takes a lock to avoid race conditions. It causes a performance issue that shows up in scenarios with multithread or bulk inserts.

The solution

A better solution here is to preload a batch of Ids and use them for assigning to new objects on the application side. Here is how it can be done.

We can use a separate table to track the batches of Ids we already issued for every entity type:

CREATE TABLE dbo.Ids
(
    EntityId int PRIMARY KEY,
    BatchNumber int NOT NULL
)

The column names are pretty self explaining. Batch number is basically an index of a pile of Ids that client can allocate as it wants. For example, if you have a batch with number 14 and the batch size is 100, then you can distribute Ids 1400, 1401, …, 1499 among new objects. Another client at the same time might reserve batch number 15 and assign keys from 1500 to 1599. As every batch is reserved for a single client, the Ids they generate never collide. Note that you can use any size of the batch, but it’s easier to work with a round number like 100.

There’s a very well suited analogy for this approach. You can think of id batches as IP addresses. ICANN gives you a bunch of IPs, and you can distribute as many of them as you want within the limit of the CIDR range. Batch number acts just like a CIRD range here.

Another benefit of such solution is that you don’t need to touch the database to get your Id back. It’s a crucial concern. First of all, it allows you to increase performance of bulk inserts and related inserts. For example, you might need to insert a parent and a child. In this case the parent’s Id is required for child object creation. Secondly, it fits the Unit of Work concept: you have to contact the database only once, when you decide to commit all the work done in the unit. With database-generated Ids, it might require you to do an intermediate insert if, say, you need to know the object’s id right after you create it. Also, if you choose to rollback the creation, you’ll have to delete the inserted object instead of just not inserting it.

Getting a new batch: code with errors

But how exactly you should issue a new batch number? Simply by incrementing the number in the Ids table using SQL query. First, let’s look at how you should not do it. Here’s the code from a project I worked on:

CREATE PROCEDURE dbo.sp_GetNextBatchNumber
    @EntityId int
AS
BEGIN
    IF EXISTS (SELECT * FROM dbo.Ids WHERE EntityId = @EntityId)
    BEGIN
        BEGIN TRAN

        SELECT BatchNumber
        FROM dbo.Ids
        WHERE EntityId = @EntityId

        UPDATE dbo.Ids
        SET BatchNumber = BatchNumber + 1
        WHERE EntityId = @EntityId

        COMMIT TRAN
    END
    ELSE
    BEGIN
        SELECT 0

        INSERT dbo.Ids (EntityId, BatchNumber)
        VALUES (@EntityId, 1)
    END
END

There’re two problems with this code. First, it will have deadlocks in heavily loaded system. Look at the first part of the if statement. When a transaction selects the current batch number, it obtains a shared lock. After that it tries to upgrade the lock to exclusive to update the record. If another transaction selects the batch number with the same entity Id after the first transaction ends selecting, but before it starts the update, the deadlock will occur. Two transactions will acquire the shared lock, and both of them will also try to acquire an exclusive lock, waiting for each other to release the shared lock.

The second problem is that there will be primary index violation exceptions in heavily loaded systems. Look at the second part of the if statement. If two transactions try to insert a record with the same entity id simultaneously, they will collide. The only way to isolate two inserts is to use serializable isolation level. But it leads to a heavy performance degradation because you have to wrap the whole stored procedure in a transaction with serializable isolation level, causing all calls to this procedure to perform sequentially.

Getting a new batch: corrected version

Here is a corrected version of the stored procedure:

CREATE PROCEDURE dbo.sp_GetNextBatchNumber
    @EntityId nvarchar(100)
AS
BEGIN
    BEGIN TRAN

    UPDATE dbo.Ids
    SET BatchNumber = BatchNumber + 1
    WHERE EntityId = @EntityId

    SELECT BatchNumber - 1
    FROM dbo.Ids
    WHERE EntityId = @EntityId

    COMMIT TRAN
END

Note, that update is performed to acquire an exclusive lock right away. The main point here is to acquire all the required locks at once so that no other transaction could wedge itself in. Also note, that insert statement was deleted. It is much simpler to insert new rows manually for the new entity types. It allows not to isolate the code with serializable level and thus to perform much better. With all that said, we can use the design that is still simple and also fits CQS principle:

public interface IRepository<T>
{
    void Create(T item);
}

What about drawbacks? Every design must have them, right? Yes, there is one. As you can see the Ids in batches are operated in memory, so if the application is rebooted all unused Ids will be lost, leading to holes in id sequences. The id sequences are still increasing, so there won’t be any performance drawbacks, but these holes might seem annoying, especially if your batch size is large.

That is why you should carefully consider the batch size. If it’s too low (1-10), then you might not gain a lot of performance benefits as you’ll need to ask the database for a new batch every 1-10 new objects. If it’s too large (1000 and more), then the holes will be particularly annoying. Also, this might lead to sequence exhausting if you use 4 byte integers for Ids. You should consider the typical usage scenarios for your application and decide what size will fit you. I prefer to start with the size of 100 and then tune it depending on use cases. Of course, you may assign different batch sizes for different entity types.

The last thing I’d like to write about is that all this functionality is already implemented in NHibernate. It’s called Hi/Lo id generation algorithm. Here is an example of fluent mapping you can use:

public class YourEntityMap : ClassMap<YourEntity>
{
    public YourEntityMap()
    {
        Table("[dbo].[YourEntity]");
        Id(x => x.Id).GeneratedBy.HiLo(
            "[dbo].[Ids]",
            "BatchNumber",
            "100",
            "EntityId = 1");
    }
}

Summary

Do I always use this approach? Yes, if I use NHibernate. If I use another ORM, or don’t use it at all, then it depends on the project’s size and its complexity. For a smaller one, it’s okay to break CQS and use database generated Ids. For larger projects, it might be better to switch to Hi/Lo.

Benefits of Hi/Lo:

  • It fits CQS
  • Inserts are faster
  • No need to use GUIDs
  • It fits Unit Of Work concept

Drawbacks of Hi/Lo:

  • Holes in id sequences
  • The solution is more complicated if you don’t use NHibernate

Subscribe


I don't post everything on my blog. Don't miss smaller tips and updates. Sign up to my mailing list below.

Comments


comments powered by Disqus