KISS vs database normalization

By Vladimir Khorikov

In this article, I’d like to share an example of how the KISS principle can go against the database normalization principles and how to make a choice in such situation.

The example

Some time ago, I had a task which required me to implement linkage between the users of an application and their social accounts. At the time, we planned to add only two social providers – Facebook and Google – but others could be potentially integrated as well in the future.

What’s interesting in this feature is the discussion I had with my colleges regarding its implementation. The first version looked like this:

public class User

{

    public string Name { get; private set; }

    public IReadOnlyList<SocialAccount> Accounts { get; private set; }

}

 

public class SocialAccount

{

    public SocialAccountType Type { get; private set; }

    public string AccountId { get; private set; } // Assigned to the user by external provider

}

   

public enum SocialAccountType

{

    Facebook,

    Google

}

Every user was about to have a list of social accounts. Each social account would contain a type property stating what provider it is from and an identifier which is assigned by that provider to each unique user in the social network.

The database was going to have two tables: one for the users and another one for their social accounts. The relation between the User and SocialAccount tables is 1-to-many:

KISS vs database normalization: Version one

Version one

Seems reasonable, doesn’t it? Indeed, this design meets the requirements. Even more, it leaves a reserve for possible future enhancements: we will be able to easily add new providers just by extending the SocialAccountType enum and without updating our DB structure.

There’s nothing wrong with this implementation except one thing: we can make it simpler. Here’s what we can replace it with:

public class User

{

    public string Name { get; private set; }

    public string FacebookAccountId { get; private set; } // Assigned to the user by external provider

    public string GoogleAccountId { get; private set; } // Assigned to the user by external provider

}

KISS vs database normalization: Version two

Version two

In this version, we’ve got rid of the account collection and introduced two separate fields instead. Also, we’ve inlined the SocialAccount table into the User table so that it now contains two columns with the currently integrated providers.

KISS vs database normalization

So, which version is better and why? Let me first bring the objections regarding the latter implementation.

First off, it’s not easily extensible. You will need to change the User class whenever a new social provider is added and that might end up being pretty burdensome. Secondly, the database here isn’t normalized: the table will contain lots of nulls in the two columns because, clearly, not all users would link their social profiles to the accounts in our system. This, in turn, would result in a waste of the disk space.

On the other hand, this design is simpler and fits the exact set of requirements we have right now.

You might have guessed where I’m going with this. The first version violates both of the two most important software development principles: YAGNI and KISS. It tries to predict the future needs by creating an extensible set of social providers and complicates the implementation of the existing requirements because of that.

Alright, but what about the drawbacks of the second version? Don’t they overweight the violation of these two principles? No, they don’t. In fact, they aren’t actually drawbacks.

New requirements tend to come in an unexpected form. Most likely, there would be details which we haven’t foreseen. That, in turn, would result in us making more effort to adapt the design comparing to the straightforward, right-to-the-point implementation. Keeping the code minimalistic prevents unnecessary complexity from creeping in and reduces the maintenance cost for the code base later on.

You might argue that with this design, we will have to modify the table schema in case we need to add a third provider. Sure, we will. But what’s the actual probability of this happening? Too often I saw extension points someone put into code stayed unused. Forever. And even if we are sure it’s not gonna happen, we are still better off with the simpler design. Remember, code is not an asset, it’s a liability. The latter version allows us to avoid bringing additional liability in up to the point where we really can’t get away without it, so it is still a much better deal.

The second objection – lots of nulls in the two Ids columns – isn’t an issue either. The days when the disk space was expensive have passed, fortunately. Besides, we are getting some performance improvements here due to the lack of SQL joins.

But I’m going to state the third objection against the single-entity design which I didn’t hear from my colleges. It just feels weird to inline columns from one table to another. Shouldn’t we always try to achieve the 3rd normal form as we’ve been taught in college? I definitely can relate to this feeling. Nevertheless, the answer is no: the KISS principle beats database normalization.

You should resist the temptation to normalize the DB structure in case it hinders design simplicity. A normalized database doesn’t automatically mean a good design. In fact, in some cases (like the one above), it makes it worse due to unnecessary complication.

Conclusion

Database normalization, like any other practice, shouldn’t be employed bluntly. Always try to make sure it serves the bigger goal of keeping the overall solution simple. If it doesn’t, don’t hesitate to deviate from this practice and denormalize tables in your DB.

Related articles:

LinkedInRedditTumblrBufferPocketShare




  • http://alexzaytsev.me hazzik

    The first sample is only in second normal form.

    • http://alexzaytsev.me hazzik

      On the other hand, the second example IS in the third normal form.

      • http://alexzaytsev.me hazzik

        So, given that, the article itself does not have or make any sense. You are trying to compare warm with soft.

        • http://enterprisecraftsmanship.com/ Vladimir Khorikov

          The first sample is only in second normal form

          My memory about normal forms might fail me but according to wiki the first sample is in the 3rd normal form, not the 2nd one as it doesn’t have transitive dependencies.

          The second sample, according to the same article, is indeed in the 3rd NF. So, I probably should have chosen another wording for the showcase as both samples reside in the same normal form.

          However, I believe the main point of the article still stands, regardless of what normal form the samples are in. That is, we should prefer simplicity and refrain from introducing additional tables in the database unless we really need to.

          • http://enterprisecraftsmanship.com/ Vladimir Khorikov

            Nah, the second code sample is in the 2nd NF, not the 3rd one, because we do have a transitive dependency here: two account fields depend on the user id. To move it to the 3rd form, we could refactor it the way I proposed in the first code example or introduce another table, like this: SocialProfile (ID, UserID, FacebookAccountID, GoogleAccountID). So I believe all points from the article are valid.

            Thank you for your comments, Alexander, they made me re-evaluate my 15 years old knowledge regarding database normal forms :)

          • mz

            But the user id is the primary key, hence the 2nd solution is in 3NF too.

            I think It’s generaliztion/specialization aspect that affects the design, not normazliation.

            Also null values are allowed in any NF.

          • http://enterprisecraftsmanship.com/ Vladimir Khorikov

            I believe 3NF doesn’t depend on the presence of a primary key, it’s all about transitive dependencies. In other words, in the latter code sample we combine together two entities: User and SocialProfile which otherwise would be in the 1-to-1 relationship.

            Anyway, it’s all rather theoretical stuff. Even more, the NF an entity is in depends on the semantics of that entity.

            You are right about the generaliztion/specialization aspect, that’s I believe the most important factor that affects the design. On the other hand, normalization/de-normalization is a good example of that aspect.

          • http://alexzaytsev.me hazzik

            The second example is in 3rd normal form, because these table does not have any potential keys, on which FbID and GoogleID fields do transitive depend.

          • http://enterprisecraftsmanship.com/ Vladimir Khorikov

            Well, the FbID and GgID columns themselves represent that composite key. They depict the unique set of attributes that characterize social accounts linked to the user. Because of that, they can be extracted into a separate table: SocialProfile (UserID, FbID, GgID)

          • http://alexzaytsev.me hazzik

            Minimal set of attributes, which identifies user is {UserID}, all other attributes identify only this column and does not affect, identify or somehow relate to others. So, this table indeed in 3NF.

          • http://alexzaytsev.me hazzik

            The first example not in 3rd normal form, because it has a potential key “Type” and it has transitive dependency on it.

            However I do support statements that everything should be kept simple as possible, but not oversimplified. But the examples here are oversimplified and do not demonstrate the problem.

            To be in a 3rd form, with this domain, fist sample should have an extra table “SocialProviderType”, with 2 fields: Id and Name. And the SocialAccount table should have a FK to SocialProviderType table.

          • http://enterprisecraftsmanship.com/ Vladimir Khorikov

            The first example is not in 3rd normal form because it has a potential key “Type” and AccountID has transitive dependency on it.

            Ah, I see. Seems that you got a wrong impression about the domain. Type and AccountID are not transitively dependent. AccountID is a unique ID that the social network assigns to each user. They can very well overlap across different providers (in theory). You can think of it as of a person having a PassportNumber and CountryID columns: 2 countries can, in theory, issue the same passport numbers. Having that said, the first example is in the 3rd NF.

            I agree with what you are saying regarding normalization versus YAGNI.

          • http://alexzaytsev.me hazzik

            The relation in your example about countries does mean transitive dependency. Even if both countries can have the same numbers, changing a country without a number does not make sense, as this number may not exist in that country. Doing so will create a data anomaly. The passport number does belong to a country. And so they must be changed together.

            A simplest law-like description of 3NF from Wikipedia: “[Every] non-key [attribute] must provide a fact about the key, the whole key, and nothing but the key.”

            Type column does provide a fact about another non key attribute AccountID, and so it breaks the above description of 3NF.

            Another option to make first example satisfy 3NF is to remove the surrogate key (ID) from SocialAccount table, and make {UserID, Type} tuple a composite primary key.

          • http://enterprisecraftsmanship.com/ Vladimir Khorikov

            Good points, I tend to agree. This indeed renders the wording of the article’s title incorrect as it’s not about normalization per se.

            It’s interesting that this also renders the 3NF impractical in many cases as composite primary keys are generally a mess. That is something I didn’t consider before.

  • ADIMO

    Sure “the KISS principle beats database normalization”. For many developers, design a good solution means design a good database schema, building solutions too complex. Is this the reason because Nosql database are always more used?

    • http://enterprisecraftsmanship.com/ Vladimir Khorikov

      I’m not sure if NoSQL databases are used more than SQL DBs but the necessity to perform essentially the same work twice (first defining classes in the domain model and second designing DB structure for them) might indeed be one of the reasons why people prefer NoSQL over SQL.

      For many developers, design a good solution means design a good database schema

      That equalization is something that bothers me too. There’s definitely more in here than just good DB schema.

  • Charles L’Carpetron Dukemarrio

    I agree with your premise, but disagree with your proposed solution in the example. If you are using a social network ID for auth purposes then you’re only going to need one, in which case the most flexible solution would be to add a single field for SocialNetworkID and another for Type. This accomplishes the same thing and still leaves it open for more networks later. If you need multiple networks for a single user for any other reason you should go the normalized route.

    • http://enterprisecraftsmanship.com/ Vladimir Khorikov

      In the example from the article, we did need to have links to both accounts, so that users could login using any of their social profiles: Google, Facebook, or both.

      • Charles L’Carpetron Dukemarrio

        Why would you need both for one user? A user can only auth from one account at a time.

        • http://enterprisecraftsmanship.com/ Vladimir Khorikov

          Yes, but they should also be able to logout and login using the second social provider, so we need to store the connections to both providers.

          • Charles L’Carpetron Dukemarrio

            In that case the whole thing falls apart IMO. What if they wanted to use Instagram or Twitter or Live? By starting out with google and fb specific fields you’ve started down a path to maintenance heartache.

          • http://enterprisecraftsmanship.com/ Vladimir Khorikov

            The point you are expressing is exactly the point I argue against in the article :)

            The answer here is YAGNI and KISS. Integration with Twitter or Live most likely won’t be needed. At the same time, unnecessary generalization complicates the solution by introducing additional maintenance overhead.

            Here I write about this in more detail: http://enterprisecraftsmanship.com/2015/06/11/yagni-revisited/

          • Charles L’Carpetron Dukemarrio

            If you have a requirement that the user should be able to login with Facebook then log out and the SAME user log back in with Google then you have no way of knowing what is going to be needed in the future because once you have one nonsense requirement there are likely to be more.

          • http://enterprisecraftsmanship.com/ Vladimir Khorikov

            You are correct. Nevertheless, I believe the best strategy in this case is not to try to anticipate the future needs but rather create as specific (for the current requirements) solution as possible. That’s what essentially the Yagni principle is about.

          • Charles L’Carpetron Dukemarrio

            There is YAGNI and there is coding yourself into a corner. With this approach you have to check two fields every time you access the info. Soon when they add Live and Twitter you have to check four fields every time. Then when you add AppleID you have to check 5 FIELDS every time. I am a big proponent of YAGNI and KISS but it has to be applied to things that are more permanent and pervasive than Social Networks. Like universal data layers. That is a bad idea. An enterprise isn’t going to need an internal app that works with Oracle and SQL Server and MySQL just in case at some future date you might want to change your standard.

          • Charles L’Carpetron Dukemarrio

            Here’s a more relevant example, if your app has a “States” table and your address table links to it, maybe using a number field called stateID as a foreign key, you might be too normalized.

          • http://enterprisecraftsmanship.com/ Vladimir Khorikov

            Why 5 checks? There are 2 separate flows for logging users in with their social accounts, and there’s only one check on each login, regardless of how many social networks are integrated.

            Regardless, my main point is that the assumption we are inevitably going to have more social networks integrated most likely will stay unfulfilled.

          • John Turner

            But by avowing these “2 separate flows”, this violates DRY – “checks” or “flows” or whatever it may be referred to as, it’s redundant. Hardcoding these columns instantiates a pernicious anti-pattern. And theoretically, this approach does not even meet 1NF – the tuples each contain a value _set_ of the same essential attribute, i.e., they are non-atomic. Furthermore, the default ORM serial ID column as PK effectively denudes the database of any connection with the domain being modeled. For all the benefits one expects to reap by decoupling from the domain with immaterial serial ID’s, increasing the opacity in the persistence layer is certainly not one of them. In both examples, the tables are not much more than uniquely indexable heaps. When implementing tables in this manner, deliberating over an appropriate attribute structure is a bit like closing the barn door after the horse has already left.

          • BBI

            It’s good to see those linked examples for comparison:

            Leaving the code to add a comment to a picture, or to canonize an album name, in one specific place does not obstruct anything. When needed elsewhere, it can be simply copied to a generic location in it’s own class, from where it can be used in different places. An ICommentable interface can be created afterwards. The only risk is that somebody who later does the same somewhere else, does not find the existing code and creates his own version.

            Faulty database design may affect a large, monolithic (partial) system; there will be many places where the Facebook and GoogleAccountIds are used and have to be changed, either adding new account types, or making a generic solution. Then there’s data migration, requiring SQL scripts.

            In the code examples, the efforts for a generic solution are simply deferred to when they are really needed (and omitted if the need never occurs). In the database example, one can be lucky and never have to add new social accounts, saving just little efforts. But if more social accounts, or social account dependent data are introduced, the efforts are MUCH greater than the initially saved work. Or the efforts are again delayed, and trash dump structures, like multi-value columns (CSV, XML, JSON) or polymorphic references (no foreign key) are introduced, with possibly even larger damage by data corruption and performance problems in the future.

            I have more than once experienced database design instructions, saying: we will NEVER need more than one related item for this, put it in a column value! And, a few weeks later, it was revised: nooo – we now need multiple values!

  • http://buclenet.com Albert Capdevila

    Thanks for the post, it made me think about some database designs I’ve done.

    If the initial requirements have been link with five different social providers, which one of the two designs you had chosen? What would be the limit of social providers from which the second design would be better?

    • http://enterprisecraftsmanship.com/ Vladimir Khorikov

      That’s an interesting question. In this situation I think I would still keep the second version of the DB design (5 separate AccountID columns in the User table) but would gather them under a single Value Object in the User class, something like this:

      public class User
      {
      public string Name { get; private set; }
      public SocialAccounts SocialAccounts { get; private set; }
      }

      public class SocialAccounts
      {
      public string FacebookAccountId { get; private set; }
      public string GoogleAccountId { get; private set; }
      /* etc... */
      }

      Regarding the limit – I don’t think the number of social providers really matter here, as we can gather all of them under a single Value Object and still keep the User class clean.

      • http://buclenet.com Albert Capdevila

        Thanks for the answer, I agree with you.

        As I see it, the key is to understand the essence of the data. Links to social network providers are fairly static data. If we were talking instead about other types of data with a more dinamic behaviour, then it would be better to keep it in two direfentes tables.

        What do you think?

        • http://enterprisecraftsmanship.com/ Vladimir Khorikov

          Absolutely. The usage pattern is essential here, it would indeed make sense to extract a separate table in this case.

  • David Raab

    Your first example did not invalidate KISS or YAGNI for me. We already discussed about it, so i just tell you my view on why it doesn’t do it shortly. At first one note. YAGNI is not about not building flexibility in your program! I think that is an important statement to think about, that is also why i view it different than you.

    YAGNI:
    The first example didn’t invalidate YAGNI because you had the task to provide a Google and Facebook integration. You came up with two different solution to save the data in the database, you didn’t even really write code, you just have two ways on how to store the needed data.

    And from the both examples the first schema definition is more flexible, the second is not flexible at all. If it is in 2nd or 2th normal form is not really important to me so much. There can be reason why you pick the first or second version. I don’t have anything against the second version. But i don’t see any validation of YAGNI at all.

    An example of a validation of YAGNI in that example would be if you would start coding and also already write a Twitter, LinkedIn, Stackeroverflow, … integration in your code even if nobody ever asked for it.

    And you can do that with the first and the second example. While doing it with the second example of yours needs more work todo because it is less flexible. So i would pick the first example. Because the first example actually don’t have any negative impact at all.

    KISS:
    I only see a validation of KISS in your second example. Simplicity means to abstract things until you have a simple (single) thing that can be used in a lot of different scenarios. Your first data schema could potential work for any service, the second don’t provide any abstraction at all. Instead you include everything what you need for all different services into one big “thing”.

    That means the second example invalidates KISS.

    Summary:
    Both examples don’t break YAGNI. You really have just two schemas where one is more flexible than the other. I don’t see where the first or second example really makes a big difference where it makes something really harder. So there is 1 point for your first example.

    The second example breaks KISS, your first version embraces KISS. So it is 2:0 for me for your first example.

  • Peter Row

    DB Schema != Software Development. Therefore use the correct tool for the job which is DB normalization. The first (correct) way has no downsides as it is flexible for the future without you having to do anything extra. Having a separate table with a list in code is no extra work over having 2 add 2 fields in DB and 2 properties in code. The second (incorrect) way has the downside that it is bad DB design, if you add any fields dependent on the social type then you table gets messy quickly, if you allow FB and Google it is likely that sooner or later someone will ask for Twitter. The former means that is simple the latter harder/more effort.
    Both ways are simple initially. But the former (correct) way is just as quick to implement as the latter but is much more maintainable and extensible.

  • Beau Harder

    I see this happening all the time at my workplace and I’ve started making informal bets with co-workers about 6 months into the future to see if the additional DB schema complexity would have been justified.

    I also like the second approach because when the business is ready to add another feature, it’s almost certainly not as simple as adding another row in a configuration table, so this allows the change to be done through a more controlled code/test cycle.

  • http://auct.eu/ Dima Taras

    Personally I use first option as I have app with 5 social accounts and ability to attach multiple accounts of same type to 1 user.

    But anyway I feel tend to use first option for being more extensible and customized easy.

    I just kinda think you have chosen wrong example, bcauze for this task in 95% Ill chose first option

  • BBI

    KISS and YAGNI are right, if new stuff can be added easily, and the simple design doesn’t obstruct future developments. But spoiling relational design for minimum short-term savings is not just bad design from an “academic” viewpoint, it OBSTRUCTS new functionality (the enum also does, on a lower and easier-to-fix level).

    I can guarantee, there will be propably hundreds of social accounts, and/or there will be additional, common data dependent on the social accounts, requiring a primary key as reference.

    What will be done when there are more than 10, 20 or 50 social accounts? Add 50 columns, or a text column with comma-separated or XML account ID values? Will social account dependent data be in a loosely connected table without foreign keys? That’s trash dump database “design”, which is tempting when rushing, but makes a system unreliable and unmaintainable in the near future. Or is there an easy migration option, to change back the design to the first, when needed (including all application usages)?

    Relational database design has a rather scientific ruleset, and DDD, or viewing relational tables like Excel spreadsheets, simply has to stay out! No entry! Forbidden! It’s perfectly legitimate to have different looking business entities designed after DDD, but relational design has to obey relational rules.

    The other way to go is to sacrifice data integrity, by using NoSQL. This will allow more sloppy design, without running into a mess like with a broken relational design (stuff like orphaned child records or outside-of-transaction inconsistency).
    Btw. a proper relational design process, like normalizing, can also help understand and evaluate a DDD model!