SQL vs NoSQL: you do want to have a relational storage by default

By Vladimir Khorikov

The concept of NoSQL databases has been around for a while, but there still are quite a few misunderstandings regarding the topic of relational SQL vs NoSQL databases. In this post, I’d like to clarify the most common misconceptions and discuss the primary use cases for each of them.

Are NoSQL databases really schemaless?

A quick note before we start: the notion of NoSQL refers to 4 different types of databases: Document-Oriented, Column-Oriented, Key-Value stores, and Graph DBs. In this post, I talk about Document-Oriented databases primarily. I refer to them as NoSQL for the sake of brevity but keep in mind the term itself has a more broad scope.

Alright, so are NoSQL databases really schemaless? Let’s take an example. Let’s say we have a document collection named Customers with the following data inside:

{ Id: 1, Name: “John Doe” },

{ Id: 2, Name: “Bob Smith” }

Because of the collection’s schemaless nature, nothing prevents us from adding another document, like this:

{ Id: 1, Name: “John Doe” },

{ Id: 2, Name: “Bob Smith” },

{ Id: 3, FirstName: “Alice”, LastName: “Christopher” }

Now, let’s say we have a method that looks for customers given a particular name:

public class CustomerRepository

{

    public IReadOnlyList<Customer> Find(string name)

    {

        return _collection.Find(x => x.Name == name).ToList();

    }

}

What would happen if we add Alice Christopher to the collection the way we did previously? Would the Find method find her? Surely, not. The repository class implicitly relies on the collection schema and require all customers to have the Name property in order to be detected. In this scenario, we have to adjust the Find method so that it would start looking at both Name and FirstName/LastName properties.

While NoSQL databases are technically schemaless meaning that they allow us to store documents in any shape we want, the notion of schema itself doesn’t vanish from our domain model. Schemaless databases just shift the responsibility to maintain the schema to us, developers.

The use of a NoSQL storage means a move from explicitly defined data structures to implicit ones. And that is a huge step back from what we have in relational databases, because, as we know, we should always try to make implicit assumptions in our code explicit.

And of course, the notion of schemaless data doesn’t belong to NoSQL databases exclusively. We can do the same in traditional relational databases. For example, one could serialize the whole customer object to an XML or JSON document and put it to an MSSQL/Oracle/MySQL table. This would make the customer storage schemaless as well.

As you might guess, such design decisions are frowned upon in the world of relational databases because of the lack of explicitness. Explicit schema SQL storages introduce is a great assistant when it comes to defining data structures. It helps developers make sure the shape of the data they work with is consistent across the whole database.

Are schema migrations easier with NoSQL databases?

Another broadly accepted statement is that schema migrations are easier with NoSQL databases. Are they?

In the example above, how would we handle the two editions of the customer data? We would need to introduce schema versioning:

{ Id: 1, Name: “John Doe”, Version: 1 },

{ Id: 2, Name: “Bob Smith”, Version: 1 },

{ Id: 3, FirstName: “Alice”, LastName: “Christopher”, Version: 2 }

It means that in any given time there exist at least 2 versions of the Customer class and we have to handle both of them in our domain model manually:

public class Customer

{

    public int Id { get; private set; }

    public string Name { get; private set; }

    public string FirstName { get; private set; }

    public string LastName { get; private set; }

    public int Version { get; private set; }

 

    /* Other members */

}

 

public class CustomerProcessor

{

    public void Process(Customer customer)

    {

        string lastName;

        if (customer.Version == 1)

        {

            lastName = /* Get the last name out of customer.Name somehow */;

        }

        else if (customer.Version == 2)

        {

            lastName = customer.LastName;

        }

        else

        {

            throw new InvalidOperationException();

        }

        /* Work with lastName here */

    }

}

The situation here is essentially the same. NoSQL databases don’t help us with the schema migrations, they transfer the obligation to perform it to us, developers. With a non-relational storage, we have to do the following:

  • Change our code so that it handles both versions of the schema (see the CustomerProcessor.Process method).
  • Create a background job which searches for documents of the old versions and transforms them into the new version, one by one.

Now, compare this to the explicit “fire-and-forget” migrations we can employ in an SQL database. All we have to do there is write a single script that would handle all data migrations at once. The migration logic in relational storages tends not to infiltrate to the domain model which helps keep the latter clean and simple.

Schema migrations in NoSQL databases are not easier. Contrarily, they are more difficult to implement and maintain comparing to SQL storages.

Other benefits of traditional SQL databases

There are two more benefits traditional SQL databases provide us out of the box. There’re no analogues for them in most Document-Oriented DBs.

The first one is data (referential) integrity. This feature helps us verify we don’t reference non-existing rows and thus keep the data consistent in the first place. For example, we can create a foreign key constraint in our customers table and thereby make sure all our customers belong to one of the predefined countries:

SQL vs NoSQL: data integrity in action

Data integrity in action

A relational database acts here as the last stand guardian, so to speak. It means that we still need to employ proper validation in the application code ourselves but if we miss something, that wouldn’t lead to data corruption. The database would inform us about the inconsistency by rejecting the incorrect transaction.

In NoSQL storages, we have to – once again – handle such situations manually. It’s a common practice to create a background job to reveal inconsistent data and try to settle the conflicts after they took place.

Atomic transactions across different tables and/or rows is the other benefit. It’s not as important as those we discussed previously because a well-designed NoSQL storage rarely requires the change of multiple documents at once. But still, it’s a nice-to-have feature in many cases. NoSQL databases usually support atomic operations within a single document only.

So why ever bother choosing a NoSQL data storage?

Despite all the merits relational databases have, they lack two important ones: scalability and performance. And this is the only reasons why one should ever consider choosing NoSQL. Not because NoSQL database are schemaless (they are not, essentially), nor because they make schema migrations easier (they don’t). Scalability. And performance.

Relational databases are Jacks of all trades. They provide rich functionality out of the box. They also do pretty well with any kind of software. The only problem with SQL storages is that when your data outgrows a single database instance, you no longer can rely on the relational model.

If that is the case, you have to revisit your application and try to accommodate a NoSQL storage – one that fits the needs of your software the most:

SQL vs NoSQL: Choosing a specialized data storage based on your application's needs

Choosing a specialized data storage based on your application’s needs

But even in this case, it doesn’t mean you should forgo the SQL database completely. You can employ a NoSQL storage for the part of your system that doesn’t fit the relational model and keep the rest in place. This practice is called Polyglot Persistence.

All said above doesn’t mean you can’t scale a relational DB. You can (to some extent). But this means you will need to give up on the benefits it provides, such as data integrity. Also, SQL databases don’t provide such functionality out of the box, so it often becomes a pain to do that. In most cases, you are better off just choosing a NoSQL DB for scalability purposes.

SQL vs NoSQL: Conclusion

In most cases, NoSQL is a forced choice. It’s a tool the use of which you want to postpone as much as possible as it shifts too much of burden to you as a programmer. Relational databases are more friendly, they provide rich capabilities out of the box:

  • Explicit schema
  • The ease of migrations
  • Data integrity
  • Atomic transactions across several tables/rows

If your application is not expecting outstanding scalability requirements in any near future, you are almost always better off choosing a relational storage. The reason is that non-relational databases require constant (and quite big) maintenance overhead comparing to relational ones.

Only if you are sure your system will contain more than tens or hundreds of millions rows in a single table, should you consider using NoSQL. And even in this case, try to extract to the non-relational storage only the parts that don’t fit the relational DB. That would help reduce the overall maintenance overhead.

Related articles





  • Alex

    Use the right tool for the job, Document Oriented databases don’t claim to be a silver bullet or a drop in replacement, they serve a specific purpose just as a relational db does. It’s down to the developer(s) to work out which if any of them is correct for the TASK AT HAND!

    • http://enterprisecraftsmanship.com/ Vladimir Khorikov

      Well said, completely agree.

  • Akmal Chaudhri

    Vladimir, I have also collected together some other considerations. Whilst there is some business focus (e.g. skills, vendor funding), I also look at some important issues, such as security. Many references in the back of the slide deck: https://speakerdeck.com/abchaudhri/considerations-for-using-nosql-technology-on-your-next-it-project-1

    • http://enterprisecraftsmanship.com/ Vladimir Khorikov

      It was an interesting read, thanks for sharing!

  • Val Waeselynck

    “There is no such thing as a schema-less application. There are applications where you write the schema down, and applications where you don’t.” – Rich Hickey

    I would recommend anyone interested in this topic to play around with Datomic. It really shows you the essential benefits of relational databases without their traditional rough edges, while allowing for more schema flexibility than many document-oriented databases (like MongoDB).

  • http://twitter.com/ahachete Álvaro Hernández Tortosa

    At ToroDB we coined a term for this. It’s not schema-less. It’s rather schema-attached. In document dbs the schema is attached to the data. That’s why it can vary from document to document. So it’s clear you are shifting the schema enforcement and understanding to the data producer/consumer (app developer).

    There are more problems because of this. For instance, you need discovery tools to understand the data within your database. And non-indexed queries are painfully slow, because you need to scan the whole db, even if you don’t care about 99% of your data, simply because you don’t know where your valuable data is, it’s not classified.

    Fortunately, this problem has been solved (hopefully) by ToroDB (I’m one of the authors). It’s a MongoDB-compatible NoSQL database that performs automatic schema inference and stores data relationally in PostgreSQL. It’s open source, hope you find it interesting: https://github.com/torodb/torodb

    • David Raab

      But accessing non-indexed data is also slow in a relational database. In the past i used MySQL and like 99% of all our performance issues we had was always a matter that we had no indexed on tables or no appropriate indexes for the given query and it did a full table scan. I think only once i ever encountered a performance issue caused by code.

      As you mention MongoDB, you also divide your data in collections so not a full database scan should be needed. But sure you can end up with a full collection scan. But at least MongoDB also supports indexes, and you have options that it throws errors when a query don’t use indexes.

      I wished in the past MySQL had such a feature, so nobody forgets to add indexes on tables.

      • http://twitter.com/ahachete Álvaro Hernández Tortosa

        Couple of interesting points here:

        – In MongoDB you can divide data in collections, but that does not prevent you to having different data (shapes of documents) within the same collection (after all, isn’t this the promise of “schema-less”?). Now if you want to query a subset of that collection based on existence/absence of some keys, or just some values or predicates based on them, different collections won’t help you.

        – Not all loads are OLTP-like, where not having an index may be a performance hit in most use cases. For reporting/analytic types of queries, you almost never use indexes.

        – I can give you some numbers comparing ToroDB running on Greenplum vs. MongoDB on aggregate queries with orders of magnitude better performance (see http://www.slideshare.net/8kdata/torodb-scaling-postgresql-like-mongodb slide 42). So it can be quite significant!

        • David Raab

          Yes it doesn’t prevent having different data in the same collection, but schemaless is not the same as chaos. It has collections so that we can put the same things into one collections. For me it is more the idea that i don’t have to specifiy the schema twice. The schema is controlled by the application.

          And for subquering of some keys that is exactly why MongoDB also supports indexes for documents in a collection. As you say ToroDB is drop-in replacement i would also assume also ToroDB supports indexes on collections?

          But yes for reporting/analytic types you will probably not use indexes. I also consider an option, that throws errors, only useful for development. That is not an option i would probably activate in production.

          The number are quite interesting but it opens up the question if indexes was used. Without indexes it will sure be slow.

          • http://twitter.com/ahachete Álvaro Hernández Tortosa

            Schemaless is pretty much the same as chaos. If you are given a database, what data is in it? You don’t know. If a document “schema” was changed 30 times, what fields does it contain, what fields it doesn’t? What assumptions can you make about your data? Invariants? How can you assure it is not, indeed, a chaos? Oh, and btw, you can’t join between collections, right? So you may end up being forced to store different data, belonging to different collections, in the same collection.

            But even if you manage to have everything organized in one collection, data will have different fields. If you want to query that collection on particular predicated based on those fields, you need to scan the whole collection. That’s where what we do at ToroDB comes in for help. You don’t need to scan it in full. Documents with different keys will reside in different relational tables, and the predicates of the queries will be resolved to only those tables that need to be queried, becoming much more efficient. Think of having data partitioned by “type” or “shape” of the document. When you only want to retrieve data of a particular “type” (or combinations of) you only need to check one or some “partitions”.

            You can’t always have indexed queries. Indeed, and specially in MongoDB, this is not the case.

            Answering your question yes, ToroDB supports indexes on collections. Only unique constraints are not supported yet.

            And regarding the numbers shown in those queries, those aggregated queries don’t use indexes (in neither system) because they can’t benefit from them (consulting the index for each record and then the record itself is slower than consulting each record). This is true for most aggregated queries and most user-generated queries (think of a UI for querying) that won’t usually have an index behind. Plus having too many indexes puts a big burden on writes and the I/O. Oh, and MongoDB usually cannot use more than one index per query, and an index is required if you need ordering…. So there are many use cases where indexes are not the answer.

          • David Raab

            Schemaless is pretty much the same as chaos. If you are given a database, what data is in it? You don’t know.

            You have to examine the database. And you do that regardless of the fact that you have a schemaless database or not.

            If a document “schema” was changed 30 times, what fields does it contain, what fields it doesn’t?

            In a schemaless database a document probably don’t contain a field at all. In a relational database every row has a field, but probably that field contains garbage like NULL. What is the difference between a non-existent field or a database that forces you to have fields containing garbage instead?

            What assumptions can you make about your data? Invariants? How can you assure it is not, indeed, a chaos?

            You can ask the same question for a relational database. How can you assure your data is not in absolute chaos after modifing a table 30 times and fields where added or deleted? Which fields must be handled by code because of garbage fields like NULL? Chaos is much more likely to happen in a relational store. If chaos happen or not is not solved by not using a schemaless database. If forcing a schema would atomatically create a great well designed databases without chaos then i do not knew how to describe all the bad databases i have seen in my past.

            Oh, and btw, you can’t join between collections, right? So you may end up being forced to store different data, belonging to different collections, in the same collection.

            That statements doesn’t even make sense. As you cannot join it doesn’t matter at all in which collection you save anything. As you also cannot do a self-join you don’t get any benefit in saving it in the same collection. But one point of a document-base database is that you don’t need to join anyway. If data is part of the document then put it in the document itself. It is not a relational store where you have to create hundreds of tables to put your data into.

            If it is not releated to the document, and it has a different schema, put it in another collection. No need at all to put in the same collection. If you put all your data in one collection then the reason why you have chaos is because of you. Not because the technology forced you to do it.

            But even if you manage to have everything organized in one collection,

            Why should i do to want this? MongoDB supports multiple collections in a database so use them. Do you also try to use a single table in your relational store?

            data will have different fields. If you want to query that collection on particular predicated based on those fields, you need to scan the whole collection

            That’s the reason why you have indexes, so you don’t have to scan the whole collection.

            That’s where what we do at ToroDB comes in for help.

            So ToroDB helps when people don’t have any clue on how to use a document-based database?

            Think of having data partitioned by “type” or “shape” of the document.

            Yeah, that’s exactly what a collection is for in MongoDB

            And regarding the numbers shown in those queries, those aggregated queries don’t use indexes

            Yes, my fault. An index sure doesn’t help when you run an aggregation across all documents. But MongoDB already supports putting different shapes into different places. That is what collections are for.

      • Miguel

        Correction to this statement, FTS is the faster way of reading data.
        Most of the more recent DB technologies are exactly based on that concept.

  • Harry McIntyre

    For me, for me it’s very important that I don’t have to make concessions to persistence when modelling domains, and I find that NOSQL databases are better for this than relational via an ORM.

    Another thing to consider before resorting to SQL or NOSQL is a MemoryImage approach, and after that, distributed actors with event-sourced state.

    http://martinfowler.com/bliki/MemoryImage.html

  • Thomas Eyde

    You can write a onetime migration for nosql, as well. But you would do it in code instead of an sql script. But a onetime migration like this is probably not what you want for large volumes, script or code, so not a good argument for either case.

    • David Raab

      You can *additional* provide a migration script. So once all your data is migrated you can delete the versioning from your current code. That also works nice with big data as you always can start/stop the migration in background. If your servers are idle you can start running the migration, if there have much to do, you can stop it. The migration can basically run in background. Once everything is converted, even if it can takes weeks to update everything you can get rid of the migration from your current code. If it is some customer specific application you probably anyway never need the migration script anyway.

    • http://enterprisecraftsmanship.com/ Vladimir Khorikov

      I agree. When I was writing this, I kept in mind a sharded scenario where it’s impossible to update all documents at once. But you are right, if we consider an equal case – a single instance DB – the difference vanishes. Although, migrations written in SQL are probably still a bit easier due to the more concise syntax.

  • David Raab

    I think your article is a little bit one-sided. At first every technology usually have is advantages and disadvantage. Because of that we cannot really ever say that one technik should be the “default”.

    I also think in general that NoSQL is a bad term. The problem is overall that it contains too much different stuff. From simple Key-Value stores, Graphs databases or Document databases everything is part of it. Even stuff like Memcached is basically part of the “NoSQL” term so the term means so much that i think it is pretty useless anyway.

    To better discuss the difference it is better to directly name a specific technology. You already said you primarily want to discuss about document-based database. So in that case let me pick MongoDB as a specific implementation. Not that i’m a MongoDB expert, i also don’t want to discuss MongoDB specifics in details, but i use at the base for my discussion.

    So back to the beginning. Why do i think it is a little bit one-sided? Because you primiarly try to find disadvantage of NoSQL. And you don’t really try to find advantages of it or even try to find disadvantages of Relational Databases. If we try to look deeper into advantage/disadvantages of both things it becomes clear that things are not really so much advantages/disadvantages of one solution. There are just different solution to the same problem.

    So let’s talk about Schema. Basically nothing is schemaless. You also have a Schema in MongoDB. The only difference to a relational solution is that in a relational database you have to specifiy your schema twice. twice because you code and probably design your classes around your schema, you probably even use a ORM to map your class schema to your database, and you once again define it in the database itself once again.

    By using let’s say MongoDB you still have a schema. Your code is your schema, but you don’t have to write a schema twice. On top of it, you don’t need that “ORM-mapping” at all. And i count that as a big disadvantage. It’s not that i’m against a ORM, whenever i would work with a relational database i would pick an ORM, but i also don’t think it is something good. The existence of things like ORM just is a prove that working with a relational database is nearly utterly crap otherwise. While SQL is definitely a nice DSL, it is important to note that SQL is for humans. And it pure insane to generate SQL from a programming language. I mean we also don’t generate JavaScript on the fly in our C# code and send it to client, or? Just because working with SQL is basically a pain-in-the-ass we have a ORM so we don’t have to mess with the ugly bits of generating SQL. But is it good when we need a technology like an ORM so we are just able to even work with a relation database?

    On top of that. Our mapping to a relational database is in my opinion often destructive. When i have a class with an array property i cannot save multiple elements into one field. No i have to create a new table and somehow map them together. And i basically have to deserialize a class into the relational schema, and when i read something from the database i have to serialize it back from the relational schema back into classes. Why is all of that stuff automatically good?

    On the other hand in MongoDB i just can save an array property in a class just as an array property in the database. It fells a lot more natural. We just save all data exactly how they are represented in our code. We don’t have two schemas, some kind of mapping to convert it from one schema to the another and a library that is needed to generate SQL. Those are all disadvantages that increases the complexity that you didn’t mention, as it seems you basically accepted all those disadvantages as given.

    But not forcing a schema in the database doesn’t change so much. You still have a schema also in your database. Your application that writes and reads to it enforces the schema. but you don’t have basically two different schemas. If however you have a different schema in the same collection even that changes not much. Sure your code would probably not work. But it is the same with a relational database. If the schema in your code doesn’t match up with the schema of the database your code using a relational database would also not work in the slightest.

    You mention we as the developer have to maintain the schema in a schemaless database. But when do we not do that? Also in a relational database you have to maintain the schema.

    You also mention versioning, but you already do that in a relational database. What do you have to do if you want to add a field in some classes? Right you add it to your class, you have to add it to your schema in the database, and you have to provide a migration from the old schema to the new schema. And updating a schema is basically a three way step.

    1) Shutting down your application

    2) Upgrade/Migrate database schema

    3) Update code and start application

    With a schemaless database it just reduces to one step. Update your code. Zero downtime. Now starting/stopping and making sure a migration script was runned. Your application can basically migrate all data on access to a new schema. You can read an old schema convert it to a new schema and save the new schema. Sure with a small database a feature like that is probably not needed, but that doesn’t make it a disadvantage. It is still easier.

    And on top, you still could do it the exact same as in a relational database. Also in a schemaless database you can create a migration script that runs your migration, and your code only contains your latest version. And it opens up the same disadvantage that you have to shut down your application while it is migrating. Every practice about schema versioning that you do in a relational database you also can do with a schemaless database. There is not really much difference here.

    But working with a schemaless database open up new workflows. For example you could handle the update of a schema in your code. And do a migration from an old schema to a new whenever you read something from the database. Whenever you save something you save it to your new schema. So you have no downtime. And you still can provide a migration script on top of it that converts all your old schema to your new schema. The nice thing is. You don’t have a downtime of your service. The migration can basically work in parallel to your normal running application. Once all data are migrated you can delete the migration from your code.

    What happens if you have a application with a lot of installation where you don’t knew which schema is currently in use by a customer? The same rules to a relational database applies. You have to provide a migration script that identifies the current schema and that does all updates to the latest schema. You have to do that regardless of the fact that your database is schemaless or not. In a schemaless database you *can* keep all migration in your code, but nobody says you have to do that, and you still can do the same things like you do in a relational database.

    How much it differs is really up to you. But from my experience. Most of the schema changes i ever did in a relational database is most of the time just adding fields to some tables. And in that case a schemaless is way much easier. If a field is not there you assume a default value. When you write you save the field. And it doesn’t even differ so much from a relational database. If you add a field to your schema in relational database probably all your fields also have a default like “NULL” or some other default, it even can be that already in a relational database you already calculate some default for new fields added. Just with the difference that you always have to provide a migration path, stop/start the service, do the migration offline and so on.

    Overall i don’t see much of a difference at all. But once again, the assumption is for MongoDB. The whole thing i explained probably doesn’t work with any kind of database or any other document-based database.

  • Barry McConnell

    I am amused at this great example of those who don’t remember history are doomed to repeat it. Every so-called NoSQL model is a reflection of models used in the 1960s up until the 1980s. Relational took prominence over other models not because it was The One True Way but rather that it has a sound theoretical basis, allows someone other than coders to navigate the data, separates the data concerns from the functional/application concerns and has followed a pattern of maintaining integrity within the DBMS rather than relying on applications developers to all build the same protections in the same ways. If the NoSQL database vendors focus on building into their products the same types of controls that relational systems have, I welcome a mature DBMS that allows for non-relational models to be used where they are the best fit to solve the problem. What I don’t want to return to are the days of data chaos and questionable integrity/utility/value that characterized the application-centric approach to data. We’ve come a long way, let’s not return to the Dark Ages simply because we have coders that don’t know history.

    • Kenneth Ambrose

      what he ^ said

    • Kenneth Ambrose

      what he ^ said

  • P Kemner

    I imagine the schemaless approach might be useful for things like the tags on tumblr. You don’t really care about referential integrity- a user might tag a post for dog, dogs, canines, Hunde, perros, fourleggedpetsnotcats or koriat. It’s all thrown on the user, not the system.

  • http://www.tehfamine.com/ Glen “Famine” Swan

    You have to love the subtle irony of this article! (not to troll, but just food for thought.)

    You highlight that some of the main reasons for using the vague term of “NoSQL” is for scalability and performance. This translates to a distributed fault tolerant system that is cheaper per terabyte. Yet, the answer you give to some of the common concerns with “NoSQL” is not to distribute the responsibility between a developer a DBA, but all to the developer… What a bottleneck to have.

    Personally, I foresee NoSQL as an important part of the ecosystem with SQL, just not the entire ecosystem. Schemaless or not, every tool has it’s pro’s and con’s. NoSQL is a great option for distributed fault tolerant systems that are easy on the pocket that can sit above your RDBMS augmenting ETL and much more for your RDBMS.

  • Kenneth Ambrose

    i think you missed another very compelling benefit of relational db- type safety!
    Columns in relational db are typed, and prevent bugs in applications, or poorly designed app interfaces, from allowing users to enter data like “too early” (yep, seen it happen) for dates.

    I work with an enterprise app that crashes all the time and is very difficult to report from because the devs decided to serialize all the data into XML column.

    Yes, it crashes all the time, yes it is impossible to report on, yes it is impossible to enforce type correctness in the db, to maintain referential and transaction correctness But it was easy for the devs to display data on the forms…

    • http://enterprisecraftsmanship.com/ Vladimir Khorikov

      That’s I believe what Explicit Schema is about.

    • David Raab

      I work with an enterprise app that crashes all the time and is very difficult to report from because the devs decided to serialize all the data into XML column.

      Did they do that in a relational database? If yes, it means you didn’t have type safety that you try to highlight.

      • Kenneth Ambrose

        yes they did do it in a relational database. it is a 3rd party app, we have no control over those decisions, and it crashes regularly due to problems with type enforcement. The .net code expects a date but pulls junk from the XML 🙂

        • David Raab

          Yeah i know exactly what you mean. In the past i also worked with a 3rth party tool. And what this tool did was it put all of his data into a single table. No normalization at all. It was horrible to work with those data as we had to do a lot of data manipulation in the language and transform it into a usable structure.

          But that is the essence of what i wanted to say or i even had the discussion about schemaless with someone other. Just because a relational database forces you to some schema doesn’t mean the schema is well-designed or even good. It can still be utterly crap.

          For having a good schema you need people and the ones who create the database schema has to know how to design it. That is true whether you use a schemaless database or not. A database doesn’t have to be good designed because it enforces you to some explicit schema. In the end you are anyway forced to a schema if the database enforces it or not doesn’t matter. The code enforces a schema.

          That you see XML serialization in a database is also not uncommon. I also did something like that myself. And actually did not regret that. In my case it was just a good option. The idea that every piece of data only exists once and you do a normalization is just simply not needed for every field/table. I also have seen stuff like tables that just emulate a key/value store instead to save arbitrary keys/values for a record. In such a case you also can just serialize a Dictionary to XML and save everything into a single field instead, and so on.

          At least some database like PostgreSQL also have Array columns. And today they even support JSON columns as far as i know. So you also could use that instead.

          But another reason is that it just gets hard to model some stuff in a relational database. For example any kind of hierarchical data is just a mess to handle in a relational database. Stuff like XML serialization is typically a solution so you don’t end up with too much complicated stuff. It can be the right thing, but also can be horrible used at the wrong place.

          In the end a relational database is also just a tool and doesn’t give you any guarantee that the data, the schema or an application uses it correctly or in a sane way. And it is also by no means “default”. For some data it is just a mess to work with.

          Just the enforcement of some schema actually doesn’t say much or give an guarantee at all of a sane designed normalized database.

          • Kenneth Ambrose

            but the difference is that with a schemaless db, the db cannot enforce type , RI, and other data constraints even if the developers and customer wanted it!

          • David Raab

            Also a relational db cannot really enforce types. If it could, how is it possible to add XML to a field then? Stuff like VARCHAR,TEXT,FLOAT and so on are not proper type validation. It is the same for programing code. If you use those types you end up with Primitive Obsession. More important validation of data happens in the application anyway.

            And that is also true for any other kind of data constraints. The point is that you leverage all of that task to your application.

            And you have todo that anyway, as you usually don’t want any kind of complex validation inside of SQL. I mean Vladimir has written a whole blog on that matter:

            http://enterprisecraftsmanship.com/2015/11/11/is-sql-a-good-place-for-business-logic/

            And for that matter i don’t need a relational database that enforces a VARCHAR. My programing language is already good enough to enforce that something is a string. And it can even enforce more important business logic beyond that!

          • Kenneth Ambrose

            David you really are not understanding at all of relational DB concepts and implementations.
            If I set a column to type DATE, ONLY VALID DATES can ever be entered in that column! period. no ifs, ands, or buts.
            Now I stupidly make every column a varchar, when some are semantically really dates, and some numbers, then the DB will allow any text. But that is because I told the DB the type was varchar and NOT int or date.
            These are BASIC relational DB concepts that there can be no argument about.

          • David Raab

            Yeah, my birthday is btw. 1.1.3070. The database saved it, so it must be valid! What you are giving as an answer is a really extremely bad answer on what i have written. But right if you think that anything is valid, then also your example with XML serialization is valid.

            Some developers chooses VARCHAR and saved XML in it. Valid data. period. no if, and, or buts.

            Everything what you answer to me is basically that i have no clue and i’m just dump instead of talking about the problems i highlight. In the same sense i also can answer.

            If your application crashes then it must be your fault, and you probably didn’t understand relational databases at all. Some developers choose to save XML and the database accept it, so it must be valid.

            Nice to have such nice discussion with you with so good arguments like yours. So i’m out of this discussion with such an impolite person like you.

          • Kenneth Ambrose

            1.1.3070 is a valid date. you are confusing data type with data semantics. That’s like saying the DB or application code should know that my name is not Kennet because the data type is varchar!
            Anyway, I won’t respond further. You are a client side coder and this stuff is simply not in your knowledgebase…

          • David Raab

            In an application i want stuff like Username, Email, Password, URL, Title. You have additional constraints. Like a Username should only consists of some characters. It isn’t allowed to contain spaces. A email adress should at least be valid, A password should be hashed before it is saved to the database. A URI should be a valid URI, not just some garbage. A title (for example for a website) is the special part to represent in a url and also shouldn’t contain extra stuff, spaces and so on.

            All of those are data constraint that you probably already have in an application. And what does your database use to save all of those things? Very likely VARCHAR for all fields. A VARCHAR doesn’t really add anything more useful to the validation, as your application already does.

            And it opens up the same question. What for example happens with your application if you bypass your application, directly read/write to the database and you write garbage like “dalkdjnsafglkhbsfkjb” into a field that is supposed to hold a “Email”? Nothing in a relational database prohibits that. Does your application break because of invalid fields? Throws it exceptions? Does it ignore the data?

            And actually for a schemaless database you have to think of the same stuff. Sure theoretical it could be happen that for an Email field someone injected a floating number, or a Date, but is that anyway different as reading a VARCHAR field that also can contain any garbage?

            Not really. A date would be the same way invalid as “fsfkdjnsadkfljbn” would be for an email. You have to think about those case anyway.

            Or you don’t have to think about those stuff. You always ever need to think about those stuff if you have multiple applications that uses the same database. If you restrict your database to a single application every data to the database goes through your applications and through your validation.

            In my opinion a relational database with a schema just give you more of a illusion that you have more data safety. Sure you have a nice VARCHAR, you cannot save a DATETIME in those field. But does it really mean that every VARCHAR is automatically valid, and your application can work with whatever VARCHAR you put in there? Hack, even a DATETIME is useless. If you have a birthday field you want probably ensure that every date lies in the past. If you write a system that handles events (music concerts and so on) you probably want at least that every new entry that you create is set in the future.

            If you bypass your application then a relational database gives you not really more safety at all. A VARCHAR is nothing more as an airbag in a car. It doesn’t prevent a crash. It can handle only a tiny bit of validation, but for a application you are already writing that is possible not good enough.

            On top of that comes another problem. At least relational database and OO doesn’t really work good together at all. It is the general problem what is described as “object relational impedance”. An ORM tries to fix that, but sure you cannot really fix that. There are two ways to solve the object relational impedance. Don’t use a OO language or drop a relational database.

            If you drop a relational database, there are also some kinds that just getting easier. For example in a document database you can directly save arrays or dictionaries. or any other kind of hierarchical data. For example in a blogging system. Is there really a need to save the comments in a separate table away from the blog? Not really. You also could save the comments just in the blog itself. The same goes for tags that you add to a blog. An array “tags” and you can do it. Saving data is way more easier to save and retrieve. No Joins over hundreds of tables, and so on.

            And real validation of data anyway happens in your application. You have make sure that either your application is the only application ever that saves data to your application and validates everythings that goes in. Or you application have to address problems that some fields might be invalid. A relational database with its primitive types is anyway not powerful enough to give you any data validation/safeness anyway.

            And those are even simple data constraints. If you have constraints like “A user only can have 3 open questions simultaneously” then you are anyway lost. This is stuff you really want to do in your application. If you write directly even to a relational database nothing prevents you of INSERTing hundreds of open question for a single user. So you anyway have constraint and validation that you will handle in an application that by far exceeds what a relational or a database with a schema can offer you.

          • Kenneth Ambrose

            Sorry, very incorrect!
            DB constraints _are in the db_.
            PRECISELY so that no application can bypass them!
            I think you don’t even know what DB constraints are?

      • Kenneth Ambrose

        I should add that I also own a car. However, if I drive the car into a lake and try to use it for water skiing, it may not work. Does that mean the car has flaws and is not appropriate to use on streets, or does it mean the driver is a fool?

        • David Raab

          No, but it also doesn’t mean you can drive carefree just because your car has airbags for protections, or that this protection turns someone into a good driver.

    • David Raab

      I work with an enterprise app that crashes all the time and is very difficult to report from because the devs decided to serialize all the data into XML column.

      If you can save XML in a field then it must be also valid data. Otherwise it would not be possible to save it.

      Yes, it crashes all the time, yes it is impossible to report on, yes it is impossible to enforce type correctness in the db, to maintain referential and transaction correctness But it was easy for the devs to display data on the forms…

      No problem of the database. All data are valid. Problem of the software if it cannot handle the data correctly.

  • Harold A. Scott

    Thanks for this article, I understood some parts but not everything, does anybody know of a good executive overview? I am just starting to get into programming so the code parts confused me a bit 🙂