Document Validation – Adding Just the Right Amount of Control Over Your MongoDB Documents

This post looks at Document Validation, a new feature in MongoDB 3.2. It introduces the feature together with its benefits and then goes on to step through a tutorial on how to introduce validation to an existing, live MongoDB deployment. This material was orginally published on the MongoDB blog.

Disclaimer

MongoDB’s future product plans are for informational purposes only. MongoDB’s plans may change and you should not rely on them for delivery of a specific feature at a specific time.

Introduction

One of MongoDB’s primary attractions for developers is that it gives them the ability to start application development without first needing to define a formal schema. Operations teams appreciate the fact that they don’t need to perform a time-consuming schema upgrade operation every time the developers need to store a different attribute (as an example, The Weather Channel is now able to launch new features in hours whereas it used to take weeks. For business leaders, the application gets launched much faster, and new features can be rolled out more frequently. MongoDB powers agility.

Many projects reach a point where it’s necessary to enforce rules on what’s being stored in the database – for example, that for any document in a particular collection, you can be assured that certain attributes are present. Reasons for this include:

Different development teams working with the same data; each one needing to know what they can expect to find in a particular collection
Development teams working on different applications, spread over multiple sites means that a clear understanding of shared data is important
Development teams from different companies where misunderstandings about what data should be present can lead to issues

As an example, an e-commerce website may centralize a product catalog feed from each of its vendors into a single collection. If one of the vendors alters the format of its product catalog, the global catalog search could fail.

This has resulted in developers building their own validation logic – either with the application code (possibly multiple times for different applications) or by adding middleware such as Mongoose.

If the database doesn’t enforce rules about the data, development teams need to implement this logic in their applications. However, use of multiple development languages makes it hard to add a validation layer across multiple applications.

To address the challenges discussed above, while at the same time maintaining the benefits of a dynamic schema, MongoDB 3.2 introduces document validation.

Validating Documents in MongoDB 3.2

Note that at the time of writing, MongoDB 3.2 is not yet released but this functionality can be tried out in MongoDB 3.2 which is available for testing only, not production.

Document Validation provides significant flexibility to customize which parts of the documents are and are not validated for any collection. For any key it might be appropriate to check:

That a key exists
If a key does exist, is it of the correct type
That the value is in a particular format (e.g., regular expressions can be used to check if the contents of the string matches a particular pattern)
That the value falls within a given range

Further, it may be necessary to combine these checks – for example that the document contains the user’s name and either their email address or phone number, and if the email address does exist, then it must be correctly formed.

Adding the validation checks to a collection is very intuitive to any developer or DBA familiar with MongoDB as it uses the same expression syntax as a find query to search the database. As an example, the following snippet adds validations to the contacts collection that validates:

The year of birth is no later than 1994
The document contains a phone number and/or an email address
When present, the phone number and email addresses are strings

db.runCommand({
   collMod: "contacts",
   validator: { 
      $and: [
        {year_of_birth: {$lte: 1994}},
        {$or: [ 
                  {phone: { $type: "string"}}, 
                  {email: { $type: "string"}}
              ]}]
    }})

When and How to Add Document Validation

Proponents of the waterfall development processes would assert that all of the validations should be added right at the start of the project – certainly before going into production. This is possible, but in more agile approaches, the first version may deploy with no validations and future releases will add new data and checks. Fortunately, MongoDB 3.2 provides a great deal of flexibility in this area.

For existing data, we want to allow the application to continue to operate as we introduce validation into our collections. Therefore, we want to allow updates and simply log failed validations so we can take corrective measures separately if necessary, or take no action.

For new data, we want to ensure the data is valid and therefore return an error if the validation fails.

For any collection, developers or the DBA can choose to specify validation rules for each collection as well as indicating whether failed validations result in a hard error or just a warning – Table 1 shows the available permutations.

Table 1: Configuration Options for Document Validation

Figure 1 illustrates one possible timeline for how the application is developed.

Figure 1: Aligning document validation with application lifecycle

Of course, as applications evolve they require additional pieces of data and it will often make sense to add to the documentat validation rules to check that this data is always included. Figure 2 illustrates an example timeline of how this could be managed.

Figure 2: Introducing New Data Together with Validations

Coping with Multiple Schema Versions

A tricky problem to solve with RDBMSs is the versioning of data models; with MongoDB it’s very straight-forward to set up validations that can cope with different versions of documents, with each version having a different set of checks applied. In the example validation checks below, the following logic is applied:

If the document is unversioned (possibly dating to the time before validations were added), then no checks are applied
For version 1, the document is checked to make sure that the name key exists
For version 2 documents, the type of the name key is also validated to ensure that it is a string

db.runCommand({
   collMod: "contacts",
   validator:
     {$or: [{version: {"$exists": false}},
            {version: 1,
             $and: [{Name: {"$exists": true}}]
            },
            {version: 2,
             $and: [{Name: {"$exists": true, "$type": 2}}]
            }
          ]
      } 
})

In this way, multiple versions of documents can exist within the same collection, and the application can lazily up-version them over time. Note that the version attribute is user-defined.

Document Validation Limitations in MongoDB 3.2

This is the first release of Document Validation and so it’s inevitable that there are still some things that would be great to add:

The current error message is very generic and doesn’t pick out which part of your document failed validation (note that the validation rule for a collection may check several things across many attributes). Jira ticket
The validation checks cannot compare one key’s value against another (whether in the same or different documents). For example {salary: {$gte: startingSalary}} is not possible. Jira ticket
It is the application or DBA’s responsibility to bring legacy data into compliance with new rules (there are no audits or tools) – the tutorial in this post attempts to show how this can be done.

Where MongoDB Document Validation Excels (vs. RDBMSs)

In MongoDB, Document Validation is simple to set up. There is no need for stored procedures – which for many types of validation would be required in an RDBMS – and because the familiar MongoDB query language is used, there is no new syntax to learn.

The functionality is very flexible and it can enforce constraints on as little or as much of the schema as required. You get the best of both worlds – a dynamic schema for rapidly changing, polymorphic data, with the option to enforce strict validation checks against specific attributes from the onset of your project, or much later on. If you initially have no validations defined, they can still be added later – even once in production, across thousand of servers.

It is always a concern whether adding extra checks will impact the performance of the system; in our tests, document validation adds a negligible overhead.

So, is all Data Validation Now Done in the Database?

The answer is ‘probably not’ – either because there’s a limit to what can be done in the database or because there will always be a more appropriate place for some checks. Here are some areas to consider:

For a good user-experience, checks should be made as high up the stack as is sensible. For example, the format of an entered email address should be first checked in the browser rather than waiting for the request to be processed and an attempt made to write it to the database.
Any validations which need to compare values between keys, other documents, or external information cannot currently be implemented within the database.
Many checks are best made within the application’s business logic – for example “is this user allowed to use these services in their home country”; the checks in the database are primarily there to protect against coding errors.
If you need information on why the document failed validation then the application will need to check against each of the sub-rules within collection’s validation rule as the error message will not currently give this level of detail.

Tutorial

The intent of this section is to step you through exactly how document validation can be introduced into an existing production deployment in such a way that there is no impact to your users. It covers:

Setting up some test data (not needed for a real deployment)
Using MongoDB Compass and the mongo shell to reverse engineer the de facto data model and identify anomalies in the existing documents
Defining the appropriate document validation rules
Preventing new documents being added which don’t follow the new rules
Bring existing documents “up to spec” against the new rules

This section looks at taking an existing, deployed database which currently has no document validations defined. It steps through understanding what the current document structure looks like; deciding on what rules to add and then rolling out those new rules.

As a pre-step add some data to the database (obviously, this isn’t needed if working with your real deployment).

use clusterdb;
db.dropDatabase();
use clusterdb();
db.inventory.insert({ "_id" : 1, "sku" : "abc", 
    "description" : "product 1", "instock" : 120 });
db.inventory.insert({ "_id" : 2, "sku" : "def", 
    "description" : "product 2", "instock" : 80 });
db.inventory.insert({ "_id" : 3, "sku" : "ijk", 
    "description" : "product 3", "instock" : 60 });
db.inventory.insert({ "_id" : 4, "sku" : "jkl", 
    "description" : "product 4", "instock" : 70 });
db.inventory.insert({ "_id" : 5, "sku" : null, 
    "description" : "Incomplete" });
db.inventory.insert({ "_id" : 6 });

for (i=1000; i<2000; i++) {
  db.orders.insert({
    _id: i,
    item: "abc", 
    price: i % 50,
    quantity: i % 5
  });
};

for (i=2000; i<3000; i++) {
  db.orders.insert({
    _id: i,
    item: "jkl", 
    price: i % 30,
    quantity: Math.floor(10 * Math.random()) + 1
  });
};

for (i=3000; i<3200; i++) {
  db.orders.insert({
    _id: i,
    price: i % 30,
    quantity: Math.floor(10 * Math.random()) + 1
  });
};

for (i=3200; i<3500; i++) {
  db.orders.insert({
    _id: i,
    item: null,
    price: i % 30,
    quantity: Math.floor(10 * Math.random()) + 1
  });
};

for (i=3500; i<4000; i++) {
  db.orders.insert({
    _id: i,
    item: "abc",
    price: "free",
    quantity: Math.floor(10 * Math.random()) + 1
  });
};

for (i=4000; i<4250; i++) {
  db.orders.insert({
    _id: i,
    item: "abc",
    price: "if you have to ask....",
    quantity: Math.floor(10 * Math.random()) + 1
  });
};

The easiest way to start understanding the de facto schema for your database is to use MongoDB Compass. Simply connect Compass to your mongod (or mongos if you’re using sharding) and select the database/collection you’d like to look into. To see MongoDB Compass in action – view this demo video.

As shown in Figure 3, there are typically four keys in each document from the clusterdb.orders table:

_id is always present and is a number
item is normally present and is a string (either “abc” or “jkl”) but is occasionally null or missing altogether (undefined)
price is always present and is in most cases a number (the histogram shows how the values are distributed between 0 and 49) but in some cases it’s a string
quantity is always present and is a number

Figure 3: Viewing the Document Schema using MongoDB Compass

For this tutorial, we’ll focus on the price. By clicking on the string label, Compass will show us more information about the string content for price – this is shown in Figure 4.

Figure 4: Drilling Down into string Values

Compass shows us that:

For those instances of price which are strings, the common values are “free” and “if you have to ask….”.
If you click on one of those values, a query expression is formed and clicking “Apply” runs that query and now Compass will show you information only for that subset of documents. For example, where price == "if you have to ask...." (see Figure 5).
By selecting multiple attributes, you can build up fairly complex queries.
The query you build visually is printed at the top so you can easily copy/paste into other contexts like the shell.

Figure 5: Formulating Search Expressions with MongoDB Compass

If applications are to work with the price from these documents then it would be simpler it it was always set to a numerical value, and so this is something that should be fixed.

Before cleaning up the existing documents, the application should be updated to ensure numerical values are stored in the price field. We can do this by adding a new validation rule to the collection. We want this rule to:

Allow changes to existing invalid documents
Prevent inserts of new documents which violate validation rules
Set up a very simple document validation rule that checks that price exists and contains a double – see the enumeration of MongoDB BSON types

These steps should be run from the mongo shell:

db.orders.runCommand("collMod", 
                   {validationLevel: "moderate", 
                    validationAction: "error"});

db.runCommand({collMod: "orders", 
               validator: {
                  price: {$exists: true},
                  price: {$type: 1}
                }
              });

The validation rules for this collection can now be checked:

db.getCollectionInfos({name:"orders"})
[
  {
    "name": "orders",
    "options": {
      "validator": {
        "price": {
          "$type": 1
        }
      },
      "validationLevel": "moderate",
      "validationAction": "error"
    }
  }
]

Now that this has been set up, it’s possible to check that we can’t add a new document that breaks the rule:

db.orders.insert({
    "_id": 6666, 
    "item": "jkl", 
    "price": "rogue",
    "quantity": 1 });

Document failed validation
WriteResult({
  "nInserted": 0,
  "writeError": {
    "code": 121,
    "errmsg": "Document failed validation"
  }
})

But it’s OK to modify an existing document that does break the rule:

db.orders.findOne({price: {$type: 2}});

{
  "_id": 3500,
  "item": "abc",
  "price": "free",
  "quantity": 5
}

> db.orders.update(
    {_id: 3500},
    {$set: {quantity: 12}});

Updated 1 existing record(s) in 5ms
WriteResult({
  "nMatched": 1,
  "nUpserted": 0,
  "nModified": 1
})

Now that the application is no longer able to store new documents that break the new rule, it’s time to clean up the “legacy” documents. At this point, it’s important to point out that Compass works on a random sample of the documents in a collection (this is what allows it to be so quick). To make sure that we’re fixing all of the documents, we check from the mongo shell. As the following commands could consume significant resources, it may make sense to run them on a secondary):

secondary> db.orders.aggregate([
    {$match: {
      price: {$type: 2}}},
    {$group: {
      _id: "$price", 
      count: {$sum:1}}}
  ])

{ "_id" : "if you have to ask....", "count" : 250 }
{ "_id" : "free", "count" : 500 }

The number of exceptions isn’t too high and so it is safe to go ahead and fix up the data without consuming too many resources:

db.orders.update(
    {price:"free"},
    {$set: {price: 0}},
    {multi: true});

db.orders.update(
    {price:"if you have to ask...."},
    {$set: {price: 1000000}},
    {multi: true});

At this point it’s now safe to enter the strict mode where any inserts or updates will cause an error if the document being stored doesn’t follow the rules:

db.orders.runCommand("collMod", 
                   {validationLevel: "strict", 
                    validationAction: "error"});

Next Steps

Hopefully this has given you a sense for what the Document Validation functionality offers and started you thinking about how it could be applied to your application and database. I’d encourage you to read up more on the topic and these are some great resources:

Webinar: Document Validation in MongoDB 3.2
MongoDB 3.2 documentation for Document Validation
The best way to really get a feel for the functionality is to try it out for yourself:Download MongoDB 3.2
Feedback is welcomed and we’d encourage you to join the MongoDB 3.2 bug hunt
Document Validation and What Dynamic Schema Means – Eliot Horowitz. This blog post adds context to why this functionality is being introduced now.
Bulletproof Data Management – Buzz Moschetti. Great presentation on how to look after your data – including in earlier versions of MongoDB.

Andrew Morgan on Databases