Tag Archive for join

Joins and Other Aggregation Enhancements in MongoDB 3.2

This post looks at the aggregation enhancements being introduced in MongoDB 3.2 – most notably $lookup which implements left-outer equi-joins in the MongoDB Aggregation Framework. The material was originally published in a MongoDB blog series.

It starts with an introduction to analyzing data with MongoDB. We then explain why joins are sometimes useful for MongoDB – in spite of the strengths of the document model – and how developers have been working without them. It then works through examples of building aggregation pipelines – including using the operators added in MongoDB 3.2. After that, we look at how geolocation data can be included as well as what to do when you reach the limit of what can be done using a single pipeline – including adding wrapper code. Finally, there’s a summary of some of the limitations of the Aggregation Framework and reasons why you might supplement it with a full visualization solution such as Tableau together with MongoDB’s Connector for BI (Business Intelligence) – also new in MongoDB 3.2.

Disclaimer

MongoDB’s product plans are for informational purposes only. MongoDB’s plans may change and you should not rely on them for delivery of a specific feature at a specific time.

Real-Time Analytics and Search

With the emergence of new data sources such as social media, mobile applications and sensor-equipped “Internet of Things” networks, organizations can extend analytics to deliver real-time insight and discovery into such areas as operational performance, customer satisfaction, and competitor behavior.

Time to value is everything. For example, having access to real-time customer sentiment or fleet tracking is of little benefit unless the data can be analyzed and reported in real-time.

MongoDB 3.2 aims to extend the options for performing analytics on the live, operational database – ensuring that answers are delivered quickly, and reflect current data. Work that would previously have needed to be done on the client side can now be performed by the database – freeing the developer to focus on new features.

The Case for Joins

MongoDB’s document data model is flexible and provides developers many options in terms of modeling their data. Most of the time all the data for a record tends to be located in a single document. For the operational application, accessing data is simple, high performance, and easy to scale with this approach.

When it comes to analytics and reporting, however, it is possible that the data you need to access spans multiple collections. This is illustrated in Figure 1, where the _id field of multiple documents from the products collection is included in a document from the orders collection. For a query to analyze orders and details about their associated products, it must fetch the order document from the orders collection and then use the embedded references to read multiple documents from the products collection. Prior to MongoDB 3.2, this work is implemented in application code. However, this adds complexity to the application and requires multiple round trips to the database, which can impact performance.

Application-Layer simulation of joins between documents
Figure 1: Application-Layer simulation of joins between documents

MongoDB 3.2 introduces the $lookup operator that can now be included as a stage in an aggregation pipeline. With this approach, the work of combining data from the orders and products collections is implemented within the database, and as part of a broader aggregation pipeline that performs other processing in a single query. As a result, there is less work to code in the application, and fewer round trips to the database. You can think about $lookup as equivalent to a left outer equi-join.

Aside – What is a Left Outer Equi-Join?

A left outer equi-join produces a result set that contains data for all documents from the left table (collection) together with data from the right table (collection) for documents where there is a match with documents from the left table (collection). This is illustrated in Figure 2.

Left-Outer join between tables/collections
Figure 2: Left-Outer join between collections

MongoDB’s Aggregation Framework

The Aggregation Framework is a pipeline for data aggregation modeled on the concept of data processing pipelines. Documents enter a multi-stage pipeline that transforms the documents into aggregated results. The pipeline consists of stages; each stage transforms the documents as they pass through.

In general, each successive stage reduces the volume of data; removing information that isn’t needed and combining other data to produce summarized results.

Figure 3 shows a conceptual model for the Aggregation Framework pipeline. This is what’s happening at each stage:

  • On the left-hand side/start of the pipeline is the original collection contents – each record (document) containing a number of shapes (keys), each with a particular color (value)
  • The $match stage filters out any documents that don’t contain a red diamond
  • The $project stage adds a new “square” attribute with a value computed from the value (color) of the snowflake and triangle attributes
  • The $lookup stage (new in 3.2 – more details later) performs a left-outer join with another collection, with the star being the comparison key. This creates new documents which contain everything from the previous stage but augmented with data from any document from the second collection containing a matching colored star (i.e., the blue and yellow stars had matching “lookup” values, whereas the red star had none).
  • Finally, the $group stage groups the data by the color of the square and produces statistics (sum, average and standard deviation) for each group.

MongoDB Aggregation Framework Pipeline
Figure 3: MongoDB Aggregation Framework pipeline

This is the full set of aggregation stages:

  • $match – Filter documents
  • $geoNear – Sort documents based on geographic proximity
  • $project – Reshape documents (remove or rename keys or add new data based on calculations on the existing data)
  • $lookupComing in 3.2 – Left-outer joins
  • $unwind – Expand documents (for example create multiple documents where each contains one element from an array from the original document)
  • $group – Summarize documents
  • $sample – Randomly selects a subset of documents
  • $sort – Order documents
  • $skip – Jump over a number of documents
  • $limit – Limit number of documents
  • $redact – Restrict sensitive content from documents
  • $outComing in 3.2* – store the results in a new collection

The details can be found in the documentation.

New Aggregation Operators in MongoDB 3.2

There are operators used within each stage and this set is being extended in MongoDB 3.2 to include:

  • Array operations
    • $slice, $arrayElemAt, $concatArrays, $isArray, $filter, $min, $max, $avg and $sum (some of these were previously available in a $group stage but not in $project
  • Standard Deviations
    • $stdDevSamp (based on a sample) and $stdDevPop (based on the complete population)
  • Square Root
    • $sqrt
  • Absolute (make +ve) value
    • $abs
  • Rounding numbers
    • $trunc, $ceil, $floor
  • Logarithms
    • $log, $log10, $ln
  • Raise to power
    • $pow
  • Natural Exponent
    • $exp

Further details on these new operators can be found in the MongoDB 3.2 Release Notes.

$lookup – Left Outer Equi-Joins

Figure 4 illustrates the syntax for performing the join:

  • leftCollection is the collection that the aggregation is being performed on and is the left collection in the join
  • from identifies the collection that it will be joined with – the right collection (rightCollection in this case)
  • localField specifies the key from the original/left collection – leftVal
  • foreignField specifies the key from the right collection – rightVal
  • as indicates that the data from the right collection should be embedded within the resulting documents as an array called embeddedData

$lookup – Left-Outer Joins for MongoDB
Figure 4: $lookup – Left-Outer Joins for MongoDB

In the follow-on blogs in this series, you’ll see how the data from a home sales collection (containing details of each home sale, including the property’s postal code) is joined with data from a postal code collection (containing postal codes and their geographical location). This produces documents that contain the original home sale information augmented with the coordinates of the property. In this case, the “homesales” collection is the left-collection and “postcodes” the right-collection; the “postcode” field from each collection is the localField which is matched with the foreignField.

Worked Examples

The Data Set

The examples use two data sets, the first being a comprehensive set of home sale data for a town and the second being a mapping from postal codes to geospatial locations for that same town.

For those interested, the imported data sets needed some cleaning up to make this walkthrough more useful – the steps are described here.

The two data sets can be checked using the mongo shell:

db.homeSales.findOne()
{
  "_id": ObjectId("56005dd980c3678b19792b7f"),
  "amount": 9000,
  "date": ISODate("1996-09-19T00:00:00Z"),
  "address": {
    "nameOrNumber": 25,
    "street": "NORFOLK PARK COTTAGES",
    "town": "MAIDENHEAD",
    "county": "WINDSOR AND MAIDENHEAD",
    "postcode": "SL6 7DR"
  }
}

db.postcodes.findOne()
{
  "_id": ObjectId("5600521e50fa77da54dfc0d2"),
  "postcode": "SL6 0AA",
  "location": {
    "type": "Point",
    "coordinates": [
      51.525605,
      -0.700974
    ]
  }
}

An even better option to understand the contents of these collections is to use MongoDB Compass (to be released with MongoDB 3.2). Figure 1 shows an overview of the homeSales collection and Figure 2 delves into its address sub-document.

These datasets (the homeSales and postcodes collections) can be downloaded here.

MongoDB Compass View of homeSales Collection
Figure 1: MongoDB Compass View of the homeSales Collection

Viewing Sub-Documents With MongoDB Compass
Figure 2: Viewing Sub-Documents With MongoDB Compass

Building the First Pipeline

As the error messages from complex pipelines aren’t always very specific, it makes sense to start with a simple pipeline and then check the results before moving onto the next.

As a collection’s indexes are only used for the beginning stages in the pipeline (before any transformations are performed), it is often sensible to reduce the result set as much as possible with a $match stage to filter out any unnecessary documents. Ideally, the match would be against the sharding key so that fewer shards need to be included. For the first pipeline stage the cheaper property sales are going to excluded and so it would help to have a secondary index on the amount key:

db.homeSales.createIndex({amount: 1})

The first stage in the pipeline then filters out any sales of less than £3,000,000:

db.homeSales.aggregate([
  {$match: {
    amount: {$gte:3000000}}
  }
  ])

The results can then be inspected to understand what will flow into the next stage in the pipeline:

{
    {
      "_id": ObjectId("56005dda80c3678b19799e52"),
      "amount": 3000000,
      "date": ISODate("2012-04-19T00:00:00Z"),
      "address": {
        "nameOrNumber": "TEMPLE FERRY PLACE",
        "street": "MILL LANE",
        "town": "MAIDENHEAD",
        "county": "WINDSOR AND MAIDENHEAD",
        "postcode": "SL6 5ND"
      }
    },

...

    {
      "_id": ObjectId("56005dda80c3678b19799e5c"),
      "amount": 5425000,
      "date": ISODate("1999-03-15T00:00:00Z"),
      "address": {
        "nameOrNumber": "2 - 3",
        "street": "THE SWITCHBACK",
        "town": "MAIDENHEAD",
        "county": "WINDSOR AND MAIDENHEAD",
        "postcode": "SL6 7RJ"
      }
    }

In the next stage, a left-outer join is performed – using $lookup – to find documents from the postcodes collection with a matching postcode so that the geographic location can be determined:

db.homeSales.aggregate([
  {$match: {
    amount: {$gte:3000000}}
  }, 
  {$lookup: {
    from: "postcodes", 
    localField: "address.postcode",
    foreignField: "postcode",
    as: "postcode_docs"}
  }
])

Which yields these results:

{
    {
      "_id": ObjectId("56005dda80c3678b19799e52"),
      "amount": 3000000,
      "date": ISODate("2012-04-19T00:00:00Z"),
      "address": {
        "nameOrNumber": "TEMPLE FERRY PLACE",
        "street": "MILL LANE",
        "town": "MAIDENHEAD",
        "county": "WINDSOR AND MAIDENHEAD",
        "postcode": "SL6 5ND"
      },
      "postcode_docs": [
        {
          "_id": ObjectId("560053e280c3678b1978b293"),
          "postcode": "SL6 5ND",
          "location": {
            "type": "Point",
            "coordinates": [
              51.549516,
              -0.80702
            ]
          }
        }
      ]
    },

...

      "postcode_docs": [
        {
          "_id": ObjectId("560053e280c3678b1978b524"),
          "postcode": "SL6 7RJ",
          "location": {
            "type": "Point",
            "coordinates": [
              51.536848,
              -0.735835
            ]
          }
        }
      ]
    }

The pipeline can then be extended with a $project stage to refactor the documents, removing any information that isn’t needed. The documents are then sorted in reverse-price order:

db.homeSales.aggregate([
  {$match: {
    amount: {$gte:3000000}}
  }, 
  {$lookup: {
    from: "postcodes", 
    localField: "address.postcode",
    foreignField: "postcode",
    as: "postcode_docs"}
  },
  {$project: {
    _id: 0,
    saleDate: "$date",
    price: "$amount",
    address: 1,
    location: "$postcode_docs.location"}},
  {$sort:
    {
      price: -1
    }}
])

The address and physical location of every home sale, starting with the most expensive is then shown:

    {
      "address": {
        "nameOrNumber": "2 - 3",
        "street": "THE SWITCHBACK",
        "town": "MAIDENHEAD",
        "county": "WINDSOR AND MAIDENHEAD",
        "postcode": "SL6 7RJ"
      },
      "saleDate": ISODate("1999-03-15T00:00:00Z"),
      "price": 5425000,
      "location": [
        {
          "type": "Point",
          "coordinates": [
            51.536848,
            -0.735835
          ]
        }
      ]
    },

...

    {
      "address": {
        "nameOrNumber": "TEMPLE FERRY PLACE",
        "street": "MILL LANE",
        "town": "MAIDENHEAD",
        "county": "WINDSOR AND MAIDENHEAD",
        "postcode": "SL6 5ND"
      },
      "saleDate": ISODate("2012-04-19T00:00:00Z"),
      "price": 3000000,
      "location": [
        {
          "type": "Point",
          "coordinates": [
            51.549516,
            -0.80702
          ]
        }
      ]
    }

Building Further Pipelines

If all of the sales need to be analyzed (rather than just the most expensive few) then there would be too many results from the previous pipeline to be easily understood. For this reason, the pipeline is modified so that extra analysis and aggregation is performed within the database. This can be done using a $group stage – in this case to produce sales statistics by year:

db.homeSales.aggregate([
  {
    $group: 
    {
      _id: {$year: "$date"},
      higestPrice: {$max: "$amount"},
      lowestPrice: {$min: "$amount"},
      averagePrice: {$avg: "$amount"},
      priceStdDev: {$stdDevPop: "$amount"}
    }
  },
  {
    $sort: {_id: 1}
  }
])
    {
      "_id": 1995,
      "higestPrice": 1000000,
      "lowestPrice": 12000,
      "averagePrice": 114059.35206869633,
      "priceStdDev": 81540.50490801703
    },

...

    {
      "_id": 2015,
      "higestPrice": 1688000,
      "lowestPrice": 125000,
      "averagePrice": 451413.23917137476,
      "priceStdDev": 228293.6005201898
    }

Note that this is using the $stdDevPop (standard deviation) aggregation operator being introduced in MongoDB 3.2.

There’s more precision than needed for some of the keys and so a common pattern can be employed – use a final $project stage to tidy up the data:

db.homeSales.aggregate([
  {
    $group: 
    {
      _id: {$year: "$date"},
      higestPrice: {$max: "$amount"},
      lowestPrice: {$min: "$amount"},
      averagePrice: {$avg: "$amount"},
      priceStdDev: {$stdDevPop: "$amount"}
    }
  },
  {
    $sort: {_id: 1}
  },
  {
    $project:
    {
      _id: 1,
      higestPrice: 1,
      lowestPrice: 1,
      averagePrice: {$trunc: "$averagePrice"},
      priceStdDev: {$trunc: "$priceStdDev"}
    }
  }
])
    {
      "_id": 1995,
      "higestPrice": 1000000,
      "lowestPrice": 12000,
      "averagePrice": 114059,
      "priceStdDev": 81540
    },

...

    {
      "_id": 2015,
      "higestPrice": 1688000,
      "lowestPrice": 125000,
      "averagePrice": 451413,
      "priceStdDev": 228293
    }

It often makes sense to store the results in a new collection – either because the results will be reused many times or that subsequent aggregation pipelines will reference them. This is simple to achieve using a $out stage; note that:

  • If the target collection already exists then its contents will be overwritten
  • If used then the $out must be the final stage in the pipeline

The following example writes the results to a collection called annualHomePrices:

db.homeSales.aggregate([
  {
    $group: 
    {
      _id: {$year: "$date"},
      highestPrice: {$max: "$amount"},
      lowestPrice: {$min: "$amount"},
      averagePrice: {$avg: "$amount"},
      priceStdDev: {$stdDevPop: "$amount"}
    }
  },
  {
    $sort: {_id: 1}
  },
  {
    $project:
    {
      _id: 0,
      year: "$_id",
      highestPrice: 1,
      lowestPrice: 1,
      averagePrice: {$trunc: "$averagePrice"},
      priceStdDev: {$trunc: "$priceStdDev"}
    }
  },
  {
    $out: "annualHomePrices"
  }
])
> db.annualHomePrices.findOne()
{
  "_id": ObjectId("560957ac29a5574d557d426d"),
  "highestPrice": 1000000,
  "lowestPrice": 12000,
  "averagePrice": 114059,
  "priceStdDev": 81540,
  "year": 1995
}

These “interim” results could then be used to build further result sets – without the need to run all of that processing again. As an example, using a simple $project stage, the gap between the highest and lowest house sale can be calculated for each year:

db.annualHomePrices.aggregate([
  {$project: 
    {
      Year: "$year",
      hightToLowPriceGap: {
        $subtract: ["$highestPrice", "$lowestPrice"]
      },
      _id: 0
    }
  }
])
{
   "Year": 2012,
   "hightToLowPriceGap": 2923000
},
{
   "Year": 2013,
   "hightToLowPriceGap": 5092250
},
{
   "Year": 2014,
   "hightToLowPriceGap": 3883050
},
{
  "Year": 2015,
  "hightToLowPriceGap": 1563000
}

As a final stage in this post, a pipeline is built to find the postal code and geographic location of the most expensive house sale for each of the recorded years:
Perform a $sort on the full homeSales data set so that the documents are ordered from most expensive sale first
$group the results based on the year of the home sale, deriving the priciestPostCode from the $first document in that group (year). Because the documents were sorted on price before entering the group stage, the first document is also the one with the highest price
$lookup the postal code in the postcodes collection to get the geolocation data
$sort the results by year
$project just the data that is of interest

db.homeSales.aggregate([
    {
      $sort: {amount: -1}
    },
    {
      $group:
      {
        _id: {$year: "$date"},
        priciestPostCode: {$first: "$address.postcode"}
      }
    },
    {
      $lookup:
      {
        from: "postcodes",
        localField: "priciestPostCode",
        foreignField: "postcode",
        as: "locationData"
      }
    },
    {
      $sort: {_id: -1}
    },
    {
      $project:
      {
        _id: 0,
        Year: "$_id",
        PostCode: "$priciestPostCode",
        Location: "$locationData.location"
      }
    }
  ])
{
  "Year": 2015,
  "PostCode": "SL6 9UD",
  "Location": [
    {
      "type": "Point",
      "coordinates": [
        51.558455,
        -0.756023
      ]
    }
  ]
},
{
  "Year": 2014,
  "PostCode": "SL6 1UP",
  "Location": [
    {
      "type": "Point",
      "coordinates": [
        51.51407,
        -0.704414
      ]
    }
  ]
},
...

Adding Some Coding Glue and Geolocation

Obviously, there are limits as to how much can be achieved with a single aggregation pipeline but with the addition of just a little code outside of the database (in this example, JavaScript in the mongo shell), additional results can be produced.

We start by repeating an aggregation from the previous section but store the data in a collection so that we can build upon those results:

db.homeSales.aggregate([
    {
      $sort: {amount: -1}
    },
    {
      $group:
      {
        _id: {$year: "$date"},
        priciestPostCode: {$first: "$address.postcode"}
      }
    },
    {
      $lookup:
      {
        from: "postcodes",
        localField: "priciestPostCode",
        foreignField: "postcode",
        as: "locationData"
      }
    },
    {
      $sort: {_id: -1}
    },
    {
      $project:
      {
        _id: 0,
        Year: "$_id",
        PostCode: "$priciestPostCode",
        Location: "$locationData.location"
      }
    },
    {
      $out: "hottestLocations"
    }
  ])
db.hottestLocations.findOne()
{
  "_id": ObjectId("5629108c96be45aba9cb0c98"),
  "Year": 2015,
  "PostCode": "SL6 9UD",
  "Location": [
    {
      "type": "Point",
      "coordinates": [
        51.558455,
        -0.756023
      ]
    }
  ]
}

In this example, geospatial operations are performed on the location from the postcodes collection and so a geospatial index should be added:

db.postcodes.createIndex({location: "2dsphere"})

A function is created that, given a location, will find the five nearest postcodes – taking advantage of a $geoNear stage – note that this must be the first stage in the pipeline:

var findNeighbours = function (spot, yearTag) {
  var result = db.postcodes.aggregate([
      {
        $geoNear:
        {
          near: spot,
          distanceField: "distance",
          num: 5,
          spherical: true
        }
      },
      {
        $group: {
          _id: yearTag,
          "neighbours": {
            $addToSet: "$postcode"
          }
        }
      }
    ]);
  return result.result;
}

Then, for each of the hottestLocations collection, that function is called to display and the contents of the returned cursor are displayed:

db.hottestLocations.find().forEach(function(myDoc) {
  var myCursor = findNeighbours(myDoc.Location[0], myDoc.Year);
  myCursor.forEach(printjson);
})

The result is a list of the 5 closest postcodes to the top selling home each year:

{
  "_id": 1995,
  "neighbours": [
    "SL6 2NL",
    "SL6 2JL",
    "SL6 2NB",
    "SL6 2JN",
    "SL6 2NA"
  ]
}

...

{
  "_id": 2015,
  "neighbours": [
    "SL6 9XB",
    "SL6 9XL",
    "SL6 9UE",
    "SL6 9UB",
    "SL6 9UD"
  ]
}

Bonus Query – For Those Choosing a School

There’s often a correlation between the house prices near a school and that school’s performance. So, if you were considering a specific school then it might make sense to check house prices in the area surrounding that school.

The following pipeline will find house price statistics, by year, for all postcodes within a 3 km radius of the school – which is located at coordinates (51.5156725, -0.727387):

db.postcodes.aggregate([
      {
        $geoNear:
        {
          near: 
          {
            "type": "Point",
            "coordinates": [
              51.5156725,
              -0.727387
            ]},
          distanceField: "distance",
          num: 10000,
          maxDistance: 3000,
          spherical: true
        }
      },
      {
        $lookup: {
          from: "homeSales",
          localField: "postcode",
          foreignField: "address.postcode",
          as: "priceData"
        }
      },
      {
        $unwind: "$priceData"
      },

      {
        $group: 
        {
          _id: {$year: "$priceData.date"},
          highestPrice: {$max: "$priceData.amount"},
          lowestPrice: {$min: "$priceData.amount"},
          averagePrice: {$avg: "$priceData.amount"},
          priceStdDev: {$stdDevPop: "$priceData.amount"}
        }
      },
      {
        $project:
        {
          _id: 0,
          Year: "_id",
          highestPrice: 1,
          lowestPrice: 1,
          averagePrice: {$trunc: "$averagePrice"},
          priceStdDev: {$trunc: "$priceStdDev"}
        }
      },
      {
        $sort: 
        {
          "Year": -1
        }
      }
    ]);
{
  "highestPrice": 1350000,
  "lowestPrice": 125000,
  "averagePrice": 410593,
  "priceStdDev": 182358,
  "Year": 2015
},
...
{
  "highestPrice": 930000,
  "lowestPrice": 12000,
  "averagePrice": 103455,
  "priceStdDev": 68615,
  "Year": 1995
}

Limitations

As seen, it’s possible to build up sophisticated analytical queries using the enhanced aggregation features – especially when pipelines are combined with a little application or scripting glue.

Some limitations that you may meet include:

  • $geoNear can only be used as the first stage in the pipeline
  • $lookup only supports equality for the match and the equality has to be between a single key from each collection
  • The right-collection for $lookup cannot be sharded
  • The pipeline is linear; there are no forks and once data has been aggregated, the lost details can’t be reused later in the pipeline (this is why writing results to a new collection using $out can be helpful)
  • One can remove information at each stage but it’s impossible to add new raw data (other than through $lookup)
  • Indexes are only used for the beginning stages of the pipeline (and right tables in any subsequent $lookup)
  • $out can only be used in the final stage of the pipeline

When to Use Full Data Visualization Solutions

A lot can be achieved directly in the database – especially when augmented with a small amount of code. So when would it make sense to use a BI visualization tool such as Tableau. The capabilities will vary from product to product but some general considerations are given here:

  • Visualization – displaying information in graphs and on maps (rather than in JSON result sets) makes it much simpler for the human mind to see patterns and draw conclusions from the data (see Figure 1 which is based on the same data set used earlier and graphically shows the highest home sale price by location and year)
  • Multiple Data Sources – combining data from multiple sources (data blending); for example from a MongoDB database and an Excel spreadsheet can greatly broaden the context of reports
  • Interactivity – visualization tools make it simple to create interactive queries/dashboards where business user can graphically tweak parameters to get precise results and test theories
  • Skills – performing all of the analytics directly in MongoDB requires knowledge of the MongoDB Query Language and possibly some basic coding skills. Using the visualization tools is analogous to using intermediate features in Microsoft Excel such as pivot tables and graphs and so it opens up the ability to analyze the data to a broader set of users in the organization
  • Extra functions – for example, the ability to add trend lines to a chart

MongoDB Data Visualized in a Tableau Map
Figure 1: MongoDB Data Visualized in a Tableau Map

MongoDB 3.2 introduces the MongoDB Connector for BI which allows Business Intelligence tools such as Tableau to access data from MongoDB using SQL – opening up a range of new options for performing analytics on live data.

Next Steps

To learn more about what’s coming up in MongoDB 3.2, register for the What’s new in MongoDB 3.2 webinar and review the MongoDB 3.2 release notes.

There’s a webinar recording explaining more about $lookup and the other aggregation features.

To get the best understanding of the new features then you should experiment with the software which is available in the MongoDB 3.2 (not for production) download – to use the new $lookup aggregation theMongoDB Enterprise Advanced download should be used.

The reason MongoDB releases development releases is to give the community a chance to try out the new software – and we hope that you’ll give us feedback, whether it be by joining the MongoDB 3.2 bug hunt or commenting on this post.





70x Faster Joins with AQL now GA with MySQL Cluster 7.2

70x faster joins with AQL

The new GA MySQL Cluster 7.2 Release (7.2.4) just announced by Oracle includes 2 new features which when combined can improve the performance of joins by a factor of 70x (or even higher). The first enhancement is that MySQL Cluster now provides the MySQL Server with better information on the available indexes which allows the MySQL optimizer to automatically produce better query execution plans. Previously it was up to the user to manually provide hints to the optimizer. The second new feature is Adaptive Query Localization which allows the work of the join to be distributed across the data nodes (local to the data it’s working with) rather than up in the MySQL Server; this allows more computing power to be applied to calculating the join as well as dramatically reducing the number of messages being passed around the system. The combined result is that your joins can now run MUCH faster and this post describes a test that results in a 70x speed-up for a real-world query.

The Query

11-Way Join used in Test

The join used in this test is based on a real-world example used for an on-line store/Content Management System. The original query identified all of the media in the system which was appropriate to a particular device and for which a user is entitled to access. As this query is part of a customer’s application I’ve replaced all of the table and column names.

The join runs across 11 tables (which contain 33.5K rows in total) and produces a result set of 2,060 rows, each with 19 columns. The figure to the right illustrates the join and the full join is included below.

SELECT
        tab1.uniquekey,
        tab8.name,
        tab8.tab8id,
        tab11.name,
        tab11.tab11id,
        tab11.value,
        tab10.tab10id,
        tab10.name,
        tab2.name,
        tab2.tab2id,
        tab4.value + tab5.value + tab6.value,
        tab3.colx,
        tab3.tab3id,
        tab4.tab4id,
        tab4.name,
        tab5.tab5id,
        tab5.name,
        tab6.tab6id,
        tab6.name
FROM
        tab1,tab2,tab3,tab4,tab5,tab6,tab7,tab8,tab9,tab10,tab11
WHERE
        tab7.tab2id = tab2.tab2id	AND
        tab7.tab8id = tab8.tab8id	AND
        tab9.tab2id = tab2.tab2id	AND
	tab9.tab10id = tab10.tab10id	AND
	tab10.tab11id = tab11.tab11id	AND
        tab3.tab2id = tab2.tab2id	AND
	tab3.tab4id = tab4.tab4id	AND
	tab4.tab5id = tab5.tab5id	AND
	tab4.colz =  'Y'		AND
	tab5.tab6id = tab6.tab6id	AND
	tab6.tab6id IN (6)		AND
	(tab3.tab4id IN (66, 77, 88))	AND
	tab1.tab2id = tab2.tab2id	AND
	tab1.colx = 6;

Enabling AQL

First of all, make sure that you’re using the GA version of MySQL Cluster (7.2.4 or later); the Open Source version is available from http://dev.mysql.com/downloads/cluster/#downloads

and the commercial version from the Oracle Software Delivery Cloud. You can double check that AQL is enabled:

mysql> show variables like 'ndb_join_pushdown';

| ndb_join_pushdown                   | ON |

Running the Query & Results

Test configuration

To get the full benefit from AQL, you should run “ANALYZE TABLE;” once for each of the tables (no need to repeat for every query and it only needs running on one MySQL Server in the Cluster). This is very important and you should start doing this as a matter of course when you create or modify a table.

For this test, 3 machines were used:

  1. Intel Core 2 Quad Core @2.83 GHz; 8 Gbytes RAM; single, multi-threaded data node (ndbmtd)
  2. Intel Core 2 Quad Core @2.83 GHz; 8 Gbytes RAM; single, multi-threaded data node (ndbmtd)
  3. 4 Core Fedora VM running on VirtualBox on Windows 7, single MySQL Server

The query was then run and compared to MySQL CLuster 7.1.15a:

MySQL Cluster 7.1.15a 1 minute 27.23 secs
MySQL Cluster 7.2.1 (without having run ANALYZE TABLE) 1 minute 5.3 secs 1.33x Cluster 7.1
MySQL Cluster 7.2.1 (having run ANALYZE TABLE) 1.26 secs 69.23x Cluster 7.1

How it Works

Classic Nested-Loop-Join

Traditionally, joins have been implemented in the MySQL Server where the query was executed. This is implemented as a nested-loop join; for every row from the first part of the join, a request has to be sent to the data nodes in order to fetch the data for the next level of the join and for every row in that level…. This method can result in a lot of network messages which slows down the query (as well as wasting resources). When turned on, Adaptive Query Localization results in the hard work being pushed down to the data nodes where the data is locally accessible. As a bonus, the work is divided amongst the pool of data nodes and so you get parallel execution.

NDB API

I’ll leave the real deep and dirty details to others but cover the basic concepts here. All API nodes access the data nodes using the native C++ NDB API, the MySQL Server is one example of an API node (the new Memcached Cluster API is another). This API has been expanded to allow parameterised or linked queries where the input from one query is dependent on the previous one. To borrow an example from an excellent post by Frazer Clement on the topic, the classic way to implement a join would be…

SQL > select t1.b, t2.c from t1,t2 where t1.pk=22 and t1.b=t2.pk;
  ndbapi > read column b from t1 where pk = 22;
              [round trip]
           (b = 15)
  ndbapi > read column c from t2 where pk = 15;
              [round trip]
           (c = 30)
           [ return b = 15, c = 30 ]

Using the new functionality this can be performed with a single network round trip where the second read operation is dependent on the results of the first…

  ndbapi > read column @b:=b from t1 where pk = 22;
           read column c from t2 where pk=@b;
              [round trip]
           (b = 15, c = 30)
           [ return b = 15, c = 30 ]

You can check whether your query is fitting these rules using EXPLAIN, for example:

mysql> set ndb_join_pushdown=on;
mysql> EXPLAIN SELECT COUNT(*) FROM residents,postcodes WHERE residents.postcode=postcodes.postcode AND postcodes.town="MAIDENHEAD";
+----+-------------+-----------+--------+---------------+---------+---------+------------------------------+--------+--------------------------------------------------------------------------+
| id | select_type | table     | type   | possible_keys | key     | key_len | ref                          | rows   | Extra                                                                    |
+----+-------------+-----------+--------+---------------+---------+---------+------------------------------+--------+--------------------------------------------------------------------------+ 
| 1  | SIMPLE      | residents | ALL    | NULL          | NULL    | NULL    | NULL                         | 100000 | Parent of 2 pushed join@1                                                |
| 1  | SIMPLE      | postcodes | eq_ref | PRIMARY       | PRIMARY | 22      | clusterdb.residents.postcode | 1      | Child of 'residents' in pushed join@1; Using where with pushed condition | 
+----+-------------+-----------+--------+---------------+---------+---------+------------------------------+--------+--------------------------------------------------------------------------+
mysql> EXPLAIN EXTENDED SELECT COUNT(*) FROM residents,postcodes,towns 
  WHERE residents.postcode=postcodes.postcode AND 
  postcodes.town=towns.town AND towns.county="Berkshire"; 
+----+-------------+-----------+--------+---------------+---------+---------+------------------------------+--------+----------+------------------------------------------------------------------------------------------------------------------------+ 
| id | select_type | table     | type   | possible_keys | key     | key_len | ref                          | rows   | filtered | Extra                                                                                                                  | 
+----+-------------+-----------+--------+---------------+---------+---------+------------------------------+--------+----------+------------------------------------------------------------------------------------------------------------------------+ 
| 1  | SIMPLE      | residents | ALL    | NULL          | NULL    | NULL    | NULL                         | 100000 | 100.00   | Parent of 3 pushed join@1                                                                                              | 
| 1  | SIMPLE      | postcodes | eq_ref | PRIMARY       | PRIMARY | 22      | clusterdb.residents.postcode | 1      | 100.00   | Child of 'residents' in pushed join@1                                                                                  |
| 1  | SIMPLE      | towns     | eq_ref | PRIMARY       | PRIMARY | 22      | clusterdb.postcodes.town     | 1      | 100.00   | Child of 'postcodes' in pushed join@1; Using where with pushed condition: (`clusterdb`.`towns`.`county` = 'Berkshire') | 
+----+-------------+-----------+--------+---------------+---------+---------+------------------------------+--------+----------+------------------------------------------------------------------------------------------------------------------------+

Note that if you want to check for more details why your join isn’t currently being pushed down to the data node then you can use “EXPLAIN EXTENDED” and then “SHOW WARNINGS” to get more hints. Hopefully that will allow you to tweak your queries to get the best improvements.

PLEASE let us know your experiences and give us examples of queries that worked well and (just as importantly) those that didn’t so that we can improve the feature – just leave a comment on this Blog with your table schemas, your query and your before/after timings.