Archive for April 29, 2010

Trying out MySQL Push-Down-Join (SPJ) preview

At the 2010 MySQL User Conference, Jonas Oreland presented on the work he’s been doing on improving the performance of joins when using MySQL Cluster – the slides are available for download. While not ready for production systems, a preview version is available for you to try out. The purpose of this blog is to step through  testing an example query as well as presenting the results (SPOILER: In one configuration, I got a 50x speedup!).

SPJ is by no means complete and there are a number of constraints as to which queries benefit (and I’ll give an example of one that didn’t). For details of the current (April 2010) software and limitations, check out Jonas’s slides and then keep up to date by following his blog.

We’re anxious to get feedback – please feel free to post results as comments to this blog but also make sure that you send them to spj-feedback@sun.com – describing your schema, the query or queries you tested, the output from EXPLAIN and your before and after timings.

Joins in MySQL Cluster are implemented as nested-loop joins within the MySQL Server; this can be inefficient as it results in many trips to the data nodes to fetch the required data. SPJ works by pushing the join (actually a spec of the needed data) down into the data nodes where the data can be collected and sent back up to the MySQL Server much more efficiently.

For my tests, I used 2 different configurations. In both cases there are 2 data nodes running on 2 physical hosts. In the first configuration the MySQL Server resides on one of those 2 hosts. In the second configuration, the MySQL Server is moved to a virtual machine running on a 3rd host.

Setting up the Cluster

On each of the 3 hosts, I downloaded the software from ftp://ftp.mysql.com/pub/mysql/download/cluster_telco/mysql-5.1.44-ndb-7.1.3-spj-preview/ and then compiled and installed it. If you’re not comfortable with that then you can find instructions in this earlier blog or if you’re used to using the tools from severalnines then check out the SPJ instructions on Johan’s blog.

Create the schema

The 3 tables I used can be created with these commands from the mysql client:

mysql> create database clusterdb; use clusterdb;
mysql> create table subs (sub_id int not null primary key,
dept int,country int) engine=ndb;
mysql> create table department (id int not null primary key,
name int) engine=ndb;
mysql> create table roles (dept int not null primary key,
role varchar (30)) engine=ndb;

Each of these tables is then populated with 100,000 rows (the files can be downloaded from here).

Once extracted, the data should be loaded into the database:

mysql> use clusterdb;
mysql> load data local infile "/home/billy/Dropbox/LINUX/projects/SPJ/subs.csv"
replace into table subs fields terminated by ',';
mysql> load data local infile  "/home/billy/Dropbox/LINUX/projects/SPJ/dept.csv"
 replace into table department fields terminated by ',';
mysql> load data local infile  "/home/billy/Dropbox/LINUX/projects/SPJ/roles.csv"
 replace into table roles fields terminated by ',';

Running the tests (Config 1 – local mysqld)

To get a baseline, ensure that SPJ is turned off:

mysql> set ndb_join_pushdown=off;

and then get the output from EXPLAIN:

mysql> EXPLAIN SELECT count(*) FROM subs, department, roles WHERE subs.country=44 AND department.id=subs.dept AND roles.dept=department.name;
+----+-------------+------------+--------+---------------+---------+---------+---------------------------+--------+-----------------------------------+
| id | select_type | table      | type   | possible_keys | key     | key_len | ref                       | rows   | Extra                             |
+----+-------------+------------+--------+---------------+---------+---------+---------------------------+--------+-----------------------------------+
|  1 | SIMPLE      | subs       | ALL    | NULL          | NULL    | NULL    | NULL                      | 100000 | Using where with pushed condition |
|  1 | SIMPLE      | department | eq_ref | PRIMARY       | PRIMARY | 4       | clusterdb.subs.dept       |      1 |                                   |
|  1 | SIMPLE      | roles      | eq_ref | PRIMARY       | PRIMARY | 4       | clusterdb.department.name |      1 |                                   |
+----+-------------+------------+--------+---------------+---------+---------+---------------------------+--------+-----------------------------------+

and then execute the query:

mysql> SELECT count(*) FROM subs, department, roles WHERE subs.country=44 AND department.id=subs.dept AND roles.dept=department.name;
+----------+
| count(*) |
+----------+
|    33334 |
+----------+
1 row in set (9.08 sec)

Now to see the benefits of SPJ, turn it on:

mysql> set ndb_join_pushdown=on;

Check the output from EXPLAIN again:

mysql> EXPLAIN SELECT count(*) FROM subs, department, roles WHERE subs.country=44 AND department.id=subs.dept AND roles.dept=department.name;
+----+-------------+------------+--------+---------------+---------+---------+---------------------------+--------+--------------------------------------------------------------+
| id | select_type | table      | type   | possible_keys | key     | key_len | ref                       | rows   | Extra                                                        |
+----+-------------+------------+--------+---------------+---------+---------+---------------------------+--------+--------------------------------------------------------------+
|  1 | SIMPLE      | subs       | ALL    | NULL          | NULL    | NULL    | NULL                      | 100000 | Parent of 3 pushed join@1; Using where with pushed condition |
|  1 | SIMPLE      | department | eq_ref | PRIMARY       | PRIMARY | 4       | clusterdb.subs.dept       |      1 | Child of pushed join@1                                       |
|  1 | SIMPLE      | roles      | eq_ref | PRIMARY       | PRIMARY | 4       | clusterdb.department.name |      1 | Child of pushed join@1                                       |
+----+-------------+------------+--------+---------------+---------+---------+---------------------------+--------+--------------------------------------------------------------+

and then re-run the query:

mysql> SELECT count(*) FROM subs, department, roles WHERE subs.country=44 AND department.id=subs.dept AND roles.dept=department.name;
+----------+
| count(*) |
+----------+
|    33334 |
+----------+
1 row in set (0.77 sec)

In this test, the query ran almost 12x faster!

Running the tests (Config 1 – separate mysqld)

The test was then repeated with the MySQL Server running within a VM on a 3rd host – the purpose of this is to represent the more normal configuration where the MySQL servers must communicate over the network to the data nodes. As the purpose of SPJ is to reduce the messaging between the MySQL Server and the data nodes, it’s reasonable to expect the benefits from SPJ to be more pronounced with this configuration.

Again, to get a baseline, ensure that SPJ is turned off:

mysql> set ndb_join_pushdown=off;

and then get the output from EXPLAIN:

mysql> EXPLAIN SELECT count(*) FROM subs, department, roles WHERE subs.country=44 AND department.id=subs.dept AND roles.dept=department.name;
+----+-------------+------------+--------+---------------+---------+---------+---------------------------+--------+-----------------------------------+
| id | select_type | table      | type   | possible_keys | key     | key_len | ref                       | rows   | Extra                             |
+----+-------------+------------+--------+---------------+---------+---------+---------------------------+--------+-----------------------------------+
|  1 | SIMPLE      | subs       | ALL    | NULL          | NULL    | NULL    | NULL                      | 100000 | Using where with pushed condition |
|  1 | SIMPLE      | department | eq_ref | PRIMARY       | PRIMARY | 4       | clusterdb.subs.dept       |      1 |                                   |
|  1 | SIMPLE      | roles      | eq_ref | PRIMARY       | PRIMARY | 4       | clusterdb.department.name |      1 |                                   |
+----+-------------+------------+--------+---------------+---------+---------+---------------------------+--------+-----------------------------------+

and then execute the query:

mysql> SELECT count(*) FROM subs, department, roles WHERE subs.country=44 AND department.id=subs.dept AND roles.dept=department.name;
+----------+
| count(*) |
+----------+
|    33334 |
+----------+
1 row in set (1 min 2.12 sec)

Now to see the benefits of SPJ, turn it back on:

mysql> set ndb_join_pushdown=on;

Check the output from EXPLAIN again:

mysql> EXPLAIN SELECT count(*) FROM subs, department, roles WHERE subs.country=44 AND department.id=subs.dept AND roles.dept=department.name;
+----+-------------+------------+--------+---------------+---------+---------+---------------------------+--------+--------------------------------------------------------------+
| id | select_type | table      | type   | possible_keys | key     | key_len | ref                       | rows   | Extra                                                        |
+----+-------------+------------+--------+---------------+---------+---------+---------------------------+--------+--------------------------------------------------------------+
|  1 | SIMPLE      | subs       | ALL    | NULL          | NULL    | NULL    | NULL                      | 100000 | Parent of 3 pushed join@1; Using where with pushed condition |
|  1 | SIMPLE      | department | eq_ref | PRIMARY       | PRIMARY | 4       | clusterdb.subs.dept       |      1 | Child of pushed join@1                                       |
|  1 | SIMPLE      | roles      | eq_ref | PRIMARY       | PRIMARY | 4       | clusterdb.department.name |      1 | Child of pushed join@1                                       |
+----+-------------+------------+--------+---------------+---------+---------+---------------------------+--------+--------------------------------------------------------------+

and then re-run the query:

mysql> SELECT count(*) FROM subs, department, roles WHERE subs.country=44 AND department.id=subs.dept AND roles.dept=department.name;
+----------+
| count(*) |
+----------+
|    33334 |
+----------+
1 row in set (1.26 sec)

In this test, the query ran almost 50x faster!

Do all queries benefit from SPJ

No and that’s why it’s especially important to get feedback from real users with representative schemas so that SPJ can be extended to cover as many of the significant use cases as possible.

As an example, using the following query I saw no speedup at all (using the local mysqld configuration):

mysql> set ndb_join_pushdown=off;

mysql> EXPLAIN SELECT count(*) FROM subs, department, roles WHERE subs.country=44 AND subs.dept=department.name AND department.id=roles.dept;
+----+-------------+------------+--------+---------------+---------+---------+-------------------------+--------+-----------------------------------+
| id | select_type | table      | type   | possible_keys | key     | key_len | ref                     | rows   | Extra                             |
+----+-------------+------------+--------+---------------+---------+---------+-------------------------+--------+-----------------------------------+
|  1 | SIMPLE      | subs       | ALL    | NULL          | NULL    | NULL    | NULL                    | 100000 | Using where with pushed condition |
|  1 | SIMPLE      | department | ALL    | PRIMARY       | NULL    | NULL    | NULL                    | 100000 | Using where; Using join buffer    |
|  1 | SIMPLE      | roles      | eq_ref | PRIMARY       | PRIMARY | 4       | clusterdb.department.id |      1 |                                   |
+----+-------------+------------+--------+---------------+---------+---------+-------------------------+--------+-----------------------------------+

mysql> SELECT count(*) FROM subs, department, roles WHERE subs.country=44 AND subs.dept=department.name AND department.id=roles.dept;
+----------+
| count(*) |
+----------+
|    33334 |
+----------+
1 row in set (3 min 56.26 sec)
mysql> set ndb_join_pushdown=on;
mysql> EXPLAIN SELECT count(*) FROM subs, department, roles WHERE subs.country=44 AND subs.dept=department.name AND department.id=roles.dept;
+----+-------------+------------+--------+---------------+---------+---------+-------------------------+--------+-----------------------------------------------------------+
| id | select_type | table      | type   | possible_keys | key     | key_len | ref                     | rows   | Extra                                                     |
+----+-------------+------------+--------+---------------+---------+---------+-------------------------+--------+-----------------------------------------------------------+
|  1 | SIMPLE      | subs       | ALL    | NULL          | NULL    | NULL    | NULL                    | 100000 | Using where with pushed condition                         |
|  1 | SIMPLE      | department | ALL    | PRIMARY       | NULL    | NULL    | NULL                    | 100000 | Parent of 2 pushed join@1; Using where; Using join buffer |
|  1 | SIMPLE      | roles      | eq_ref | PRIMARY       | PRIMARY | 4       | clusterdb.department.id |      1 | Child of pushed join@1                                    |
+----+-------------+------------+--------+---------------+---------+---------+-------------------------+--------+-----------------------------------------------------------+

mysql> SELECT count(*) FROM subs, department, roles WHERE subs.country=44 AND subs.dept=department.name AND department.id=roles.dept;
+----------+
| count(*) |
+----------+
|    33334 |
+----------+
1 row in set (3 min 57.76 sec)




Free webinar – learn about MySQL Cluster 7.1

MySQL Cluster 7.1 was declared GA earlier this month and today (29 April) you have the chance to learn all about it by registering for this free webinar.

In blazing speed we will cover the most important features of MySQL Cluster 7.1: NDB$INFO; MySQL Cluster Connector/Java and other features that push the limits of MySQL Cluster into new workloads and communities.

NDB$INFO presents real-time usage statistics from the MySQL Cluster data nodes as a series of SQL tables, enabling developers and administrators to monitor database performance and optimize their applications.

Designed for Java developers, the MySQL Cluster Connector for Java implements an easy-to-use and high performance native Java interface and OpenJPA plug-in that maps Java classes to tables stored in the MySQL Cluster database.

It’s worth registering even if you can’t attend as  you should then receive a link to the replay and the charts.

It starts at 9:00 Pacific / 5 pm UK / 6pm CET.





MySQL Cluster 7.1 is GA

MySQL Cluster 7.1 has been declared GA – including MySQL Cluster Connector for Java and MySQL Cluster Manager – see http://www.mysql.com/products/database/cluster/ for details.





Charts from LDAP Con on LDAP access to MySQL Cluster

At last year’s LDAP-Con event, Ludo from OpenDS and Howard from OpenLDAP presented on the work that they’d done on using MySQL Cluster as the scalable, real-time data store for LDAP directories (going directly to the NDB API rather than using SQL). Symas now provide their implementation (back-ndb) for OpenLDAP.

You can view the charts at http://www.mysql.com/customers/view/?id=1041