Archive for November 25, 2009

MySQL Cluster Community Survey

If you use MySQL Cluster or think that you might want to use it in the future (especially if you’ve thought “if only they’d add that feature then it would be ideal for my application”) then now’s your chance to influence the product roadmap.

All you have to do is follow this link to the MySQL Cluster Community Survey and give your input.

Please also mention this to anyone you know of who might have an interest in MySQL Cluster.

Thanks in advance for you help.





MySQL Cluster 7.0.9 GA binaries released

The GA binaries for MySQL Cluster 7.0.9 have been released – download them from http://dev.mysql.com/downloads/select.php?id=14

A summary of the changes can be found in the MySQL Cluster 7.0.9 Change Log





Software preview MySQL Scriptable Replication

Fig. 1 MySQL per-row replication filtering

Fig. 1 MySQL per-row replication filtering

A MySQL Software preview is available which allows you to write Lua scripts to control replication on a statement-by-statement basis. Note that this is prototype functionality and is not supported but feedback on its usefulness would be gratefully received.The final version would allow much greater functionality but this preview allows you to implement filters on either the master or slave to examine the statements being replicated and decide whether to continue processing each one or not.

After reading this article, you may be interested in trying this out for yourself and want to create your own script(s). You can get more information on the functionality and download the special version of MySQL from http://forge.mysql.com/wiki/ReplicationFeatures/ScriptableReplication

To understand how this feature works, you first need to understand the very basics about how MySQL replication works. Changes that are made to the ‘Master’ MySQL Server are written to a binary log. Any slave MySQL Servers that subscribe to this master are sent the data from the master’s binary log; the slave(s) then copy this data to their own relay log(s). The slave(s) will then work through all of the updates in their relay logs and apply them to their local database(s). The implementation is a little more complex when using MySQL Cluster as the master’s updates may come through multiple MySQL Servers or directly from an application through the NDB API but all of the changes will still make it into the binary log.

MySQL Replication supports both statement and row based replication (as well as mixed) but this software preview is restricted to statement based replication. As MySQL Cluster must use row based replication this preview cannot be used with Cluster but the final implementation should work with all storage engines.

As show in Fig. 1 there are 4 points where you can choose to filter statements being replicated:

  1. Before the update is written to the binary log
  2. After the update has been read from the binary log
  3. Before the update is written to the relay log
  4. After the update has been read from the relay log

The final 2 interest me most as it allows us to have multiple slaves which apply different filters – this article includes a worked example of how that could be exploited.

Fig. 2 Details for each filtering point

Fig. 2 Details for each filtering point

The filters are written as Lua scripts. The names of the script file, module name and function names vary depending on which of these filtering points is to be used. Fig. 2 shows these differences. In all cases, the scripts are stored in the following folder: “<mysql-base-directory>/ext/replication”.

This article creates 2 different scripts – one for each of 2 slave servers. In both cases the filter script is executed after an update is read from the relay log. One slave will discard any statement of the form “REPLACE INTO <table-name> SET sub_id = 401, …” by searching for the sub string “sub_id = X” where X is even while the second slave will discard any where X is odd. Any statement that doesn’t include this pattern will be allowed through.

Fig. 3 Implementation of odd/even sharded replication

Fig. 3 Implementation of odd/even sharded replication

If a script returns TRUE then the statement is discarded, if it returns FALSE then the replication process continues. Fig. 3 shows the architecture and pseudo code for the odd/even replication sharding.

 

 

 

 

 

 

The actual code for the two slaves is included here:

slave-odd: <mysql-base-directory>/ext/replication/relay_log.lua
function after_read(event)
  local m = event.query
  if m then
    id = string.match(m, "sub_id = (%d+)")
    if id then
      if id %2 == 0 then
        return true
      else
        return false
      end
    else
      id = string.match(m, "sub_id=(%d+)")
        if id then
          if id %2 == 0 then
            return true
          else
            return false
          end
       else
         return false
       end
    end
  else
    return false
  end
end
slave-even: <mysql-base-directory>/ext/replication/relay_log.lua
function after_read(event)
  local m = event.query
  if m then
    id = string.match(m, "sub_id = (%d+)")
    if id then
      if id %2 == 1 then
        return true
      else
        return false
      end
    else
      id = string.match(m, "sub_id=(%d+)")
        if id then
          if id %2 == 1 then
            return true
          else
            return false
          end
       else
         return false
       end
    end
  else
    return false
  end
end

Replication can then be set-up as normal as described in Setting up MySQL Asynchronous Replication for High Availability with the exception that we use 2 slaves rather than 1.

Once replication has been started on both of the slaves, the database and tables should be created; note that for some reason, the creation of the tables isn’t replicated to the slaves when using this preview load and so the tables actually need to be created 3 times:

mysql-master> CREATE DATABASE clusterdb; mysql-master> USE clusterdb; mysql-master> CREATE TABLE sys1 (code INT NOT NULL PRIMARY KEY, country VARCHAR (30)) engine=innodb; mysql-master> CREATE TABLE subs1 (sub_id INT NOT NULL PRIMARY KEY, code INT) engine=innodb;
mysql-slave-odd> USE clusterdb; mysql-slave-odd> CREATE TABLE sys1 (code INT NOT NULL PRIMARY KEY, country VARCHAR (30)) engine=innodb; mysql-slave-odd> create table subs1 (sub_id INT NOT NULL PRIMARY KEY, code INT) engine=innodb;
mysql-slave-even> USE clusterdb; mysql-slave-even> CREATE TABLE sys1 (code INT NOT NULL PRIMARY KEY, country VARCHAR (30)) engine=innodb; mysql-slave-even> CREATE TABLE subs1 (sub_id INT NOT NULL PRIMARY KEY, code INT) engine=innodb;

The data can then be added to the master and then the 2 slaves can be checked to validate that it behaved as expected:

mysql-master> REPLACE INTO sys1 SET area_code=33, country="France";
mysql-master> REPLACE INTO sys1 SET area_code=44, country="UK";
mysql-master> REPLACE INTO subs1 SET sub_id=401, code=44;
mysql-master> REPLACE INTO subs1 SET sub_id=402, code=33;
mysql-master> REPLACE INTO subs1 SET sub_id=976, code=33;
mysql-master> REPLACE INTO subs1 SET sub_id=981, code=44;
mysql-slave-odd> SELECT * FROM sys1;
+------+---------+
| code | country |
+------+---------+
|  33  | France  |
|  44  | UK      |
+------+---------+

mysql-slave-odd> SELECT * FROM subs1;
+--------+------+
| sub_id | code |
+--------+------+
|   401  | 44   |
|   981  | 44   |
+--------+------+
Fig. 4 Results of partitioned replication

Fig. 4 Results of partitioned replication

mysql-slave-even> SELECT * FROM sys1;
+------+---------+
| code | country |
+------+---------+
|  33  | France  |
|  44  | UK      |
+------+---------+
mysql-slave-even> SELECT * FROM subs1;
+--------+------+
| sub_id | code |
+--------+------+
|   402  | 33   |
|   976  | 33   |
+--------+------+

Fig. 4 illustrates this splitting of data between the 2 slaves – all rows from the system table are stored in both databases (as well as in the master) while the data in the subscriber table (and it would work for multiple subscriber tables too) are partitioned between the 2 databases – odd values in one, even in the other. Obviously, this could be extended to more slaves by changing the checks in the scripts.

As an illustration of how this example could be useful, all administrative data could be provisioned into and maintained by the master – both system and subscriber data. Each slave could then serve a subset of the subscribers, providing read-access to the administrative data andread/write access for the more volatile subscriber data (which is mastered on the ‘slave’). In this way, there can be a central point to manage the administrative data while being able to scale out to multiple, databases to provide maximum capacity and performance to the applications. For example, in a telco environment, you may filter rows by comparing a subscriber’s phone number to a set of area codes so that the local subscribers are accessed from the local database – minimising latency.

From a data integrity perspective, this approach is safe if (and only if) the partitioning rules ensures that all related rows are on the same slave (in our example, all rows from all tables for a particular subscriber will be on the same slave – so as long as we don’t need transactional consistency between different subscribers then this should be safe).

Fig. 5 Partioned replication for MySQL Cluster

Fig. 5 Partitioned replication for MySQL Cluster

As mentioned previously this software preview doesn’t work with MySQL Cluster but looking forward to when it does, the example could be extended by having each of the slave servers be part of the same Cluster. In this case, the partitioned data will be consolidated back into a single database (for this scenario, you would likely configure just one server to act as the slave for the system data). On the face of it, this would be a futile exercise but in cases where the performance bottlenecks on the throughput of a single slave server, this might be a way to horizontally scale the replication performance for applications which make massive numbers of database writes.





MySQL Cluster: Geographic Replication Deep-Dive webinar

I will be presenting a free Webinar on Geographic Replication for MySQL Cluster at 9:00 am (UK time) on Tuesday 24 November.
Multi-Master Replication for HA with MySQL Cluster

Multi-Master Replication for HA with MySQL Cluster

MySQL Cluster has been deployed into some of the most demanding web, telecoms and enterprise /
government workloads, supporting 99.999% availability with real time performance and linear write scalability.

You can register on-line here.

Tune into this webinar where you can hear from the MySQL Cluster product management team provide a detailed “deep dive” into one of MySQL Cluster’s key capabilities – Geographic Replication.

In this session, you will learn how using Geographic Replication enables your applications to:

  • Achieve higher levels of availability within a data center or across a WAN
  • Locate data closer to users, providing lower latency access
  • Replicate to other MySQL storage engines for complex data analysis and reporting of real time data
  • Gow to get started with Geographic Replication

Tuesday, November 24, 2009: 10:00 Central European time

  • Tue, Nov 24:  09:00 Western European time
  • Tue, Nov 24:  11:00 Eastern European time

The presentation will be approximately 1 hour long, including on-line Q&A.





MySQL Cluster fault tolerance – impact of deployment decisions

Typical management configuration

Fig 1. Typical management configuration

MySQL Cluster is designed to be a High Availability, Fault Tolerant database where no single failure results in any loss of service.

This is however dependent on how the user chooses to architect the configuration – in terms of which nodes are placed on which physical hosts, and which physical resources each physical host is dependent on (for example if the two blades containing the data nodes making up a particular node group are cooled by the same fan then the failure of that fan could result in the loss of the whole database).

Of course, there’s always the possibility of an entire data center being lost due to earthquake, sabotage etc. and so for a fully available system, you should consider using asynchronous replication to a geographically remote Cluster.

Fig 1. illustrates a typical small configuration with one or more data nodes from different node groups being stored on two different physical hosts and a management node on an independent machines (probably co-located with other applications as its resource requirements are minimal. If any single node (process) or physical host is lost then service can continue.

The basics of MySQL Cluster fault tolerance

Data held within MySQL Cluster is partitioned, with each node group being responsible for 2 or more fragments. All of the data nodes responsible for the same fragments form a Node Group (NG). If configured correctly, any single data node can be lost and the other data nodes within its node group will continue to provide service.

The management node (ndb_mgmd process) is required when adding nodes to the cluster – either when it was initially configured or when a node has been lost and restarted.

A heart-beat protocol is used between the data nodes in order to identify when a node has been lost. In many cases, the community of surviving data nodes can reconfigure themselves but in some cases they need help from the management node – much of this article focuses on how to identify these cases so that you can decide what level of redundancy is required for the management node.

Goals of the algorithm

The algorithm used by MySQL Cluster has 2 priorities (in order):

  1. Prevent database inconsistencies through “split brain” syndrome
  2. Keep the database up and running, serving the application

Split brain would occur if 2 data nodes within a node group lost contact with each other and independently decided that they should be master for the fragments controlled by their node group. This could lead to them independently applying conflicting changes to the data – making it very hard to recover the database (which would include undoing the changes that the application believes to have been safely committed). Note that a particular node doesn’t know whether its peer(s) has crashed or if it has just lost its connection to it. If the algorithm is not confident of avoiding a split brain situation then all of the data nodes are shut down – obviously that isn’t an ideal result and so it’s important to understand how to configure your cluster so that doesn’t happen.

The algorithm

If all of the data nodes making up a node group are lost then the cluster shuts down.

When data nodes lose contact with each other (could be failure of a network connection, process or host) then all of the data nodes that can still contact each other form a new community. Each community must decide whether its data nodes should stay up or shut down:

  • If the community doesn’t contain at least 1 data node from each node group then it is not viable and its data nodes should shut down.
  • If this community is viable and it can determine that it contains enough of the data nodes such that there can be no other viable community out there (one with at least 1 data node from each node group) then it will decide (by itself) to keep up all of its data nodes.
  • If the community is viable but there is the possibility of another viable community then it contacts the arbitrator which decides which amongst all viable communities should be allowed to stay up. If the community can not connect to the arbitrator then its data nodes shut down.

In this way, at most one community of data nodes will survive and there is no chance of split brain.

The arbitrator will typically run on a management node. As you’ll from the algorithm and the following examples, the cluster can sometimes survive a failure without needing a running management node but sometimes it can’t. In most of the examples, a single management node is used but you may well decide to have a second for redundacy so that more multiple-point-of-failures can be handled. At any point in time, just one of the management nodes would act as the active arbitrator, if the active one is lost then a majority community of data nodes can hand control over to the other management node.

Note that the management node consumes very little resource and so can be co-located with other functions/applications but as you’ll see from the examples, you would normally avoid running it on the same host as a data node.

Example 1: Simplest cluster – loss of management node followed by a data node

Fig 2. Loss of management node followed by data node in a simple Cluster

Fig 2. Loss of management node followed by data node in a simple Cluster

Fig 2. shows a very simple cluster with 3 hosts, the management node running on 192.168.0.19 and then a single node group (NG1) made up of 2 data nodes split between 192.168.0.3 and 192.168.0.4. In this scenario, the management node is ‘lost’ (could be process, host or network failure) followed by one of the data nodes.

The surviving data node forms a community of 1. As it can’t know whether the other data node from NG1 is still viable and it can’t contact the arbitrator (the management node) it must shut itself down and so service is lost.

Note that the 2 nodes could be lost instantaneously or the management node might be lost first followed some time later by a data node.

To provide a truly HA solution there are 2 steps:

  1. Ensure that there is no single point of failure that could result in 192.168.0.19 and either of the other 2 hosts being lost.
  2. Run a second management node on a 4th host that can take over as arbitrator if 192.168.0.19 is lost

Example 2: Half of data nodes isolated but management node available

Fig 3. Half of data nodes lost

Fig 3. Half of data nodes lost

In Fig 3. host 192.168.0.3 and its 2 data nodes remains up and running but becomes isolated from the management node and the other data nodes. 2 communities of connected data nodes are formed. As each of these communities are viable but recognize that there could be another viable surviving communitycontain a data node from each node group, they must defer to the management node. As192.168.0.3 has lost it’s connection to the management node, the community of data nodes hosted there shut themselves down. The community hosted on 192.168.0.4 can contact the management node which as it’s the only community it can see, allows its data nodes to stay up and so service is maintained.

.

.

.

Example 3: Half of data nodes isolated and management node lost

Fig 4. Data nodes isolated following loss of management node

Fig 4. Data nodes isolated following loss of management node

The scenario shown in Fig 4. builds upon Example 2 but in the case, the management node is lost before one of the data node hosts loses its connection to the other.

In this case, both communities defer to the management node but as that has been lost they both shut themselves down and service is lost.

Refer to Example 1 to see what steps could be taken to increase the tolerance to multiple failures.

.

.

.

.

.

Example 4: Management node co-located with data nodes

Fig 4. Management node co-located with data nodes

Fig 5. Management node co-located with data nodes

Fig. 4 shows a common, apparent short-cut that people may take, with just 2 hosts available hosting the management node of the same machine as some of the data nodes. In this example, the connection between the 2 hosts is lost. As each community is viable they each attempt to contact the arbitrator – the data nodes on 192.168.0.3 are allowed to stay up while those on 192.168.0.4 shut down as they can’t contact the management node.

However this configuration is inherently unsafe, if 192.168.0.3 failed then there would be a complete loss of service as the data nodes on 192.168.0.4 would form a viable community but be unable to confirm that they represent the only viable community.

It would be tempting to make this more robust by running a second management node on 192.168.0.4 – in that case, when each host becomes isolated from the other, the data nodes local to the management node that’s the current arbitrator will stay up however if the entire host on which the active arbitrator failed then you would again lose service. The management node must be run on a 3rd host for a fault-tolerant solution.

Example 5: Odd number of data node hosts

Fig 6. Isolation of data node host - majority survive

Fig 6. Isolation of data node host - majority survive

Fig 6. shows a configuration running the management node on the same host as some of the data nodes does provide a robust solution.

In this example 192.168.0.3 becomes isolated its data nodes form a community of 2 which doesn’t include a data node from NG2 and so they shut themselves down. 192.168.0.4 and 192.168.0.19 are still connected and so they form a commiunity of 4 data nodes; they recognize that the community holds all data nodes from NG2 and so there can be no other viable community and so they are kept up without having to defer to the arbitrator.

Note that as there was no need to consult the management node, service would be maintained even if it was the machine hosting the management node that became isolated.

.

.

Example 6: Loss of all data nodes in a node group

Fig 7. All data nodes in a node-group lost

Fig 7. All data nodes in a node-group lost

Fig 7. illustrates the case where there are multiple data node failures such that all of the data nodes making up a node group are lost. In this scenario, the cluster has to shut down as that node group is no longer viable and the cluster would no longer be able to provide access to all data.

.

.

.

.

.

.

Example 7: Loss of multiple data nodes from different node groups

Fig 8. Data nodes from different node-groups lost

Fig 8. Data nodes from different node-groups lost

Fig 8. is similar to Example 6 but in this case, there is still a surviving data node from each node group. The surviving data node from NG1 forms a community with the one from NG2. As there could be another viable community (containing a data node from NG1 and NG2), they defer to the management node and as they form the only viable community they’re allowed to stay up.

.

.

.

.

.

.

Example 8: Classic, fully robust configuration

Fig 8. Robust configuration

Fig 9. Robust configuration

Fig 9. shows the classic, robust configuration. 2 independent machines both host management nodes. These are in turn connected by 2 independent networks to each of the data nodes (which are in turn all connected to each other via duplicated network connections).





MySQL Cluster: Geographic Replication Deep-Dive – webinar replay

The recording and slides from this week’s MySQL Cluster Geographic Replication webinar is now available – download replay & slides. As always the material is free.

MySQL Cluster Replication

MySQL Cluster Replication

For those that missed the invitation, here is a description of the content…

MySQL Cluster has been deployed into some of the most demanding web, telecoms and enterprise / government workloads, supporting 99.999% availability with real time performance and linear write scalability.

Tune into this webinar where you can hear from the MySQL Cluster product management team provide a detailed “deep dive” into one of MySQL Cluster’s key capabilities – Geographic Replication.

In this session, you will learn how using Geographic Replication enables your applications to:

  • achieve higher levels of availability within a data center or across a WAN
  • locate data closer to users, providing lower latency access
  • replicate to other MySQL storage engines for complex data analysis and reporting of real time data
  • how to get started with Geographic Replication

Presented by Andrew Morgan (Senior Product Manager, MySQL Cluster) and Matthew Keep (MySQL Cluster Product Management)





MySQL Cluster Restarts Get Faster

MySQL Cluster 6.3.28b and 7.0.9b contain optmizations which can greatly reduce the time taken for data nodes to restart – this includes restarting a single node, performing a rolling restart or a full system restart.

The benefits you see will depend on many factors including including the size of your database and the frequency, size and complexity of your transactions. As an experiment, I re-ran some data node restart timings from an earlier post (http://www.clusterdb.com/mysql-cluster/mysql-cluster-data-node-restart-times/)

The headline figure from my results is that for a 6 Gbyte database, with modest traffic I saw a 2.2x improvement. This is using very simple transactions and so you may get a much better improvement – 70x is being seen by some!!! The best news is that the slower your current restarts, the higher the benefit you can expect to see from the optimizations.

What follows is a reworking of that original post.

MySQL Cluster Data Node restart times

Restarts are required for certain, infrequent maintenance activities. Note that there is no loss of service while a single node restarts.

When a data node restarts, it first attempts to load the data into memory from the local log files and then it will catch up with any subsequent changes by retrieveing them from the surviving node(s) in its node group.

Based on this, you would expect the time taken to restart a data node to be influenced by:

  1. The amount of data that was stored on the data node before the restart
  2. Rate of updates being made to the data during the restart
  3. Network performance (assuming the data is being updated during recovery)

The times will also be influenced bycertain configuration parameters, performance of the host machine and whether the multi-threaded data node (ndbmtd) is being used.

To provide some insight into how these factors impact restart times, tests have been performed where the following factors are varied:

  • Database size (Each Gbyte is made up of 1,000,000 tuples in each of 5 tables)
  • Whether traffic is running or not (a single thread using the NDB API to send in up to 2K tps (10K updates/second))
  • Whether the 2 data nodes in the node group are on the same host or separated by a Gbit Ethernet network

The following factors are kept constant:

  • Physical hosts: Intel Core 2 Quad Q8200@2.33 GHz; 7.7 GBytes RAM
  • NoOfFragmentLogFiles: 300
  • MaxNoOfExecutionThreads=4

The optimizations introduced in MySQL Cluster 6.3.28a and 7.0.9a have reduced these times – especially when write transactions are running before, during and after the node restart is triggered.

Here are the observed results:

Improved Data Node Restart Times

Improved Data Node Restart Times

For comparrison purposes, these are the results before the optimizations were introduced:

Data Node restart times

Old Data Node restart times

There are a couple of things to note from these results:

  • The optmizations greatly improve the results when update traffic is running
  • Using the multi-threaded data node (ndbmtd) greatly improves the restart time (in this case, 4 threads were available, improvements could be even greater on an 8 core/thread system)
  • Results become less predictable when heavy update traffic is being processed (in this case, up to 10,000 updated rows/second on a single node group). In the tests, no attempt was made to regulate this traffic and the test application was run on the same host as the one of the data nodes. Changes to the rate of updates will vary how long it takes for the restarting node to catch-up as it’s a moving target.

There is another recovery/restart scenario. The measurements shown above assumed that the file system on the data node’s host was intact and could be used to recover the in-memory copy – if that were not the case (or the data nodes were restarted with the “initial” option) then all of the data would have to be recovered from the surviving data node(s) in the same node group. As a comparison restarting a 6 Gbyte data node with the “initial” option took 20 minutes compared to 8.5 minutes without it (ndbmtd, over Gbit n/w).





MySQL Cluster 6.3.27a binaries now available

The binary version for MySQL Cluster 6.3.27a has now been made available at http://dev.mysql.com/downloads/select.php?id=14&display=current&previous_ga=1#downloads

A description of all of the changes (fixes) that have gone into MySQL Cluster 6.3.27a (compared to 6.3.26) can be found in the MySQL Cluster 6.3.27a Change Log.





MySQL Cluster 7.0.9a source release now available – replaces 7.0.9

This release has now been replaced with 7.0.9b

The source version for MySQL Cluster 7.0.9a has now been made available at ftp://ftp.mysql.com/pub/mysql/download/cluster_telco/mysql-5.1.39-ndb-7.0.9a/

This replaces MySQL Cluster 7.0.9.

You can either wait for the binaries to be released or if you’re in a rush then you can find instructions on building the binaries for yourself in the earlier article: “MySQL Cluster 7.0.7 source released“.

A description of all of the changes (fixes) that have gone into MySQL Cluster 7.0.9a (compared to 7.0.8a) can be found in the MySQL Cluster MySQL Cluster 7.0.9a Change Log.





MySQL Cluster 6.3.28a source release now available – replaces 6.3.28

This release has now been replaced with 6.3.28b.

The source version for MySQL Cluster 6.3.28a has now been made available at ftp://ftp.mysql.com/pub/mysql/download/cluster_telco/mysql-5.1.39-ndb-6.3.28a/

This replaces MySQL Cluster 7.3.28 which has been withdrawn.

You can either wait for the binaries to be released or if you’re in a rush then you can find instructions on building the binaries for yourself in the earlier article: “MySQL Cluster 7.0.7 source released“.

A description of all of the changes (fixes) that have gone into MySQL Cluster 6.3.28a (compared to 6.3.27) can be found in the MySQL Cluster MySQL Cluster 6.3.28a Change Log.