Tuesday, January 29, 2013

MongoDB EC2 Deployment

MongoDB it's part of the NoSQL ecosystem and is presented as scalable, high-performance, auto sharding database. A full list of features that MongoDB offers can be found at mongodb.org.

In this post I'll explain how MongoDB can be deployed into the cloud - EC2 specifically in order to support a three layers architecture for a web application. Before I go into technical details and show how it can be done we'll have to understand why things are done that way, since these days there are many ways to achieve the same thing. Will wear different hats and start with being the architect. First lets' establish who is involved into the whole process to build a web application. Obviously there is somebody that will pay for the application - will call this entity - the business. The next entity will be developers who will actually write the code and the last(but not the least) will be the infrastructure people the admins (the architect belongs to this group). All these three entities can be combined into a single physical person or spread apart different i departments into an enterprise but the roles remain the same.

Now that we know who is involved let's see what does the business want. Well that's quite simple and usually spans from a single line as I want an application that is resilient to failure and always responsive to a full business case that has all the fine grained details. The more details the better but is not necessarily responding to the question I have in mind. My real question is how popular is going to be the app, this will give an estimate on what sort of traffic you will get, based on the traffic volume and hopefully pattern, with this you can determine quite a lot - from sizing the environment to how much it will cost to operate(remember in EC2 there is only OPEX cost). In my experience the business will not have any more input into this equation and that is just fine. So the next step will be to start looking at how developers think about it, would be a 'real time', how many reads and writes will be done from the web servers to the database, how all pieces will fall into place, etc.

Will start building on the following premises about the application

  • has to be resilient to failures (business's requirement)
  • has to be responsive all the time (business's requirement)
  • will need to support an initial high volume of users with the option to grow (business's requirement)
  • balanced as 70% reads and 30% writes for the database traffic(developers predict)
  • to be cost effective (this is always relative to the business's budget)
  • because of the above (cost) constraint the business agree that if a catastrophic failures happens into a region will be ok to have downtime but not ok to not be able to bring the site up somewhere else in a manner of a few hours.

At this point we have enough details to start putting all the pieces together. Will not mention the load balancer (piece no.1) and the application servers (piece no.2) into the three layers architecture web app. The focus is going to be the database (piece no.3) - MongoDB.

Starting at the bottom the smallest part is an individual server (instance), will explain how this can be made resilient to failure. Into ec2 the instances are into a flat network (not talking about VPC) and the storage is divided into two types.

  • ephemeral or instance storage - this is the disk space you have on the local hypervisor that hosts your instance and will be destroy just after you terminated the instance
  • network block storage - EBS volumes - which will be resilient to terminations of your instance

Obviously you will not store the database's data on the ephemeral storage so the only real option will be the EBS (network storage). The EBS come in two flavors these days - with provision IOPS and normal. The difference between the two, the first will have a guarantee performance and the second is best effort with a minimum (which is quite low) of performance. A consequence of this is that the provision IOPS is quite expensive compared with a normal EBS. Since we do have a constraint on cost will have to look into using the normal EBS but there is hope - you can group a number of volumes and use some of the Linux tools as LVM or Raid to stripe or mirror. So will attach 10 EBS volumes to each server that acts as a MongoDB database (you can use less than that of course but I'll do a Raid 10).

        # you will need to have your ebs volumes attached to the server
        mdadm --create --verbose /dev/md0 --level=10 --raid-devices=8 /dev/sdj /dev/sdk /dev/sdl /dev/sdm /dev/sdn /dev/sdo /dev/sdp /dev/sdq
        # now create a file system
        mkfs.xfs /dev/md0
        #mount the drive
        mount /dev/md0 /mnt/mongo/data
        # I said 10 and there are 8 ...
        # The rest of the 2 volumes are used for Journaling - threfore if Journaling is enabled will not affect the data

With this done will look onto how MongoDB will offer resilience to failure. The solutions that are offered are Replica Sets and Sharding more to say partitioning at the database layer.

MongoDB Replica Sets - what is it and how it works. The idea is to have a set of database group together part of a replica set which will replicate between them the data asynchronous . Within the replica set there is a PRIMARY and a number of SECONDARY databases. Who is the PRIMARY is established by a process called voting. How a database server will become PRIMARY is based on different criteria, the important thing about it is that is the only database server that accepts write into the replica set. All other servers part of the replica set will just accept reads. An other very important factor is that a replica set is considered healthy only if it has 51% of the capacity available. That is from 3 servers you can loose only 1 - hence you just lost 33.3% and still have 66.6% up. If you have 4 servers and you loose 2, guess what ... well you just lost 50% of your capacity and the replica set is not healthy - remember you need to have 51% available. Knowing this is obvious that using odd numbers into a replica set is the preferred choice. There are workarounds for this where you can have servers not participating in voting, have arbiters (special servers that only participate in voting) and you can as well give more weight to a server into the voting process.

Why all this happens you may wonder - well there is the CAP theorem and MongoDB will choose CA from it - this is how 10Gen - makers of MongoDB decided to design the product and we have to live with it. So from the CAP MongoDB will choose CA - you can read why and how at Consistency and Availability at MongDB.

Now let's pause for a minute and see how the overall database cluster will look with a Replica Set:


However having all instances into one single region doesn't look too good in case of a total failure on that region is always better to have an instance outside it - in this case I choose US-WEST - California.


This is how we are doing so far in respect to initial requirements:

  • has to be resilient to failures - Replica Set will provide that
  • if a catastrophic failures happens into a region will be ok to have downtime but not ok to not be able to bring the site up somewhere else in a manner of a few hours. Having a member of the Replica Set into US-WEST will fulfill it.

Well how about the rest of the requirements ?!

From the infrastructure point of view there is only one more requirement that you can satisfy - the place to grow. This can be solved initially by vertically scaling the instance sizes but you can upscale up to the biggest instance ... and then what else you can do is to add what MongoDB calls Shards, basically you will have to partition the database. This may sound very scary but MongoDB makes it quite easy. It will automatically split the data based on a key(s). You can provide the key yourself or let MongoDB use the _id - this is a special object that will be provided and is called Object identifier for every document stored.

Alright ! - we are doing quite ok so far - from infrastructure point of view we solved all the problems that the business asked, but is not over yet - what about backups ? You can't run a site without it ! First let's pause and think about what the application will be doing in the first place with the actual setup. Based on the capabilities that MongoDB offers will write to the PRIMARY and will read from the SECONDARY ... well that means that if the application servers are hosted into region US-EAST than will have to go over the wire to read from the servers hosted in US-WEST ?!. Not a very good idea ... but there is hope, MongoDB has a an option when you create a Replica Set which says that a specific member of the replica will be hidden, meaning will replicate data but will not make itself available for any reads or writes.

With this in place we can actually have servers into a remote region, US-WEST that will just replicate the data but never been actually used by the application servers. Well this is the best candidate for backups from all other members!

The final infrastructure diagram including two Replica Sets, two Shards and the backups looks like this:


Now let's switch roles and have the developer hat on. You've been told about all this setup and you may think all you have to do is drop the code on the application servers and you are done, but wait a minute what will happen if a full replica set will be unavailable ? Well this is the part called design for failure.

Let's assume the application has a few a entities, one of them will be the User. In a typical SQL world you would have a table that has something like user_id, first_name and so on. Then all other tables will be linked to this table with Foreign Primary Key on user_id. If we would try to replicate this scenario the schema will look like this:


    /* users collection */
    > db.users.find().pretty()
            "_id" : ObjectId("5107fb6736141503d37b6a31"),
            "username" : "johnd",
            "password" : "bc7a0154948baa69ecbe1d7843b25113fc5f3f20",
            "first_name" : "fname"
    /* objects collection - linking back to users via user_id */
    > db.objects.find().pretty()
            "_id" : ObjectId("5107fc7536141503d37b6a32"),
            "user_id" : ObjectId("5107fb6736141503d37b6a31"),
            "data" : "all the goodies you need"


What can go wrong with this schema ?! Well, let's say we will shard on the users._id key, than MongoDB will split data based on the chunk size as need it and distribute it accordingly. On the objects collection we will have different options to shard, we can use the objects._id or objects.user_id etc. However if user X is located on shard01 and most of his entries (if not all) are located on shard02 than if shard02 is down the user will be without entries! If shard01 is down the user can't even use the system. So what will be a better approach? Locality of the data, have the user collection contain the objects collection. So this will look as:

    /* users collection embeds the objects collection */
    > db.users.find( {"username": "bobc"}).pretty()
            "_id" : ObjectId("5107fe2936141503d37b6a33"),
            "username" : "bobc",
            "password" : "ad7a0154948baa69ecbe1d7843b25113fc5f3f20",
            "first_name" : "fname",
            "data" : {
                    "key" : "value"

With this schema if any of the shards are down than the user can NOT use the system but in case it can use the system his data will be consistent.

The process of choosing the 'right' sharding key is very tricky, has a few constraints from the MongoDB part as well it depends on your data structure and requirements. For more info see Shard keys for MongoDB.

Final review of the total requirements:

  • has to be resilient to failures
  • MongoDB ability to function into a Replica Set.
  • has to be responsive all the time
  • MongoDB Replica Set, write to Primary and read from Secondary. In the case you need more capacity there are two options, upscale of instance size and sharding.
  • will need to support an initial high volume of users with the option to grow
  • Again Sharding and Replica Sets will fulfill.
  • balanced as 70% reads and 30% writes for the database traffic
  • For the writes you have the option to add more shards, for reads you can add more servers to the existing Replica Sets, I choose three servers but nobody stops you from adding five for example.
  • to be cost effective (this is always relative to the business's budget)
  • Considering all other requirements having three members per Replica Set will be the minimum to have.
  • because of the above (cost) constraint the business agree that if a catastrophic failures happens into a region will be ok to have downtime but not ok to not be able to bring the site up somewhere else in a manner of a few hours.
  • In the case of total failure of Primary Site from region US-EAST you can still have the data in US-WEST. If your infrastructure is automated it will take no time to re-create all three architecture layers.
  • Backups of the data
  • The nodes from US-WEST are the perfect candidate for this. Use EBS snapshots and ship data to S3. With these snapshots you can recover even if all nodes (including US-WEST) are down.