MongoDB it's part of the NoSQL ecosystem and is presented as scalable,
high-performance, auto sharding database. A full list of features that
MongoDB offers can be found at mongodb.org.
In this post I'll explain how MongoDB can be deployed into the cloud -
EC2 specifically in order to support a three layers architecture for a web application.
Before I go into technical details and show how it can be done we'll have to
understand why things are done that way, since these days there are many ways to achieve the same
thing. Will wear different hats and start with being the architect.
First lets' establish who is involved into the whole process to build a web application.
Obviously there is somebody that will pay for the application - will call this entity - the business.
The next entity will be developers who will actually write the code and the last(but not the least)
will be the infrastructure people the admins (the architect belongs to this group).
All these three entities can be combined into a single physical person or spread apart different i
departments into an enterprise but the roles remain the same.
Now that we know who is involved let's see what does the
business
want. Well that's quite simple and usually spans from a single line as
I want an application that is resilient to failure and always responsive to
a full business case that has all the fine grained details. The more details
the better but is not necessarily responding to the
question
I have in mind. My real question is
how popular is going to be the app,
this will give an estimate on what sort of traffic you will get, based on the traffic
volume and hopefully pattern, with this you can determine quite a lot
- from sizing the environment to how much it will cost to operate(remember in EC2 there is only OPEX cost).
In my experience the
business will not have any more input into this equation
and that is just fine. So the next step will be to start looking at how
developers
think about it, would be a 'real time', how many reads and writes will be done from the web
servers to the database, how all pieces will fall into place, etc.
Will start building on the following premises about the application
- has to be resilient to failures (business's requirement)
- has to be responsive all the time (business's requirement)
- will need to support an initial high volume of users with the option to grow (business's requirement)
- balanced as 70% reads and 30% writes for the database traffic(developers predict)
- to be cost effective (this is always relative to the business's budget)
- because of the above (cost) constraint the business agree that if a catastrophic failures happens
into a region will be ok to have downtime but not ok to not be able to bring the site up somewhere else in a manner of a few hours.
At this point we have enough details to start putting all the pieces together.
Will not mention the load balancer (piece no.1) and the application servers (piece no.2)
into the three layers architecture web app. The focus is going to be the database (piece no.3) - MongoDB.
Starting at the bottom the smallest part is an individual server (instance), will explain how
this can be made resilient to failure. Into ec2 the instances are into a flat network
(not talking about VPC) and the storage is divided into two types.
- ephemeral or instance storage - this is the disk space you have on the local hypervisor
that hosts your instance and will be destroy just after you terminated the instance
- network block storage - EBS volumes - which will be resilient to terminations of your instance
Obviously you will not store the database's data on the ephemeral storage so the only real option will be
the EBS (network storage). The EBS come in two flavors these days - with provision IOPS and
normal. The difference between the two, the first will have a guarantee performance and the second
is best effort with a minimum (which is quite low) of performance. A consequence of this is that the
provision IOPS is quite expensive compared with a normal EBS. Since we do have a constraint
on cost will have to look into using the normal EBS but there is hope - you can group a number of
volumes and use some of the Linux tools as LVM or Raid to stripe or mirror. So will attach 10 EBS volumes
to each server that acts as a MongoDB database (you can use less than that of course but I'll do a Raid 10).
# you will need to have your ebs volumes attached to the server
mdadm --create --verbose /dev/md0 --level=10 --raid-devices=8 /dev/sdj /dev/sdk /dev/sdl /dev/sdm /dev/sdn /dev/sdo /dev/sdp /dev/sdq
# now create a file system
mkfs.xfs /dev/md0
#mount the drive
mount /dev/md0 /mnt/mongo/data
# I said 10 and there are 8 ...
# The rest of the 2 volumes are used for Journaling - threfore if Journaling is enabled will not affect the data
With this done will look onto how MongoDB will offer resilience to failure. The solutions that are offered are
Replica Sets and Sharding more to say partitioning at the database layer.
MongoDB Replica Sets - what is it and how it works. The idea is to have a set of database group
together part of a replica set which will replicate between them the data asynchronous .
Within the replica set there is a PRIMARY and a number of SECONDARY databases. Who is the
PRIMARY is established by a process called voting.
How a database server will become PRIMARY is based on different criteria, the important thing
about it is that is the only database server that accepts write into the replica set.
All other servers part of the replica set will just accept reads. An other very important factor is that
a replica set is considered healthy only if it has 51% of the capacity available. That is from 3 servers
you can loose only 1 - hence you just lost 33.3% and still have 66.6% up. If you have 4 servers and you loose 2,
guess what ... well you just lost 50% of your capacity and the replica set is not healthy -
remember you need to have 51% available. Knowing this is obvious that using odd numbers into a replica set
is the preferred choice. There are workarounds for this where you can have servers not participating in voting,
have arbiters (special servers that only participate in voting) and you can as well give more weight to a server
into the voting process.
Why all this happens you may wonder - well there is the
CAP theorem and MongoDB will choose CA from it -
this is how 10Gen - makers of MongoDB decided to design the product and we have to live with it. So from the
CAP MongoDB will choose CA - you can read why and how at
Consistency and Availability at MongDB.
Now let's pause for a minute and see how the overall database cluster will look with a Replica Set:
1ZfNjpswFIWfhmUQxkBgmTCZdtNN06pSN5UTHLDGYGScCeTpC4MJcQxt1ML8SAkyx9jc+/lwDQYM0/ITR3nyhUWYGrYVlQZ8MGzbt6z62AjVjRBzErUSkMKRRLhQJMEYFSRXxT3LMrwXinZgVJ0sRzHWhO0eUV39QSKRyOBsr9c/YxIn3W2AF7Q9hai6OSJ8QEcqFi+S3XaXlhwQyOsrKdhO0E2dKSGcGUsVgeOCnNUwD0RNd8d4hLkiUZI9XXOCGwOGnDHRttIyxLRZmA56uwqPI70XPBxn4p4BMv1nRI8y9K84Jiyrte/bxWa1/WbIGa4JFieSUpRhORhzgcvRAPq0aq9hlmLBq4a4DKDzlfSZ7crzU7+8DpRacrW0l4FI4osvc/f51g2Z8ggvPX892QTlTTMt4+ZJMdGpMGPOjnlhomdEKNoRSkT168zuBGL/EYjjKzwGcPi2TgOACWj4/wYjQgLtUIFNHtU3Xx8IpSGjrPU59B7tB6++3boQnD3hriebhpbnmK7Ca+l2wrWBAp2YG0xgH/ju7AOWzl/9483kH+B8PAOBQK0/r+sf9935B97U4yH/DFTjafzjfTz/QOC9ZQWCYACZNfhb/awPi/oPVhrVOn8xRTWGNyyAp6NwgY6ie3/7LxRDe/kIgfVcBG7LyRAAx50JwNBuNAIgnAsABMs7PDAbAn0HSlkWs4UFZvO8uuBOMPD0e0Pvr1Okq+8fMl29cE5lcH/5hvnqG4TMF87n59db3/q0//J76bv6Noeb3w==
However having all instances into one single region doesn't look too good in case of a total failure on that region
is always better to have an instance outside it - in this case I choose US-WEST - California.
1ZhRj6IwEMc/DY8a2gLio7ru3cu9nHfZ5F4uVSo2W6gpdUU//cFSlNLibgyYvUQJnUo7/19nplQHLZL8m8D73Q8eEeZAN8od9ORAGLpucS0Np5YhFjSqTEAZDjQimWaSnDNJ97pxw9OUbKRm23KmD7bHMTEMqw1mpvWFRnKnnIPB1f6d0HhXTwOCadWTyVM9RkS2+MDk6N0Eq+7cVSN5Vfuk2tCb1iOnmgdnzhPNIEhGz7qXW6rcUOjWXEREaD9hNH1tYkJLBy0E57K6S/IFYeW61MyrkZ47ei90BEnlZx5Q6t8wOyjXf5KY8rSw/V6NlrPVL6d2vgEwO9KE4ZSoh4mQJO904CqrCDXCEyLFqQSuePuKt/LHV7Mdr4sLJ8q2aywsrMMRK3rxZeir3OJGKe7AZco3te7wvrxN8rjMkzE+ZuNY8MM+G+M3TBleU0bl6e+Zf5IHvMkDgbFOxAIkhCYPAHrgEd6HI8ISr3FGxiIqJp9vKWMLzngV6Ch4hk9BMd08k4K/kron7YeXF2q0Jn6Nr8GrTuEmL3/aQ/igLxc+oNZ6iR4LjzAYKH6A9/8FEAj0fJsED4sfBCy4XOtn9qe4jIovmBlEC+2yj0yaaBwuXBogfGApxEEPIGx1uEP/fCj97UAo2rZa4g+EwEydhKcxH7lgsAVHra3Gm1oEB7bNtw/Bfpdgs2b0tsKwVRwfphaYaoffKm6/eXmhq9FArnWvsADpZ68IvuJe8QGyadBKGQQfuF3YXjc6quTi/hz6CMGkXTXAQ+ukGTaqbKDhJBuJ8rgqab6SNw9oxfVlWZ7R2onT3wHNqx+5dUJzbfLvOKEVzevZ972v8ecEWv4D
This is how we are doing so far in respect to initial requirements:
- has to be resilient to failures - Replica Set will provide that
- if a catastrophic failures happens into a region will be ok to have downtime but not ok
to not be able to bring the site up somewhere else in a manner of a few hours.
Having a member of the Replica Set into US-WEST will fulfill it.
Well how about the rest of the requirements ?!
From the infrastructure point of view there is only one more requirement that you can satisfy - the place to grow.
This can be solved initially by vertically scaling the instance sizes but you can upscale up to the biggest
instance ... and then what else you can do is to add what MongoDB calls Shards, basically you will have to
partition the database. This may sound very scary but MongoDB makes it quite easy. It will automatically split
the data based on a key(s). You can provide the key yourself or let MongoDB use the _id - this is a special
object that will be provided and is called Object identifier
for every document stored.
Alright ! - we are doing quite ok so far - from infrastructure point of view we solved all the problems that the
business asked, but is not over yet - what about backups ? You can't run a site without it !
First let's pause and think about what the application will be doing in the first place with the
actual setup. Based on the capabilities that MongoDB offers will write to the PRIMARY
and will read from the SECONDARY ... well that means that if the application servers are hosted into
region US-EAST than will have to go over the wire to read from the servers hosted in US-WEST ?!. Not a very good idea ...
but there is hope, MongoDB has a an option when you create a Replica Set which says that a specific member of the
replica will be hidden, meaning will replicate data but will not make itself available for any reads or writes.
With this in place we can actually have servers into a remote region, US-WEST that will just replicate the data but
never been actually used by the application servers. Well this is the best candidate for backups from all other members!
The final infrastructure diagram including two Replica Sets, two Shards and the backups looks like this:
3Vtbb+I4FP41fWwUX3LhsfSy+zAjVe2uRtqXkSEuWA0xSkwL/fXrQAIkx1Ca2ikdiVZgJ3bOd26fj50Lcj1b/pWz+fSnTHh6gf1keUFuLjCOfV//LxtWrYZJLpJNE6oaFiLhRaNJSZkqMW82jmWW8bFqtD3JtDnYnE04aHgcsxS2/hKJmlatKBzsOv7mYjKt5olxuOko1KoeI+FPbJGqy3UT3nQv/U0frS5fVb/Jdtys8QBvUs4aDTkvxFvzIZ9E9RAVciOZJzxvXJKK7HkfJXJ7Qa5zKdXm22x5zdNSLTXkm5HuDvRuwcl5pk65oRL+haWL6tF/ymwib4a68TpdFEo/bhu94lXMUpbx6laeK748OP1OKG1nXM64ylcl2pveiFAPBxXg1Q1+5FUtrzv94vqq6Z5qSVjhyioEJ9sJdiLrL5XUByCDEECJp2xefp0tJ6WreOy18Ca5XMwLj70wkbKRSIVa/X6TJ6KCj6JCkNfEJPABIDGGeCBkAY+4GxwJU2zECu7liZ58+CTS9FqmcmPsJLzDN6GeblioXD7zuiezgxeNG2hFgcGA6ADiFQwsmA85O/NBiJxgP6Ej+0H0+xkQwoMWYr3aEEEGyHzj5+o//e9S/6ErgKqWX9nwpqgZj0MIRIAgEHWa/RQQplh8QP6hK/mhMZggoIZ8ZAUC6D6zMiNf+kh/XMl8Ssapk611iYMjEsPYYUvLiL4v8i4K2Ba6drJzYh3Ui5qIYDSAecMV76DfkHgEqM1eMaU9Jo563LMiH+FJZuSKftRe/K3oB8EAs17tqM7n50FAoFcRDC3IFQcJzoKDwGBswsAVCQngoqZKydglCTEkoNjgBa6ISHCQemGXRASjLxYb8i8dIPNEN7nTdRy02BcmJpFDV/wrgGxjK7QzVcc+KHbF8bapF7kJlLvfgh/129YeGIg3JiZj9y3wAwStHcprmzJtUD8cAqL2UsRUfYgNkNip2ITdIHFLmd7BrF4n15sTCODlrFoD4dqtV0n32GHFSKLQM1XKrSxZTcToqz2HEBhO+vMc2rH4+6WegwKQhnRi6nHReoxmOfQf0qYcATHl3shV5oW2MhVJwjPdNuOzEc+Li3qf0IH4ISDXKDRyj61jWI8fbnjmO3LXT36UZiJnZT6YK2zQzPdMnWJQl+pX7nqtsif3kI2fdR4oB85K+W8eust/nGPGPiinBBhyTG3onitnD2F59z4XM7Z+zEehuDPZEWmrPjK6eeSofFLXh/cll7K84ur+vhSe5y/rUHdojUGGU5kLTRAU03Pc+BZAoWC1aT5kgILYyJ6snDMIO9a3x3I2Xyju8TGGVOHublAey+lMFTbaOlyJi0EcwQC1wIBYYMOHOha3zwuw2LB76giwqCNBPy/AkKEc4Ayxjvz9zHwSGYmsM9DoHwFa3J+ZDTpGsoyrV5k/i2zi8ZQVSox/p5Ilv0dMZ8pxeZCwDSMOS1bXHcbjeRSHoGobmPJoEBlwjD6PI/JP8FeeTPhj9VPLJdTqQYOnhMxutz3re8oLOwJBMG4DgbBxMeWbSg/H6pjVnPdS6Ec5NmG8m7Aep5CLfMyrW3ewgtFo3B4N1YvyeijF8glXYKi1hrZ4nKi0E+KFRaXpcdcw7LveBxSJBruFcj+K3Jvw44r0G2NR2nogm2o8oUrch++FsOSHie8Rh0oLaTvuEZ929r4IFKJKm2uPZlNzJ5w16kdzhp1O7MVONQeMhfixF9rTHcVOdWc4TftDUwDdMqxIAFxKWyolYGj12DfHRuqqjoRO2ZtieS5fP22dCIiLB+4ME/kHZ/uoTUI9gaFsGuQHyhhsrEr6OSzNUIxZ+oONeHovC1HGFd0/kkqVL9FsL7hKxaTsUHJuwYBpe8cMRcZ9DoNGrbxYUh97+45Q4cB4/swVVBjWi78NVGSPkPSCFUwJD3xSyo79fx8vb68e/4HY2TvBsRVs1ZRzv7Zs2nCldoSHm0j7wv+67Vf4wLCvEMUHTo1/XH79c/cq4CZa717VJLf/Aw==
Now let's switch roles and have the developer hat on. You've been told about all this setup and you may think all you
have to do is drop the code on the application servers and you are done, but wait a minute what will happen if a full
replica set will be unavailable ? Well this is the part called design for failure.
Let's assume the application has a few a entities, one of them will be the User. In a typical SQL world you
would have a table that has something like user_id, first_name and so on. Then all other tables will be linked to
this table with Foreign Primary Key on user_id. If we would try to replicate this scenario the schema will
look like this:
/* users collection */
> db.users.find().pretty()
{
"_id" : ObjectId("5107fb6736141503d37b6a31"),
"username" : "johnd",
"password" : "bc7a0154948baa69ecbe1d7843b25113fc5f3f20",
"first_name" : "fname"
}
>
/* objects collection - linking back to users via user_id */
> db.objects.find().pretty()
{
"_id" : ObjectId("5107fc7536141503d37b6a32"),
"user_id" : ObjectId("5107fb6736141503d37b6a31"),
"data" : "all the goodies you need"
}
>
What can go wrong with this schema ?! Well, let's say we will shard on the users._id key, than MongoDB will
split data based on the chunk size as need it and distribute it accordingly. On the objects collection we will
have different options to shard, we can use the objects._id or objects.user_id etc. However if user X is located on
shard01 and most of his entries (if not all) are located on shard02 than if shard02 is down the user will be
without entries! If shard01 is down the user can't even use the system. So what will be a better approach?
Locality of the data, have the user collection contain the objects collection. So this will look as:
/* users collection embeds the objects collection */
> db.users.find( {"username": "bobc"}).pretty()
{
"_id" : ObjectId("5107fe2936141503d37b6a33"),
"username" : "bobc",
"password" : "ad7a0154948baa69ecbe1d7843b25113fc5f3f20",
"first_name" : "fname",
"data" : {
"key" : "value"
}
}
With this schema if any of the shards are down than the user can NOT use the system but in case
it can use the system his data will be consistent.
The process of choosing the 'right' sharding key is very tricky, has a few constraints from the MongoDB part
as well it depends on your data structure and requirements. For more info see
Shard keys for MongoDB.
Final review of the total requirements:
- has to be resilient to failures
MongoDB ability to function into a Replica Set.
- has to be responsive all the time
MongoDB Replica Set, write to Primary and read from Secondary. In the case you need more capacity
there are two options, upscale of instance size and sharding.
- will need to support an initial high volume of users with the option to grow
Again Sharding and Replica Sets will fulfill.
- balanced as 70% reads and 30% writes for the database traffic
For the writes you have the option to add more shards, for reads you can add more servers to
the existing Replica Sets, I choose three servers but nobody stops you from adding five for example.
- to be cost effective (this is always relative to the business's budget)
Considering all other requirements having three members per Replica Set will be the minimum to have.
- because of the above (cost) constraint the business agree that if a catastrophic failures happens into
a region will be ok to have downtime but not ok to not be able to bring the site up somewhere else in a
manner of a few hours.
In the case of total failure of Primary Site from region US-EAST you can still have the data in US-WEST.
If your infrastructure is automated it will take no time to re-create all three architecture layers.
- Backups of the data
The nodes from US-WEST are the perfect candidate for this. Use EBS snapshots and ship data to S3.
With these snapshots you can recover even if all nodes (including US-WEST) are down.