Suppose you created a database for a web application a few years ago. It started with a handful of users but steadily grew, and now its growth is far outpacing the server’s relatively limited resources. You could upgrade the server, but that would only stem the effects of the growth for a year or two. Also, now that you have thousands of users, you are worried not only about scalability but also about reliability.
If that one database server fails, a few sleepless nights and many hours of downtime might be required to get a brand new server configured to host the database again. No one-time solution is going to scale infinitely or be perfectly reliable, but there are a number of ways to distribute a database across multiple servers that will increase the scalability and reliability of the entire system.
Put simply, we want to have multiple servers hosting the same database. This will prevent a single failure from taking the entire database down, and it will spread the database across a large resource pool.
By definition, a distributed database is one that is run by a central management system but which has storage nodes distributed among many processors. These “slave” nodes could be in the same physical location as the “master”, or they could be connected via a LAN, WAN, or the Internet. Many times, database nodes in configurations like this have significant failover properties, like RAID storage and/or off-site backup, to improve chances of successful recovery after a database failure.
Distributed databases can exist in many configurations, each of which may be used alone or combined to achieve different goals.
A distributed database is divided into sections called nodes. Each node typically runs on a different computer, or at least a different processor, but this is not true in all cases.
One of the usual reasons for distributing a database across multiple nodes is to more optimally manage the size of the database.
For example, if a database contains information about customers in the United States, it might be distributed across three servers, one each for the eastern United States, the mid-western United States, and the western United States.
Each server might be responsible for customers with certain ZIP codes. Since ZIP codes are generally arranged from lowest to highest as they progress westward across the country, the actual limits on the ZIP codes might be 00000 through 33332, 33333 through 66665, and 66666 through 99999, respectively.
In this case, each node would be responsible for approximately one third of the data for which a single, non-distributed node would be. If each of these three nodes approached its own storage limit, another node or two nodes might be added to the database, and the ZIP codes for which they are responsible could be altered appropriately. More “intelligent” configurations could be imagined, as well, wherein, for example, population density is considered, and larger metropolitan areas like New York City would be grouped with fewer other cities and towns.
In a distribution like this, either the database application or the database management system, itself, could be responsible for deciding which database node would process requests for certain ZIP codes. Regardless of which approach is taken, the distribution of the database must remain transparent to the user of the application. That is, the user should not realize that separate databases might handle different transactions.
Reducing node storage size in this manner is an example of using horizontal fragments to distribute a database. This means that each node contains a subset of the larger database’s rows.