CST 363

CST 363 - Into to Database Systems

This course provides balanced coverage of database use and design, focusing on relational databases. Students will learn to design relational schemas, write SQL queries, access a DB programmatically, and perform database administration. Students will gain a working knowledge of the algorithms and data structures used in query evaluation and transaction processing. Students will also learn to apply newer database technologies such as XML, NoSQL, and Hadoop.

Map-Shuffle-Reduce

Description
Code

A simple implementation of a Map-Shuffle-Reduce database in python. The algorithm looks to solve the problem of having multiple shards of a database where data is spread out. For this SimpleDB example, there are 2 sets of data spread across multiple shards:

Employees (sharded by the employee id)
Departments (sharded by the department id)

For instance, if the user must run a JOIN query over the employees and Departments, the user will quickly find out that a particular node may not hold all the employees for a department, or that the department is not allocated to the same node as all, if any of the employees. To solve this, we execute the algorithm as a three step process:

Map Step: We must take the smallest table, in this case Employees, and perform a select statement with the minimum amount of data required to perform the join.
Shuffle Step: A temporary table is created on each node, and the data is then shuffled by the new shard id into the proper node.In this case, the Employees are sharded by the department id, and the data is injected into a temporary table.
Reduce Step: The JOIN query can then be easily executed over the Employees and Departments table to be returned to the main process.
Cleanup: At the end of the query, any temporary table should be destroyed to keep proper maintenance on the database.

The end result is the ability to query data over multiple shards. However, this process can be taxing on the database as well as the system since large amounts of data need to be passed around, written and stored. This of course is a naive approach to understand the mechanisms behind other database that solve this type of issues and optimize around it such as Hadoop. While the implementation is different, the principles behind it remain the same. The end goal is to allow the user to store data across multiple databases which means scaling horizontally, while making the data accessible to any and all nodes for aggregations or other multi-table queries.