Unit 08 | Building the Analytics Store |
| 5 + 1 hrs |
Upon completion of this module, you will be able to:
- list the different genres for databases
- define key-value and columnar databases
- define document and graph databases
- contrast the databases with the common relational model
- explore Riak, Redis, and HBase as examples of key-value and columnar databases
- explore MongoDB and CouchDB as examples of document databases
- explore Neo4J as an example of a graph database
- insert and retrieve data from MongoDB through R
The Role of the Analytics Store |
45 min
|
While relational database management system (RDBMS) are the workhorse of the data storage world, they are based on a set-model that is not appropriate for certain kinds of data storage needs. Recently, non-relational databases that use query languages other than SQL have become more common. Two genres of non-relational (or NoSQL) database management systems are key-value and columnar databases such as MongoDB, Riak, Redis, and HBase. In this module, we'll take a look at the different genres and how they approach data storage. We'll then focus on the three databases aforementioned to explore when these two genres present appropriate storage mechanisms.
|
Required Work
Additional Resources
|
Database Genres |
30 min
|
Databases can be broadly classified into several "styles." The most common style is the relational database which organizes data into tables and emphasizes relationships between the tables. The SQL query language is used to combine tables and select data. Postgres, SQLite, MySQL, Oracle, SQL Server are examples of common relational database management systems. Alternative models include Key-Value, Columnar, Document, Graph, and Polyglot databases. Most of these use query mechanisms other than SQL -- hence the name NoSQL databases. Although, some of them offer SQL to express queries due to SQLs versatility, ubiquity, and platform independence.
|
Required Work
Additional ResourcesSlide Deck |
Key-Value Databases |
60 min
|
Key-Value (KV) databases is one of the simplest, if not the simplest, type of data store. A KV database stores keys along with values matching each key -- it's similar to a lookup table. The program provides a key (record ID or some other unique identifier) and the database returns a value corresponding to that key. Some KV database allow the values to be a list but most are simple values. As with relational databases, many open source implementations are available, including Voldemort, memcached, memcachedb, membase, Redis, and Riak. For example, in Riak values matching the keys can be simple numbers of strings, but also XML documents or images. Riak is built on top of Amazon's Dynamo engine and it uses the HTTP REST mechanism for data queries. Redis, on the other hand, supports complex data types such as sorted sets and offers basic message patterns for data distribution such as publish-subscribe and blocking queues. Redis offers incredible performance but at the risk of occasional data loss due to its aggressive caching approach.
|
Required Work
Additional Resources
|
Columnar Databases |
60 min
|
Columnar database are, as the name implies, column-oriented rather than the relational models row-orientation. It is relatively easy and computationally inexpensive to add new columns of data. Each record is identified through a key and each record contains zero or more columns (fields). The columns do not have to be of the same data type across rows and, in fact, may have different numbers of columns making sparse data efficient to store. Columnar databases are about half-way between relational and key-value databases. HBase, along with Cassandra and Hypertable, are among the most popular columnar database implementations.
|
Optional Work
Additional Work
|
Document and Graph Databases |
60 min
|
Document databases store their information in "document" structures. Documents are nested structured generally stored as JSON or XML objects as those structures provide inherent nesting capability. A document database imposes few formal structures as long as the information can be expressed as a "document". Most document database are queried using JavaScript rather than SQL. Many of them, such as MongoDB, support fast concurrent query execution using MapReduce. CouchDBs major strenght is in its resilience to hardware and network failure making it an ideal choice for mobile environments.
Graph databases are less commonly used compared to key-value, columnar, and document databases. However, they are an attractive choice for storing "network" information such as that occurring in "social networks". Graph databases store nodes of information with relationships between the nodes. One of the most popular graph databases is Neo4J. Neo4J allows extremely fast traversal of networks to find relevant information.
Graph databases are less commonly used compared to key-value, columnar, and document databases. However, they are an attractive choice for storing "network" information such as that occurring in "social networks". Graph databases store nodes of information with relationships between the nodes. One of the most popular graph databases is Neo4J. Neo4J allows extremely fast traversal of networks to find relevant information.
MongoDB is one of several non-relational database alternatives that started to appear in the mid-2000s under the NoSQL banner. Instead of using tables and rows as in relational databases, MongoDB is built on an architecture of collections and documents. Documents comprise sets of key-value pairs and are the basic unit of data in MongoDB. Collections contain sets of documents and function as the equivalent of relational database tables. Like other NoSQL databases, MongoDB supports dynamic schema design, allowing the documents in a collection to have different fields and structures. The database uses a document storage and data interchange format called BSON, which provides a binary representation of JSON-like documents. Automatic horizontal and vertical decomposition enables data in a collection to be distributed across multiple systems for horizontal scalability as data volumes increase, using MapReduce to process data queries in parallel.
|
Required WorkAdditional ResourcesSlide Deck & Data Sets |