Unit 08 | Analytics Store - DA5020 | Collect-Store-Analyze Data

Unit 08 | Building the Analytics Store

| 5 + 1 hrs

Upon completion of this module, you will be able to:

list the different genres for databases
define key-value and columnar databases
define document and graph databases
contrast the databases with the common relational model
explore Riak, Redis, and HBase as examples of key-value and columnar databases
explore MongoDB and CouchDB as examples of document databases
explore Neo4J as an example of a graph database
insert and retrieve data from MongoDB through R

Lesson 1
Lesson 2
Lesson 3
Lesson 4
Lesson 5

The Role of the Analytics Store

45 min

While relational database management system (RDBMS) are the workhorse of the data storage world, they are based on a set-model that is not appropriate for certain kinds of data storage needs. Recently, non-relational databases that use query languages other than SQL have become more common. Two genres of non-relational (or NoSQL) database management systems are key-value and columnar databases such as MongoDB, Riak, Redis, and HBase. In this module, we'll take a look at the different genres and how they approach data storage. We'll then focus on the three databases aforementioned to explore when these two genres present appropriate storage mechanisms.

Required Work

Watch the introductory lessons to understand the need for databases to store tidy data and the need for different types of databases for different kinds of data and different analytics,predictive modeling, and data mining techniques.
Read the overview of NoSQL databases by Tim Perdue.

Additional Resources

None

Database Genres

30 min

Databases can be broadly classified into several "styles." The most common style is the relational database which organizes data into tables and emphasizes relationships between the tables. The SQL query language is used to combine tables and select data. Postgres, SQLite, MySQL, Oracle, SQL Server are examples of common relational database management systems. Alternative models include Key-Value, Columnar, Document, Graph, and Polyglot databases. Most of these use query mechanisms other than SQL -- hence the name NoSQL databases. Although, some of them offer SQL to express queries due to SQLs versatility, ubiquity, and platform independence.

Required Work

View the lecture to get an overview of key-value and columnar databases.
After viewing the lecture, read A Comparison of NoSQL Database Management Systems & Models and then read Sadalage, P. (2014). NoSQL Databases: An Overview. ThoughtWord, Oct 3, 2014.

Additional Resources

Peng, Roger. Interacting with data using the filehash package for R.

Slide Deck

Slide deck

Key-Value Databases

60 min

Key-Value (KV) databases is one of the simplest, if not the simplest, type of data store. A KV database stores keys along with values matching each key -- it's similar to a lookup table. The program provides a key (record ID or some other unique identifier) and the database returns a value corresponding to that key. Some KV database allow the values to be a list but most are simple values. As with relational databases, many open source implementations are available, including Voldemort, memcached, memcachedb, membase, Redis, and Riak. For example, in Riak values matching the keys can be simple numbers of strings, but also XML documents or images. Riak is built on top of Amazon's Dynamo engine and it uses the HTTP REST mechanism for data queries. Redis, on the other hand, supports complex data types such as sorted sets and offers basic message patterns for data distribution such as publish-subscribe and blocking queues. Redis offers incredible performance but at the risk of occasional data loss due to its aggressive caching approach.

Required Work

View the lecture and then read Sadalage, P. (2014). NoSQL Databases: An Overview ToughtWorks. Oct 2, 2014.

Additional Resources

None

Columnar Databases

60 min

Columnar database are, as the name implies, column-oriented rather than the relational models row-orientation. It is relatively easy and computationally inexpensive to add new columns of data. Each record is identified through a key and each record contains zero or more columns (fields). The columns do not have to be of the same data type across rows and, in fact, may have different numbers of columns making sparse data efficient to store. Columnar databases are about half-way between relational and key-value databases. HBase, along with Cassandra and Hypertable, are among the most popular columnar database implementations.

Optional Work

View the guest lecture by Martin Fowler. Martin Fowler is a sought-after expert in modern multi-tier software architectures and a key contributor to UML and Agile. The serves as Chief Scientist at ThoughWorks.

Additional Work

None

Document and Graph Databases

60 min

Document databases store their information in "document" structures. Documents are nested structured generally stored as JSON or XML objects as those structures provide inherent nesting capability. A document database imposes few formal structures as long as the information can be expressed as a "document". Most document database are queried using JavaScript rather than SQL. Many of them, such as MongoDB, support fast concurrent query execution using MapReduce. CouchDBs major strenght is in its resilience to hardware and network failure making it an ideal choice for mobile environments.
Graph databases are less commonly used compared to key-value, columnar, and document databases. However, they are an attractive choice for storing "network" information such as that occurring in "social networks". Graph databases store nodes of information with relationships between the nodes. One of the most popular graph databases is Neo4J. Neo4J allows extremely fast traversal of networks to find relevant information.

MongoDB is one of several non-relational database alternatives that started to appear in the mid-2000s under the NoSQL banner. Instead of using tables and rows as in relational databases, MongoDB is built on an architecture of collections and documents. Documents comprise sets of key-value pairs and are the basic unit of data in MongoDB. Collections contain sets of documents and function as the equivalent of relational database tables. Like other NoSQL databases, MongoDB supports dynamic schema design, allowing the documents in a collection to have different fields and structures. The database uses a document storage and data interchange format called BSON, which provides a binary representation of JSON-like documents. Automatic horizontal and vertical decomposition enables data in a collection to be distributed across multiple systems for horizontal scalability as data volumes increase, using MapReduce to process data queries in parallel.

Required Work

Read Robinson, I., Webber, J., Eifrem, E. (2015). Graph Databases: New Opportunities for Connecting Data. 2nd Edition. O'Reilly Media, Inc. Published by Neo Technology, Inc..

Additional Resources

Example Code: Accessing MongoDB in R

Slide Deck & Data Sets

Slide Deck