Practicum 2
This assignment provides you with an opportunity to use the MongoDB document based data store. MongoDB is a commonly used non-relational database. This practicum requires lots of software installations, command line servers, and keys to be installed. Have patience. Also, this practicum requires a 64-bit operating system such as a modern version of MacOS or 64-bit Windows -- also be sure that you have installed and run the 64-bit version of R on Windows -- if your version of Windows is an (old) 32-bit version, you need to use a university computer. The reason is that you won't have enough virtual memory -- 32 bit operating systems allow access to a maximum of 4GB of memory (regardless how much memory you computer has) of which Windows only makes 2GB available for programs; you need more than 2GB, so you need 64-bit Windows. In addition, if your computer only has 4GB (or less) of RAM your operating system will use your disk drive as additional (virtual) memory and your program will run *very* slowly. An adequate computer for data analytics should be a fast processor (2GHz+ i7 or similar) with a minimum of 32GB of RAM running a modern 64-bit OS. This is one of the reasons that heavy data analytics is done on cloud-hosted virtual machines where you can get that computing power inexpensively and on-demand, e.g., Amazon AWS, Microsoft Azure, BlueHost, etc.
This is a *not* a group practicum and must be completed individually.
Problem 1 (80 Points)
Follow this tutorial on MongoDB and build an R Notebook that implements all of the steps. Show that you are getting the same results. Note that the comments contain some notes on items that do not work in the original tutorial.
There are some new steps for using a Google API. In addition, you should avoid calling the API repeatedly, so save the imported data as an R Object.
There are some new steps for using a Google API. In addition, you should avoid calling the API repeatedly, so save the imported data as an R Object.
Notes
- You will need to install the package data.table and then load it using library(data.table). Be sure to load it before lubridate as it will otherwise mask key functions in lubridate, i.e., both packages have functions with the same name. Alternatively, you need to prefix the name of the package for functions, e.g., lubridate:: -- this is referred to as scope resolution or fully scoped names.
- Be sure you have the full path and correct file name for the CSV file in data.table::fread()
- The data file is over 1.5GB (and over 6.6 million observations) so it can take a while to load; in practice it is often worthwhile to sample the data into a smaller file and use that for development and initial analysis; that is one of the reasons why you'll want to put data into a database rather than a CSV or XML. Loading from a database is much faster and you can selectively load data based on what you'll actually process and analyze
- Make sure that the mongod background server daemon is running before you connect to the database; don't exit the terminal/shell from which you started it; try connecting from another terminal/shell using the mongo command, e.g., ./mongo --host 127.0.0.1:27017 (this assumes that you are in the bin directory of the MongoDB installation where all the MongoDB executables are located)
- Loading the data into the database takes significant time due to the size; again, that's something you'll only want to do once
- The data set often changes so your counts may be different than what's in the tutorial
- The latest version of the mongolite package appears to convert "TRUE" and "FALSE" into booleans so you'll need to adjust the queries to use booleans true and false rather than string, e.g., "Domestic" : true rather than "Domestic" : :true"; if you still get errors, change '{"Domestic":"true"}' to '{"Domestic": true}'
- If you have trouble connecting to mongod using "localhost" then it might be because it is not defined in your hosts file; either add it or use url="mongodb://MacBook-Pro.local" when calling the mongo function in R; replace the name with your computer's name
- There is a step that asks for an API key from Google. When trying input the Chicago map this error will appear:
Error: Google now requires an API key. See ?register_google for details. Google offers a free trial for API keys for Google Maps when you search the Google map platform. This tutorial provides instructions on how to obtain a key. Note that it appears that a student email will not work when trying to sign up for the program. You must use a separate gmail account when registering. It will ask for credit card information but it will not charge your card. Once you obtain a key go into R Studio and input this command into the console: register_google(key = 'Your API Key here') Once this has been submitted run the code provided again and it should work. - We suggest you save your map as an RObject so you don't need to re-call the API again and again. The code for that is as follows:
saveRDS(YourMap, file = "NameYourFile.RData")
YourMap<- readRDS(" NameYourFile .RData")
Keep in mind that you still need to run the function to get the map beforehand and then run those lines
Problem 2 (20 Points)
Build one additional query (of your choice) to retrieve data from MongoDB into a data frame.
Submission Details
- Your submission must contain two files: the .Rmd notebook and a knitted PDF or HTML (from the notebook). Name your files with the patter DA5020.P2.LastName.{Rmd,[pdf,html]}.
- The .Rmd file must be fully commented and properly "chunked" R code and detailed explanations. Make sure that it is easy to recognize which question you answer and that your code runs from beginning to end (because that is how we will test it.) Code that doesn't execute, stops, throws errors will receive -- naturally -- receive no points. If the graders have to "debug" your code or spend any effort getting it to run, substantial points will be deducted.
- Not submitting a knitted PDF or HTMLwill result in reduction of 30 points.
- Not submitting the .Rmd file (or both) will result in a score of 0.
Data Files
- Crimes Data (from data.gov) -- you may need to use a proxy server if accessing from outside the US; the data file is also often updated so it may be different than the one used in the tutorial, so your results may be a bit different