Unit 01 | Introduction & Essentials |
| 4 + 3 hrs |
Upon completion of this module, you will be able to:
- appreciate the data analysis workflow.
- list the main characteristics of working with "big data".
- install the R Programming Environment and the R Studio IDE.
- learn to navigate R Studio.
- load data into R and explore the data visually using a chart with ggplot.
Data, "Big Data", and Data Analytics |
1.25 hrs
|
Data drives decision-making in most organizations. Because of that, data has become an important asset and source of competitive advantage. Data comes from many sources and in many formats. Unfortunately, the data is rarely in a format that is conducive to analysis. Therefore, the data scientist must "munge" and "wrangle" the data to suit the desired analytical and visualization goals. In practice that means format conversions, filling in missing data, converting data and fields to appropriate formats, and storing the data in a data store suitable for its intended purpose. Databases use different architectures to deal with different types and amounts of data. The data scientist must choose the appropriate database and storage architecture based on the data and how it will be used.
|
Required Work
Additional Resources |
Key Resources for Data Science
Data science requires tools, skills, and background knowledge, so the practicing data scientist needs a toolbox to rely on. The blog post below is a list of lists spanning the key foundations: programming, R, statistics, visualization, and predictive analytics. Look them over to get an idea of what you need to learn (over time -- not all at once -- and not all in this course.)
Smith, Jerry A (2013). Six Lists of Lists for Data Scientists. December 28, 2015.
Smith, Jerry A (2013). Six Lists of Lists for Data Scientists. December 28, 2015.
Characteristics of "Big Data" |
30 min
|
Big Data is a relatively new term that describes data sets that are so large and complex that traditional methods of storing and processing them are not sufficient. The exact size above which a data set is "big" is not clearly defined and depends on the domain, industry, and analytical goals. While there is not a single definition, the general consensus is that Big Data is the integration of large amounts of multiple types of structured and unstructured data into a single data set that can be analyzed to gain insight and new understanding. Big Data can be understood through the six V's of volume, variety, velocity, veracity, validity, and volatility.
|
Required Work
Additional Resources |
Data Analysis Pipleline and CRISP-DM |
30 min
|
|
Required Work
|
Exploring Data with R |
90 min
|
R is a large and rather old programming language that provides multiple functions that duplicate the same functionality. In other words, there are many programming paths to achieve the same result. This leads to confusion to the new R programmer. In order to simplify the vast universe of R, we limit this course to tidyverse or the tidy universe. This collection of packages provides a more consistent interface than the other methods provided by R and provides all of the functionality we need for manipulating data.
|
Required Work
Additional Resources |
Other Resources for Learning R
We will learn a lot of R in this course and you will have numerous opportunities to hone your data processing and analysis skills in R. However, some of you might want additional resources, particularly if you haven't programmed much before. Fortunately, Northeastern University subscribes to Lynda's online training courses. If you need additional background in R programming, you are urged to take the course "Up and Running with R" by Barton Poulson on Lynda. Sign in to Lynda Online Training by going to lynda.northeastern.edu or from Blackboard. Log on with your myNEU username and password. Note that the Google Chrome may not be fully compatible with Lynda, so you may need to switch to an alternate browser (Safari, Internet Explorer, or Firefox). You can also install the Lynda mobile app from their website or from the appropriate app store for your iOS or Android device. You will need to log in on the apps with your myNEU credentials.
These tutorials by the kind folks at Datacamp are also wonderful and might help some of you:
- For basics in R: https://www.datacamp.com/tracks/r-programming
- For visualization in R using ggplot2: https://www.datacamp.com/courses/data-visualization-with-ggplot2-1