Assignment 6
This assignment provides you with an opportunity to practice SQL queries for retrieving data from relational databases. This is used in practice for two primary reasons. One, to collect data from databases sources - relational being the most common database architecture in use today. Second, eventually we will store cleaned data in an analytics store and there, again, relational databases are a common choice.
The assignment uses the SQLite database system, which is not truly a database server as it does not manage data on disk and does not provide multi-user access to the data. However, it is a fully functioning database that allows us to practice SQL queries. Nevertheless, SQLite is often used in practice for in-memory storage of relational data, particularly for iOS and Android apps. Database management systems are "systems software" so they generally require that you work with them, configure them, and install them through consoles or terminal shells -- this can be a bit technical and tricky so be sure to pay attention to the commands and have patience, particularly if you have never worked at the terminal level before.
The assignment uses the SQLite database system, which is not truly a database server as it does not manage data on disk and does not provide multi-user access to the data. However, it is a fully functioning database that allows us to practice SQL queries. Nevertheless, SQLite is often used in practice for in-memory storage of relational data, particularly for iOS and Android apps. Database management systems are "systems software" so they generally require that you work with them, configure them, and install them through consoles or terminal shells -- this can be a bit technical and tricky so be sure to pay attention to the commands and have patience, particularly if you have never worked at the terminal level before.
Problem 1 (80 Points)
Download this SQLite database and save it a local folder on your computer, then connect to that database in R. Once connected, formulate SQL queries for each of these questions. Show each result set to demonstrate that the correct data was retrieved.
- List the names and budgets of all projects.
- How many projects are there?
- List the names of all project managers.
- List all project names and the name of the project manager only, sorted in alphabetical order by project name.
- What is the total budget of all project managed by "John Wu"?
- How many projects have a budget of less than $50,000 but more than $10,000?
- List the name of each project manager, the number of projects they manage, and the total budget of their projects.
- List all projects that have a project name that starts with "W".
Problem 2 (20 Points)
This problem is completely done in the SQLite Console, not in R. Start by downloading this Customer Transaction CSV, then perform the following tasks:
- (4 pts) Create a table named transactions in SQLite that matches the CSV (column name and order must match). The columns are: Visits, Transactions, OS, Gender, Revenue.
- (4 pts) Import (load) the entire data from the CSV into that table using the SQLite .import command.
- (4 pts) Create a SQL query that finds the total revenue by OS.
- (4 pts) Create a SQL query that finds the largest revenue and for which OS and gender it occurred.
- (4 pts) Create a SQL query that finds the average number of visits by gender, excluding NA cases.
Copy the queries into a comment chunk in your R Notebook.
Submission Details
- Your submission must contain two files: the .Rmd notebook and a knitted PDF or HTML (from the notebook). Name your .Rmd R Notebook, DA5020.A6.LastName.Rmd and your PDF/HTML DA5020.A6.LastName.{pdf,html}, where LastName is *your* last name.
- The .Rmd file must be fully commented and properly "chunked" R code and detailed explanations. Make sure that it is easy to recognize which question you answer and that your code runs from beginning to end (because that is how we will test it.) Code that doesn't execute, stops, throws errors will receive -- naturally -- receive no points. If the graders have to "debug" your code or spend any effort getting it to run, substantial points will be deducted.
- Not submitting the knitted PDF or HTML will result in reduction of 30 points.
- Not submitting the .Rmd file (or both) will result in a score of 0.