Assignment 7 (Alternate)
This assignment requires that you collect embedded data from various websites and import them into R. You will scrape data from HTML through a toolkit and identify search parameters through a URL. The objective of this assignment is to understand the structure of HTML pages and URL parameters.
Problem 1 (40 Points)
Use the import.io toolkit to extract data from Wikipedia.
- Got to the Wikipedia page that lists countries and their nominal GDP.
- Extract rank, country name, and GDP per capita (in US$) from the International Monetary Fund numbers.
- Export the result into a CSV.
Problem 2 (20 Points)
- Import the CSV you extracted into a data frame in R. Each row represents a country with GDP.
- Calculate the average GDP.
- Calculate the standard deviation of the GDP.
- Calculate the interquartile range of the GDP.
Problem 3 (40 Points)
- Redo problem 1 but using rvest or direct parsing of the the HTML in R rather than an external toolkit and extracting the values according to the UN rather than the IMF. You may use any parsing method in R to solve this problem and scrape the data.
- Calculate the same metrics as in Problem 2 and compare them to the IMF numbers from Problem 2.
- Add a new column labeled PctUSGDP to the scraped data frame that contains the percentage of US GDP for each country. In other words, calculate how much of the US GDP each country has. For example, the GDP as reported by the UN (as of this writing -- data can change on the website) of the US is $19,485,394 MM (MM means that it's in millions.) The GDP of Brazil is US$2,055,512 MM which is 0.105 or 10.5% -- so the additional column should contain 10.5.
Submission Details
- For Problem 1, submit a PDF that contains screen shots of each step and that shows the extraction plus the result. Name the PDF file DA5020.LastName.A7-1.pdf.
- For Problems 2 and 3, submit the .Rmd file with fully commented R code and detailed explanations and a knitted PDF or HTML. Name the file with the pattern DA5020.LastName.A7.{Rmd,[pdf,html]. Make sure that it is easy to recognize which question you are answering.
- Not submitting a knitted PDF or HTMLwill result in reduction of 30 points.
- Not submitting the .Rmd file (or both) will result in a score of 0.