Assignment 7
This assignment requires that you collect embedded data from various websites and import them into R. You will scrape data from HTML through a toolkit and identify search parameters through a URL. The objective of this assignment is to understand the structure of HTML pages and URL parameters.
As an alternative, you may do this Assignment 7 instead.
Problem 1 (40 Points)
Use the import.io toolkit to extract data from the Yelp website. In particular, create a search in Yelp to find good burger restaurants in the Boston area.
- Start at the website https://www.yelp.com/boston, create a search for Burgers.
- Use the search filters to limit Boston neighborhoods to Allston, Brighton, Back Bay, Beacon Hill, Downtown Area, Fenway, South End, and West End.
- Notice the URL format in your browser’s location bar. Copy and save the URL somewhere safe, e.g., paste into a stickie or an editor - you will need that URL later. You want to extract the first three pages of the search results. For each page notice the change in the URL and save the updated URLs, too.
- Extract information about restaurants appearing in the search results, including their name, address, service categories, and review count. Do not scrape “Ad” items.
- Export the result into a CSV.
Problem 2 (20 Points)
- Import the CSV you extracted into a data frame in R. Each row represents a burger restaurant in Boston.
- Calculate the average number of reviews for all restaurants.
- Which restaurant has the least reviews?
Problem 3 (40 Points)
Redo problem 1 but using rvest or direct parsing of the the HTML in R rather than an external toolkit. As in the problem, save your result as a CSV. Make sure it matches the CSV from problem 1. You may use any parsing method in R to solve this problem and scrape the data.
Submission Details
- For Problem 1, submit a PDF that contains screen shots of each step and that shows the extraction plus the result. Name the PDF file DA5020.LastName.A7-1.pdf.
- For Problems 2 and 3, submit the .Rmd file with fully commented R code and detailed explanations and a knitted PDF or HTML. Name the file with the pattern DA5020.LastName.A7.{Rmd,[pdf,html]. Make sure that it is easy to recognize which question you are answering.
- Not submitting a knitted PDF or HTMLwill result in reduction of 30 points.
- Not submitting the .Rmd file (or both) will result in a score of 0.