Unit 07 | Collecting Data from Web Scraping & APIs |
| 5 + 3 hrs |
Upon completion of this module, you will be able to:
- use scraping toolkits to collect embedded data from websites
- export scraped data as CSV and XML
- implement custom programmed scrapers in R using rvest
- connect to Web APIs to collect data
- perform simple text analytics
Web Scraping Concepts |
90 min
|
Data is not always neatly available as a downloadable CSV (or similar) file. Chances are that much third party and local government data is only available through viewing of a web page. While a data scientist might be inclined to check if there's a web API, many such sites don't offer that as well. However, any content that can be viewed can be "scraped" from the page. This lesson reviews some of the available automated scraping platform available (e.g., Kimono, import.io, and Google Chrome even has an add-on through the Chrome Store).
|
Required Work
Additional ResourcesSlide Deck & Data Sets |
Scraping Web Pages in R with rvest |
60 min
|
|
Required Work
Additional Resources |
Getting data from Web APIs |
30 min
|
Traditionally, Web Services provided a great way of creating connected web applications. SOAP and XML created an excellent solution for creating connected web applications. SOAP is a standard XML based protocol that communicated over HTTP. We can think of SOAP as message format for sending messaged between applications using XML. It is independent of technology, platform and is extensible too. The SOAP offered an excellent way of transferring the data between the applications. but the problem with SOAP was that along with data a lot of other meta data also needs to get transferred with each request and response. This extra information is needed to find out the capabilities of the service and other meta data related to the data that is being transferred coming from the server. This makes the payload heavy even for small data. REST is intended to address the weakness of traditional SOAP-based web services. REST stands for Representational State Transfer. This is a protocol for exchanging data over a distributed environment. The main idea behind REST is that we should treat our distributed services as a resource and we should be able to use simple HTTP protocols to perform various operations on that resource. When we talk about the Database as a resource we usually talk in terms of CRUD operations. i.e. Create, Retrieve, Update and Delete. Now the philosophy of REST is that for a remote resource all these operations should be possible and they should be possible using simple HTTP protocols.
|
Required Work
Additional ResourcesSample Code |
Case Study: Web Scraping in R for Text Analytics |
90 min
|
|
Required Work
Advanced Readings
Data Sets
|