Mini Project 1

For practice purposes only.

Mini Project 1

Practicums provide you with an opportunity to dive deeper into a data analytics problem. In this practicum you will practice data loading, parsing, shaping, and exploration. This is more than an assignment and requires creative problem solving and patience, a skill that is essential to data analytics. You will work with a sample of MEDLINE/PubMed publication data from the National Institute of Health (NIH) that is provided in XML. The actual database is vast so we will work with an excerpt but your solution should scale to the entire database.

It is anticipated that the average student will spend 8-12 hours on this practicum. You will likely struggle: have patience and solve small (partial) problem before trying to solve an entire problem. Work slowly and deliberately. Use this mini-project as an opportunity to practice "coding in the wild" without the guardrails of an assignment. This is very close to an actual problem you might encounter in a data science or data engineering position.

This is a group practicum which means that you may (but do not have to) work in groups of up to three students. You may fully collaborate and submit the same work. However, you must put all three students' names on all submitted work. If a group member is not adequately contributing, the remaining team members may "vote to eject" the student from the team by emailing me the reason. In such an event, the team member who was "fired" must still complete the project individually by the due date.

Practicum Tasks

(0 pts) Download the PubMed excerpt data set (XML). Load the XML file into a browser or text editing tool and inspect it. Explore the data set as you see fit and that allows you to get a sense of the data and get comfortable with it.
(25 pts) Load the data into R and create two linked tibbles: one for publication and one for journal. Use ISSN as the key to link them. Only load the following information into the publication tibble: PMID (primary key for publication), ISSN and publication year (foreign key for journal), date completed (as one date field), date revised (as one date field), number of authors (derived/calculated field from the authors), publication type, title of article. Load this information into the journal tibble: ISSN (primary key), medium (from CitedMedium attribute), publication year (primary key), publication season, language, and journal title. In cases where there are multiple languages for a publication, pick the first language. Same for publication type: pick the first one. The primary key for journal is (ISSN, publication year). Also, exclude any journals that do not have an ISSN as the primary key cannot be empty.
(5 pts) Create a line graph of the number of publications per year from 2000 to 2015.
(20 pts) Find the articles that had fewer than three authors and list the article, journal, and publication date.
(10 pts) Find the average number of authors for articles. Display a single number.
(10 pts) What is the average time period (in days) between date completed and date revised. Display the time elapsed in days. Only consider cases where the difference is a positive number.
(10 pts) Which articles published in PubMed were not written in English? Only consider the first language of publication.
(20 pts) Using the XML data (not the tibbles created above), find the articles containing any of the words "drug resistance" or "virus" in any capitalization in the title. Note that drug resistance could be spelled as "drug resistance" or "drug-resistance" or "drug resistant" or "drug resistent" -- use regular expressions to deal with the variations.

Submission Details

Your submission must contain two files: the .Rmd notebook and a knitted PDF or HTML (from the notebook). Name your .Rmd R Notebook, DA5020.P1.LastName.Rmd, your PDF DA5020.P1.LastName.pdf, and your HTML DA5020.P1.LastName.html. If you are producing an HTML instead of a PDF, be sure to ZIP the HTML file as Blackboard does not allow uploading of HTML (or it will "mungle" it and it won't be viewable).
The .Rmd file must be fully commented and properly "chunked" R code and detailed explanations -- each chunk should be a step in your analysis.
Make sure that it is easy to recognize which question you answer.
Ensure that your code runs from beginning to end (because that is how we will test it.) Code that doesn't execute, stops, throws errors will receive -- naturally -- receive no points. If the graders have to "debug" your code or spend any effort getting it to run, substantial points will be deducted.
Not submitting a knitted PDF or HTML will result in reduction of 30 points.
Not submitting the .Rmd file (or both) will result in a score of 0.

Useful Resources

Hints

If you get errors such as "Input is not proper UTF-8" or other errors loading your XML file then you likely did not specify the correct path or URL; the error is misleading as it's actually an issue with the file not being found rather it containing incorrect content. Another common issue is that your URL might use SSL by starting with https:// -- change to http://. Alternatively, download the load the file locally before switching back to an http load.
xmlToDataFrame() will not work for this problem. You need to parse the individual fields using the parsing functions using XPath.