Practicum 3
This practicum will allow you to explore two issues. One, it will allow you to implement the kNN algorithm and practice R, and, two, you will see how to apply kNN to predict a continuous variable. We will use the same data set as for a previous assignment and then compare the two algorithms.
This is a group practicum which means that you may (but do not have to) work in groups of up to three students. You may fully collaborate and submit the same work. However, you must put all three students' names on all submitted work. If a group member is not adequately contributing, the remaining team members may "vote to eject" the student from the team by emailing me the reason. In such an event, the team member who was "fired" must still complete the project individually by the due date.
This is a group practicum which means that you may (but do not have to) work in groups of up to three students. You may fully collaborate and submit the same work. However, you must put all three students' names on all submitted work. If a group member is not adequately contributing, the remaining team members may "vote to eject" the student from the team by emailing me the reason. In such an event, the team member who was "fired" must still complete the project individually by the due date.
Problem (95 Points)
- (0 Pts) Load the data set on franchise sales. The variables in the data set are: NetSales = net sales in $1000s for a franchise; StoreSize = size of store in 1000s square-feet; InvValue = inventory value in $1000s; AdvBudget = advertising budget in $1000s; DistrictSize = number of households in sales district in 1000s ; NumComp = number of competing stores in sales district. Do you detect any multi-collinearity that would affect the construction of a multiple regression model?
- (20 Pts) Normalize all columns, except NetSales, using z-score standardization.
- (50 Pts) Implement the k-NN algorithm in R (do not use an implementation of k-NN from a package); write a function called kNN-predict(data,y,x,k) that takes a data set of predictor variables, a set of NetSales values, a new set of values for the variables, and a k and returns a prediction. To predict a continuous variable you need to calculate the distances of x to all observations in data, then take the k closest cases and average the NetSales values for those cases. That average is your prediction.
- (10 Pts) Use your algorithm with a k=3 to predict net sales of a store with the following values for the variables in order: (4.2, 601, 7.8, 14.2, 6). Compare that prediction to the one you obtained in Assignment 10.
- (15 Pts) Calculate the mean square error (MSE) for the kNN by predicting each actual value in the data set and comparing it to the actual observation. Compare the MSE to the MSE you calculated in Assignment 10 and comment on the difference. Which model is better?
Problem (15 Points)
- (10 Pts) Determine an optimal k by trying all values from 1 through 7 for your own k-NN algorithm implementation against the cases in the entire data set (if we had a larger data set, we would split it into training and validation data to avoid overfitting). What is the optimal k, i.e., the k that results in the best accuracy as measured by smallest MSE?
- (5 Pts) Create a plot of k (x-axis) versus MSE using ggplot.
Submission Details
- Graded out of 100, so maximum score is 110/100 which means 10% bonus added to practicum average.
- Your submission must contain two files: the .Rmd notebook and a knitted PDF or HTML (from the notebook). Name your files with the pattern DA5020.P3.LastNames.{Rmd,[pdf,html]} where LastNames are the last names of your group members. Put the names of the group members into the submission comments and into the R notebook.
- The .Rmd file must be fully commented and properly "chunked" R code and detailed explanations. Make sure that it is easy to recognize which question you answer and that your code runs from beginning to end (because that is how we will test it.) Code that doesn't execute, stops, throws errors will receive -- naturally -- receive no points. If the graders have to "debug" your code or spend any effort getting it to run, substantial points will be deducted.
- Not submitting a knitted PDF or HTML will result in reduction of 30 points.
- Not submitting the .Rmd file (or both) will result in a score of 0.
- All members of a group should submit the same files.
- Put the names of all group members into the submission comments. If anyone was dismissed from the group or submitted a different file, be sure to explain in the submission comments.
- No late submissions will be accepted. Submit early and often in case of computer, network, or Blackboard glitches.
Useful Resources