Class 9: Candy Mini-Project

Author

Brian Wong (PID: A18639001)

Background

In this mini-project, you will explore FiveTHirtyEight’s Halloween Candy dataset.

We will use lots of ggplot some basic stats, correlation analysis and PCA to make sense of the landscape of US candy - somehting hopefully more relatable than the proteomics and transcriptomics work that we will use these methods on throughout the rest of the course

Importing Candy Data

Our dataset is a CSV file so we use read.csv()

candy_file <- "candy-data.csv"

candy = read.csv(candy_file, row.names=1)
head(candy)

             chocolate fruity caramel peanutyalmondy nougat crispedricewafer
100 Grand            1      0       1              0      0                1
3 Musketeers         1      0       0              0      1                0
One dime             0      0       0              0      0                0
One quarter          0      0       0              0      0                0
Air Heads            0      1       0              0      0                0
Almond Joy           1      0       0              1      0                0
             hard bar pluribus sugarpercent pricepercent winpercent
100 Grand       0   1        0        0.732        0.860   66.97173
3 Musketeers    0   1        0        0.604        0.511   67.60294
One dime        0   0        0        0.011        0.116   32.26109
One quarter     0   0        0        0.011        0.511   46.11650
Air Heads       0   0        0        0.906        0.511   52.34146
Almond Joy      0   1        0        0.465        0.767   50.34755

What is in the dataset?

Q1. How many different candy types are in this dataset?

nrow(candy)

[1] 85

Q2. How many fruity candy types are in the dataset?

sum(candy$fruity)

[1] 38

What is your favorite candy?

candy["Twix", ]$winpercent

[1] 81.64291

library(dplyr)


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

candy |> 
  filter(row.names(candy)=="Twix") |> 
  select(winpercent)

     winpercent
Twix   81.64291

Q3. What is your favorite candy (other than Twix) in the dataset and what is it’s winpercent value?

Hershey’s Kisses’ winpercent value is 55.37545

candy["Hershey's Kisses", ]$winpercent

[1] 55.37545

Q4. What is the winpercent value for “Kit Kat”?

Kit Kat’s winpercent value is 76.7686

candy["Kit Kat", ]$winpercent

[1] 76.7686

Q5. What is the winpercent value for “Tootsie Roll Snack Bars”?

Tootsie Roll Snack Bars’ winpercent value is

candy["Tootsie Roll Snack Bars", ]$winpercent

[1] 49.6535

library("skimr")
skim(candy)

Data summary
Name	candy
Number of rows	85
Number of columns	12
_______________________
Column type frequency:
numeric	12
________________________
Group variables	None

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
chocolate	1	0.44	0.50	0.00	0.00	0.00	1.00	1.00	▇▁▁▁▆
fruity	1	0.45	0.50	0.00	0.00	0.00	1.00	1.00	▇▁▁▁▆
caramel	1	0.16	0.37	0.00	0.00	0.00	0.00	1.00	▇▁▁▁▂
peanutyalmondy	1	0.16	0.37	0.00	0.00	0.00	0.00	1.00	▇▁▁▁▂
nougat	1	0.08	0.28	0.00	0.00	0.00	0.00	1.00	▇▁▁▁▁
crispedricewafer	1	0.08	0.28	0.00	0.00	0.00	0.00	1.00	▇▁▁▁▁
hard	1	0.18	0.38	0.00	0.00	0.00	0.00	1.00	▇▁▁▁▂
bar	1	0.25	0.43	0.00	0.00	0.00	0.00	1.00	▇▁▁▁▂
pluribus	1	0.52	0.50	0.00	0.00	1.00	1.00	1.00	▇▁▁▁▇
sugarpercent	1	0.48	0.28	0.01	0.22	0.47	0.73	0.99	▇▇▇▇▆
pricepercent	1	0.47	0.29	0.01	0.26	0.47	0.65	0.98	▇▇▇▇▆
winpercent	1	50.32	14.71	22.45	39.14	47.83	59.86	84.18	▃▇▆▅▂

Q6. Is there any variable/column that looks to be on a different scale to the majority of the other columns in the dataset?

“winpercent” looks to be on a different scale than the majority of the other columns. The other ones seem to be on a 0-1 scale while winpercent is not bound by those limits.

Q7. What do you think a zero and one represent for the candy$chocolate column?

In the choclolate column, a zero represents the candy not being classified as chocolate while a one means that the specific candy is classified as a chocolate type of candy.

Exploratory Analysis

Q8. Plot a histogram of winpercent values

hist(candy$winpercent)

library(ggplot2)

ggplot(candy, aes(winpercent)) + geom_histogram(binwidth = 5)

Q9. Is the distribution of winpercent values symmetrical?

No, the distribution of winpercent values is not symmetrical

Q10. Is the center of the distribution above or below 50%?

summary(candy$winpercent)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  22.45   39.14   47.83   50.32   59.86   84.18

The center of the distribution depends on what you look at. If you look at the mean, it is 50.32%, which is above the threshold. If you look at the median, it is 47.83%, which is below 50%.

Q11. On average is chocolate candy higher or lower ranked than fruit candy?

mean(candy$winpercent[as.logical(candy$chocolate)]) > 
  mean(candy$winpercent[as.logical(candy$fruity)])

[1] TRUE

On average, chocolate candy is higher ranked than fruity candy.

Q12. Is this difference statistically significant?

chocolate <- candy$winpercent[as.logical(candy$chocolate)]
fruity <- candy$winpercent[as.logical(candy$fruity)]

t.test(chocolate, fruity)


    Welch Two Sample t-test

data:  chocolate and fruity
t = 6.2582, df = 68.882, p-value = 2.871e-08
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 11.44563 22.15795
sample estimates:
mean of x mean of y 
 60.92153  44.11974

Yes, the means of the chocolate and fruity candy are significantly different.

Overall Candy Rankings

Q13. What are the five least liked candy types in this set?

head(candy[order(candy$winpercent), ], n = 5)

                   chocolate fruity caramel peanutyalmondy nougat
Nik L Nip                  0      1       0              0      0
Boston Baked Beans         0      0       0              1      0
Chiclets                   0      1       0              0      0
Super Bubble               0      1       0              0      0
Jawbusters                 0      1       0              0      0
                   crispedricewafer hard bar pluribus sugarpercent pricepercent
Nik L Nip                         0    0   0        1        0.197        0.976
Boston Baked Beans                0    0   0        1        0.313        0.511
Chiclets                          0    0   0        1        0.046        0.325
Super Bubble                      0    0   0        0        0.162        0.116
Jawbusters                        0    1   0        1        0.093        0.511
                   winpercent
Nik L Nip            22.44534
Boston Baked Beans   23.41782
Chiclets             24.52499
Super Bubble         27.30386
Jawbusters           28.12744

candy |> arrange(winpercent) |> head(5)

                   chocolate fruity caramel peanutyalmondy nougat
Nik L Nip                  0      1       0              0      0
Boston Baked Beans         0      0       0              1      0
Chiclets                   0      1       0              0      0
Super Bubble               0      1       0              0      0
Jawbusters                 0      1       0              0      0
                   crispedricewafer hard bar pluribus sugarpercent pricepercent
Nik L Nip                         0    0   0        1        0.197        0.976
Boston Baked Beans                0    0   0        1        0.313        0.511
Chiclets                          0    0   0        1        0.046        0.325
Super Bubble                      0    0   0        0        0.162        0.116
Jawbusters                        0    1   0        1        0.093        0.511
                   winpercent
Nik L Nip            22.44534
Boston Baked Beans   23.41782
Chiclets             24.52499
Super Bubble         27.30386
Jawbusters           28.12744

Q14. What are the top 5 all time favorite candy types out of this set?

candy |> arrange(desc(winpercent)) |> head(5)

                          chocolate fruity caramel peanutyalmondy nougat
Reese's Peanut Butter cup         1      0       0              1      0
Reese's Miniatures                1      0       0              1      0
Twix                              1      0       1              0      0
Kit Kat                           1      0       0              0      0
Snickers                          1      0       1              1      1
                          crispedricewafer hard bar pluribus sugarpercent
Reese's Peanut Butter cup                0    0   0        0        0.720
Reese's Miniatures                       0    0   0        0        0.034
Twix                                     1    0   1        0        0.546
Kit Kat                                  1    0   1        0        0.313
Snickers                                 0    0   1        0        0.546
                          pricepercent winpercent
Reese's Peanut Butter cup        0.651   84.18029
Reese's Miniatures               0.279   81.86626
Twix                             0.906   81.64291
Kit Kat                          0.511   76.76860
Snickers                         0.651   76.67378

Q15. Make a first barplot of candy ranking based on winpercent values.

ggplot(candy) + aes(winpercent, rownames(candy)) + geom_col()

Q16. This is quite ugly, use the reorder() function to get the bars sorted by winpercent?

ggplot(candy) + aes(winpercent, reorder(rownames(candy),winpercent)) + geom_col()

Time to add some useful color

my_cols=rep("black", nrow(candy))
my_cols[as.logical(candy$chocolate)] = "chocolate"
my_cols[as.logical(candy$bar)] = "brown"
my_cols[as.logical(candy$fruity)] = "pink"

ggplot(candy) + 
  aes(winpercent, reorder(rownames(candy),winpercent)) +
  geom_col(fill=my_cols)

Q17. What is the worst ranked chocolate candy?

The worst ranked chocolate candy is “Sixlets”

Q18. What is the best ranked fruity candy?

The best ranked fruity candy is “Starburst”

Taking a look at pricepercent

library(ggrepel)

# How about a plot of win vs price
ggplot(candy) +
  aes(winpercent, pricepercent, label=rownames(candy)) +
  geom_point(col=my_cols) + 
  geom_text_repel(col=my_cols, size=3.3, max.overlaps = 5)

Warning: ggrepel: 50 unlabeled data points (too many overlaps). Consider
increasing max.overlaps

Q19. Which candy type is the highest ranked in terms of winpercent for the least money - i.e. offers the most bang for your buck?

Reese’s Miniatures is the highest ranked in terms of winpercent for the least money

ord <- order(candy$winpercent, decreasing = TRUE)
head( candy[ord,c(11,12)], n=5 )

                          pricepercent winpercent
Reese's Peanut Butter cup        0.651   84.18029
Reese's Miniatures               0.279   81.86626
Twix                             0.906   81.64291
Kit Kat                          0.511   76.76860
Snickers                         0.651   76.67378

Q20. What are the top 5 most expensive candy types in the dataset and of these which is the least popular?

Nik L Nip is the least popular amongst the 5 most expensive candy types.

ord <- order(candy$pricepercent, decreasing = TRUE)
head( candy[ord,c(11,12)], n=5 )

                         pricepercent winpercent
Nik L Nip                       0.976   22.44534
Nestle Smarties                 0.976   37.88719
Ring pop                        0.965   35.29076
Hershey's Krackel               0.918   62.28448
Hershey's Milk Chocolate        0.918   56.49050

Optional

Q21. Make a barplot again with geom_col() this time using pricepercent and then improve this step by step, first ordering the x-axis by value and finally making a so called “dot chat” or “lollipop” chart by swapping geom_col() for geom_point() + geom_segment().

ggplot(candy) + aes(pricepercent, reorder(rownames(candy),pricepercent)) + geom_col()

# Make a lollipop chart of pricepercent
ggplot(candy) +
  aes(pricepercent, reorder(rownames(candy), pricepercent)) +
  geom_segment(aes(yend = reorder(rownames(candy), pricepercent), 
                   xend = 0), col="gray40") +
    geom_point()

Exploring the Correlation Structure

library(corrplot)

corrplot 0.95 loaded

cij <- cor(candy)
corrplot(cij)

Q22. Examining this plot what two variables are anti-correlated (i.e. have minus values)?

There are many variables that are anti-correlated with each other. However, Chocolate and Fruity are the two variables that are the most anti-correlated with each other.

Q23. Similarly, what two variables are most positively correlated?

Chocolate and winpercent are the two variables that are the most positively correlated.

Principal Component Analysis

pca <- prcomp(candy, scale = TRUE)
summary(pca)

Importance of components:
                          PC1    PC2    PC3     PC4    PC5     PC6     PC7
Standard deviation     2.0788 1.1378 1.1092 1.07533 0.9518 0.81923 0.81530
Proportion of Variance 0.3601 0.1079 0.1025 0.09636 0.0755 0.05593 0.05539
Cumulative Proportion  0.3601 0.4680 0.5705 0.66688 0.7424 0.79830 0.85369
                           PC8     PC9    PC10    PC11    PC12
Standard deviation     0.74530 0.67824 0.62349 0.43974 0.39760
Proportion of Variance 0.04629 0.03833 0.03239 0.01611 0.01317
Cumulative Proportion  0.89998 0.93832 0.97071 0.98683 1.00000

plot(pca$x[,1:2])

plot(pca$x[,1:2], col=my_cols, pch=16)

# Make a new data-frame with our PCA results and candy data
my_data <- cbind(candy, pca$x[,1:3])

p <- ggplot(my_data) + 
        aes(x=PC1, y=PC2, 
            size=winpercent/100,  
            text=rownames(my_data),
            label=rownames(my_data)) +
        geom_point(col=my_cols)

p

library(ggrepel)

p + geom_text_repel(size=3.3, col=my_cols, max.overlaps = 7)  + 
  theme(legend.position = "none") +
  labs(title="Halloween Candy PCA Space",
       subtitle="Colored by type: chocolate bar (dark brown), chocolate other (light brown), fruity (red), other (black)",
       caption="Data from 538")

Warning: ggrepel: 39 unlabeled data points (too many overlaps). Consider
increasing max.overlaps

# library(plotly)
# ggplotly(p)

Q24. Complete the code to generate the loadings plot above. What original variables are picked up strongly by PC1 in the positive direction? Do these make sense to you? Where did you see this relationship highlighted previously?

Fruity, Pluribus, and Hard are the variables picked up strongly by PC1 in the positive direction. This makes sense because fruity and chocolate in the previous correlation plot above were the most anti-correlated with each other. If we compare fruity with the variables in the negative direction on the correlation plot, we can see that they are highly negatively correlated.

ggplot(pca$rotation) +
  aes(PC1, reorder(rownames(pca$rotation), PC1)) + 
  geom_col() + 
  theme(axis.title.y = element_blank())

Summary

Q25. Based on your exploratory analysis, correlation findings, and PCA results, what combination of characteristics appears to make a “winning” candy? How do these different analyses (visualization, correlation, PCA) support or complement each other in reaching this conclusion?

Based on exploratory analysis, correlation findings, and PCA results, to make a “winning” candy, it generally is a chocolate with features like having a lower price point, caramel, peanutyalmondy, nougat, and a bar. These chocolate candies are correlated with a higher winpoint and thus appear to more likely to make a “winning” candy.

Optional Extension Questions

Q26. Are popular candies more expensive? In other words: is price significantly different between “winners” and “losers”? List both average values and a P-value along with your answer.

The mean of “losers” candy is 0.3744 while the mean of “winning” candies is 0.5804. The p-value is 0.0006068 which is statistically significant.This suggests that the popular candies are more expensive than the unpopular ones.

losers = candy[which(candy$winpercent < 50),]
winners = candy[which(candy$winpercent >= 50),]

t.test(losers$pricepercent, winners$pricepercent)


    Welch Two Sample t-test

data:  losers$pricepercent and winners$pricepercent
t = -3.5653, df = 82.798, p-value = 0.0006068
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.32090727 -0.09107157
sample estimates:
mean of x mean of y 
0.3743696 0.5803590

Q27. Are candies with more sugar more likely to be popular? What is your interpretation of the means and P-value in this case?

The mean of candies with more sugar is 53.949 while the mean of popular candies is 47.238 The p-value is 0.0373, which is statistically significant. This suggests that the candies with more sugar are more likely to be popular than those with less sugar.

more = candy[which(candy$sugarpercent >= 0.5),]
less = candy[which(candy$sugarpercent < 0.5),]

t.test(more$winpercent, less$winpercent)


    Welch Two Sample t-test

data:  more$winpercent and less$winpercent
t = 2.1192, df = 76.967, p-value = 0.0373
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
  0.4050308 13.0167630
sample estimates:
mean of x mean of y 
 53.94854  47.23765