# Worksheet 10 - Clustering

### Lecture and Tutorial Learning Goals:

After completing this week's lecture and tutorial work, you will be able to:

* Describe a case where clustering would be an appropriate tool, and what insight it would bring from the data.
* Explain the K-means clustering algorithm.
* Interpret the output of a k-means cluster analysis.
* Perform K-means clustering in R
* Visualize the output of K-means clustering in R using a coloured scatter plot
* Identify when it is necessary to scale variables before clustering and do this using R
* Use the elbow method to choose the number of clusters for k-means
* Describe advantages, limitations and assumptions of the k-means clustering algorithm.

This worksheet covers parts of [the Clustering chapter](https://datasciencebook.ca/clustering.html) of the online textbook. You should read this chapter before attempting the worksheet.

In [None]:
### Run this cell before continuing.
library(tidyverse)
library(tidymodels)
library(tidyclust)
library(forcats)
library(repr)
options(repr.matrix.max.rows = 6)
source('tests.R')
source("cleanup.R")

**Question 0.0** Multiple Choice:
<br> {points: 1}

In which of the following scenarios would clustering methods likely be appropriate?

A. Identifying sub-groups of houses according to their house type, value, and geographical location

B. Predicting whether a given user will click on an ad on a website

C. Segmenting customers based on their preferences to target advertising

D. Both A. and B.

E. Both A. and C. 

*Assign your answer to an object called `answer0.0`. Make sure your answer is an uppercase letter and is surrounded by quotation marks (e.g. `"F"`).*

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_0.0()

**Question 0.1** Multiple Choice:
<br> {points: 1}

Which step in the description of the K-means algorithm below is *incorrect*?

0. Choose the number of clusters

1. Randomly assign each of the points to one of the clusters

2. Calculate the position for the cluster centre (centroid) for each of the clusters (this is the middle of the points in the cluster, as measured by straight-line distance)

3. Re-assign each of the points to the cluster whose centroid is furthest from that point

4. Repeat steps 2 - 3 until the cluster centroids don't change at all

*Assign your answer to an object called `answer0.1`. Your answer should be a single numerical character surrounded by quotes.*

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_0.1()

## Hoppy Craft Beer

Craft beer is a strong market in Canada and the US, and is expanding to other countries as well. If you wanted to get into the craft beer brewing market, you might want to better understand the product landscape. One popular craft beer product is hopped craft beer. Breweries create/label many different kinds of hopped craft beer, but how many different kinds of hopped craft beer are there really when you look at the chemical properties instead of the human labels? 

We will start to look at the question by looking at a [craft beer data set from Kaggle](https://www.kaggle.com/nickhould/craft-cans#beers.csv). In this data set, we will use the alcoholic content by volume  (`abv` column) and the International bittering units (`ibu` column) as variables to try to cluster the beers. The `abv` variable has values 0 (indicating no alcohol) up to 1 (pure alcohol) and the `ibu` variable quantifies the bitterness of the beer (higher values indicate higher bitterness).

**Question 1.0** 
<br> {points: 1}

Read in the `beers.csv` data using `read_csv()` and assign it to an object called `beer`. The data is located within the `worksheet_10/data/` folder. 

*Assign your dataframe answer to an object called `beer`.*

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer
beer

In [None]:
test_1.0()

**Question 1.1**
<br> {points: 1}

Let's start by visualizing the variables we are going to use in our cluster analysis as a scatter plot. Put `ibu` on the horizontal axis, and `abv` on the vertical axis. Name the plot object `beer_plot`. 

*Assign your plot to an object named `beer_plot`, and remember to follow the best visualization practices, including adding human-readable labels to your plot.*

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer
beer_plot

In [None]:
test_1.1()

**Question 1.2**
<br> {points: 1}

We need to clean this data a bit. Specifically, we need to remove the rows where `ibu` is `NA`, and select only the columns we are interested in clustering, which are `ibu` and `abv`. 

*Assign your answer to an object named `clean_beer`.*

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer
clean_beer

In [None]:
test_1.2()

**Question 1.3** Multiple Choice:
<br>{points: 1}

Why do we need to scale the variables when using K-means clustering?

A. K-means uses the Euclidean distance to compute how similar data points are to each cluster center

B. K-means is an iterative algorithm

C. Some variables might be more important for prediction than others

D. To make sure their mean is 0

*Assign your answer to an object named `answer1.3`. Make sure your answer is an uppercase letter and is surrounded by quotation marks (e.g. `"F"`).*

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_1.3()

**Question 1.4**
<br> {points: 1}

We will now build a `tidymodels` workflow to cluster the data. The first step is to create a `recipe` that specifies that we want to center and scale all of the variables in the `clean_beer` data frame. 

*Recall that we used a `recipe` for scaling when doing classification and regression. Even though `recipe`s were originally designed for predictive modeling tasks (like classification and regression), the `tidyclust` library lets us use our familiar `tidymodels` functions for clustering too!*

*Assign your answer to an object named `kmeans_recipe`. Use the scaffolding provided.*

In [None]:
# ... <- ...( ~ . , ...) |> 
#        ...(...) |>
#        ...(...)

# your code here
fail() # No Answer - remove if you provide an answer
kmeans_recipe

In [None]:
test_1.4()

**Question 1.5**
<br>{points: 1}

The next step in our `tidymodels` workflow is a model specification that specifies that we want to cluster the data. From our exploratory data visualization, 2 seems like a reasonable number of clusters. Use the `k_means` function with `num_clusters = 2` to perform clustering with this choice of $k$. Make sure to use the "stats" engine.

*Assign your answer to an object named `kmeans_spec`. Use the scaffolding provided.*

In [None]:
# ... <- ...(... = ...) |>
#        ...(...)

# your code here
fail() # No Answer - remove if you provide an answer
kmeans_spec

In [None]:
test_1.5()

**Question 1.6**
<br> {points: 1}

Combine the recipe and model specification into a `workflow`, and fit the `workflow` on the `clean_beer` data.

*Assign your model to an object named `kmeans_fit`. Note that since k-means uses a random initialization, we need to set the seed; don't change the value!*

In [None]:
# DON'T CHANGE THE SEED VALUE!
set.seed(1234)

# ... <- kmeans(..., centers = 2)
# your code here
fail() # No Answer - remove if you provide an answer
kmeans_fit

In [None]:
test_1.6()

**Question 1.7**
<br> {points: 1}

Use the `augment` function to add the cluster assignment for each point to the `clean_beer` data frame. 

*Assign your answer to an object named `labelled_beer`.* 

In [None]:
# ... <- augment(..., ...)
# your code here
fail() # No Answer - remove if you provide an answer
labelled_beer

In [None]:
test_1.7()

**Question 1.8**
<br> {points: 1}

Create a scatter plot of `abv` on the y-axis versus `ibu` on the x-axis (using the data in `labelled_beer`) where the points are labelled by their cluster assignment. Name the plot object `cluster_plot`.

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer
cluster_plot

In [None]:
test_1.8()

**Question 1.9.1** Multiple Choice:
<br> {points: 1}

We do not know, however, that two clusters ($K$ = 2) is the best choice for this data set. What can we do to choose the best $K$?

A. Perform *cross-validation* for a variety of possible $K$s. Choose the one where within-cluster sum of squares distance starts to *decrease less*.

B. Perform *cross-validation* for a variety of possible $K$s. Choose the one where the within-cluster sum of squares distance starts to *decrease more*. 

C. Perform *clustering* for a variety of possible $K$s. Choose the one where within-cluster sum of squares distance starts to *decrease less*.

D. Perform *clustering* for a variety of possible $K$s. Choose the one where the within-cluster sum of squares distance starts to *decrease more*. 

*Assign your answer to an object called `answer1.9.1`. Make sure your answer is an uppercase letter and is surrounded by quotation marks (e.g. `"F"`).*

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_1.9.1()

**Question 1.9.2**
<br> {points: 1}

Use the `glance` function to get the model-level statistics for the clustering we just performed, including total within-cluster sum of squares. 

*Assign your answer to an object named `clustering_stats`.*

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer
clustering_stats

In [None]:
test_1.9.2()

**Question 1.9.3**
<br>{points: 1}

What is the total within cluster sum-of-squares distance for this clustering (rounded to 2 decimals)?

*Assign your answer to an object named `totalWSSD`. Round your answer to 2 decimal points.*

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_1.9.3()

**Question 2.0**
<br> {points: 1}

Let's now choose the best $K$ for this clustering problem. To do this we need to create a tibble with a column having the same name as the parameter we want to tune (`num_clusters`), taking values 1 to 10. 

*Assign your answer to an object named `beer_ks`.*

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer
beer_ks

In [None]:
test_2.0()

**Question 2.1**
<br> {points: 1}

We also need to create a new model specification that lets `tidymodels` tune the number of clusters. Rather than setting `num_clusters` to a particular value in the model specification, set it to `tune()`. Use `nstart = 10` restarts.

*Assign your answer to an object named `kmeans_spec_tune`.*

In [None]:
# ... <- ...(... = ...) |>
#        ...(...)

# your code here
fail() # No Answer - remove if you provide an answer
kmeans_spec_tune

In [None]:
test_2.1()

**Question 2.2**
<br>{points: 1}

Now combine the new model specification and our original recipe into a new `workflow`. Include the `tune_clusters` function in the workflow to run the tuning procedure. In the `tune_clusters` function, specify the `resamples` argument to be `apparent(clean_beer)` so that we use the same full data for each tuning trial. Also specify the `grid` argument to be the data frame of values of $K$ we just created. Finally, include the `collect_metrics` step to gather the results of the tuning procedure.

*Assign your answer to an object named `kmeans_tuning_stats`*.

In [None]:
# DON'T CHANGE THE SEED VALUE!
set.seed(9999)
# 
# ... <- ... |>
#        ...(...) |>
#        ...(...) |>
#        tune_cluster(resamples = ..., grid = ...) |>
#        ...()

# your code here
fail() # No Answer - remove if you provide an answer
kmeans_tuning_stats

In [None]:
test_2.2()

**Question 2.3**
<br> {points: 1}

Now we need to extract the total WSSD results from the `kmeans_tuning_stats` data frame. Recall that we want to look at the `mean` variable for rows where the `.metric` variable is `sse_within_total`. Use the `filter`, `select`, and `mutate` functions to create a data frame containing only two variables: `num_clusters` and `total_WSSD`.

*Assign your answer to an object named `tidy_tuning_stats`.*

In [None]:
# ... <- ... |>
#        mutate(... = ...) |>
#        filter(... == ...) |>
#        select(..., ...)

# your code here
fail() # No Answer - remove if you provide an answer
print(tidy_tuning_stats)

In [None]:
test_2.3()

**Question 2.4**
<br> {points: 1}

We now have the the values for total within-cluster sum of squares for each model in a column (`total_WSSD`). Let's use it to create a line plot with points of total within-cluster sum of squares versus $K$, so that we can choose the best number of clusters to use. 

*Assign your plot to an object called `choose_beer_k`. Total within-cluster sum of squares should be on the y-axis and $K$ should be on the x-axis. Remember to follow the best visualization practices, including adding human-readable labels to your plot.*

In [None]:
options(repr.plot.width = 8, repr.plot.height = 7)

# your code here
fail() # No Answer - remove if you provide an answer
choose_beer_k

In [None]:
test_2.4()

**Question 2.5**
<br> {points: 1}

From the plot above, which $K$ should we choose? 

*Assign your answer to an object called `answer2.5`. Make sure your answer is a single numerical character surrounded by quotation marks.*

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_2.5()

**Question 2.6**
<br> {points: 1}

Why did we choose the $K$ we chose above?

A. It had the greatest total within-cluster sum of squares

B. It had the smallest total within-cluster sum of squares

C. Increasing $K$ further than this only decreased the total within-cluster sum of squares a small amount

D. Increasing $K$ further than this only increased the total within-cluster sum of squares a small amount

*Assign your answer to an object called `answer2.6`. Make sure your answer is an uppercase letter and is surrounded by quotation marks (e.g. `"F"`).*

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_2.6()

**Question 2.7** Multiple Choice:
<br> {points: 1}

What can we conclude from our analysis? How many different types of hoppy craft beer are there in this data set using the two variables we have? 


A. 1

B. 2 to 4

C. 5 to 7

D. more than 7

*Assign your answer to an object called `answer2.7`. Make sure your answer is an uppercase letter and is surrounded by quotation marks (e.g. `"F"`).*

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_2.7()

**Question 2.8** True or false:
<br> {points: 1}

Our analysis might change if we added additional variables, true or false?

*Assign your answer to an object called `answer2.8`. Make sure your answer is written in lowercase and is surrounded by quotation marks (e.g. `"true"` or `"false"`).* 

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_2.8()

In [None]:
source("cleanup.R")