# Worksheet 1: Introduction to Data Science

Welcome to DSCI 100: Introduction to Data Science!  

Each week you will complete a lecture assignment like this one. Before we get started, there are some administrative details.

You can't learn technical subjects without hands-on practice. The weekly lecture worksheets and tutorials are an important part of the course. The lecture worksheets and tutorials will automatically be collected on the due date. Attendance in lectures and tutorials are required. There will be participatory activities in both the lecture and tutorial to help support your learning.

- The lecture worksheets: 
    - Each question is worth 1 point. 
- The tutorial assignments: 
    - Each autograded question is worth 1 point. 
    - Each manually graded question is worth 3 points. 

Collaborating on lecture worksheets and tutorial assignments is more than okay -- it's encouraged! You should rarely be stuck for more than a few minutes on questions in lecture or tutorial, so ask a neighbor, TA or an instructor for help (explaining things is beneficial, too -- the best way to solidify your knowledge of a subject is to explain it). Please don't just share answers, though. Everyone must submit a copy of their own work.

### Lecture and Tutorial Learning Goals:

After completing this week's lecture and tutorial work, you will be able to:

* use a Jupyter notebook to execute provided R code
* edit code and markdown cells in a Jupyter notebook
* create new code and markdown cells in a Jupyter notebook
* load the `tidyverse` library into R
* create new variables and objects in R using the assignment symbol
* use the help and documentation tools in R
* match the names of the following functions from the `tidyverse` library to their documentation descriptions: 
    - `read_csv` 
    - `select`
    - `mutate`
    - `filter`
    - `ggplot`
    - `aes`

In this first worksheet you will also learn how to test the answers you write in this worksheet to assess if you answered questions correctly before your assignment is collected.

This worksheet covers parts of [the Introduction chapter](https://datasciencebook.ca/intro.html) of the online textbook. In most worksheets we expect you to read the textbook chapters before completing the worksheet, however we know that might not have been possible for this worksheet, so we have added a bit more help to get you through. You still should however read the chapter to get a deeper understanding of this week's material (it will help you more easily answer the problems in this week's tutorial homework). 

## 1. Jupyter Notebooks
This webpage is called a Jupyter notebook. A notebook is a place to write computer code for analysis, view the results of the analysis, as well as to narrate the analysis with rich formatted text.

### 1.1. Text Cells
In a notebook, each rectangle containing text or code is called a *cell*.

Text cells (like this one) can be edited by double-clicking on them. They're written in a simple format called [Markdown](http://daringfireball.net/projects/markdown/syntax) to add formatting and section headings.  You don't need to learn Markdown, but you might want to.

After you edit a text cell, click the "run cell" button at the top that looks like ▶ to confirm any changes. (Try not to delete the instructions of the lab.)

**Question 1.1.1**
<br> {points: 0}

This paragraph is in its own text cell.  Try editing it so that all of the sentences following this one are deleted, then click the "run cell" ▶ button .  This sentence, for example, should be deleted.  So should this one.

### 1.2. Code Cells
Other cells contain code in the R language. Running a code cell will execute all of the code it contains.

To run the code in a cell, first click on that cell to activate it.  It will be highlighted with a blue rectangle to the left of it when activated.  Next, either press Run ▶ or hold down the `shift` key and press `return` or `enter`.

Try running the next cell:

In [None]:
print("Hello, World!")

The above code cell contains a single line of code, but cells can also contain multiple lines of code. When you run a cell, the lines of code are executed in the order in which they appear. Every `print` expression prints a line. Run the next cell and notice the order of the output.

In [None]:
print("First this line is printed,")
print("and then this one.")

**Question 1.2.1**
<br> {points: 0}

Change the cell above so that it prints out:

    First this line is printed,
    and then the next line, 
    and then this one.

*Hint:* If you're stuck for more than a few minutes, try talking to a neighbor or a TA.  That's a good idea for any worksheet or tutorial problem.

### 1.3. Writing Jupyter Notebooks
You can use Jupyter notebooks for your own projects or documents.  When you make your own notebook, you'll need to create your own cells for text and code.

To add a cell, click the + button in the menu bar of this tab.  The newly created cell will start out as a code cell.  You can change it to a text cell by clicking inside it so it's highlighted, clicking the drop-down box next to the restart and runall button (⏩) in the menu bar of this tab, and changing it from "Code" to "Markdown".

**Question 1.3.1**
<br> {points: 0}

Add a code cell below this one.  Write code in it that prints out:
   
    A whole new code cell!

Run your cell to verify that it works.

**Question 1.3.2**
<br> {points: 0}

Add a text/Markdown cell below this one. Write the text "A whole new Markdown cell" in it.

### 1.4. Comments
Below you see lines like this in code cells:

    # Test cell; please do not change!

That is called a *comment*.  It doesn't make anything happen in R; R ignores anything on a line after a #.  Instead, it's there to communicate something about the code to you, the human reader.  Comments are extremely useful and can help increase how readable our code is.

<img src="http://imgs.xkcd.com/comics/future_self.png">

*Source: https://xkcd.com/1421/*

The below code cell contains comments (one at the start of a line, and one after some other code). Run the cell. You will see that everything after a comment symbol `#` is ignored by R.

In [None]:
# you can use comments to document your code, or make R ignore some code without deleting it entirely
# print("this is a commented line that R will ignore. You won't see this text in the output!")

print("hello!") # you can also put comments at the end of a line of code

### 1.5. Errors
R is a language, and like natural human languages, it has rules.  It differs from natural language in two important ways:
1. The rules are *simple*.  You can learn most of them in a few weeks and gain reasonable proficiency with the language in a semester.
2. The rules are *rigid*.  If you're proficient in a natural language, you can understand a non-proficient speaker, glossing over small mistakes.  A computer running R code is not smart enough to do that.

Whenever you write code, you'll make mistakes (everyone who writes code does, even your course instructor!).  When you run a code cell that has errors, R will sometimes produce error messages to tell you what you did wrong.

Errors are totally okay; even experienced programmers make many errors. It's a natural part of the coding process.  When you make an error, you just have to find the source of the problem, fix it, and move on. 

We have made an error in the next cell. **Remove the `#` symbol below (i.e., uncomment the line)**, and then run the cell to see what happens.

In [None]:
# print("This line is missing something."

![ws1_error_image.png](images/ws1_error_image.png)

There's a lot of terminology in programming languages, but you don't need to know it all in order to program effectively. Even though the error message can seem cryptic, if you read it carefully you'll often find hints as to what went wrong. For example, above, you'll see the message `unexpected end of input` (among a lot of other technical jargon). In other words, R reached the end of the line of code, and wasn't expecting to reach the end -- it thinks there is still something missing!

Of course, even if you do your best to interpret the error message, sometimes you may get stuck figuring out what went wrong and how to fix it. In that case, ask a neighbor or a TA for help.

Try to fix the code above so that you can run the cell and see the intended message instead of an error.

### 1.6 The Kernel
The kernel is a program that executes the code inside your notebook and outputs the results. In the top right of your window, you can see a circle that indicates the status of your kernel. If the circle is empty (⚪), the kernel is idle and ready to execute code. If the circle is filled in (⚫), the kernel is busy running some code. 

You may run into problems where your kernel is stuck for an excessive amount of time, your notebook is very slow and unresponsive, or your kernel loses its connection. If this happens, try the following steps:
1. At the top of your screen, click **Kernel**, then **Interrupt Kernel**.
2. If that doesn't help, click **Kernel**, then **Restart Kernel...**. If you do this, you will have to run your code cells from the start of your notebook up until where you paused your work!
3. If that doesn't help, restart your server. First, save your work by clicking **File** at the top left of your screen, then **Save Notebook**. Next, from the **File** menu click **Hub Control Panel**. Choose **Stop My Server** to shut it down, then **My Server** to start it back up. Then, navigate back to the notebook you were working on. 

### 1.7 Saving your work

Its important to save your work often so you don't lose your progress! At the top of the screen, go to the **File** menu then **Save Notebook**. There is also a disk icon (<img src="images/disk.png" width="2%">) in the menu of this tab that can be used to save your work as well. Finally, there are keyboard shorcuts for saving your work too: control + s on Windows, or command + s on Mac. Once you've saved your work, you will see a message at the bottom of the screen that says **Saving completed**. 

### 1.8 Submitting your work
All lecture worksheets and tutorials assignments in the course will be distributed as notebooks like this one. You will complete your work in this notebook and at the due date we will copy this notebook and grade that copy. For lecture worksheets we will use a system called nbgrader that checks your work. For tutorial assignments we will use a combination of nbgrader and manual grading of your work. 

**Play the Youtube video below to see how to properly answer questions and save in DSCI100 worksheets or tutorials.** 

In [None]:
# Run this cell and play the Youtube video below to see how to properly answer questions in DSCI100 worksheets or tutorials.

IRdisplay::display_html('<iframe width="560" height="315" src="https://www.youtube.com/embed/M8W0HbzcK8Q" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>')


## 2. Numbers

Quantitative information arises everywhere in data science. In addition to representing commands to print out lines, our R code can represent numbers and methods of combining numbers. The expression `3.2500` evaluates to the number 3.25. (Run the cell and see.)

In [None]:
3.2500

Notice that we didn't have to write `print()`. When you run a notebook cell, Jupyter helpfully prints out that value for you. 

In [None]:
2
3
4

Above, you should see that the three numbers (2, 3, and 4) are printed out. In R, simply inputting numbers and running the cell will generate all the numbers that you listed. Even though we don't need to use print, we will continue to do in several places in these worksheets so that we are very clear with our intentions.

### 2.1. Arithmetic
The line in the next cell subtracts.  Its value is what you'd expect.  Run it.

In [None]:
2.0 - 1.5

Same with the cell below. Run it.

In [None]:
2 * 2

Many basic arithmetic operations are built in to R.  [This webpage](https://www.statmethods.net/management/operators.html) describes all the arithmetic operators used in the course.  You can refer back to this webpage as you need throughout the term. 

## 3. Names
In natural language, we have terminology that lets us quickly reference very complicated concepts.  We don't say, "That's a large mammal with brown fur and sharp teeth!"  Instead, we just say, "Bear!"

Similarly, an effective strategy for writing code is to define names for data as we compute it, like a lawyer would define terms for complex ideas at the start of a legal document to simplify the rest of the writing.

In R, we do this with *objects*. An object has a name on the left side of an `<-` sign and an expression to be evaluated on the right.

In [None]:
answer <- 3 * 2 + 4

When you run that cell, R first evaluates the first line.  It computes the value of the expression `3 * 2 + 4`, which is the number 10.  Then it gives that value the name `answer`.  At that point, the code in the cell is done running.

After you run that cell, the value 10 is bound to the name `answer`:

In [None]:
answer

We can name our objects anything we'd like. Above we called it `answer`, but we could have named it `value`, `data` or anything else we desired. A good rule of thumb is to name it something that has meaning to a human as it relates to what we are trying to accomplish with our R code.

**Question 3.1**
<br> {points: 0}

Enter a new code cell. Try creating another object using `<- 3 * 2 + 4` with a name different from `answer`. 

A common pattern in Jupyter notebooks is to assign a value to a name and then immediately evaluate the name in the last line in the cell so that the value is displayed as output. 

In [None]:
close_to_pi <- 355/113
close_to_pi

Another common pattern is that a series of lines in a single cell will build up a complex computation in stages, naming the intermediate results.

In [None]:
bimonthly_salary <- 840
monthly_salary <- 2 * bimonthly_salary
number_of_months_in_a_year <- 12
yearly_salary <- number_of_months_in_a_year * monthly_salary
yearly_salary

When naming objects in R there are some rules:
1. Names in R can have letters (upper- and lower-case letters are both okay and count as different letters e.g. "Answer" and "answer" will be treated as different objects), underscores, dots, and numbers. 
2. The first character can't be a number (otherwise a name might look like a number).  
3. Names can't contain spaces, since spaces are used to separate pieces of code from each other. 

Other than those rules, what you name something doesn't matter *to R*.  For example, the next cell does the same thing as the above cell, except everything has a different name:

In [None]:
a <- 840
b <- 2 * a
c <- 12
d <- c * b
d

**However**, names are very important for making your code *readable* to yourself and others.  The cell above is shorter, but it's totally useless without an explanation of what it does. 

There is also cultural style associated with different programming languages. In the modern R style, object names should use only lowercase letters, numbers, and `_`. Underscores (`_`) are typically used to separate words within a name (*e.g.*, `answer_one`).

**Question 3.2** <br> {points: 1}

Assign the name `seconds_in_an_hour` to the number of seconds in an hour. You should do this in two steps. In the first, you calculate the number of seconds in a minute and assign that number the name `seconds_in_a_minute`. Next you should calculate the number of seconds in an hour and assign that number the name `seconds_in_an_hour.`  

*Hint - there are 60 seconds in a minute and 60 minutes in a hour*

In [None]:
# Replace the fail() with your answer. 

# your code here
fail() # No Answer - remove if you provide an answer

# We've put this line in this cell so that it will print
# the value you've given to seconds_in_an_hour when you
# run it.  You don't need to change this.
seconds_in_an_hour

### 3.2. Checking your code

Now that you know how to name things, you can start using the built-in *tests* to check whether your work is correct. To do this, you will need to run the cell below to set things up. In future worksheets and tutorial assignments you will see this cell at the very top of the notebook:

In [None]:
source("tests.R")
source("cleanup.R")
options(repr.matrix.max.rows = 6)

Below is an example of a test cell for Question 3.2 above (assesses whether you have assigned `seconds_in_an_hour` correctly). If you haven't, this test will tell you that your solution is incorrect. Try not to change the contents of the test cells. Resist the urge to just copy it, and instead try to adjust your expression. (Sometimes the tests will give hints about what went wrong...)

In [None]:
test_3.2()

For this first question we'll provide you the solution:

In [None]:
# Calculate the number of seconds in an hour.

#SOLUTION:
seconds_in_a_minute <- 60
seconds_in_an_hour <- seconds_in_a_minute * 60

# We've put this line in this cell so that it will print
# the value you've given to seconds_in_an_hour when you
# run it.  You don't need to change this.
seconds_in_an_hour

*Note: All autograded questions with visible tests in this course are worth 1 point.*

## 4. Calling Functions

The most common way to combine or manipulate values in R is by calling functions. R comes with many built-in functions that perform common operations.

We used a function `print()` at the beginning of this notebook when we printed text from a code cell. Here we'll demonstrate using another function `toupper()` that converts text to uppercase:

In [None]:
greeting <- toupper("Why, hello there!")
greeting

**Question 4.0** <br> {points: 1} 

Use the function `tolower` to change all the words in the following movie title to lower case text: "The House with a Clock in Its Walls" and assign the lower case text the name `title`.

In [None]:
# Replace the fail() with your answer. 

# your code here
fail() # No Answer - remove if you provide an answer
title

In [None]:
test_4.0()

### 4.1. Multiple Arguments
Some functions take multiple arguments, separated by commas. For example, the built-in `max` function returns the maximum argument passed to it.

In [None]:
biggest <- max(2, 15, 4, 7)
biggest

**Question 4.1** <br> {points: 1}

Use the `min` function to find the minumum value of the numbers in the cell above.

Assign the value to an object called `smallest`.

In [None]:
# Replace the fail() with your answer. 

# your code here
fail() # No Answer - remove if you provide an answer
smallest

In [None]:
test_4.1()

## 5. Packages
R has many built-in functions, but we can also use functions that are stored within packages created by other R users. We are going to use a package, called `tidyverse`, to load, modify and plot data.
This package has already been installed for you. Later in the course you will learn how to install packages so you are free to bring in other tools as you need them for your data analysis. 

To use the functions from a package you first need to load it using the `library` function. This needs to be done once per notebook (and a good rule of thumb is to do this at the very top of your notebook so it is easy to see what packages your R code depends on). 

In [None]:
library(tidyverse)

> Note: it is normal and expected that a message is printed out after loading the tidyverse and some packages. Generally, this message let’s you know if functions from the different packages were loaded share the same name (which is confusing to R), and if so, which one you can access using just it’s name (and which one you need to refer the package name and the function name to refer to it, this is called masking). Additionally, the tidyverse is a special R package - it is a meta-package that bundles together several related and commonly used packages. Because of this it lists the packages it does the job of loading.

**Question 5.1** <br> {points: 1} 

Use the `library` function to load the `rvest` R package. This package can be used to scrape data from the web. Next week, there is an optional section where you can learn how to do this if you like!

In [None]:
# Replace the fail() with your answer. 

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_5.1()

## 6. Looking for Help

No one, even experienced, professional programmers remember what every function does, nor do they remember every possible function argument/option. So both experienced and new programmers (like you!) need to look things up, A LOT! 
### 6.1. Help Files
One of the most efficient places to look for help on how a function works is the R help files. Let’s say we wanted to pull up the help file for the `max()` function. We can do this by typing a question mark in front of the function we want to know more about. Run the cell below to find out more about `read_csv`.

In [None]:
?read_csv

At the very top of the file, you will see the function itself and the package it is in (in this case, it is base). Next is a description of what the function does. You’ll find that the most helpful sections on this page are “Usage”, “Arguments” and "Examples". 

- **Usage** gives you an idea of how you would use the function when coding--what the syntax would be and how the function itself is structured. 
- **Arguments** tells you the different parts that can be added to the function to make it more simple or more complicated. Often the “Usage” and “Arguments” sections don’t provide you with step by step instructions, because there are so many different ways that a person can incorporate a function into their code. Instead, they provide users with a general understanding as to what the function could do and parts that could be added. At the end of the day, the user must interpret the help file and figure out how best to use the functions and which parts are most important to include for their particular task. 
- The **Examples** section is often the most useful part of the help file as it shows how a function could be used with real data. It provides a skeleton code that the users can work off of.

Beyond the R help files there are many resources that you can use to find help. [Stack overflow](https://stackoverflow.com/), an online forum, is a great place to go and ask questions such as how to perform a complicated task in R or why a specific error message is popping up. Oftentimes, a previous user will have already asked your question of interest and received helpful advice from fellow R users.

**Question 6.1** Multiple Choice:
<br> {points: 1}

Use `?read_csv` and read the **Description** section to answer the multiple choice question below. To answer the question assign the letter associated with the correct answer to a variable in the the code cell below:

Which statement below is accurate?

A. `read_csv2()` uses `;` for separators, instead of `,`

B. `read_delim` is a special case of the `read_csv` function.

C. These functions are useful for reading binary files, such as excel spreadsheets.

D. European countries commonly use `:` as the decimal separator.

*Assign your answer to an object called `answer6.1`. Make sure your answer is an uppercase letter and is surrounded by quotation marks (e.g. `"F"`).*

In [None]:
# Replace the fail() with your answer. 

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_6.1()

## 7. Tidyverse Functions 

Now that we have learned a little about Jupyter notebooks and R, let's load a real dataset into R and explore it. As we do this we will learn more about key data loading, wrangling and visualization functions in R. 

### Exercise: Data about Runners!
Researchers, Vickers and Vertosick performed [a study in 2016](https://bmcsportsscimedrehabil.biomedcentral.com/articles/10.1186/s13102-016-0052-y) that aimed to identify what factors had a relationship with race performance of recreational runners so that they could better predict future 5 km, 10 km and marathon race times for individual runners. Such predictions (and knowing what drives these predictions) can help runners by suggesting changes they could make to modifiable factors, such as training, to help them improve race time. Unmodifiable factors that contribute to the prediction, such as age or sex, allow for fair comparisons to be made between different runners.

Vickers and Vertosick reasoned that their study is important because all previous research done to predict races times has focused on data from elite athletes. This biased data set means that the predictions generated from them do not necessarily do a good job predicting race times for recreational runners (whose data was not in the dataset that was used to create the model that generates the predictions). Additionally, previous research focused on reporting/measuring factors that require special expertise or equipment that are not freely available to recreational runners. This means that recreational runners may not be able to put their characteristics/measurements for these factors in the race time prediction models and so they will not be able to obtain an accurate prediction, or a prediction at all (in the case of some models).

To make a better model, Vickers and Vertosick performed a large survey. They put their survey on the news website [Slate.com](https://slate.com/) attached to a news story about race time prediction. They were able to obtain 2,497 responses. The survey included questions that allowed them to collect a data set that included: 
- age,
- sex,
- body mass index (BMI) (in kg/m^2),
- whether they are an edurance runner or speed demon,
- what type of shoes they wear,
- what type of training they do,
- race time for 2-3 races they completed in the last 6 months,
- self-rated fitness for each race,
- and race difficulty for each race.


Let's now use this data to explore a question we might be interested in - is there a relationship between 10 km race time and body mass index (BMI) for men runners in this data set. This is an exploratory data analysis question because we stated we looking for a relationship between measurements within the single data set we have and are not interested in yet interpreting beyond it. We can answer this question by visualizing the data as a scatter plot using R.

If, however we are not aiming to extend our findings to a broader population, make predictions, analyze cause or mechanics, we would need to state a different data analyis question and follow-up with different analytical methods to answer that question.

To answer our exploratory question (is there a relationship between 10 km race time and body mass index (BMI) for men runners in this data set), we will need to do the following things in R:

1. load the data set into R
2. subset the data we are interested in visualizing from the loaded dataset
3. create a new column to get the unit of time in minutes instead of seconds
4. create a scatter plot using this modified data

> *Note 1 - subsetting the data and converting from seconds to minutes is not absolutely required to answer our question, but it will give us practice manipulating data in R, and make our data tables and figures more readable.*
>
> *Note 2 - many historical datasets treated sex as a variable where the possible values are only binary: male or female. This representation in this question reflects how the data were historically collected and is not meant to imply that we believe that sex is binary.*

**Question 7.0.1** Multiple Choice:
<br> {points: 1}

Which of the following will you *not* find included in Vickers and Vertosick's data set?

A. age

B. what each runner ate before the race 

C. body mass index

D. self-rated fitness for each race



*Assign your answer to an object called `answer7.0.1`. Make sure your answer is an uppercase letter and is surrounded by quotation marks (e.g. `"F"`).*

In [None]:
# Replace the fail() with your answer. 

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_7.0.1()

**Question 7.0.2** True or False: 
<br> {points: 1} 

The researchers compiled this data so that they could build better models to predict marathon race times. 

*Assign your answer to an object called `answer7.0.2`. Make sure your answer is in lowercase letters and is surrounded by quotation marks (e.g. `"true"` or `"false"`).* 

In [None]:
# Replace the fail() with your answer. 

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_7.0.2()

**Question 7.0.3** Multiple Choice: 
<br> {points: 1}

What kind of graph will we be creating? Choose the correct answer from the options below. 

A. Bar Graph 

B. Pie Chart

C. Scatter Plot

D. Box Plot 

*Assign your answer to an object called `answer7.0.3`. Make sure your answer is an uppercase letter and is surrounded by quotation marks (e.g. `"F"`).* 

In [None]:
# Replace the fail() with your answer. 

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_7.0.3()

### 7.1. Reading Data

Let's get started with our first step - loading the data set. The data set we are loading is called `marathon_small.csv` and it contains a subset of the data from the study described above. The file is in the same directory/folder as the file for this notebook. It is a comma separated file (meaning the columns are separated by the `,` character). We often refer to these files as `.csv`'s.


```
age,bmi,km5_time_seconds,km10_time_seconds,sex
25.0,21.6221160888672,NA,2798,female
41.0,23.905969619751,1210.0,NA,male
25.0,21.6407279968262,994.0,NA,male
35.0,23.5923233032227,1075.0,2135,male
34.0,22.7064037322998,1186.0,NA,male
45.0,42.0875434875488,3240.0,NA,female
33.0,22.5182952880859,1292.0,NA,male
58.0,25.2340793609619,NA,3420,male
29.0,24.505407333374,1440.0,3240,male
```

We can use the `read_csv` function to do this. Below is an example of reading a `.csv` file that is in the same directory/folder as the file for the notebook that would be reading it in:

<img src="images/ws1_read_csv_gen.png" width="500">

*Note - the quotes around the filename are important and you will get an error if you forget them.*

**Question 7.1.1** <br> {points: 1}

Use the `read_csv()` function to load the data from the `marathon_small.csv` file into R. Save the data to an object called `marathon_small`. If you need additional help try `?read_csv` and/or ask your neighbours or the Instructional team for help.

In [None]:
library(tidyverse)
# your code here
fail() # No Answer - remove if you provide an answer
marathon_small

In [None]:
test_7.1.1()

The pink output under the code cell above tells you a bit about what happened when `read_csv` read the data into R. It tells you that 5 columns were created (names: age, bmi, km5_time_seconds, km10_time_seconds and sex) as well as the type of the data in those columns (*e.g.*, number-type or text-type), specifically:

- `col_double` means that the data in this column is a number-type, specifically real numbers (meaning that these values *can contain decimals*) 
- `col_integer` means that the data in this column is a number-type, specifically integers (whole numbers) 
- `col_character` means that the data in this column contains text (e.g., letter or words)

**Question 7.1.2** Multiple Choice <br> {points: 1}

From the list below, which is a valid way to store a data frame object read in from `read_csv` to an object in R?

A. `data -> read_csv("example_file.csv")`

B. `data <- read_csv("example_file.csv")`

C. `data <- read_csv"example_file.csv"`

D. `data <- read_csv(example_file.csv)`

*Assign your answer to an object called `answer7.1.2`. Make sure your answer is an uppercase letter and is surrounded by quotation marks (e.g. `"F"`).*

In [None]:
# Replace the fail() with your answer. 

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_7.1.2()

### 7.2. Data frames

The functions from the `tidyverse` package give us a special class of data frame, called a tibble. You will learn more about tibbles in later chapters. For now, just know that a tibble is a type of data frame and we can look at the structure of a tibble by simply writing its name to view the output. 

In [None]:
marathon_small

This returns the first 3 and last 3 rows of the data frame. 

```
age	bmi	km5_time_seconds	km10_time_seconds	sex
<dbl>	<dbl>	<dbl>	<dbl>	<chr>
25	21.62212	NA	2798	female
41	23.90597	1210	NA	male
25	21.64073	994	NA	male
⋮	⋮	⋮	⋮	⋮
42	23.74768	1203	NA	male
23	24.20903	2040	NA	female
58	23.49177	1304	2819	male
```
By default, the first row of a data set is always the **header** that `read_csv` uses to label the column. Therefore, the first row contains descriptive names while the rows below contain the actual data. 

This only shows us a small portion of the data set. You can look at more of the data set by using the `print()` function and specifying the number of rows you want to print. 

In [None]:
print(marathon_small, n = 50)

This shows us the first 50 rows of the data set. We could look at the entire data by changing the `n` argument but looking at many rows of data can be very long and unnecessary to look at. 

**Question 7.2.1** <br> {points: 1}

To know how many rows there really are, use the function `nrow()`. Replace the `fail()` with your line of code. Assign the number of rows to the object `number_rows`.

In [None]:
# Replace the fail() with your answer. 

# your code here
fail() # No Answer - remove if you provide an answer
print(number_rows)

In [None]:
test_7.2.1()

### 7.3. Filter

One of the most useful functions of `tidyverse` is `filter()`. With this function, it is possible to filter out specific observations based on their entries in one or more columns. 

For example, if we had a data set (named `data`) that looked like this:

```
  colour size speed
1    red   15  12.3
2   blue   19  34.1
3   blue   20  23.2
4    red   22  21.9
5   blue   12  33.6
6   blue   23  28.8
```

we could use the first line of the code in the image below to filter for rows where the colour has the value of "blue". The second line of code below would let us filter for rows where the size has a value greater than 20.

<img src="images/ws1_filter_gen.png" width="500">

**Question 7.3.1** <br> {points: 1}

Use the function `filter()` to subset your data frame `marathon_small` so it only contains survey data from males. Assign your new filtered data frame to an object called `marathon_filtered`. Replace the `fail()` with your line of code.

In [None]:
# Replace the fail() with your answer. 

# your code here
fail() # No Answer - remove if you provide an answer
marathon_filtered

In [None]:
test_7.3.1()

### 7.4. Select

The `select()` function allows you to zoom in and focus on specific parts of the data. It is particularly helpful when working with extremely large datasets. More specifically, the function allows you to separate one or more columns from your dataset and transfer them into their own data frame.

Remembering our example `data`:

```
  colour size speed
1    red   15  12.3
2   blue   19  34.1
3   blue   20  23.2
4    red   22  21.9
5   blue   12  33.6
6   blue   23  28.8
```

For example, we can use the function `select()` to choose columns of interest (here colour and size). 


<img src="images/ws1_select_gen.png" width="500">


and we would get this smaller data set back:

```
  colour size
1    red   15
2   blue   19
3   blue   20
4    red   22
5   blue   12
6   blue   23
```

**Question 7.4.1** <br> {points: 1}

Use the function `select` to choose the columns `bmi` and `km10_time_seconds` from `marathon_filtered`. Assign your new filtered data frame to an object called `marathon_male`. 

Replace the `fail()` with your line of code. 

In [None]:
# Replace the fail() with your answer. 

# your code here
fail() # No Answer - remove if you provide an answer
marathon_male

In [None]:
test_7.4.1()

**Question 7.4.2** <br> {points: 1}

What are the units of the time taken to complete a run of 10 km? Assign your answer to an object called `answer7.4.2`. Write your answer in lower case. Place your answer between quotation marks.


*Hint: scroll up and look at the introduction to this exercise.* 

In [None]:
# Replace the fail() with your answer. 

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_7.4.2()

**Question 7.4.3**
<br> {points: 1}

What are the units for time (e.g., seconds, minutes, hours) that we would like to use when plotting BMI against time taken to run 10 km? Assign your answer to an object called `answer7.4.3`. Write your answer in lower case. Place your answer between quotation marks.

*Hint: scroll up and look at the introduction to this exercise.*

In [None]:
# Replace the fail() with your answer. 

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_7.4.3()

### 7.5. Mutate

The function `mutate()` is used to add columns to a dataset, typically by making use of existing columns to compute a new column. 

<img src="images/ws1_mutate_gen.png">

In the example above, we are creating a new column named `new_column` that is equal to `old_column * 10` and saving the results to an object called `data_mutated`.

**Question 7.5.1**<br> {points: 1}

Add a new column to our `marathon_male` dataset called `km10_time_minutes` that is equal to `km10_time_seconds/60.` Assign your answer to an object called `marathon_minutes`.

In [None]:
# Replace the fail() with your answer. 

# your code here
fail() # No Answer - remove if you provide an answer
marathon_minutes

In [None]:
test_7.5.1()

### 7.6. Graphing
`ggplot` is a function that works using layers of code. Every time you want to see something new added to your plot, you must add a new layer with each layer being separated by the “+” symbol. The first function we use in this line of code is the `ggplot` function. Here, we indicate the arguments that apply to all layers of the plot. The second function we use is `geom_point()`. This function indicates that we wish to produce a scatterplot and the way we wish to display the data within this scatterplot. 






![ws1_ggplot_male.png](images/ws1_ggplot_male.png)

Let's plot a scatterplot with the `bmi` on the x axis and `km5_time_minutes` on the y axis.

In [None]:
# code to set-up plot size
library(repr)
options(repr.plot.width=8, repr.plot.height=7) #play with width/height to make the plot below a reasonable size

In [None]:
# Run this cell to create a scatterplot of BMI against the time it took to run 10 km. 
ggplot(data = marathon_minutes, aes(x = bmi, y = km10_time_minutes)) + 
    geom_point() + 
    theme(text = element_text(size=20)) #play with the size value to change the text size in the plot

> We also use the `options()` function to set the size of the plot, and the `theme()` function to control the size of the text in the plot. Note that using the `options()` function sets the default plot size for the whole notebook, while the `theme()` function only sets the font size for the current plot created by `ggplot`. So later on in the notebook we will create another plot, but we will only use `theme()` -- we won't use `options()` again.

**Question 7.6.1** Multiple Choice
<br> {points: 1}

Looking at the graph above, choose a statement above that most reflects what we see.

A. There appears to be no relationship between 10 km run time and body mass index. As the value for body mass index increases we see neither an increase nor decrease in the time it takes to run 10 km.

B. There may be a positive relationship between 10 km run time and body mass index. As the value for body mass index increases, so does the time it takes to run 10 km.

C. There may be a negative relationship between 10 km run time and body mass index. As the value for body mass index increases, the time it takes to run 10 km decreases.




*Assign your answer to an object called `answer7.6.1`. Make sure your answer is an uppercase letter and is surrounded by quotation marks (e.g. `"F"`).*

In [None]:
# Replace the fail() with your answer. 

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_7.6.1()

The code we listed above for graphics barely scratches the surface of what ggplot, and R as a whole, are capable of. Not only are there far more choices about the kinds of plots available, but there are many, many options for customizing the look and feel of each graph. You can choose the font, the font size, the colors, the style of the axes, etc. 

Let’s dig a little deeper into just a couple of options that you can add to any of your graphs to make them look a little better. For example, you can change the text of the x-axis label or the y-axis label by using `xlab("")` or `ylab("")`. Let’s do that for the scatterplot to make the labels easier to read.

*Notice the formatting of the cell.*

In [None]:
# Run this cell. 
# You can replace the axes with whatever you wish to label. 
# After running the cell once, try changing the axes to something else. 

ggplot(data = marathon_minutes, aes(x = bmi, y = km10_time_minutes)) + 
    geom_point() + 
    xlab("Body Mass Index") + 
    ylab("10 km run time (minutes)") +
    theme(text = element_text(size = 20))

## Attributions
- UC Berkeley [Data 8 Public Materials](https://github.com/data-8/data8assets)

In [None]:
source("cleanup.R")