As always, the first thing we will do is load the tidyverse.
Note: If you haven’t yet installed the tidyverse, you’ll first have to run the code install.packages(“tidyverse”).
library(tidyverse)
Here’s the dataframe that we’ll analyze in this exercise:
scores <-
tibble(
name = c("mike", "carol", "greg", "marcia", "peter", "jan", "bobby", "cindy", "alice"),
school = c("south", "south", "south", "south", "north", "north", "north", "south", "south"),
teacher = c("johnson", "johnson", "johnson", "johnson", "smith", "smith", "smith", "perry", "perry"),
sex = c("male", "female", "male", "female", "male", "female", "male", "female", "female"),
math_score = c(4, 3, 2, 4, 3, 4, 5, 4, 5),
reading_score = c(1, 5, 2, 4, 5, 4, 1, 5, 4)
)
Let’s first take a look at it:
scores
## # A tibble: 9 x 6
## name school teacher sex math_score reading_score
## <chr> <chr> <chr> <chr> <dbl> <dbl>
## 1 mike south johnson male 4 1
## 2 carol south johnson female 3 5
## 3 greg south johnson male 2 2
## 4 marcia south johnson female 4 4
## 5 peter north smith male 3 5
## 6 jan north smith female 4 4
## 7 bobby north smith male 5 1
## 8 cindy south perry female 4 5
## 9 alice south perry female 5 4
Before we get started, I want to make sure you understand the difference between doing something and assigning it to a new name and just doing it without naming it. For example, make sure you understand the following:
# get the first 3 rows
scores %>%
slice(1:3)
## # A tibble: 3 x 6
## name school teacher sex math_score reading_score
## <chr> <chr> <chr> <chr> <dbl> <dbl>
## 1 mike south johnson male 4 1
## 2 carol south johnson female 3 5
## 3 greg south johnson male 2 2
# get the first 3 rows, and assign it to a new name "scores_small"
scores_small <- scores %>%
slice(1:3)
# see what's in "scores_small"
scores_small
## # A tibble: 3 x 6
## name school teacher sex math_score reading_score
## <chr> <chr> <chr> <chr> <dbl> <dbl>
## 1 mike south johnson male 4 1
## 2 carol south johnson female 3 5
## 3 greg south johnson male 2 2
In this exercise we’ll typically just print the results and not save them, but it’s an option if you want it!
Now we can get to the exercise. Most sections will begin with an example for you to look at. When questions require a written answer, there will be an “Answer” line for you to complete.
Question: Sort the data by math_score from high to low. Who had the best math score?
scores %>%
arrange(desc(math_score))
## # A tibble: 9 x 6
## name school teacher sex math_score reading_score
## <chr> <chr> <chr> <chr> <dbl> <dbl>
## 1 bobby north smith male 5 1
## 2 alice south perry female 5 4
## 3 mike south johnson male 4 1
## 4 marcia south johnson female 4 4
## 5 jan north smith female 4 4
## 6 cindy south perry female 4 5
## 7 carol south johnson female 3 5
## 8 peter north smith male 3 5
## 9 greg south johnson male 2 2
Answer: Bobby and alice both tied for the highest math score
Question: Sort the data by name from first to last in the alphabet.
Question: Sort the data by sex so females show up first. Which sex appears to have better reading scores?
Answer:
Question: Sort the data by school, then teacher, then sex, then math_score, and finally by reading_score.
Question: Select only the name, math_score, and reading_score columns.
scores %>%
select(name, math_score, reading_score)
## # A tibble: 9 x 3
## name math_score reading_score
## <chr> <dbl> <dbl>
## 1 mike 4 1
## 2 carol 3 5
## 3 greg 2 2
## 4 marcia 4 4
## 5 peter 3 5
## 6 jan 4 4
## 7 bobby 5 1
## 8 cindy 4 5
## 9 alice 5 4
Question: Select all of the columns except the sex column.
Question: Select all of the columns except the math_score and reading_score columns.
Question: Keep all of the columns but rearrange them so sex is the first column.
Question: Filter to students who are male and went to south.
# Option 1
scores %>%
filter(sex == "male" & school == "south")
## # A tibble: 2 x 6
## name school teacher sex math_score reading_score
## <chr> <chr> <chr> <chr> <dbl> <dbl>
## 1 mike south johnson male 4 1
## 2 greg south johnson male 2 2
# Option 2
scores %>%
filter(sex == "male", school == "south")
## # A tibble: 2 x 6
## name school teacher sex math_score reading_score
## <chr> <chr> <chr> <chr> <dbl> <dbl>
## 1 mike south johnson male 4 1
## 2 greg south johnson male 2 2
Question: Filter to students who did well in math (you decide what “well” means).
Question: Use filter to figure out how many students had a math score of 4 or more and a reading score of 3 or more.
Answer:
Question: Explain the errors in each of the following code blocks, then fix it to make it right!
# code block 1
scores %>%
filter(school == south)
## Error: object 'south' not found
# code block 2
scores %>%
filter(school = "south")
## Warning: `rlang__backtrace_on_error` is no longer experimental.
## It has been renamed to `rlang_backtrace_on_error`. Please update your RProfile.
## This warning is displayed once per session.
## `school` (`school = "south"`) must not be named, do you need `==`?
# fix it!
Answer:
Question: You are creating a remediation program. Filter to students who got a 3 or worse in either math or reading.
Question: Filter to students who got a reading score of 2, 3, or 4.
Question: Filter to students who have a name that starts with an “m”. Hint: type “?substr” in the console and then scroll to the bottom of the help file to see useful examples.
Question: Filter to teachers whose best math student got a score of 5.
scores %>%
group_by(teacher) %>%
filter(max(math_score) == 5)
## # A tibble: 5 x 6
## # Groups: teacher [2]
## name school teacher sex math_score reading_score
## <chr> <chr> <chr> <chr> <dbl> <dbl>
## 1 peter north smith male 3 5
## 2 jan north smith female 4 4
## 3 bobby north smith male 5 1
## 4 cindy south perry female 4 5
## 5 alice south perry female 5 4
Question: Filter to the sex with a mean math score of 4.
Question: Explain why the following code removes students who have perry as their teacher.
Answer:
Question: Both the math and reading scores were actually out of 50 – replace both variables to be 10 times their original values.
scores %>%
mutate(math_score = math_score * 10,
reading_score = reading_score * 10)
## # A tibble: 9 x 6
## name school teacher sex math_score reading_score
## <chr> <chr> <chr> <chr> <dbl> <dbl>
## 1 mike south johnson male 40 10
## 2 carol south johnson female 30 50
## 3 greg south johnson male 20 20
## 4 marcia south johnson female 40 40
## 5 peter north smith male 30 50
## 6 jan north smith female 40 40
## 7 bobby north smith male 50 10
## 8 cindy south perry female 40 50
## 9 alice south perry female 50 40
Question: Create a new column called “math_reading_avg” which is the average of a students math and reading scores.
Question: Create a new column “high_math_achiever” that is an indicator of if a student got a 4 or better on their math_score.
Question: Create a new column “reading_score_centered” that is a students reading score with the mean of all students’ reading scores subtracted from it.
Question: Create a new column “science_score”. You can make up what the actual scores are!
Question: Mike and cindy both got a 4 for their math score. Explain why why Mike has a higher “math_score_centered_by_sex” score.
scores %>%
group_by(sex) %>%
mutate(math_score_centered_by_sex = math_score - mean(math_score))
## # A tibble: 9 x 7
## # Groups: sex [2]
## name school teacher sex math_score reading_score math_score_centered…
## <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 mike south johnson male 4 1 0.5
## 2 carol south johnson fema… 3 5 -1
## 3 greg south johnson male 2 2 -1.5
## 4 marcia south johnson fema… 4 4 0
## 5 peter north smith male 3 5 -0.5
## 6 jan north smith fema… 4 4 0
## 7 bobby north smith male 5 1 1.5
## 8 cindy south perry fema… 4 5 0
## 9 alice south perry fema… 5 4 1
Answer:
Question: Create a “reading_score_centered_by_teacher” column. What can you learn from it?
Answer:
Question: Make a “number_of_students_in_class” column that is number of students in a student’s class. For example, it should be 4 for mike and 3 for peter.
Question: Use the summarize command to find the mean math score for all students.
scores %>%
summarize(math_score_mean = mean(math_score))
## # A tibble: 1 x 1
## math_score_mean
## <dbl>
## 1 3.78
Question: Use the summarize command to find the mean reading score for all students.
Question: Use the summarize command to find the median for both math scores and reading scores.
Question: Look closely at the following code. Why is it throwing an error? How can Rstudio help you see this error?
Answer: We need another “)” at the end of the code. The first “)” is for the min function but we also need a “)” to end the summarize function. Rstudio helps because if you go to the right of a paranthese, it highlights the corresponding closing paranthese.
Question: Find the minimum math score for each school.
scores %>%
group_by(school) %>%
summarize(min_math_score = min(math_score))
## # A tibble: 2 x 2
## school min_math_score
## <chr> <dbl>
## 1 north 3
## 2 south 2
Question: Find the maximum math score for each teacher.
Question: If we grouped by sex, and then summarized with the minimum reading score, how many rows would the resulting data frame have?
Answer:
Question: Remember that mutate always keeps the same number of rows but summarize usually reduces the number of rows. Why doesn’t the following use of summarize reduce the number of rows?
Answer:
Question: Create a data frame with the mean and median reading score by sex, as well as the number of students of that sex.
Question: Select just the name and math_score columns. Then create a new column “math_score_ec” that is a students math score plus 5 extra credit points. Finally, arrange the data frame by math_score_ec from low to high.
scores %>%
select(name, math_score) %>%
mutate(math_score_ec = math_score + 5) %>%
arrange(math_score_ec)
## # A tibble: 9 x 3
## name math_score math_score_ec
## <chr> <dbl> <dbl>
## 1 greg 2 7
## 2 carol 3 8
## 3 peter 3 8
## 4 mike 4 9
## 5 marcia 4 9
## 6 jan 4 9
## 7 cindy 4 9
## 8 bobby 5 10
## 9 alice 5 10
Question: Select every column except the teacher column. Create a new variabled called “mean_score” that is the mean of a student’s math and reading score. Finally, arrange the data frame by mean_score from low to high.
Question: Remove any students with smith as a teacher, then find the mean math_score by sex.
Question: Find the min, max, and median reading_score for female students at south school.
Question: Inspect each of the following code blocks. They both do about the same thing. Which one do you think is preffered from a computer efficiency standpoint?
# code block 1
scores %>%
group_by(school, teacher) %>%
summarize(max_math_score = max(math_score)) %>%
filter(school == "south")
## # A tibble: 2 x 3
## # Groups: school [1]
## school teacher max_math_score
## <chr> <chr> <dbl>
## 1 south johnson 4
## 2 south perry 5
# code block 2
scores %>%
filter(school == "south") %>%
group_by(teacher) %>%
summarize(max_math_score = max(math_score))
## # A tibble: 2 x 2
## teacher max_math_score
## <chr> <dbl>
## 1 johnson 4
## 2 perry 5
Answer:
Play around with these tools. Write a question or two that you think best exposes a misunderstanding you had or drills down on an important thing to remember. I’d love to add these questions in the future! Feel free to email what you came up with to stenhaug@stanford.edu.