1.2 Data Wrangling

Workshop: “Handling Uncertainty in your Data”

Dr. Mario Reutter & Juli Nagel
(slides adapted from Dr. Lea Hildebrandt)

Data Wrangling?

“Preparation” of the data for analysis: cleaning up variables (outliers, erroneous values, recoding…), changing the structure/format of data frames, merging/joining data sets, calculating new variables, reducing/summarizing variables…

You will spend a lot more time wrangling the data than analyzing it!

You could do this manually (e.g. in Excel), but this is tedious, error prone & not reproducible! (+ Datasets can be huge!)

Fortunately, it is easy to do in R

Accessing Variables/Columns

When wrangling your data in R, you often want to access/use different columns, e.g. to calculate new ones. There are a number of ways you can do that:

# create a small data set for this example:
testdata <- data.frame(a = c(1, 2, 3),  # c() creates a vector!
                       b = c("a", "b", "c"),
                       c = c(4, 5, 6),
                       d = c(7, 8, 9),
                       e = c(10, 11, 12))

print(testdata)

  a b c d  e
1 1 a 4 7 10
2 2 b 5 8 11
3 3 c 6 9 12

str(testdata)

'data.frame':   3 obs. of  5 variables:
 $ a: num  1 2 3
 $ b: chr  "a" "b" "c"
 $ c: num  4 5 6
 $ d: num  7 8 9
 $ e: num  10 11 12

Accessing Variables/Columns 2

When wrangling your data in R, you often want to access/use different columns, e.g. to calculate new ones. There are a number of ways you can do that:

# in baseR, we access elements of a data.frame with square brackets
testdata[1, 2] # get cell that is in first row and second column

[1] "a"

testdata[1:2, 4:5] # use a colon to create ranges of values: first two rows and column numbers 4 and 5

  d  e
1 7 10
2 8 11

# we can leave one part empty to select ALL available columns/rows
testdata[1:2, ] # first two rows, all columns

  a b c d  e
1 1 a 4 7 10
2 2 b 5 8 11

testdata[, 4:5] # columns number 4 and 5, all rows

Accessing Variables/Columns 3

When wrangling your data in R, you often want to access/use different columns, e.g. to calculate new ones. There are a number of ways you can do that:

# it is usually better to access columns by their column name:
testdata[c("d", "e")] # columns with names "d" and "e", all rows

# access a column only:
testdata[["a"]] # double square brackets to get a vector (not a data.frame)

[1] 1 2 3

testdata$a # short notation to get column "a" as a vector

[1] 1 2 3

better avoid the comma when accessing by column name:

  a b
1 1 a
2 2 b
3 3 c

[1] 1 2 3

  a b
1 1 a
2 2 b
3 3 c

[1] 1 2 3

=> consistent data format without leading comma

Accessing Variables/Columns 4

When wrangling your data in R, you often want to access/use different columns, e.g. to calculate new ones. There are a number of ways you can do that:

# tidy versions (see next slides)
library(tidyverse) # load tidyverse (if not already done)
pull(testdata, a) # same as testdata$a but can be used better in pipes (see next slide)

[1] 1 2 3

select(testdata, a, b) # get column(s) as a data.frame; no c() needed!

  a b
1 1 a
2 2 b
3 3 c

Tidyverse

You can do all data wrangling in Base R, i.e. without loading any packages. However, there’s a neat collection of packages called tidyverse, which makes data wrangling even easier!

Tidyverse 2

Base R:

output_data1 <- function1(data)

output_data2 <- function2(output_data1, param1)

output_data3 <- function3(output_data2, param2, param3)

Or:

output_data <- function3(function2(function1(data), param1), param2, param3)

Tidyverse:

output_data <- data %>% function1() %>% function2(param1) %>% function3(param2, param3)

You can insert a pipe %>% (including spaces) by pressing Ctrl + Shift + M

Tidyverse 3

library(tidyverse) will load a number of packages, such as dplyr, ggplot2, readr, forcats, tibble etc., which are all usefuls for data wrangling.

We will work mainly with functions from the dplyr package, but also use readr to read in data. We will also use ggplot2 to visualize data.

The most important dplyr functions for data wrangling are:

Function	Description
select()	Include or exclude certain columns (variables)
filter()	Include or exclude certain rows (observations)
mutate()	Create new columns (variables)
summarize()	Create new columns that aggregate data/create summary variables for groups of observations (data frame will become smaller)
group_by()	Organize the rows (observations) into groups
arrange()	Change the order of rows (observations)

Wrangling Babynames

This is Data wrangling 1 of our QuantFun book: https://psyteachr.github.io/quant-fun-v2/data-wrangling-1.html

Setting up libraries

Open your Workshop R project.
Create a new R script and save it, e.g. as “DataWrangling1.R”.
Insert code to make sure the packages “tidyverse” and “babynames” are installed and loaded.

# install.packages("tidyverse")
# install.packages("babynames")

library(babynames)
library(tidyverse)

Look at the Data

Type the word babynames into your console pane and press enter. What kind of information do you get?
- “A tibble: 1,924,665 x 5”
  - tibble is an extension of the data.frame with more convenient output (e.g., values rounded to significant digits)
  - ~1.9 million rows/observations
  - 5 columns/variables
What kind of columns/variables do we have?
- dbl = double/numeric (can take decimals)
- chr = character/string (letters or words)
- int = integer (only whole numbers)

Selecting Variables of Interest

Use select() to choose only the columns year, sex, name, and prop and store it as a new tibble called babynames_reduced.
Remember that you can run ?select in the console if you need help about, e.g., input/arguments to the function.

# my favorite:
babynames_reduced <- 
  babynames %>% 
  select(year, sex, name, prop)

# or alternatively:
babynames_reduced <- 
  babynames %>% 
  select(-n) # remove columns by using -

Removing columns vs. selecting columns: Results may change if the data get updated!

Arranging Data

Change the order of the data (oberservations/rows)!

Using arrange(), try sorting the data according to the names column. What happens?
How can you sort a column in a descending fashion? Check out the help file (?arrange).
Let’s sort by year descendingly and within each year, sort names alphabetically.

sort_asc <- babynames %>% arrange(name)

sort_desc <- babynames %>% arrange(desc(year)) 

babynames %>% arrange(desc(year), name)

# A tibble: 1,924,665 × 5
    year sex   name          n       prop
   <dbl> <chr> <chr>     <int>      <dbl>
 1  2017 M     Aaban        11 0.0000056 
 2  2017 F     Aabriella     6 0.0000032 
 3  2017 M     Aadam        18 0.00000917
 4  2017 M     Aadan         8 0.00000407
 5  2017 M     Aadarsh      15 0.00000764
 6  2017 M     Aaden       240 0.000122  
 7  2017 M     Aadesh        7 0.00000357
 8  2017 M     Aadhav       31 0.0000158 
 9  2017 M     Aadhavan      6 0.00000306
10  2017 M     Aadhi        10 0.00000509
# ℹ 1,924,655 more rows

Filter Observations

We have already used select() to keep only certain variables (columns), but often we also want to keep only certain observations (rows), e.g. babies born in the year 2000 and later.

We use the function filter() for this.

Look at the following code and think about what it might do.

babynames %>% 
  filter(year > 2000)

# A tibble: 562,156 × 5
    year sex   name          n    prop
   <dbl> <chr> <chr>     <int>   <dbl>
 1  2001 F     Emily     25055 0.0127 
 2  2001 F     Madison   22164 0.0112 
 3  2001 F     Hannah    20712 0.0105 
 4  2001 F     Ashley    16526 0.00835
 5  2001 F     Alexis    16401 0.00828
 6  2001 F     Sarah     15896 0.00803
 7  2001 F     Samantha  15862 0.00801
 8  2001 F     Abigail   14807 0.00748
 9  2001 F     Elizabeth 14784 0.00747
10  2001 F     Olivia    13978 0.00706
# ℹ 562,146 more rows

The data starts at 2001! :(

Boolean Expressions

The second argument, year > 2000, is a Boolean or logical expression, which means that it results in a value of either TRUE or FALSE. filter() runs this expression and then removes all values/rows that contain FALSE.

Boolean Expressions 2

Boolean expressions
Operator	Name	is TRUE if and only if
A < B	less than	A is less than B
A <= B	less than or equal	A is less than or equal to B
A > B	greater than	A is greater than B
A >= B	greater than or equal	A is greater than or equal to B
A == B	equivalence	A exactly equals B
A != B	not equal	A does not exactly equal B
A %in% B	in	A is an element of vector B

Filter some more 1

Keep only those observations with the name “Mary”.
Discard all observations with name “Mary” and keep only those from year > 2000.
Keep only those with names of former Queens (Mary, Elizabeth, Victoria).
Discard the ones with the Queen names!

First task:

marys <- 
  babynames %>% 
  filter(name == "Mary")

Filter some more 2

Keep only those observations with the name “Mary”.
Discard all observations with name “Mary” and keep only those from year > 2000.
Keep only those with names of former Queens (Mary, Elizabeth, Victoria).
Discard the ones with the Queen names!

Second task:

This might be difficult because you have two expressions, name != "Mary" and year > 2000. You can simply add several expressions separated by commas in filter (commas are treated like a “logical and” &):

no_marys_young <- 
  babynames %>% 
  filter(name != "Mary", year > 2000)

Filter some more 3

Keep only those observations with the name “Mary”.
Discard all observations with name “Mary” and keep only those from year > 2000.
Keep only those with names of former Queens (Mary, Elizabeth, Victoria).
Discard the ones with the Queen names!

Third task:

queens <- 
  babynames %>% 
  filter(
    name == "Mary" | # the vertical line is a logical OR
    name == "Elizabeth" | 
    name == "Victoria"
  )

A better shorthand exists with the operator %in%:

queens <- 
  babynames %>% 
  filter(name %in% c("Mary", "Elizabeth", "Victoria"))

Filter some more 4

Keep only those observations with the name “Mary”.
Discard all observations with name “Mary” and keep only those from year > 2000.
Keep only those with names of former Queens (Mary, Elizabeth, Victoria).
Discard the ones with the Queen names!

Fourth task:

This is very tricky! You could use three filters in a row with:
name != "Mary", name != "Elizabeth", name != "Victoria".

There is no function “not in” but you can negate the result in two ways:

no_queens <- 
  babynames %>% 
  filter(!name %in% c("Mary", "Elizabeth", "Victoria")) # ! is a negation ("not")

no_queens <- 
  babynames %>% 
  filter(name %in% c("Mary", "Elizabeth", "Victoria") == FALSE)

Your First Plot

In your script, insert and run the following code:

babynames %>% 
  filter(
    sex == "F", # only female babies
    name %in% c("Emily", "Kathleen", "Alexandra", "Beverly") # reduce to these 4 names
  ) %>% 
  ggplot(aes(x = year, y = prop, colour = name)) +
  geom_line(linewidth = 2) # plot data as a line (with increased size)

Alter the code to check for male babies with the same names (change sex == "F" to sex == "M").
Optional: Plot the absolute number n instead of the relative proportion prop.

Create New Variables

If we want to create variables that do not exist yet (i.e. by calculating values, combining other variables, etc.), we can use mutate()!

Add a variable called “country” that contains the value “USA” for all observations

baby_where <- 
  babynames %>% 
  mutate(country = "USA")

But mutate is much more powerful and can create variables that differ per observation, depending on other values in the tibble/data frame:

Create New Variables 2

Create a variable that denotes the decade a baby was born:

# we can only use floor to round down to full numbers => divide year by 10, floor it, and then multiply by 10 again
baby_decades <- 
  babynames %>% 
  mutate(decade = floor(year / 10) * 10) # round(year, -1) works but not floor(year, -1) :(

# A tibble: 10 × 2
    year decade
   <dbl>  <dbl>
 1  1894   1890
 2  1896   1890
 3  1929   1920
 4  1944   1940
 5  1951   1950
 6  1959   1950
 7  1974   1970
 8  1977   1970
 9  1979   1970
10  2009   2000

Summarizing

The goal of data wrangling is often to summarize (or aggregate) the data, e.g. to have an average value per condition. Sometimes you’d also want to calculate descriptive statistics to report.

You can do so using the function summarize():

# run the filter function just like above again:
dat <- 
  babynames %>% 
  filter(
    name %in% c("Emily", "Kathleen", "Alexandra", "Beverly"), 
    sex == "F"
  )

# summarize the data, calculating the number of oberservations:
dat_sum <- dat %>% summarize(total = sum(n))
dat_sum

# A tibble: 1 × 1
    total
    <int>
1 2161374

Summarizing 2

dat_sum

# A tibble: 1 × 1
    total
    <int>
1 2161374

As you can see, a new variable named total is created, which contains the total number of observations (in this case, it is different from the number of rows because each row already contains a count n).
There’s just one row in the resulting data frame, because summarize() reduces the data frame (to only include the necessary information)!

Grouping and Summarizing

Often, we want to summarize data for specific subgroups. For this aim, summarize() has the .by parameter:

group_sum <- 
  dat %>% 
  summarize(total = sum(n), .by = name) 

group_sum

# A tibble: 4 × 2
  name       total
  <chr>      <int>
1 Emily     841491
2 Kathleen  711605
3 Beverly   376914
4 Alexandra 231364

Grouping and Summarizing 2

You can also subgroup by a combination of variables:

babynames %>% 
  filter(name %in% c("Emily", "Kathleen", "Alexandra", "Beverly")) %>% # we start with the 4 names regardless of sex
  summarize(
    total = sum(n),
    .by = c(name, sex) # and then summarize by name, separated for sex
  )

# A tibble: 8 × 3
  name      sex    total
  <chr>     <chr>  <int>
1 Emily     F     841491
2 Kathleen  F     711605
3 Beverly   M       4633
4 Beverly   F     376914
5 Alexandra F     231364
6 Emily     M       1744
7 Kathleen  M       1692
8 Alexandra M        859

Grouping and Summarizing 3

In earlier versions, we had to use summarize() together with group_by():

group_sum <- dat %>% group_by(name) %>% summarize(total = sum(n))

We avoid using group_by() like this because it can have unintended side effects.
It is just part of this class because you will likely encounter it in somebody else’s (old) code.

If you do have to use it, make sure to ungroup() after summarize() (or mutate()) to avoid unintended effects:

group_sum <- dat %>% group_by(name) %>% summarize(total = sum(n)) %>% ungroup()
group_sum <- dat %>% group_by(name) %>% summarize(total = sum(n), .groups = "drop")

Grouping and Summarizing 4

Use the baby_decades data frame to calculate the mean and median number of observations, grouped by sex & decade.

baby_decades %>% 
  summarize(
    mean_year = mean(n),
    median_year = median(n),
    .by = c(sex, decade)
  )

# A tibble: 28 × 4
   sex   decade mean_year median_year
   <chr>  <dbl>     <dbl>       <dbl>
 1 F       1880     111.           13
 2 M       1880     101.           12
 3 F       1890     128.           13
 4 M       1890      93.6          12
 5 F       1900     131.           12
 6 M       1900      94.4          12
 7 F       1910     187.           12
 8 M       1910     181.           12
 9 F       1920     211.           12
10 M       1920     227.           13
# ℹ 18 more rows

Counting Data

There are several ways to get the number of rows per group. You can use the function n() within a call to summarize() (or mutate()). A shortcut is to use count():

dat %>% summarize(n = n(), .by = name)

# A tibble: 4 × 2
  name          n
  <chr>     <int>
1 Emily       138
2 Kathleen    138
3 Beverly     122
4 Alexandra   117

dat %>% count(name)

# A tibble: 4 × 2
  name          n
  <chr>     <int>
1 Alexandra   117
2 Beverly     122
3 Emily       138
4 Kathleen    138

Interestingly, the order of the output may vary. summarize() leaves the data in the original order (i.e., by prop, which (likely) translates to an order by n()). count() arranges the output by the variables for which the counting is done (here: alphabetically by name).

Bigger Pipes!

So far we have often saved intermediate steps in tibbles and used those as input for the next function. With the pipe, we can chain several functions and save relevant results only, no need for crowding the environment with intermediate data.frames or tibbles!

pipe_summary <- 
  babynames %>%
  mutate(decade = floor(year / 10) * 10) %>%
  filter(
    name %in% c("Emily", "Kathleen", "Alexandra", "Beverly"),
    sex == "F"
  ) %>%
  summarize(
    mean_decade = mean(n),
    .by = c(name, decade)
  )

It’s not easy to decide which intermediate steps to save and which not. Usually, it involves some sort of trial and error. Sometimes you go back and break a pipe apart. Sometimes you get overwhelmed by the number of variables in your environment and create bigger pipes.
As a rule of thumb: If an intermediate step is only used once, you should probably delete it (unless it makes the code easier to comprehend).

More serious Data Wrangling

This is Data wrangling 3 of our QuantFun book: https://psyteachr.github.io/quant-fun-v2/data-wrangling-3.html

We will skip Data wrangling 2 here but you will find the solutions in the end of the slides.

Tidy Data

Tidy data: Data that is easily processed by tidyverse functions (also for visualizations and statistical analyses).

Three principles:

Each variable has its own column.
Each observation has its own row.
Each value has its own cell.

Tidy Data: wide vs. long format

Wide format: Each participant/animal has one row;
repeated observations are in several columns

ID	Time_1	Time_2
a1	230	310
a2	195	220
a3	245	290

Long format: Each observation has its own row;
there are (usually) several rows per participant

ID	Time	Value
a1	1	230
a1	2	310
a2	1	195
a3	2	220
a3	1	245
a3	2	290

Wide format implements a sparser representation of the data but less tidy!
If you want to convert Time from milliseconds into seconds, what do you have to do in both formats?

Tidy Data 2

What do you think, which of the following data sets is tidy?

# A tibble: 12 × 4
   country      year type            count
   <chr>       <dbl> <chr>           <dbl>
 1 Afghanistan  1999 cases             745
 2 Afghanistan  1999 population   19987071
 3 Afghanistan  2000 cases            2666
 4 Afghanistan  2000 population   20595360
 5 Brazil       1999 cases           37737
 6 Brazil       1999 population  172006362
 7 Brazil       2000 cases           80488
 8 Brazil       2000 population  174504898
 9 China        1999 cases          212258
10 China        1999 population 1272915272
11 China        2000 cases          213766
12 China        2000 population 1280428583

# A tibble: 6 × 4
  country      year  cases population
  <chr>       <dbl>  <dbl>      <dbl>
1 Afghanistan  1999    745   19987071
2 Afghanistan  2000   2666   20595360
3 Brazil       1999  37737  172006362
4 Brazil       2000  80488  174504898
5 China        1999 212258 1272915272
6 China        2000 213766 1280428583

# A tibble: 6 × 3
  country      year rate             
  <chr>       <dbl> <chr>            
1 Afghanistan  1999 745/19987071     
2 Afghanistan  2000 2666/20595360    
3 Brazil       1999 37737/172006362  
4 Brazil       2000 80488/174504898  
5 China        1999 212258/1272915272
6 China        2000 213766/1280428583

# A tibble: 3 × 5
  country     `1999_cases` `2000_cases` `1999_population` `2000_population`
  <chr>              <dbl>        <dbl>             <dbl>             <dbl>
1 Afghanistan          745         2666          19987071          20595360
2 Brazil             37737        80488         172006362         174504898
3 China             212258       213766        1272915272        1280428583

Analyzing the Autism Spectrum Quotient

For the following activities, we will need the following files:

responses.csv containing the AQ survey responses to each of the 10 questions for the 66 participants
qformats.csv containing information on how a question should be coded - i.e. forward or reverse coding
scoring.csv containing information on how many points a specific response should get; depending on whether it is forward or reverse coded
pinfo.csv containing participant information such as Age, Sex and importantly ID number.

Set Up

Create a new script, e.g. as “DataWrangling3.R” (remember we skipped #2 in the book).
Download the data into your project folder:
responses.csv
qformats.csv
scoring.csv
pinfo.csv
Clear your environment (the brush in the top right pane) and/or restart the R session (Session -> Restart R).
Load the four .csv files into your environment, e.g.:

library(tidyverse)
responses <- read_csv("responses.csv") 
qformats <- read_csv("qformats.csv")
scoring <- read_csv("scoring.csv")
pinfo <- read_csv("pinfo.csv")

Look at the Data

Is the data (responses) in a tidy format?

# A tibble: 6 × 11
     Id Q1                 Q2    Q3    Q4    Q5    Q6    Q7    Q8    Q9    Q10  
  <dbl> <chr>              <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1    16 Slightly Disagree  Defi… Slig… Defi… Slig… Slig… Slig… Defi… Slig… Slig…
2    17 Definitely Agree   Slig… Slig… Defi… Defi… Defi… Slig… Slig… Slig… Slig…
3    18 Definitely Agree   Defi… Slig… Defi… Defi… Defi… Slig… Defi… Defi… Defi…
4    19 Definitely Agree   Defi… Defi… Slig… Defi… Defi… Slig… Slig… Defi… Slig…
5    20 Definitely Disagr… Slig… Defi… Slig… Slig… Slig… Slig… Slig… Slig… Slig…
6    21 Slightly Disagree  Slig… Defi… Slig… Slig… Slig… Defi… Defi… Slig… Slig…

Why is it not tidy?

Reformatting the Data

Let’s bring the wide data in a longer, tidy format!

There are several functions in R to reformat data, but the newest ones are pivot_longer() and pivot_wider().

Run the code and see what changes:

rlong <- 
  responses %>% 
  pivot_longer(
    cols = Q1:Q10, # we can select a range of column names
    # cols = starts_with("Q"), # alternative
    names_to = "Question", 
    values_to = "Response"
  )

Describe what the function does, what does the input/the arguments mean?

Joining the Data

We now want to combine the different data sets: We want to have the information how the questionnaire has to be scored included with the items.

We can find the scoring information (i.e. how the questions are framed, positive or negative/whether they need to be reversed) in the qformats tibble. Furthermore, we can find how many points are given to each item/response in scoring.

We can use the function inner_join() to merge the tibbles into one bigger tibble.

Activity: Replace the NULL values in the below code with the necessary variable names to join rlong and qformats by Question.

rlong2 <- 
  inner_join(x = NULL, y = NULL, by = "NULL")

rlong2 <- 
  inner_join(
    x = rlong, 
    y = qformats, 
    by = "Question"
  )

Combining more Data

You can only join two data frames/tibbles at once.
Now add the scoring data:

rscores <- 
  rlong2 %>% 
  inner_join(
    scoring, 
    c("QFormat", "Response")
  )

You can also let the function figure out by itself which columns should be used for joining:

rscores <- inner_join(rlong2, scoring)

Joining with `by = join_by(Response, QFormat)`

And if you are happy with the result, copy the information into your code to make the join explicit:

rscores <- inner_join(rlong2, scoring, 
                      by = join_by(Response, QFormat)) #same as by = c("QFormat", "Response")

Calculate the Questionnaire Scores

How do we need to group and summarize the data to get a sum score per person? (Ignoring the reverse coding for now!) Add the correct column names instead of the NULL.

aq_scores <- 
  rscores %>% 
  summarize(
    AQ = sum(NULL), 
    .by = NULL
  )

aq_scores <- 
  rscores %>% 
  summarize(
    AQ = sum(Score), # sum column Score to obtain AQ scores.
    .by = Id # separately for each Id (participant)
  )

Pipe it all together!

aq_scores2 <- 
  responses %>% 
  pivot_longer(
    cols = Q1:Q10,
    names_to = "Question", 
    values_to = "Response"
  ) %>%  
  inner_join(qformats, "Question") %>% 
  inner_join(scoring, c("QFormat", "Response")) %>% 
  summarize(AQ = sum(Score), .by = Id)

Optional Exercises

This is Data wrangling 2 of our QuantFun book: https://psyteachr.github.io/quant-fun-v2/data-wrangling-2.html

Background

We’ll use data from a paper that investigates whether the ability to perform an action influences perception. In particular, the authors wondered whether participants who played Pong would perceive the ball to move faster when they have a small paddle.

Download the data, create a new script.
Clear the environment if you prefer.
Look at the data.

Solutions

library("tidyverse")
pong_data <- read_csv("Data/PongBlueRedBack 1-16 Codebook.csv") # I put the data into a separate subfolder "Data"
summary(pong_data)

  Participant     JudgedSpeed      PaddleLength   BallSpeed    TrialNumber    
 Min.   : 1.00   Min.   :0.0000   Min.   : 50   Min.   :2.0   Min.   :  1.00  
 1st Qu.: 4.75   1st Qu.:0.0000   1st Qu.: 50   1st Qu.:3.0   1st Qu.: 72.75  
 Median : 8.50   Median :1.0000   Median :150   Median :4.5   Median :144.50  
 Mean   : 8.50   Mean   :0.5471   Mean   :150   Mean   :4.5   Mean   :144.50  
 3rd Qu.:12.25   3rd Qu.:1.0000   3rd Qu.:250   3rd Qu.:6.0   3rd Qu.:216.25  
 Max.   :16.00   Max.   :1.0000   Max.   :250   Max.   :7.0   Max.   :288.00  
 BackgroundColor      HitOrMiss       BlockNumber   
 Length:4608        Min.   :0.0000   Min.   : 1.00  
 Class :character   1st Qu.:0.0000   1st Qu.: 3.75  
 Mode  :character   Median :1.0000   Median : 6.50  
                    Mean   :0.6866   Mean   : 6.50  
                    3rd Qu.:1.0000   3rd Qu.: 9.25  
                    Max.   :1.0000   Max.   :12.00

# look at the data (can also use summary(), str(), head() etc.)
glimpse(pong_data)

Rows: 4,608
Columns: 8
$ Participant     <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ JudgedSpeed     <dbl> 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, …
$ PaddleLength    <dbl> 50, 250, 50, 250, 250, 50, 250, 50, 250, 50, 50, 250, …
$ BallSpeed       <dbl> 5, 3, 4, 3, 7, 5, 6, 2, 4, 4, 7, 7, 3, 6, 5, 7, 2, 5, …
$ TrialNumber     <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,…
$ BackgroundColor <chr> "red", "blue", "red", "red", "blue", "blue", "red", "r…
$ HitOrMiss       <dbl> 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, …
$ BlockNumber     <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …

Solutions 2

new_pong_data <- pong_data %>% 
  select(BallSpeed, HitOrMiss, JudgedSpeed, Participant, TrialNumber) %>% 
  arrange(desc(HitOrMiss), desc(JudgedSpeed)) %>% 
  filter(
    JudgedSpeed == 1,
    BallSpeed %in% c("2", "4", "5", "7"),
    HitOrMiss == 0
  ) %>% 
  filter(TrialNumber > 2) %>% 
  mutate(TrialNumber = TrialNumber -1) 
  
  # summarize (use old data frame because we removed variables)
pong_data_hits <- 
  pong_data %>% 
  summarize(
    total_hits = sum(HitOrMiss, na.rm = TRUE),
    meanhits = mean(HitOrMiss, na.rm = TRUE),
    .by = c(BackgroundColor, PaddleLength)
  )

Thanks!

Learning objectives:

Learn about tidyverse vs. base R
Learn and apply the six basic dplyr “verbs”
Learn how to join data frames

Data Visualization in R