“Preparation” of the data for analysis: cleaning up variables (outliers, erroneous values, recoding…), changing the structure/format of data frames, merging/joining data sets, calculating new variables, reducing/summarizing variables…
You will spend a lot more time wrangling the data than analyzing it!
You could do this manually (e.g. in Excel), but this is tedious, error prone & not reproducible! (+ Datasets can be huge!)
Fortunately, it is easy to do in R
When wrangling your data in R, you often want to access/use different columns, e.g. to calculate new ones. There are a number of ways you can do that:
# create a small data set for this example:
testdata <- data.frame(a = c(1, 2, 3), # c() creates a vector!
b = c("a", "b", "c"),
c = c(4, 5, 6),
d = c(7, 8, 9),
e = c(10, 11, 12))
print(testdata)
a b c d e
1 1 a 4 7 10
2 2 b 5 8 11
3 3 c 6 9 12
'data.frame': 3 obs. of 5 variables:
$ a: num 1 2 3
$ b: chr "a" "b" "c"
$ c: num 4 5 6
$ d: num 7 8 9
$ e: num 10 11 12
When wrangling your data in R, you often want to access/use different columns, e.g. to calculate new ones. There are a number of ways you can do that:
# in baseR, we access elements of a data.frame with square brackets
testdata[1, 2] # get cell that is in first row and second column
[1] "a"
testdata[1:2, 4:5] # use a colon to create ranges of values: first two rows and column numbers 4 and 5
d e
1 7 10
2 8 11
# we can leave one part empty to select ALL available columns/rows
testdata[1:2, ] # first two rows, all columns
a b c d e
1 1 a 4 7 10
2 2 b 5 8 11
d e
1 7 10
2 8 11
3 9 12
When wrangling your data in R, you often want to access/use different columns, e.g. to calculate new ones. There are a number of ways you can do that:
When wrangling your data in R, you often want to access/use different columns, e.g. to calculate new ones. There are a number of ways you can do that:
You can do all data wrangling in Base R, i.e. without loading any packages. However, there’s a neat collection of packages called tidyverse, which makes data wrangling even easier!
Base R:
output_data1 <- function1(data)
output_data2 <- function2(output_data1, param1)
output_data3 <- function3(output_data2, param2, param3)
Or:
output_data <- function3(function2(function1(data), param1), param2, param3)
Tidyverse:
output_data <- data %>% function1() %>% function2(param1) %>% function3(param2, param3)
You can insert a pipe %>%
(including spaces) by pressing Ctrl + Shift + M
library(tidyverse)
will load a number of packages, such as dplyr, ggplot2, readr, forcats, tibble etc., which are all usefuls for data wrangling.
We will work mainly with functions from the dplyr package, but also use readr to read in data. We will also use ggplot2 to visualize data.
The most important dplyr functions for data wrangling are:
Function | Description |
---|---|
select() | Include or exclude certain columns (variables) |
filter() | Include or exclude certain rows (observations) |
mutate() | Create new columns (variables) |
summarize() | Create new columns that aggregate data/create summary variables for groups of observations (data frame will become smaller) |
group_by() | Organize the rows (observations) into groups |
arrange() | Change the order of rows (observations) |
This is Data wrangling 1 of our QuantFun book: https://psyteachr.github.io/quant-fun-v2/data-wrangling-1.html
Open your Workshop R project.
Create a new R script and save it, e.g. as “DataWrangling1.R
”.
Insert code to make sure the packages “tidyverse
” and “babynames
” are installed and loaded.
Type the word babynames
into your console pane and press enter. What kind of information do you get?
“A tibble: 1,924,665 x 5”
tibble is an extension of the data.frame with more convenient output (e.g., values rounded to significant digits)
~1.9 million rows/observations
5 columns/variables
What kind of columns/variables do we have?
dbl = double/numeric (can take decimals)
chr = character/string (letters or words)
int = integer (only whole numbers)
Use select()
to choose only the columns year
, sex
, name
, and prop
and store it as a new tibble called babynames_reduced
.
Remember that you can run ?select
in the console if you need help about, e.g., input/arguments to the function.
Change the order of the data (oberservations/rows)!
arrange()
, try sorting the data according to the names
column. What happens??arrange
).sort_asc <- babynames %>% arrange(name)
sort_desc <- babynames %>% arrange(desc(year))
babynames %>% arrange(desc(year), name)
# A tibble: 1,924,665 × 5
year sex name n prop
<dbl> <chr> <chr> <int> <dbl>
1 2017 M Aaban 11 0.0000056
2 2017 F Aabriella 6 0.0000032
3 2017 M Aadam 18 0.00000917
4 2017 M Aadan 8 0.00000407
5 2017 M Aadarsh 15 0.00000764
6 2017 M Aaden 240 0.000122
7 2017 M Aadesh 7 0.00000357
8 2017 M Aadhav 31 0.0000158
9 2017 M Aadhavan 6 0.00000306
10 2017 M Aadhi 10 0.00000509
# ℹ 1,924,655 more rows
We have already used select()
to keep only certain variables (columns), but often we also want to keep only certain observations (rows), e.g. babies born in the year 2000 and later.
We use the function filter()
for this.
# A tibble: 562,156 × 5
year sex name n prop
<dbl> <chr> <chr> <int> <dbl>
1 2001 F Emily 25055 0.0127
2 2001 F Madison 22164 0.0112
3 2001 F Hannah 20712 0.0105
4 2001 F Ashley 16526 0.00835
5 2001 F Alexis 16401 0.00828
6 2001 F Sarah 15896 0.00803
7 2001 F Samantha 15862 0.00801
8 2001 F Abigail 14807 0.00748
9 2001 F Elizabeth 14784 0.00747
10 2001 F Olivia 13978 0.00706
# ℹ 562,146 more rows
The data starts at 2001! :(
The second argument, year > 2000
, is a Boolean or logical expression, which means that it results in a value of either TRUE or FALSE. filter()
runs this expression and then removes all values/rows that contain FALSE.
Operator | Name | is TRUE if and only if |
---|---|---|
A < B | less than | A is less than B |
A <= B | less than or equal | A is less than or equal to B |
A > B | greater than | A is greater than B |
A >= B | greater than or equal | A is greater than or equal to B |
A == B | equivalence | A exactly equals B |
A != B | not equal | A does not exactly equal B |
A %in% B | in | A is an element of vector B |
Second task:
This might be difficult because you have two expressions, name != "Mary"
and year > 2000
. You can simply add several expressions separated by commas in filter (commas are treated like a “logical and” &
):
Third task:
Fourth task:
This is very tricky! You could use three filters in a row with:
name != "Mary", name != "Elizabeth", name != "Victoria"
.
There is no function “not in” but you can negate the result in two ways:
In your script, insert and run the following code:
Alter the code to check for male babies with the same names (change sex == "F"
to sex == "M"
).
Optional: Plot the absolute number n
instead of the relative proportion prop
.
If we want to create variables that do not exist yet (i.e. by calculating values, combining other variables, etc.), we can use mutate()
!
But mutate
is much more powerful and can create variables that differ per observation, depending on other values in the tibble/data frame:
# A tibble: 10 × 2
year decade
<dbl> <dbl>
1 1902 1900
2 1909 1900
3 1914 1910
4 1934 1930
5 1966 1960
6 1986 1980
7 1990 1990
8 1999 1990
9 2006 2000
10 2012 2010
The goal of data wrangling is often to summarize (or aggregate) the data, e.g. to have an average value per condition. Sometimes you’d also want to calculate descriptive statistics to report.
You can do so using the function summarize()
:
# run the filter function just like above again:
dat <-
babynames %>%
filter(
name %in% c("Emily", "Kathleen", "Alexandra", "Beverly"),
sex == "F"
)
# summarize the data, calculating the number of oberservations:
dat_sum <- dat %>% summarize(total = sum(n))
dat_sum
# A tibble: 1 × 1
total
<int>
1 2161374
As you can see, a new variable named total is created, which contains the total number of observations (in this case, it is different from the number of rows because each row already contains a count n
).
There’s just one row in the resulting data frame, because summarize()
reduces the data frame (to only include the necessary information)!
Often, we want to summarize data for specific subgroups. For this aim, summarize()
has the .by
parameter:
You can also subgroup by a combination of variables:
babynames %>%
filter(name %in% c("Emily", "Kathleen", "Alexandra", "Beverly")) %>% # we start with the 4 names regardless of sex
summarize(
total = sum(n),
.by = c(name, sex) # and then summarize by name, separated for sex
)
# A tibble: 8 × 3
name sex total
<chr> <chr> <int>
1 Emily F 841491
2 Kathleen F 711605
3 Beverly M 4633
4 Beverly F 376914
5 Alexandra F 231364
6 Emily M 1744
7 Kathleen M 1692
8 Alexandra M 859
In earlier versions, we had to use summarize()
together with group_by()
:
We avoid using group_by()
like this because it can have unintended side effects.
It is just part of this class because you will likely encounter it in somebody else’s (old) code.
If you do have to use it, make sure to ungroup()
after summarize()
(or mutate()
) to avoid unintended effects:
Use the baby_decades
data frame to calculate the mean and median number of observations, grouped by sex & decade.
# A tibble: 28 × 4
sex decade mean_year median_year
<chr> <dbl> <dbl> <dbl>
1 F 1880 111. 13
2 M 1880 101. 12
3 F 1890 128. 13
4 M 1890 93.6 12
5 F 1900 131. 12
6 M 1900 94.4 12
7 F 1910 187. 12
8 M 1910 181. 12
9 F 1920 211. 12
10 M 1920 227. 13
# ℹ 18 more rows
There are several ways to get the number of rows per group. You can use the function n()
within a call to summarize()
(or mutate()
). A shortcut is to use count()
:
# A tibble: 4 × 2
name n
<chr> <int>
1 Emily 138
2 Kathleen 138
3 Beverly 122
4 Alexandra 117
# A tibble: 4 × 2
name n
<chr> <int>
1 Alexandra 117
2 Beverly 122
3 Emily 138
4 Kathleen 138
Interestingly, the order of the output may vary. summarize()
leaves the data in the original order (i.e., by prop
, which (likely) translates to an order by n()
). count()
arranges the output by the variables for which the counting is done (here: alphabetically by name
).
So far we have often saved intermediate steps in tibbles and used those as input for the next function. With the pipe, we can chain several functions and save relevant results only, no need for crowding the environment with intermediate data.frames or tibbles!
It’s not easy to decide which intermediate steps to save and which not. Usually, it involves some sort of trial and error. Sometimes you go back and break a pipe apart. Sometimes you get overwhelmed by the number of variables in your environment and create bigger pipes.
As a rule of thumb: If an intermediate step is only used once, you should probably delete it (unless it makes the code easier to comprehend).
This is Data wrangling 3 of our QuantFun book: https://psyteachr.github.io/quant-fun-v2/data-wrangling-3.html
We will skip Data wrangling 2 here but you will find the solutions in the end of the slides.
Tidy data: Data that is easily processed by tidyverse functions (also for visualizations and statistical analyses).
Three principles:
Each variable has its own column.
Each observation has its own row.
Each value has its own cell.
Wide format: Each participant/animal has one row;
repeated observations are in several columns
ID | Time_1 | Time_2 |
---|---|---|
a1 | 230 | 310 |
a2 | 195 | 220 |
a3 | 245 | 290 |
Long format: Each observation has its own row;
there are (usually) several rows per participant
ID | Time | Value |
---|---|---|
a1 | 1 | 230 |
a1 | 2 | 310 |
a2 | 1 | 195 |
a3 | 2 | 220 |
a3 | 1 | 245 |
a3 | 2 | 290 |
Wide format implements a sparser representation of the data but less tidy!
If you want to convert Time
from milliseconds into seconds, what do you have to do in both formats?
What do you think, which of the following data sets is tidy?
1:
# A tibble: 12 × 4
country year type count
<chr> <dbl> <chr> <dbl>
1 Afghanistan 1999 cases 745
2 Afghanistan 1999 population 19987071
3 Afghanistan 2000 cases 2666
4 Afghanistan 2000 population 20595360
5 Brazil 1999 cases 37737
6 Brazil 1999 population 172006362
7 Brazil 2000 cases 80488
8 Brazil 2000 population 174504898
9 China 1999 cases 212258
10 China 1999 population 1272915272
11 China 2000 cases 213766
12 China 2000 population 1280428583
2:
# A tibble: 6 × 4
country year cases population
<chr> <dbl> <dbl> <dbl>
1 Afghanistan 1999 745 19987071
2 Afghanistan 2000 2666 20595360
3 Brazil 1999 37737 172006362
4 Brazil 2000 80488 174504898
5 China 1999 212258 1272915272
6 China 2000 213766 1280428583
3:
# A tibble: 6 × 3
country year rate
<chr> <dbl> <chr>
1 Afghanistan 1999 745/19987071
2 Afghanistan 2000 2666/20595360
3 Brazil 1999 37737/172006362
4 Brazil 2000 80488/174504898
5 China 1999 212258/1272915272
6 China 2000 213766/1280428583
4:
# A tibble: 3 × 5
country `1999_cases` `2000_cases` `1999_population` `2000_population`
<chr> <dbl> <dbl> <dbl> <dbl>
1 Afghanistan 745 2666 19987071 20595360
2 Brazil 37737 80488 172006362 174504898
3 China 212258 213766 1272915272 1280428583
For the following activities, we will need the following files:
responses.csv containing the AQ survey responses to each of the 10 questions for the 66 participants
qformats.csv containing information on how a question should be coded - i.e. forward or reverse coding
scoring.csv containing information on how many points a specific response should get; depending on whether it is forward or reverse coded
pinfo.csv containing participant information such as Age
, Sex
and importantly ID
number.
Create a new script, e.g. as “DataWrangling3.R
” (remember we skipped #2 in the book).
Download the data into your project folder:
responses.csv
qformats.csv
scoring.csv
pinfo.csv
Clear your environment (the brush in the top right pane) and/or restart the R session (Session -> Restart R).
Load the four .csv files into your environment, e.g.:
Is the data (responses
) in a tidy format?
# A tibble: 6 × 11
Id Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10
<dbl> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 16 Slightly Disagree Defi… Slig… Defi… Slig… Slig… Slig… Defi… Slig… Slig…
2 17 Definitely Agree Slig… Slig… Defi… Defi… Defi… Slig… Slig… Slig… Slig…
3 18 Definitely Agree Defi… Slig… Defi… Defi… Defi… Slig… Defi… Defi… Defi…
4 19 Definitely Agree Defi… Defi… Slig… Defi… Defi… Slig… Slig… Defi… Slig…
5 20 Definitely Disagr… Slig… Defi… Slig… Slig… Slig… Slig… Slig… Slig… Slig…
6 21 Slightly Disagree Slig… Defi… Slig… Slig… Slig… Defi… Defi… Slig… Slig…
Why is it not tidy?
Let’s bring the wide data in a longer, tidy format!
There are several functions in R to reformat data, but the newest ones are pivot_longer()
and pivot_wider()
.
Run the code and see what changes:
Describe what the function does, what does the input/the arguments mean?
We now want to combine the different data sets: We want to have the information how the questionnaire has to be scored included with the items.
We can find the scoring information (i.e. how the questions are framed, positive or negative/whether they need to be reversed) in the qformats
tibble. Furthermore, we can find how many points are given to each item/response in scoring
.
We can use the function inner_join()
to merge the tibbles into one bigger tibble.
Activity: Replace the NULL
values in the below code with the necessary variable names to join rlong
and qformats
by Question
.
You can only join two data frames/tibbles at once.
Now add the scoring data:
You can also let the function figure out by itself which columns should be used for joining:
How do we need to group and summarize the data to get a sum score per person? (Ignoring the reverse coding for now!) Add the correct column names instead of the NULL
.
This is Data wrangling 2 of our QuantFun book: https://psyteachr.github.io/quant-fun-v2/data-wrangling-2.html
We’ll use data from a paper that investigates whether the ability to perform an action influences perception. In particular, the authors wondered whether participants who played Pong would perceive the ball to move faster when they have a small paddle.
Download the data, create a new script.
Clear the environment if you prefer.
Look at the data.
library("tidyverse")
pong_data <- read_csv("Data/PongBlueRedBack 1-16 Codebook.csv") # I put the data into a separate subfolder "Data"
summary(pong_data)
Participant JudgedSpeed PaddleLength BallSpeed TrialNumber
Min. : 1.00 Min. :0.0000 Min. : 50 Min. :2.0 Min. : 1.00
1st Qu.: 4.75 1st Qu.:0.0000 1st Qu.: 50 1st Qu.:3.0 1st Qu.: 72.75
Median : 8.50 Median :1.0000 Median :150 Median :4.5 Median :144.50
Mean : 8.50 Mean :0.5471 Mean :150 Mean :4.5 Mean :144.50
3rd Qu.:12.25 3rd Qu.:1.0000 3rd Qu.:250 3rd Qu.:6.0 3rd Qu.:216.25
Max. :16.00 Max. :1.0000 Max. :250 Max. :7.0 Max. :288.00
BackgroundColor HitOrMiss BlockNumber
Length:4608 Min. :0.0000 Min. : 1.00
Class :character 1st Qu.:0.0000 1st Qu.: 3.75
Mode :character Median :1.0000 Median : 6.50
Mean :0.6866 Mean : 6.50
3rd Qu.:1.0000 3rd Qu.: 9.25
Max. :1.0000 Max. :12.00
Rows: 4,608
Columns: 8
$ Participant <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ JudgedSpeed <dbl> 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, …
$ PaddleLength <dbl> 50, 250, 50, 250, 250, 50, 250, 50, 250, 50, 50, 250, …
$ BallSpeed <dbl> 5, 3, 4, 3, 7, 5, 6, 2, 4, 4, 7, 7, 3, 6, 5, 7, 2, 5, …
$ TrialNumber <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,…
$ BackgroundColor <chr> "red", "blue", "red", "red", "blue", "blue", "red", "r…
$ HitOrMiss <dbl> 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, …
$ BlockNumber <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
new_pong_data <- pong_data %>%
select(BallSpeed, HitOrMiss, JudgedSpeed, Participant, TrialNumber) %>%
arrange(desc(HitOrMiss), desc(JudgedSpeed)) %>%
filter(
JudgedSpeed == 1,
BallSpeed %in% c("2", "4", "5", "7"),
HitOrMiss == 0
) %>%
filter(TrialNumber > 2) %>%
mutate(TrialNumber = TrialNumber -1)
# summarize (use old data frame because we removed variables)
pong_data_hits <-
pong_data %>%
summarize(
total_hits = sum(HitOrMiss, na.rm = TRUE),
meanhits = mean(HitOrMiss, na.rm = TRUE),
.by = c(BackgroundColor, PaddleLength)
)
Learning objectives:
Learn about tidyverse vs. base R
Learn and apply the six basic dplyr “verbs”
Learn how to join data frames
Next:
Data Visualization in R