Learn R for Psychology Research: A Crash Course

Last updated on February 16, 2019 23 min read tutorial, demonstration, R

Introduction

I wrote this for psychologists who want to learn how to use R in their research right now. What does a psychologist need to know to use R to import, wrangle, plot, and model their data today? Here we go.

Foundations: People and their resources that inspired me.

Dan Robinson [.html] convinced me that beginneRs should learn tidyverse first, not Base R. This tutorial uses tidyverse. All you need to know about the differnece is in his blog post. If you’ve learned some R before this, you might understand that difference as you go through this tutorial.

If you want a more detailed introduction to R, start with R for Data Science (Wickham & Grolemund, 2017) [.html]. The chapters are short, easy to read, and filled with simple coding examples that demonstrate big principles. And it’s free.

Hadley Wickham is a legend in the R community. He’s responsible for the tidyverse, including ggplot2. Read his books and papers (e.g., Wickham, 2014). Watch his talks (e.g., ReadCollegePDX, October 19, 2015). He’s profoundly changed how people think about structuring and visualizing data.

Need-to-Know Basics

Install R and R Studio (you need both in that order)

Installing R (Macintosh / Windows)

Uninstalling R (Macintosh / Windows)

Installing R Studio [.html]

Uninstalling R Studio [.html]

Understand all the panels in R Studio

Four panels in R Studio

Packages: They’re like apps you download for R

“Packages are collections of R functions, data, and compiled code in a well-defined format. The directory where packages are stored is called the library. R comes with a standard set of packages. Others are available for download and installation. Once installed, they have to be loaded into the session to be used.” [.html]

# before you can load these libraries, you need to install them first (remove the # part first):
# install.packages("tidyverse")
# install.packages("haven")
# install.packages("psych")
# install.packages("car")

library(tidyverse)
library(haven)
library(psych)
library(car)

Objects: They save stuff for you

“object <- fuction(x), which means ‘object is created from function(x)’. An object is anything created in R. It could be a variable, a collection of variables, a statistical model, etc. Objects can be single values (such as the mean of a set of scores) or collections of information; for example, when you run an analysis, you create an object that contains the output of that analysis, which means that this object contains many different values and variables. Functions are the things that you do in R to create your objects.”
Field, A., Miles., J., & Field, Z. (2012). Discovering statistics using R. London: SAGE Publications. [.html]

`c()` Function: Combine things like thing1, thing2, thing3, …

“c” stands for combine. Use this to combine values into a vector. “A vector is a sequence of data ‘elements’ of the same basic type.” [.html] Below, we create an object called five_numbers. We are naming it for what it is, but we could name it whatever we want: some_numbers, maple_tree, platypus. It doesn’t matter. We’ll use this in the examples in later chunks of code.

# read: combine 1, 2, 3, 4, 5 and "save to", <-, five_numbers
five_numbers <- c(1, 2, 3, 4, 5)

# print five_numbers by just excecuting/running the name of the object
five_numbers

## [1] 1 2 3 4 5

R Help: `help()` and `?`

“The help() function and ? help operator in R provide access to the documentation pages for R functions, data sets, and other objects, both for packages in the standard R distribution and for contributed packages. To access documentation for the standard lm (linear model) function, for example, enter the command help(lm) or help(”lm“), or ?lm or ?”lm" (i.e., the quotes are optional)." [.html]

Piping, `%>%`: Write code kinda like you write sentences

The %>% operator allows you to “pipe” a value forward into an expression or function; something along the lines of x %>% f, rather than f(x). See the magrittr page [.html] for more details, but check out these examples below.

Compute z-scores for those five numbers, called five_numbers

see help(scale) for details

five_numbers %>% scale()

##            [,1]
## [1,] -1.2649111
## [2,] -0.6324555
## [3,]  0.0000000
## [4,]  0.6324555
## [5,]  1.2649111
## attr(,"scaled:center")
## [1] 3
## attr(,"scaled:scale")
## [1] 1.581139

Compute Z-scores for five_numbers and then convert the result into only numbers

see help(as.numeric) for details

five_numbers %>% scale() %>% as.numeric()

## [1] -1.2649111 -0.6324555  0.0000000  0.6324555  1.2649111

Compute z-scores for five_numbers and then convert the result into only numbers and then compute the mean

see help(mean) for details

five_numbers %>% scale() %>% as.numeric() %>% mean()

## [1] 0

Tangent: most R introductions will teach you to code the example above like this:

mean(as.numeric(scale(five_numbers)))

## [1] 0

I think this code is counterintuitive. You’re reading the current sentence from left to right. That’s how I think code should read like: how you read sentences. Forget this “read from the inside out” way of coding for now. You can learn the “read R code inside out” way when you have time and feel motivated to learn harder things. I’m assuming you don’t right now.

Functions: they do things for you

“A function is a piece of code written to carry out a specified task; it can or can not accept arguments or parameters and it can or can not return one or more values.” Functions do things for you. [.html]

Compute the num of five_numbers

see help(sum) for details

five_numbers %>% sum()

## [1] 15

Compute the length of five_numbers

see help(length) for details

five_numbers %>% length()

## [1] 5

Compute the sum of five_numbers and divide by the length of five_numbers

see help(Arithmetic) for details

five_numbers %>% sum() / five_numbers %>% length()

## [1] 3

Define a new function called compute_mean

compute_mean <- function(some_numbers) {
  some_numbers %>% sum() / some_numbers %>% length()
}

Compute the mean of five_numbers

five_numbers %>% compute_mean()

## [1] 3

Tangent: Functions make assumptions; know what they are

What is the mean of 5 numbers and a unknown number, NA?

see help(NA) for details

c(1, 2, 3, 4, 5, NA) %>% mean()

## [1] NA

Is this what you expected? Turns out, this isn’t a quirky feature of R. R was designed by statisticians and mathematicians. NA represents a value that is unknown. Ask yourself, what is the sum of an unknown value and 17? If you don’t know the value, then you don’t know the value of adding it to 17 either. The mean() function gives NA for this reason: the mean of 5 values and an unknwon value is NA; it’s unknown; it’s not available; it’s missing. When you use functions in your own research, think about what the functions “assume” or “know”; ask, “What do I want the function to do? What do I expect it to do? Can the function do what I want with the information I gave it?”

Tell the `mean()` function to remove missing values

c(1, 2, 3, 4, 5, NA) %>% mean(na.rm = TRUE)

## [1] 3

Create data for psychology-like examples

This is the hardest section of the tutorial. Keep this is mind: we’re making variables that you might see in a simple psychology dataset, and then we’re going to combine them into a dataset. Don’t worry about specifcs too much. If you want to understand how a function works, use ?name_of_function or help(name_of_function).

Subject numbers

read like this: generate a sequence of values from 1 to 100 by 1

see help(seq) for details

subj_num <- seq(from = 1, to = 100, by = 1)

# print subj_num by just excecuting/running the name of the object
subj_num

##   [1]   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17
##  [18]  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34
##  [35]  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51
##  [52]  52  53  54  55  56  57  58  59  60  61  62  63  64  65  66  67  68
##  [69]  69  70  71  72  73  74  75  76  77  78  79  80  81  82  83  84  85
##  [86]  86  87  88  89  90  91  92  93  94  95  96  97  98  99 100

See those numbers on the left hand side of the output above? Those help you keep track of a sequence (i.e., line) of elements as the output moves to each new line. This is obvious in the example above because numbers correspond exactly to their position. But if these were letters or random numbers, that index or position (e.g., [14] “N”) tells you the position of that element in the line.

Condition assignments

read like this: replicate each element of c(“control”, “manipulation”) 50 times, and then turn the result into a factor

side: in R, factors are nominal variables (i.e., integers) with value labels (i.e., names for each integer).

see help(rep) and help(factor) for details

condition <- c("control", "manipulation") %>% rep(each = 50) %>% factor()

# print condition by just excecuting/running the name of the object
condition

##   [1] control      control      control      control      control     
##   [6] control      control      control      control      control     
##  [11] control      control      control      control      control     
##  [16] control      control      control      control      control     
##  [21] control      control      control      control      control     
##  [26] control      control      control      control      control     
##  [31] control      control      control      control      control     
##  [36] control      control      control      control      control     
##  [41] control      control      control      control      control     
##  [46] control      control      control      control      control     
##  [51] manipulation manipulation manipulation manipulation manipulation
##  [56] manipulation manipulation manipulation manipulation manipulation
##  [61] manipulation manipulation manipulation manipulation manipulation
##  [66] manipulation manipulation manipulation manipulation manipulation
##  [71] manipulation manipulation manipulation manipulation manipulation
##  [76] manipulation manipulation manipulation manipulation manipulation
##  [81] manipulation manipulation manipulation manipulation manipulation
##  [86] manipulation manipulation manipulation manipulation manipulation
##  [91] manipulation manipulation manipulation manipulation manipulation
##  [96] manipulation manipulation manipulation manipulation manipulation
## Levels: control manipulation

Dependent Measure

Save 5 values that represent the sample sizes and the true means and standard deviations of our pretend conditions

sample_size <- 50
control_mean <- 2.5
control_sd <- 1
manip_mean <- 5.5
manip_sd <- 1

Introduce a neat function in R: `rnorm()`

rnorm stands for random normal. Tell it sample size, true mean, and the true sd, and it’ll draw from that normal population at random and spit out numbers. * see help(rnorm) for details

# for example
rnorm(n = 10, mean = 0, sd = 1)

##  [1]  0.49894471 -0.32804955  0.50537413 -0.11556736 -0.08508432
##  [6]  0.08209753 -0.19807435 -0.67604751 -1.40819698  0.65839239

Randomly sample from our populations we made up above

control_values <- rnorm(n = sample_size, mean = control_mean, sd = control_sd)

manipulation_values <- rnorm(n = sample_size, mean = manip_mean, sd = manip_sd)

Combine those and save as our dependent variable

dep_var <- c(control_values, manipulation_values)

# print dep_var by just excecuting/running the name of the object
dep_var

##   [1]  3.80435908  1.98433834  3.58911265  2.25427353  2.26939479
##   [6]  3.00571297  1.79599758  3.78066987  2.57893735  0.41113928
##  [11]  1.45297236 -0.26800475  0.94876854  2.72645231  2.11293954
##  [16]  0.93341662  2.32122047  2.13377155  2.47441804  4.10472885
##  [21]  2.87566838  1.61589446  1.40305547  1.17898161  3.45501450
##  [26]  1.63901146  3.27007143  2.88237529 -0.01839347  1.34987475
##  [31]  2.81149228  2.43357785  2.00504670  0.92384205  2.12499045
##  [36]  2.81331542  4.16386024  3.37649612  3.00652454  3.56641968
##  [41]  2.79969451  1.69841398  3.34037340  1.58360830  3.27350578
##  [46]  2.21078116  2.28055121  1.24831449  4.25549097  2.28968369
##  [51]  5.53075605  5.88079557  3.98853615  5.40940592  5.67097034
##  [56]  6.57733121  4.95465727  4.74030459  7.07676495  5.84818695
##  [61]  5.58482038  6.63561264  4.22953125  5.26021676  4.18271199
##  [66]  6.28222681  5.09162940  4.15811312  4.22998668  6.46896233
##  [71]  6.39382167  4.30209211  5.17371080  6.37887623  6.65385566
##  [76]  4.99211078  5.80340366  6.05295369  4.41951254  4.76894952
##  [81]  5.97968673  5.18637234  6.27683921  4.14482297  4.88000977
##  [86]  3.52223543  5.10475171  5.12186433  3.99062501  5.86301456
##  [91]  6.55805845  5.03726371  5.37350470  5.01889901  7.13802500
##  [96]  5.18802629  6.02507734  6.59428100  5.23363327  4.44856252

Create a potentially confounding variable or a control variable

in the code below, we multiply every value in dep_var by 0.5 and then we “add noise”: 100 random normal values whose true population has a mean = 0 and sd = 1. Every value in dep_var * 0.5 gets a random value added to it.

confound <- dep_var * 0.5 + rnorm(n = 100, mean = 0, sd = 1)

# print confound by just excecuting/running the name of the object
confound

##   [1]  0.51411880  0.40047444  2.35260247  2.39802673  0.61456036
##   [6]  1.53744814 -0.24856945  2.84688078  1.78000979  0.65793447
##  [11]  0.46038034 -0.55693190  0.15719474  0.55373294  0.65975774
##  [16]  0.10169319  1.74698979  0.12030403  1.32009371  2.27981471
##  [21]  0.34613849  0.43965108  0.66271575  1.20432223 -0.25043490
##  [26]  0.26868572  0.61011480  1.65228582  1.37326363  0.43832942
##  [31]  1.62745456 -0.59358047  1.10697713 -1.36755391  1.59130779
##  [36]  2.91935947  1.99479358  1.07678653  1.00595597  1.62642447
##  [41]  1.28899263  1.82400453  2.51240635 -0.35753691  0.43771414
##  [46]  1.30727552  1.08060064  0.06576004  2.80538226 -0.55497383
##  [51]  0.38169063  3.59617145  2.70618669  1.71982121  3.04889900
##  [56]  5.09499928  3.57930502  2.09590896  2.88080717  2.33411106
##  [61]  4.62382234  1.43874759  2.38724141  1.30176362  1.67788292
##  [66]  4.04790755  2.18167509  2.31908101  1.62504006  2.59734364
##  [71]  2.61518822  2.41878340  1.18822924  4.56433660  3.71618299
##  [76]  2.35777545  3.53622419  1.77834806  1.84902914  2.79422726
##  [81]  3.42986400  2.99422568  3.01973675  2.36812929  2.99174776
##  [86]  2.44949626  2.39993654  2.42561094  2.07900733  4.28085347
##  [91]  3.14719580  2.29168943  2.17551577  3.81201421  4.96003880
##  [96]  2.51509873  3.16018881  5.21212618  3.35937272  2.22616663

Subject gender

read like this: replicate each element of c(“Woman”, “Man”) sample_size = 50 times, and then turn the result into a factor

gender <- c("Woman", "Man") %>% rep(times = sample_size) %>% factor()

# print gender by just excecuting/running the name of the object
gender

##   [1] Woman Man   Woman Man   Woman Man   Woman Man   Woman Man   Woman
##  [12] Man   Woman Man   Woman Man   Woman Man   Woman Man   Woman Man  
##  [23] Woman Man   Woman Man   Woman Man   Woman Man   Woman Man   Woman
##  [34] Man   Woman Man   Woman Man   Woman Man   Woman Man   Woman Man  
##  [45] Woman Man   Woman Man   Woman Man   Woman Man   Woman Man   Woman
##  [56] Man   Woman Man   Woman Man   Woman Man   Woman Man   Woman Man  
##  [67] Woman Man   Woman Man   Woman Man   Woman Man   Woman Man   Woman
##  [78] Man   Woman Man   Woman Man   Woman Man   Woman Man   Woman Man  
##  [89] Woman Man   Woman Man   Woman Man   Woman Man   Woman Man   Woman
## [100] Man  
## Levels: Man Woman

Subject age

read like this: generate a sequence of values from 18 to 25 by 1, and take a random sample of 100 values from that sequence (with replacement)

see help(sample) for details

age <- seq(from = 18, to = 25, by = 1) %>% sample(size = 100, replace = TRUE)

# print gender by just excecuting/running the name of the object
age

##   [1] 22 23 18 20 22 23 20 22 18 21 21 20 18 21 18 20 19 25 25 24 21 25 23
##  [24] 18 23 18 19 25 19 23 24 19 20 21 18 18 24 19 24 18 24 22 21 18 24 21
##  [47] 18 25 20 18 22 20 18 25 23 23 22 23 24 19 21 25 22 21 23 18 20 23 24
##  [70] 21 22 18 18 19 22 25 22 24 19 25 23 25 18 23 22 23 21 25 18 25 20 20
##  [93] 18 20 22 19 23 25 18 19

`data.frame()` and `tibble()`

“The concept of a data frame comes from the world of statistical software used in empirical research; it generally refers to”tabular" data: a data structure representing cases (rows), each of which consists of a number of observations or measurements (columns). Alternatively, each row may be treated as a single observation of multiple “variables”. In any case, each row and each column has the same data type, but the row (“record”) datatype may be heterogenous (a tuple of different types), while the column datatype must be homogenous. Data frames usually contain some metadata in addition to data; for example, column and row names." [.html]

“Tibbles are a modern take on data frames. They keep the features that have stood the test of time, and drop the features that used to be convenient but are now frustrating (i.e. converting character vectors to factors).” [.html]

Put all our variable we made into a tibble

every variable we made above is seperated by a comma – these will become columns in our dataset

see help(data.frame) and help(tibble) for details

example_data <- tibble(subj_num, condition, dep_var, confound, gender, age)

# print example_data by just excecuting/running the name of the object
example_data

## # A tibble: 100 x 6
##    subj_num condition dep_var confound gender   age
##       <dbl> <fct>       <dbl>    <dbl> <fct>  <dbl>
##  1        1 control     3.80     0.514 Woman     22
##  2        2 control     1.98     0.400 Man       23
##  3        3 control     3.59     2.35  Woman     18
##  4        4 control     2.25     2.40  Man       20
##  5        5 control     2.27     0.615 Woman     22
##  6        6 control     3.01     1.54  Man       23
##  7        7 control     1.80    -0.249 Woman     20
##  8        8 control     3.78     2.85  Man       22
##  9        9 control     2.58     1.78  Woman     18
## 10       10 control     0.411    0.658 Man       21
## # … with 90 more rows

Data wrangling examples

create new variables in data.frame or tibble

mutate() adds new variables to your tibble.

we’re adding new columns to our dataset, and we’re saving this new dataset as the same name as our old one (i.e., like changing an Excel file and saving with the same name)

see help(mutate) for details

example_data <- example_data %>%
  mutate(dep_var_z = dep_var %>% scale() %>% as.numeric(),
         confound_z = confound %>% scale() %>% as.numeric())

# print example_data by just excecuting/running the name of the object
example_data

## # A tibble: 100 x 8
##    subj_num condition dep_var confound gender   age dep_var_z confound_z
##       <dbl> <fct>       <dbl>    <dbl> <fct>  <dbl>     <dbl>      <dbl>
##  1        1 control     3.80     0.514 Woman     22   -0.0289    -0.995 
##  2        2 control     1.98     0.400 Man       23   -1.03      -1.08  
##  3        3 control     3.59     2.35  Woman     18   -0.147      0.348 
##  4        4 control     2.25     2.40  Man       20   -0.879      0.381 
##  5        5 control     2.27     0.615 Woman     22   -0.871     -0.921 
##  6        6 control     3.01     1.54  Man       23   -0.467     -0.247 
##  7        7 control     1.80    -0.249 Woman     20   -1.13      -1.55  
##  8        8 control     3.78     2.85  Man       22   -0.0419     0.709 
##  9        9 control     2.58     1.78  Woman     18   -0.701     -0.0701
## 10       10 control     0.411    0.658 Man       21   -1.89      -0.890 
## # … with 90 more rows

Select specific columns

select() selects your tibble’s variables by name and in the exact order you specify.

see help(select) for details

example_data %>% 
  select(subj_num, condition, dep_var)

## # A tibble: 100 x 3
##    subj_num condition dep_var
##       <dbl> <fct>       <dbl>
##  1        1 control     3.80 
##  2        2 control     1.98 
##  3        3 control     3.59 
##  4        4 control     2.25 
##  5        5 control     2.27 
##  6        6 control     3.01 
##  7        7 control     1.80 
##  8        8 control     3.78 
##  9        9 control     2.58 
## 10       10 control     0.411
## # … with 90 more rows

Filter specific rows

filter() returns rows that all meet some condition you give it.

Note, == means “exactly equal to”. See ?Comparison.

see help(filter) for details

example_data %>% 
  filter(condition == "control")

## # A tibble: 50 x 8
##    subj_num condition dep_var confound gender   age dep_var_z confound_z
##       <dbl> <fct>       <dbl>    <dbl> <fct>  <dbl>     <dbl>      <dbl>
##  1        1 control     3.80     0.514 Woman     22   -0.0289    -0.995 
##  2        2 control     1.98     0.400 Man       23   -1.03      -1.08  
##  3        3 control     3.59     2.35  Woman     18   -0.147      0.348 
##  4        4 control     2.25     2.40  Man       20   -0.879      0.381 
##  5        5 control     2.27     0.615 Woman     22   -0.871     -0.921 
##  6        6 control     3.01     1.54  Man       23   -0.467     -0.247 
##  7        7 control     1.80    -0.249 Woman     20   -1.13      -1.55  
##  8        8 control     3.78     2.85  Man       22   -0.0419     0.709 
##  9        9 control     2.58     1.78  Woman     18   -0.701     -0.0701
## 10       10 control     0.411    0.658 Man       21   -1.89      -0.890 
## # … with 40 more rows

Make your own table of summary data

summarize() let’s you apply functions to your data to reduce it to single values. Typically, you create new summary values based on groups (e.g., condition, gender, id); for this, use group_by() first.

see help(summarize) and help(group_by) for details

example_data %>% 
  group_by(gender, condition) %>% 
  summarize(Mean = mean(confound),
            SD = sd(confound),
            n = length(confound))

## # A tibble: 4 x 5
## # Groups:   gender [?]
##   gender condition     Mean    SD     n
##   <fct>  <fct>        <dbl> <dbl> <int>
## 1 Man    control      0.812 1.13     25
## 2 Man    manipulation 2.84  1.09     25
## 3 Woman  control      1.10  0.816    25
## 4 Woman  manipulation 2.75  1.01     25

Plotting your data with ggplot2

“ggplot2 is a plotting system for R, based on the grammar of graphics, which tries to take the good parts of base and lattice graphics and none of the bad parts. It takes care of many of the fiddly details that make plotting a hassle (like drawing legends) as well as providing a powerful model of graphics that makes it easy to produce complex multi-layered graphics.” [.html]

Make ggplots in layers

Aesthetic mappings describe how variables in the data are mapped to visual properties (aesthetics) of geoms. [.html]

Below, we map condition on our plot’s x-axis and dep_var on its y-axis

see help(ggplot) for details

example_data %>%
  ggplot(mapping = aes(x = condition, y = dep_var))

Think of this like a blank canvas that we’re going to add pictures to or like a map of the ocean we’re going to add land mass to

Boxplot

next, we add—yes, add, with a +—a geom, a geometric element: geom_boxplot()

see help(geom_boxplot) for details

example_data %>%
  ggplot(mapping = aes(x = condition, y = dep_var)) +
  geom_boxplot()

Here’s emphasis: we started with a blank canvas, and we added a boxplot. All ggplots work this way: start with a base and add layers.

QQ-plots

Below, we plot the sample quantiles of dep_var against the theoretical quantiles

Useful for exploring the distribution of a variable

see help(geom_qq) for details

example_data %>%
  ggplot(mapping = aes(sample = dep_var)) +
  geom_qq()

Means and 95% confidence intervals

add a new aesthetic, fill, which will fill the geoms with different colors, depending on the variable (e.g., levels of categorical variables are assigned their own fill color)

stat_summary() does what its name suggests: it applies statistical summaries to your raw data to make the geoms (bars and error bars in our case below)

the width argument sets the width of the error bars.

see help(stat_summary) for details

example_data %>%
  ggplot(mapping = aes(x = condition, y = dep_var, fill = condition)) +
  stat_summary(geom = "bar", fun.data = mean_cl_normal) +
  stat_summary(geom = "errorbar", fun.data = mean_cl_normal, width = 0.1)

Scatterplots

we add geom_point() and geom_smooth() below to add points to the scatterplot and fit a linear regression line with 95% confidence ribbons/bands around that line

see help(geom_point) and help(geom_smooth)for details

example_data %>%
  ggplot(mapping = aes(x = confound, y = dep_var)) +
  geom_point() +
  geom_smooth(method = "lm", se = TRUE)

Descriptive statistics

describe(), help(describe)

describeBy(), help(describeBy)

For whole sample

example_data %>% 
  select(dep_var, dep_var_z, confound, confound_z) %>% 
  describe()

##            vars   n mean   sd median trimmed  mad   min  max range  skew
## dep_var       1 100 3.86 1.82   4.05    3.89 2.24 -0.27 7.14  7.41 -0.14
## dep_var_z     2 100 0.00 1.00   0.10    0.02 1.23 -2.26 1.80  4.06 -0.14
## confound      3 100 1.88 1.37   1.84    1.84 1.43 -1.37 5.21  6.58  0.17
## confound_z    4 100 0.00 1.00  -0.03   -0.03 1.04 -2.37 2.44  4.81  0.17
##            kurtosis   se
## dep_var        -1.0 0.18
## dep_var_z      -1.0 0.10
## confound       -0.3 0.14
## confound_z     -0.3 0.10

By condition

The code below is a little confusing. First, we’re piping our subsetted tibble with only our four variables—dep_var and confound and their z-scored versions—into the first argument for the describeBy() function. But we need to give data to the group argument, so then we just give it another subsetted tibble with only our grouping variable, condition.

example_data %>% 
  select(dep_var, dep_var_z, confound, confound_z) %>% 
  describeBy(group = example_data %>% select(condition))

## 
##  Descriptive statistics by group 
## condition: control
##            vars  n  mean   sd median trimmed  mad   min  max range  skew
## dep_var       1 50  2.33 1.06   2.29    2.35 1.05 -0.27 4.26  4.52 -0.27
## dep_var_z     2 50 -0.84 0.58  -0.86   -0.82 0.58 -2.26 0.22  2.48 -0.27
## confound      3 50  0.96 0.99   0.83    0.94 1.05 -1.37 2.92  4.29  0.08
## confound_z    4 50 -0.67 0.72  -0.76   -0.68 0.77 -2.37 0.76  3.13  0.08
##            kurtosis   se
## dep_var       -0.40 0.15
## dep_var_z     -0.40 0.08
## confound      -0.58 0.14
## confound_z    -0.58 0.10
## -------------------------------------------------------- 
## condition: manipulation
##            vars  n mean   sd median trimmed  mad   min  max range skew
## dep_var       1 50 5.39 0.90   5.25    5.39 1.12  3.52 7.14  3.62 0.01
## dep_var_z     2 50 0.84 0.49   0.76    0.84 0.61 -0.18 1.80  1.98 0.01
## confound      3 50 2.80 1.04   2.56    2.73 0.80  0.38 5.21  4.83 0.44
## confound_z    4 50 0.67 0.76   0.50    0.63 0.59 -1.09 2.44  3.53 0.44
##            kurtosis   se
## dep_var       -0.97 0.13
## dep_var_z     -0.97 0.07
## confound      -0.07 0.15
## confound_z    -0.07 0.11

Read in your own data

.csv file: read_csv()

.txt file: read_delim()

SPSS .sav file: read_sav()

SPSS

see help(read_sav) for details

# path to where file lives on your computer
coffee_filepath <- "data/coffee.sav"

coffee_data <- read_sav(coffee_filepath)

# print coffee_data by just excecuting/running the name of the object
coffee_data

## # A tibble: 138 x 3
##    image     brand      freq
##    <dbl+lbl> <dbl+lbl> <dbl>
##  1  1        1            82
##  2  2        1            96
##  3  3        1            72
##  4  4        1           101
##  5  5        1            66
##  6  6        1             6
##  7  7        1            47
##  8  8        1             1
##  9  9        1            16
## 10 10        1            60
## # … with 128 more rows

CSV

see help(read_csv) for details

# path to where file lives on your computer
coffee_filepath <- "data/coffee.csv"

coffee_data <- read_csv(coffee_filepath)

## Parsed with column specification:
## cols(
##   image = col_double(),
##   brand = col_double(),
##   freq = col_double()
## )

TXT

see help(read_delim) for details

# path to where file lives on your computer
coffee_filepath <- "data/coffee.txt"

coffee_data <- read_delim(coffee_filepath, delim = " ")

## Parsed with column specification:
## cols(
##   image = col_double(),
##   brand = col_double(),
##   freq = col_double()
## )

Modeling your data

`t.test()`, help(t.test)

t.test(dep_var ~ condition, data = example_data)

## 
##  Welch Two Sample t-test
## 
## data:  dep_var by condition
## t = -15.588, df = 95.663, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -3.453165 -2.673003
## sample estimates:
##      mean in group control mean in group manipulation 
##                   2.325443                   5.388527

`pairs.panels()`, help(pairs.panels)

shows a scatter plot of matrices (SPLOM), with bivariate scatter plots below the diagonal, histograms on the diagonal, and the Pearson correlation above the diagonal (see ?pairs.panels).

example_data %>% 
  select(dep_var, confound, age) %>% 
  pairs.panels()

`lm()`, help(lm)

lm_fit <- lm(dep_var ~ condition + confound, data = example_data)

# print lm_fit by just excecuting/running the name of the object
lm_fit

## 
## Call:
## lm(formula = dep_var ~ condition + confound, data = example_data)
## 
## Coefficients:
##           (Intercept)  conditionmanipulation               confound  
##                1.8611                 2.1710                 0.4853

`summary()`, help(summary)

summary(lm_fit)

## 
## Call:
## lm(formula = dep_var ~ condition + confound, data = example_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.54595 -0.60775 -0.04598  0.59644  1.90531 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             1.8611     0.1457  12.778  < 2e-16 ***
## conditionmanipulation   2.1710     0.2316   9.376 2.99e-15 ***
## confound                0.4853     0.0850   5.709 1.24e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8544 on 97 degrees of freedom
## Multiple R-squared:  0.7849, Adjusted R-squared:  0.7804 
## F-statistic:   177 on 2 and 97 DF,  p-value: < 2.2e-16

`Anova()`, help(Anova)

Anova(lm_fit, type = "III")

## Anova Table (Type III tests)
## 
## Response: dep_var
##              Sum Sq Df F value    Pr(>F)    
## (Intercept) 119.195  1 163.280 < 2.2e-16 ***
## condition    64.172  1  87.906 2.988e-15 ***
## confound     23.796  1  32.598 1.237e-07 ***
## Residuals    70.810 97                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Recommended Resources

ReadCollegePDX (2015, October 19). Hadley Wickham “Data Science with R”. [YouTube]

Robinson, D. (2017, July 05). Teach the tidyverse to beginners. Variance Explained. [.html]

Wickham, H. (2014). Tidy data. Journal of Statistical Software, 59(10), 1-23. [.html]

The tidyverse style guide [.html] by Hadley Wickham

Wickham, H., & Grolemund, G. (2017). R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. Sebastopol, CA: O’Reilly Media, Inc. [.html]

More advanced data wrangling and analysis techniques by psychologists, for psychologists

R programming for research [.html], a workshop instructed by Nick Michalak and Iris Wang at the University of Michigan

More information about tidyverse and the psych package

tidyverse: R packages for data science [.html]

Using R and psych for personality and psychological research [.html]

R Studio Cheat Sheets

RStudio Cheat Sheets [.html]

tutorial demonstration R

Learn R for Psychology Research: A Crash Course

Introduction

Foundations: People and their resources that inspired me.

Need-to-Know Basics

Install R and R Studio (you need both in that order)

Understand all the panels in R Studio

Packages: They’re like apps you download for R

Objects: They save stuff for you

c() Function: Combine things like thing1, thing2, thing3, …

R Help: help() and ?

Piping, %>%: Write code kinda like you write sentences

Compute z-scores for those five numbers, called five_numbers

Compute Z-scores for five_numbers and then convert the result into only numbers

Compute z-scores for five_numbers and then convert the result into only numbers and then compute the mean

Tangent: most R introductions will teach you to code the example above like this:

Functions: they do things for you

Compute the num of five_numbers

Compute the length of five_numbers

Compute the sum of five_numbers and divide by the length of five_numbers

Define a new function called compute_mean

Compute the mean of five_numbers

Tangent: Functions make assumptions; know what they are

What is the mean of 5 numbers and a unknown number, NA?

Tell the mean() function to remove missing values

Create data for psychology-like examples

Subject numbers

Condition assignments

Dependent Measure

Save 5 values that represent the sample sizes and the true means and standard deviations of our pretend conditions

Introduce a neat function in R: rnorm()

Randomly sample from our populations we made up above

Combine those and save as our dependent variable

Create a potentially confounding variable or a control variable

Subject gender

Subject age

data.frame() and tibble()

Put all our variable we made into a tibble

Data wrangling examples

create new variables in data.frame or tibble

Select specific columns

Filter specific rows

Make your own table of summary data

Plotting your data with ggplot2

Make ggplots in layers

Boxplot

QQ-plots

Means and 95% confidence intervals

Scatterplots

Descriptive statistics

For whole sample

By condition

Read in your own data

SPSS

CSV

TXT

Modeling your data

t.test(), help(t.test)

pairs.panels(), help(pairs.panels)

lm(), help(lm)

summary(), help(summary)

Anova(), help(Anova)

Recommended Resources

More advanced data wrangling and analysis techniques by psychologists, for psychologists

More information about tidyverse and the psych package

R Studio Cheat Sheets

Nicholas M. Michalak

Data Scientist | Ph.D. Social Psychology

Related

`c()` Function: Combine things like thing1, thing2, thing3, …

R Help: `help()` and `?`

Piping, `%>%`: Write code kinda like you write sentences

Tell the `mean()` function to remove missing values

Introduce a neat function in R: `rnorm()`

`data.frame()` and `tibble()`

`t.test()`, help(t.test)

`pairs.panels()`, help(pairs.panels)

`lm()`, help(lm)

`summary()`, help(summary)

`Anova()`, help(Anova)