Probing and Plotting Interactions in R

Last updated on February 16, 2019 14 min read regression, plotting, interactions

What is moderation?

Moderation refers to how some variable modifies the direction or the strength of the association between two variables. In other words, a moderator variable qualifies the relation between two variables. A moderator is not a part of some proposed causal process; instead, it interacts with the relation between two variables in such a way that their relation is stronger, weaker, or opposite in direction—depending on values of the moderator. For example, as room temperature increases, people may report feeling thirstier. But that may depend on how physically fit people are. Maybe physically fit people don’t report feeling thirsty as room temperature increases, or maybe physically fit people—compared to less physically fit people—have a higher room temperature threshold at which they start feeling thirstier. In this example, the product of one predictor variable and the moderator—their interaction—quantifies the moderator’s effect. Statistically, the product term accounts for variability in thirst or water drinking independently of either predictor variable by itself.

What is a simple slope?

In a 2-way interaction, a simple slope represents the relation between two variables (e.g., x and y) at a specific value of a third variable (e.g., a moderator variable). In this sense, a simple slope is a conditional relationship between two variables. For example, if participants are physically fit, then as room temperature increases, thirst also increases.

Model and Conceptual Assumptions

Correct functional form. Your model variables share linear relationships.

No omitted influences. This one is hard: Your model accounts for all relevant influences on the variables included. All models are wrong, but how wrong is yours?

Accurate measurement. Your measurements are valid and reliable. Note that unreliable measures can’t be valid, and reliable measures don’t necessairly measure just one construct or even your construct.

Well-behaved residuals. Residuals (i.e., prediction errors) aren’t correlated with predictor variables or eachother, and residuals have constant variance across values of your predictor variables.

Libraries

library(tidyverse)
library(knitr)
library(psych)
library(effects)
library(multcomp)

# In the code belo,w I want select from the dplyr package from the tidyverse
select <- dplyr::select

Data: Example 1 (categorical x continuous interaction)

I combined the data from Table 3.1 in Mackinnon (2008, p. 56) [mackinnon_2008_t3.1.csv] with those from Table 10.1 in Mackinnon (2008, p. 291) [mackinnon_2008_t10.1.csv]

thirst_norm <- read_csv("https://github.com/nmmichalak/academic-kickstart/raw/master/static/post/2019-02-14-testing-conditional-indirect-effects-mediation-in-r/2019-02-14-testing-conditional-indirect-effects-mediation-in-r_files/data/mackinnon_2008_t3.1.csv")
thirst_fit <- read_csv("https://github.com/nmmichalak/academic-kickstart/raw/master/static/post/2019-02-13-testing-indirect-effects-mediation-in-r/data/mackinnon_2008_t10.1.csv")

Code new IDs for fit data

thirst_fit$id <- 51:100

Add column in both datasets that identifies fitness group

Unfit = -0.5 and Fit = 0.5

thirst_norm$phys_fit <- -0.5
thirst_fit$phys_fit <- 0.5

Bind unfit and fit data by rows

Imagine stacking these datasets on top of eachother

thirst_data <- bind_rows(thirst_norm, thirst_fit)

Mean-center predictors

i.e., mean-center everything but the consume variable

thirst_data <- thirst_data %>% mutate(room_temp_c = room_temp - mean(room_temp),
                                      thirst_c = thirst - mean(thirst))

Print first and last five observations

thirst_data %>% 
  headTail() %>% 
  kable()

id	room_temp	thirst	consume	phys_fit	room_temp_c	thirst_c
1	70	4	3	-0.5	-0.13	0.87
2	71	4	3	-0.5	0.87	0.87
3	69	1	3	-0.5	-1.13	-2.13
4	70	1	3	-0.5	-0.13	-2.13
…	…	…	…	…	…	…
97	71	4	4	0.5	0.87	0.87
98	71	4	5	0.5	0.87	0.87
99	70	3	3	0.5	-0.13	-0.13
100	71	4	3	0.5	0.87	0.87

Visualize relationships

It’s always a good idea to look at your data. Check some assumptions.

thirst_data %>% 
  select(room_temp, room_temp_c, thirst, thirst_c, consume, phys_fit) %>% 
  pairs.panels()

Fit linear model

flm01 <- lm(thirst ~ room_temp_c * phys_fit, data = thirst_data)

Summarize linear model

summary(flm01)

## 
## Call:
## lm(formula = thirst ~ room_temp_c * phys_fit, data = thirst_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.99905 -0.90020 -0.06908  0.66235  2.00095 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            3.1406     0.1004  31.273  < 2e-16 ***
## room_temp_c            0.5498     0.1015   5.416  4.5e-07 ***
## phys_fit               0.1950     0.2008   0.971   0.3341    
## room_temp_c:phys_fit   0.4225     0.2030   2.081   0.0401 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.003 on 96 degrees of freedom
## Multiple R-squared:  0.2415, Adjusted R-squared:  0.2178 
## F-statistic: 10.19 on 3 and 96 DF,  p-value: 6.918e-06

Print 95% confidence intervals

confint(flm01)

##                            2.5 %    97.5 %
## (Intercept)           2.94122187 3.3399029
## room_temp_c           0.34833220 0.7513491
## phys_fit             -0.20369699 0.5936651
## room_temp_c:phys_fit  0.01947918 0.8255130

Plot scatterplot with slope estimates for each fitness group

Save table of predicted values

term: Which interaction term are you interested in?

mod: What model are you using the make predictions?

x.var: Which variable would you want to see on your x-axis?

xlevels: Give this argument a list of vectors (i.e., strings of values) whose values you want to make predictions for. If you have two conditions, you probably want to make predictions for the values that represent both conditions. If you have a moderator variable, you probably want to make predictions across a range of relevant, meaningful values of your moderator. Importantly, there should be people in your dataset who have scores at (or very close to) those values. In other words, be careful not to make predictions outside your variables’ observed range of values (unless that’s your goal).

pred_table01 <- effect(term = "room_temp_c:phys_fit", mod = flm01, x.var = "room_temp_c", xlevels = list(phys_fit = c(-0.5, 0.5), room_temp_c = seq(-3, 3, 1))) %>%
  as_tibble %>% 
  mutate(phys_fit_lbl = phys_fit %>% recode(`-0.5` = "Normal", `0.5` = "Fit"))

# print table
kable(pred_table01)

room_temp_c	phys_fit	fit	se	lower	upper	phys_fit_lbl
-3	-0.5	2.0272925	0.4095873	1.2142681	2.840317	Normal
-2	-0.5	2.3658851	0.2946487	1.7810120	2.950758	Normal
-1	-0.5	2.7044778	0.1939502	2.3194896	3.089466	Normal
0	-0.5	3.0430704	0.1419796	2.7612431	3.324898	Normal
1	-0.5	3.3816630	0.1855867	3.0132763	3.750050	Normal
2	-0.5	3.7202556	0.2836712	3.1571727	4.283339	Normal
3	-0.5	4.0588482	0.3977926	3.2692361	4.848460	Normal
-3	0.5	0.9547883	0.4906478	-0.0191399	1.928717	Fit
-2	0.5	1.7158770	0.3413438	1.0383150	2.393439	Fit
-1	0.5	2.4769657	0.2073590	2.0653613	2.888570	Fit
0	0.5	3.2380544	0.1420630	2.9560616	3.520047	Fit
1	0.5	3.9991431	0.2192441	3.5639471	4.434339	Fit
2	0.5	4.7602319	0.3558876	4.0538006	5.466663	Fit
3	0.5	5.5213206	0.5059109	4.5170953	6.525546	Fit

Save character variable with condition labels

thirst_data <- thirst_data %>% 
  mutate(phys_fit_lbl = phys_fit %>% recode("-0.5" = "Normal", "0.5" = "Fit"))

ggplot2

Read this code line by line. The idea is that you’re first generating a scatterplot with your raw values and then you’re “adding (+)” layers which use the predicted values your tabled above. Put another way, geom_line() and geom_ribbon() are using data from the table of predicted values; geom_point() is using data from your dataset.

thirst_data %>% 
  ggplot(mapping = aes(x = room_temp_c, y = thirst)) +
  geom_point(mapping = aes(color = phys_fit_lbl)) +
  geom_line(data = pred_table01, mapping = aes(x = room_temp_c, y = fit, linetype = phys_fit_lbl)) +
  geom_ribbon(data = pred_table01, mapping = aes(x = room_temp_c, y = fit, ymin = lower, ymax = upper, fill = phys_fit_lbl), alpha = 0.5) +
  scale_x_continuous(breaks = pretty(thirst_data$room_temp_c)) +
  scale_y_continuous(breaks = pretty(thirst_data$thirst)) +
  scale_color_manual(values = c("blue", "red")) +
  scale_fill_manual(values = c("blue", "red")) +
  theme(legend.position = "top")

## Warning: Ignoring unknown aesthetics: y

Test simple slopes

Save matrix of contrasts

Each coefficient gets a contrast weight.

0 means don’t use it; cross it out.

intercept = 0, room_temp_c = 1, phys_fit = 0, room_temp_c:phys_fit = -0.5

In words, test the linear effect of room_temp_c when phys_fit = -0.5 (normal participants)

contmat01 <- rbind(normal = c(0, 1, 0, -0.5),
                   fit = c(0, 1, 0, 0.5))

Save general linear hypothesis object output from `glht()`

glht() takes your model and your contrast matrix you made above.

glht01 <- glht(model = flm01, linfct = contmat01)

Contrast summary

test = adjusted("none") means, “Don’t correct for multiple comparisons.”

summary(glht01, test = adjusted("none"))

## 
##   Simultaneous Tests for General Linear Hypotheses
## 
## Fit: lm(formula = thirst ~ room_temp_c * phys_fit, data = thirst_data)
## 
## Linear Hypotheses:
##             Estimate Std. Error t value Pr(>|t|)    
## normal == 0   0.3386     0.1260   2.688  0.00848 ** 
## fit == 0      0.7611     0.1592   4.780 6.31e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- none method)

Interpretation

Holding all other predictors at 0, every 1°F increase in room temperature was associated with a b = 0.55 unit increase in thirst, t(96) = 5.42, 95% CI[0.35, 0.75], p < .001. Holding all other predictors at 0, there was no sufficient evidence that physically fit participants reported different mean thirst units, b = 0.20, than normal participants, t(96) = 0.97, 95% CI[-0.20, 0.59], p = .334. However, the association between room temperature and thirst was different between fitness groups, b = 0.42, t(96) = 2.08, 95% CI[0.02, 0.83], p = .040. Among normal participants, every 1°F increase in room temperature was associated with a b = 0.34 unit increase in thirst, t(96) = 2.69, p = .008; in contrast, among physically fit participants, every 1°F increase in room temperature was associated with a b = 0.76 unit increase in thirst, t(96) = 4.78, p < .001.

Data: Example 1 (continuous x continuous interaction)

Data come from the example in Chapter 7 from Cohen, Cohen, Aiken, & West (2003, p. 276) [C07E01DT.txt]

endurance <- "data/C07E01DT.txt" %>% read_table(col_names = c("id", "xage", "zexer", "yendu"))

Mean-center predictors

i.e., mean-center everything but the yendu variable

endurance <- endurance %>% mutate(xage_c = xage - mean(xage),
                                  zexer_c = zexer - mean(zexer))

Print first and last five observations

endurance %>% 
  headTail() %>% 
  kable()

id	xage	zexer	yendu	xage_c	zexer_c
1	60	10	18	10.82	-0.67
2	40	9	36	-9.18	-1.67
3	29	2	51	-20.18	-8.67
4	47	10	18	-2.18	-0.67
…	…	…	…	…	…
247	45	9	37	-4.18	-1.67
248	60	7	0	10.82	-3.67
249	57	11	18	7.82	0.33
250	56	12	24	6.82	1.33

Visualize relationships

It’s always a good idea to look at your data. Check some assumptions.

endurance %>% 
  select(xage, zexer, yendu, xage_c, zexer_c) %>% 
  pairs.panels()

Fit linear model

flm02 <- lm(yendu ~ zexer_c * xage_c, data = endurance)

Summarize linear model

summary(flm02)

## 
## Call:
## lm(formula = yendu ~ zexer_c * xage_c, data = endurance)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -21.165  -6.939   0.269   6.300  21.299 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    25.88872    0.64662  40.037  < 2e-16 ***
## zexer_c         0.97272    0.13653   7.124 1.20e-11 ***
## xage_c         -0.26169    0.06406  -4.085 6.01e-05 ***
## zexer_c:xage_c  0.04724    0.01359   3.476 0.000604 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.7 on 241 degrees of freedom
## Multiple R-squared:  0.2061, Adjusted R-squared:  0.1962 
## F-statistic: 20.86 on 3 and 241 DF,  p-value: 4.764e-12

Print 95% confidence intervals

confint(flm02)

##                      2.5 %      97.5 %
## (Intercept)    24.61497596 27.16246675
## zexer_c         0.70376546  1.24166505
## xage_c         -0.38788631 -0.13549322
## zexer_c:xage_c  0.02046862  0.07402086

Plot scatterplot with slope estimates for percentiles

Save quantiles of mean-centered age

You’ll use these for plotting and simple slopes.

(xage_cqs <- quantile(endurance$xage_c))

##         0%        25%        50%        75%       100% 
## -29.183673  -6.183673  -1.183673   6.816327  32.816327

Save table of predicted values

xage_cqs[2:4] means, “subset the 2nd, 3rd, and 4th value in xage_cqs”

term: Which interaction term are you interested in?

mod: What model are you using to make predictions?

x.var: Which variable would you want to see on your x-axis?

xlevels: Give this argument a list of vectors (i.e., strings of values) whose values you want to make predictions for. If you have two conditions, you probably want to make predictions for the values that represent both conditions. If you have a moderator variable, you probably want to make predictions across a range of relevant, meaningful values of your moderator. Importantly, there should be people in your dataset who have scores at (or very close to) those values. In other words, be careful not make predictions outside your variables’ observed range of values (unless that’s your goal).

pred_table02 <- effect(term = "zexer_c:xage_c", mod = flm02, x.var = "zexer_c", xlevels = list(zexer_c = pretty(endurance$zexer_c), xage_c = xage_cqs[2:4])) %>%
  as_tibble %>% 
  mutate(xage_c_lbl = xage_c %>% factor(levels = xage_cqs[2:4], labels = c("25th Percentile", "50th Percentile", "75th Percentile")))

# print table
kable(pred_table02)

zexer_c	xage_c	fit	se	lower	upper	xage_c_lbl
-15	-6.183674	17.298387	2.2198327	12.925636	21.67114	25th Percentile
-10	-6.183674	20.701233	1.5254098	17.696395	23.70607	25th Percentile
-5	-6.183674	24.104079	0.9354140	22.261448	25.94671	25th Percentile
0	-6.183674	27.506925	0.7563266	26.017071	28.99678	25th Percentile
5	-6.183674	30.909772	1.1907841	28.564098	33.25545	25th Percentile
10	-6.183674	34.312618	1.8473793	30.673546	37.95169	25th Percentile
15	-6.183674	37.715464	2.5605781	32.671493	42.75943	25th Percentile
20	-6.183674	41.118310	3.2938149	34.629968	47.60665	25th Percentile
-15	-1.183674	12.446583	2.1146888	8.280950	16.61222	50th Percentile
-10	-1.183674	17.030547	1.4844112	14.106471	19.95462	50th Percentile
-5	-1.183674	21.614512	0.9240868	19.794194	23.43483	50th Percentile
0	-1.183674	26.198477	0.6506056	24.916877	27.48008	50th Percentile
5	-1.183674	30.782441	0.9547410	28.901739	32.66314	50th Percentile
10	-1.183674	35.366406	1.5227162	32.366874	38.36594	50th Percentile
15	-1.183674	39.950370	2.1551544	35.705026	44.19571	50th Percentile
20	-1.183674	44.534335	2.8088446	39.001315	50.06735	50th Percentile
-15	6.816326	4.683696	2.9239000	-1.075966	10.44336	75th Percentile
-10	6.816326	11.157450	2.0956730	7.029276	15.28562	75th Percentile
-5	6.816326	17.631204	1.3214234	15.028190	20.23422	75th Percentile
0	6.816326	24.104958	0.7823900	22.563763	25.64615	75th Percentile
5	6.816326	30.578713	0.9948712	28.618959	32.53847	75th Percentile
10	6.816326	37.052467	1.6967803	33.710054	40.39488	75th Percentile
15	6.816326	43.526221	2.5059964	38.589768	48.46267	75th Percentile
20	6.816326	49.999975	3.3455392	43.409743	56.59021	75th Percentile

ggplot2

Read this code line by line. The idea is that you’re first generating a scatterplot with your raw values and then you’re “adding (+)” layers which use the predicted values your tabled above. Put another way, geom_line() and geom_ribbon() are using data from the table of predicted values; geom_point() is using data from your dataset. The colors here are probably overkill, but the idea is that darker red values mean those participants are older; in this way, color gives you a sense of the distribution of participant age without plotting another dimension.

endurance %>% 
  ggplot(mapping = aes(x = zexer_c, y = yendu)) +
  geom_point(mapping = aes(color = xage_c)) +
  geom_line(data = pred_table02, mapping = aes(x = zexer_c, y = fit, linetype = xage_c_lbl)) +
  geom_ribbon(data = pred_table02, mapping = aes(x = zexer_c, y = fit, ymin = lower, ymax = upper, fill = xage_c_lbl), alpha = 0.5) +
  scale_x_continuous(breaks = pretty(endurance$zexer_c)) +
  scale_y_continuous(breaks = pretty(endurance$yendu)) +
  scale_color_gradient(low = "blue", high = "red") +
  scale_fill_manual(values = c("blue", "purple", "red")) +
  theme(legend.position = "top")

## Warning: Ignoring unknown aesthetics: y

Test simple slopes

Save matrix of contrasts

Each coefficient gets a contrast weight.

0 means don’t use it; cross it out.

intercept = 0, zexer_c = 1, xage_c = 0, zexer_c:xage_c = 25th percentile of age

In words, test the linear effect of zexer_c when xage_c = 25th percentile of age (younger participants)

contmat02 <- rbind(xage_c25 = c(0, 1, 0, xage_cqs[2]),
                   xage_c50 = c(0, 1, 0, xage_cqs[3]),
                   xage_c75 = c(0, 1, 0, xage_cqs[4]))

Save general linear hypothesis object output from `glht()`

glht() takes your model and your contrast matrix you made above.

glht02 <- glht(model = flm02, linfct = contmat02)

Contrast summary

test = adjusted("none") means, “Don’t correct for multiple comparisons.”

summary(glht02, test = adjusted("none"))

## 
##   Simultaneous Tests for General Linear Hypotheses
## 
## Fit: lm(formula = yendu ~ zexer_c * xage_c, data = endurance)
## 
## Linear Hypotheses:
##               Estimate Std. Error t value Pr(>|t|)    
## xage_c25 == 0   0.6806     0.1516   4.490 1.11e-05 ***
## xage_c50 == 0   0.9168     0.1356   6.763 1.02e-10 ***
## xage_c75 == 0   1.2948     0.1739   7.446 1.70e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- none method)

Interpretation

Holding all other predictors at 0, every 1 year increase in vigorous exercise (mean-centered) was associated with a b = 0.97 minute increase in jogging on a treadmill (i.e., endurance), t(241) = 7.12, 95% CI[0.70, 1.24], p < .001. Holding all other predictors at 0, every 1 year increase in age (mean-centered) was associated with a b = -0.26 minute decrease in jogging on a treadmill, t(241) = 4.09, 95% CI[-0.39,, -0.14], p = .334. However, the association between years of vigorous exercise and minutes jogging on a treadmill was different at different ages, b = 0.45, t(241) = 3.48, 95% CI[0.02, 0.07], p < .001. Among people at the 25th percentile of age in the sample, every 1 year increase in vigorous exercise (mean-centered) was associated with a b = 0.68 minute increase in jogging on a treadmill, t(241) = 4.49, p < .001; in contrast, among people at the 75th percentile of age in the sample, every 1 year increase in vigorous exercise (mean-centered) was associated with a b = 1.29 minute increase in jogging on a treadmill, t(241) = 7.45, p < .001.

Resources

Cohen, J., Cohen, P., West, S. G., & Aiken, L. S. (2003). Applied multiple regression/correlation analysis for the behavioral sciences. New York, NY: Routledge.

MacKinnon, D. P. (2008). Introduction to statistical mediation analysis. New York, NY: Lawrence Erlbaum Associates.

Revelle, W. (2017) How to use the psych package for mediation/moderation/regression analysis. [.pdf]

General word of caution

Above, I listed resources prepared by experts on these and related topics. Although I generally do my best to write accurate posts, don’t assume my posts are 100% accurate or that they apply to your data or research questions. Trust statistics and methodology experts, not blog posts.

regression plotting interactions

Probing and Plotting Interactions in R

What is moderation?

What is a simple slope?

Model and Conceptual Assumptions

Libraries

Data: Example 1 (categorical x continuous interaction)

Code new IDs for fit data

Add column in both datasets that identifies fitness group

Bind unfit and fit data by rows

Mean-center predictors

Print first and last five observations

Visualize relationships

Fit linear model

Summarize linear model

Print 95% confidence intervals

Plot scatterplot with slope estimates for each fitness group

Save table of predicted values

Save character variable with condition labels

ggplot2

Test simple slopes

Save matrix of contrasts

Save general linear hypothesis object output from glht()

Contrast summary

Interpretation

Data: Example 1 (continuous x continuous interaction)

Mean-center predictors

Print first and last five observations

Visualize relationships

Fit linear model

Summarize linear model

Print 95% confidence intervals

Plot scatterplot with slope estimates for percentiles

Save quantiles of mean-centered age

Save table of predicted values

ggplot2

Test simple slopes

Save matrix of contrasts

Save general linear hypothesis object output from glht()

Contrast summary

Interpretation

Resources

General word of caution

Nicholas M. Michalak

Data Scientist | Ph.D. Social Psychology

Related

Save general linear hypothesis object output from `glht()`

Save general linear hypothesis object output from `glht()`