What is a correlation?

A correlation quantifies the linear association between two variables. From one perspective, a correlation has two parts: one part quantifies the association, and the other part sets the scale of that association.

The first part—the covariance, also the correlation numerator—equates to a sort of “average sum of squares” of two variables:

\(cov_{(X, Y)} = \frac{\sum(X - \bar X)(Y - \bar Y)}{N - 1}\)

It could be easier to interpret the covariance as an “average of the X-Y matches”: Deviations of X scores above the X mean multipled by deviations of Y scores below the Y mean will be negative, and deviations of X scores above the X mean multipled by deviations of Y scores above the Y mean will be positive. More “mismatches” leads to a negative covariance and more “matches” leads to a positive covariance.

The second part—the product of the standard deviations, also the correlation denominator—restricts the association to values from -1.00 to 1.00.

\(\sqrt{var_X var_Y} = \sqrt{\frac{\sum(X - \bar X)^2}{N - 1} \frac{\sum(Y - \bar Y)^2}{N - 1}}\)

Divide the numerator by the denominator and you get a sort of “ratio of the sum of squares”, the Pearson correlation coefficient:

\(r_{XY} = \frac{\frac{\sum(X - \bar X)(Y - \bar Y)}{N - 1}}{\sqrt{\frac{\sum(X - \bar X)^2}{N - 1} \frac{\sum(Y - \bar Y)^2}{N - 1}}} = \frac{cov_{(X, Y)}}{\sqrt{var_X var_Y}}\)

Square this “standardized covariance” for an estimate of the proportion of variance of Y that can be accounted for by a linear function of X, \(R^2_{XY}\).

By the way, the correlation equation is very similar to the bivariate linear regression beta coefficient equation. The only difference is in the denominator which excludes the Y variance:

\(\hat{\beta} = \frac{\frac{\sum(X - \bar X)(Y - \bar Y)}{N - 1}}{\sqrt{\frac{\sum(X - \bar X)^2}{N - 1} }} = \frac{cov_{(X, Y)}}{\sqrt{var_X}}\)

What does it mean to “adjust” a correlation?

An adjusted correlation refers to the (square root of the) change in a regression model’s \(R^2\) after adding a single predictor to the model: \(R^2_{full} - R^2_{reduced}\). This change quantifies that additional predictor’s “unique” contribution to observed variance explained. Put another way, this value quantifies observed variance in Y explained by a linear function of X after removing variance shared between X and the other predictors in the model.

Model and Conceptual Assumptions for Linear Regression

  • Correct functional form. Your model variables share linear relationships.
  • No omitted influences. This one is hard: Your model accounts for all relevant influences on the variables included. All models are wrong, but how wrong is yours?
  • Accurate measurement. Your measurements are valid and reliable. Note that unreliable measures can’t be valid, and reliable measures don’t necessairly measure just one construct or even your construct.
  • Well-behaved residuals. Residuals (i.e., prediction errors) aren’t correlated with predictor variables or eachother, and residuals have constant variance across values of your predictor variables.

Libraries

# library("tidyverse")
# library("knitr")
# library("effects")
# library("psych")
# library("candisc")

library(tidyverse)
library(knitr)
library(effects)
library(psych)
library(candisc)

# select from dplyr
select <- dplyr::select
recode <- dplyr::recode

Load data

From help("HSB"): “The High School and Beyond Project was a longitudinal study of students in the U.S. carried out in 1980 by the National Center for Education Statistics. Data were collected from 58,270 high school students (28,240 seniors and 30,030 sophomores) and 1,015 secondary schools. The HSB data frame is sample of 600 observations, of unknown characteristics, originally taken from Tatsuoka (1988).”

HSB <- as_tibble(HSB)

# print a random subset of rows from the dataset
HSB %>% sample_n(size = 15) %>% kable()
<<<<<<< HEAD ======= >>>>>>> 47303b579cc9bfad90fb5dc653eb42904ad01c95 <<<<<<< HEAD ======= >>>>>>> 47303b579cc9bfad90fb5dc653eb42904ad01c95 <<<<<<< HEAD ======= >>>>>>> 47303b579cc9bfad90fb5dc653eb42904ad01c95 <<<<<<< HEAD ======= >>>>>>> 47303b579cc9bfad90fb5dc653eb42904ad01c95 <<<<<<< HEAD ======= >>>>>>> 47303b579cc9bfad90fb5dc653eb42904ad01c95 <<<<<<< HEAD ======= >>>>>>> 47303b579cc9bfad90fb5dc653eb42904ad01c95
id gender race ses sch prog locus concept mot career read write math sci ss
355 female222 malewhite low public general-0.40 -0.89 0.33 operative 41.6 56.7 45.0 50.4 43.1
575 female white middle private general 0.68 0.03 0.00 prof2 44.2 35.9 43.6 47.1 40.6
3350.93 0.34 1.00 prof2 46.9 54.1 54.6 55.3 50.6
42 male hispanic middle public academic 0.03 0.28 0.33 operative 52.1 54.1 52.0 55.3 50.6
413female white middle public general0.46 0.34 0.67 prof1 62.7 61.9 52.9 44.4 40.6
221 male white low public general -0.36 -1.67 1.00 farmer 44.2 43.7 56.4 58.0 60.5
254 female white high public academic 0.48 -0.47 0.33 prof1 52.1 61.9 55.5 60.7 60.5
463 male white middle public general -0.11 0.25 1.00 prof1 44.2 44.3 45.6 39.0 50.6
294 male white low public academic -0.82 -0.76 0.00 clerical 57.4 43.7 59.6 52.6 50.6
419 male white middle public vocation -0.19 0.03 0.33 craftsman 54.8 51.5 42.8 60.7 50.6
3490.51 0.03 0.33 manager 60.1 51.5 53.9 63.4 50.6
486 female white middle public academic 0.53 0.81 0.67 prof1 54.8 59.3 61.4 47.1 55.6
458 male white middle public academic 0.46 0.65 1.00 technical 49.5 48.9 60.5 55.3 55.6
137 male african-amer middle public academic -0.37 -1.90 0.67 manager 54.8 36.5 37.7 49.8 60.5
200 male white high public academic -0.27 0.88 1.00 sales 52.1 64.5 60.6 60.7 45.6
440 female white high public academic 1.36 0.94 1.00 homemaker 52.1 48.9 51.3 41.7 45.6
367male white high publicvocation -1.50 0.03 0.67 prof1 33.6 48.9 38.6 42.3 55.6
354general 0.70 -0.16 0.33 prof1 68.0 59.3 55.7 63.4 65.5
277female white high public academic-0.60 -1.18 0.67 clerical 54.8 59.3 68.0 49.3 65.5
11 female hispanic low public academic 0.25 0.34 1.00 prof1 49.5 61.9 42.9 41.7 50.6
390 female white high public academic 0.45 0.03 0.67 prof1 60.1 61.9 51.9 53.1 58.1
307 male1.11 0.34 1.00 prof2 73.3 67.1 62.3 58.0 65.5
173 female white low public general -0.61 0.03 0.33 proprietor 44.2 54.1 40.3 52.6 40.6
524 female white middle private academic -0.66 -1.07 0.67 clerical 49.5 61.9 60.4 47.1 50.6
264 femalewhite low publicacademic 0.46 0.34 1.00 prof1 76.0 52.1 64.1 63.9 60.5
523 female white high private general 0.68 0.32 1.00 service 36.3 56.7 41.9 49.8 40.6
522 male white middle private academic 0.00 0.65 1.00 military 52.1 61.9 62.1 58.0 60.5general -0.80 0.15 0.33 service 41.6 41.1 39.5 47.1 60.5
404 female white middle public general -0.38 -0.47 0.67 homemaker 62.7 43.7 44.7 52.6 41.9
493 male white low public vocation -0.86 0.28 1.00 farmer 36.3 48.9 54.4 60.7 35.6

Do students who score higher on a standardized math test tend to score higher on a standardized science test?

Scatterplot

alpha below refers to the points’ transparency (0.5 = 50%), lm refers to linear model and se refers to standard error bands

HSB %>% 
  ggplot(mapping = aes(x = math, y = sci)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm", se = FALSE, color = "red")