Correlation

Table of Contents

  1. Correlation
  2. Correlation Coefficient
  3. Interactive with correlations from -0.99 to +0.99
  4. Correlation doesn’t logically imply causation
  5. Correlation can be evidence of causation
  6. Correlation and Linear Regression
    1. Data points for variables {x, y}
    2. Correlation Coefficient
    3. Regression Equation
    4. Plot of Data Points and Regression Equation
    5. Predictions from the Regression Equation
  7. Mill’s Method of Concomitant Variation
  8. Math of the Correlation Coefficient
  9. P-Value for Correlation
    1. Pearson Correlation Test
    2. Calculation using T-value
    3. Monte Carlo Simulation using Permutations of ys

Correlation

The correlation between two variables is the extent they vary together, in the same or opposite directions.

Correlation Coefficient

  • The correlation coefficient, a number from -1 to +1, measures the degree of correlation.
  • Consider hypothetical measurements of wind speed and the angular velocity of a wind turbine.
  • A scatter plot of RPM against MPH.
  • The variables are correlated. But by how much?
  • In 1900 the British statistician Karl Pearson developed the correlation coefficient
    • The correlation coefficient runs from -1 to +1. 
    • A positive value indicates the variables vary in the same direction.
    • A negative value indicates the variables vary in opposite directions.
    • The greater the absolute value, the greater the correlation.
    • Zero means no correlation.
  • The correlation coefficient for wind speed and angular velocity of the wind turbine is 0.957972, a strong positive correlation.

Interactive with correlations from -0.99 to +0.99

View Correlation Interactive

Correlation doesn’t logically imply causation

  • In New York City Joe sneezed each day of the first week in August. In Texas Bill cursed each day of the same week. A table of their daily sneezes and curses:
  • The correlation coefficient is 1.0, a perfect positive correlation. But there’s no causal connection. Correlation by itself does not logically imply a causal connection.
  • Inferring causation from mere statistical correlation is the fallacy of cum hoc ergo propter hoc (“with this, therefore because of this”).
  • To infer causation you need evidence beyond mere correlation.

Correlation can be evidence of causation

  • Correlation can be evidence of causal connection in conjunction with other evidence, such as:
    • Other correlations
    • Observational studies
    • Laboratory experiments
    • Randomized controlled trials
    • General knowledge
  • Three kinds of causal connections between variables X and Y
    • X causes Y
    • Y causes X
    • A third (confounding) variable causes both X and Y
  • Causal connections may be complex, for example between income and years of education.
    • Education, Income, and Wealth, Federal Reserve
      • There’s a positive correlation between income and years of education.
      • But…
        • Is the correlation because years of eduction open up high-paying jobs.
        • Or is it because a person with the drive and wherewithal to go to college has the drive and wherewithal to land a high-paying job?
        • Or is it a combination of the two?
  • Ecological correlations are correlations between variables pertaining to groups rather than individuals, for example, correlations between percentages, rates, and averages.
  • Such correlations provide weaker evidence for causal connections among individuals in the groups than correlations among the individuals themselves.
  • Possible explanations for a correlation between X and Y
    • X causes Y
    • Y causes X
    • A third (confounding) variable causes both X and Y
    • The correlation is a coincidence
      • There’s no causal connection between X and Y

Correlation and Linear Regression

Data points for variables {x, y}

{6,13},{9,12},{10,21},{21,4},{17,30},{22,23},{13,25},{28,21},{15,27},{29,30},{33,17},{22,33},{13,17},{33,31},{25,25},{36,30},{28,42},{28,19},{36,35},{36,44},{37,27},{27,32},{24,28},{31,34},{36,44},{45,32},{30,49},{50,28},{49,30},{46,42},{34,50},{50,46},{37,39},{54,34},{57,38},{54,46},{48,38},{63,57},{59,58},{53,62},{43,58},{59,64},{63,66},{65,47},{46,67},{66,54},{66,62},{63,58},{62,55},{73,68},{57,62},{66,57},{60,70},{56,77},{76,55},{66,66},{69,68},{70,78},{80,61},{61,80},{72,86},{70,75},{76,86},{80,69},{70,75},{68,72},{84,82},{77,73},{70,85},{78,77},{81,71},{77,73},{93,96},{80,97},{87,97},{94,91},{98,94},{98,97},{104,100},{100,95}

Correlation Coefficient

0.904976

Regression Equation

y = 0.913 x + 4.77

Plot of Data Points and Regression Equation
Predictions from the Regression Equation
  • Prediction[0] = 4.8556
  • Prediction[50] = 50.4126
  • Prediction[100] = 95.9697

View Regression

Mill’s Method of Concomitant Variation

  • John Stuart Mill, System of Logic 1874 page 287-288
    • “Whatever phenomenon varies in any manner whenever another phenomenon varies in some particular manner, is either a cause or an effect of that phenomenon, or is connected with it through some fact of causation.”
  • britannica.com/topic/Mills-methods
    • The method of concomitant variation is used when E can be present in various degrees; if we identify a factor F, such as temperature, whose variations are positively or negatively correlated with variations in E, for instance, size, then we can infer that F is causally connected with E.

Math of the Correlation Coefficient

Correlation coefficient for x and y = covariance of x and y / ( (standard deviation of x) (standard deviation of y) )

P-Value for Correlation

Pearson Correlation Test
Calculation using T-value
Monte Carlo Simulation using Permutations of ys