Regression Analysis

Contents
Regression Analysis
  • Regression Analysis is a set of procedures for finding and evaluating an equation that predicts a dependent variable from a set of independent variables.
Basic Form of Argument
  • Equation E, with a certain degree of accuracy, predicts the observed values of a dependent variable from the observed values of a set of independent variables.
  • Therefore, with the same degree of accuracy, E predicts the dependent variable from the independent variables generally.
Hubble’s Law and Simple Linear Regression
  • Astronomers in the early 1900s observed distant galaxies moving directly away from us.  In the late 1920s Edwin Hubble realized that the more distant a galaxy was, the faster it was receding.
  • Here’s a graph of 735 faraway galaxies, with distances in megaparsecs.
    • A megaparsec is 3.26 million lightyears
  • The further away a galaxy is, the faster it recedes (in kilometers per second).
  • Indeed, the correlation coefficient between distance and velocity is 0.818565, the scale going from 0 to 1.
  • In 1929 Hubble set forth Hubble’s Law
    • v = H0 x D, where
      • D is distance from the Milky Way in megaparsecs
      • v is the recessional velocity in kilometers per second
      • H0 is Hubble’s Constant, likely between 67 and 73.
  • We’ll use simple linear regression to estimate Hubble’s Law.
  • The data are the distances and velocities of 735 galaxies. For example:
  • Using linear regression, with help from Mathematica, we can draw a straight line that best “fits” the data.
  • The equation for the regression line is:
    • y = 59 x + 284, where
      • y is the dependent variable, velocity
      • x is the the independent variable, distance
  • The coefficient of x, 59, is less than Hubble’s Constant but in the ballpark
    • Reasons for the discrepancy:
      • Measurement error
      • Gravitational pushes and pulls on the galaxies by other galaxies.
How Simple Linear Regression Works
  • Suppose the data consists of pairs of x’s and y’s:
    • data {x, y} = {1, 3}, {2, 4}, {3, 8}, {4, 7}, {5, 10}
  • The x’s and y’s are correlated:
  • The goal is to find a simple equation that predicts the y’s from the x’s.
  • In simple linear regression the equation has the form:
    • y = a x + b.
  • The objective, therefore, is to find coefficients a and b that, given the observed x’s, predict y’s as close to the observed y’s as possible.
  • “As close as possible” needs to be quantified.
    • One idea is the least sum of absolute residuals.
      • A residual is the difference between an observed y and and a predicted y.
      • The sum of absolute residuals is the sum of the |observed y – predicted y| for all the observed y’s.
    • Statisticians instead use the least sum of squares of residuals, for mathematical reasons.
      • The residual sum of squares is the sum of the (observed y – predicted y)2 for all the observed ys.
      • In mathematical form:
  • The Least Squares Algorithm is a way of calculating the coefficients a and b given the data.
    • Given
      • xs = {1, 2, 3, 4, 5}
      • ys = {3, 4, 8, 7, 10}
      • n = 5
    • a and b are:
        • a = 1.7
        • b = 1.3
    • So the regression equation is:
      • y = 1.7 x + 1.3
    • And the least residual sum of squares is 4.3
      • So, if you change a or b, the sum of squares increases. For example, if a = 1.8:
Evaluating Simple Linear Regression
  • Two things to look at
    • Scatter plot with regression line
    • R-Squared and Adjusted R-Squared
      • R-Squared, non-technically speaking, is the proportion of the observed dependent variable predicted by the regression equation.  It runs from 0 and 1.
      • View R-Squared
Two Regressions
  • data {x, y} = {1, 2}, {2, 3.5}, {3, 6}, {4, 8.5}, {5, 10}, {6, 12.5}, {7, 14}, {8,16}, {9, 17.5}, {10, 20}
  • Regression Equation: y = 2 x + 0.03
  • Residuals = {-0.02727, -0.521212, -0.01515, 0.4909, -0.00303, 0.50303, 0.00909, 0.01515, -0.47879, 0.02727}
  • R-Squared = 0.99697
  • Adjusted R-Squared = 0.996591

  • data {x, y} = {4, 2}, {2, 3.5}, {3, 6}, {6, 8.5}, {7, 10}, {5, 12.5}, {9, 14}, {8, 16}, {1, 17.5}, {9, 20}
  • Regression Equation: y = x + 5.63
  • Residuals = {-7.60753, -4.11828, -2.6129, -3.09677, -2.5914, 1.89785, -0.580645, 2.41398, 10.8763, 5.41935}
  • R-Squared = 0.223715
  • Adjusted R-Squared = 0.126679
More Advanced Regressions
  • Nonlinear Regression
    • The regression line needn’t be straight.
  • Multiple Regression
    • The dependent variable may be a function of more than one independent variable.
    • The chart below represents a linear regression with two independent variables, x and y
      • the spheres are the data points, at locations (x, y, z)
      • the checkered board is the regression plane for z = 5 x + 5 y – 0.262
Multiple Linear Regression
Data
  • You take two medications to lower your blood pressure. To determine which is more effective you keep a daily log, with dosages and blood pressure.
  • The dosage for Drug Alpha runs from 0.1 to 1 mg.
  • The dosage for Beta can be anywhere from 1 mg to 10 mg.
  • Data in 3D:
Multiple Regression
  • Multiple regression enables you to compare the functional impact of the two drugs on blood pressure.
  • The regression equation from multiple linear regression on the data:
      • where
        • z is blood pressure
        • x is the dosage of Drug Alpha
        • y is the dosage of Drug Beta
Evaluating Multiple Linear Regression
R-Squared for the Regression
  • R-Squared = 0.973826
  • Adjusted R-Squared = 0.960739
Scatter Plots and R-Squared for Comparing Independent Variables
  • Multiple linear regression for Drugs Alpha and Beta combined
    • R-Squared = 0.973826
  • Simple linear regression for Drug Alpha alone
    • R-Squared = 0.846676
  • Simple linear regression for Drug Beta alone
    • R-Squared = 0.416667
  • R-Squared correlation between Drugs Alpha and Beta
    • R-Squared = 0.1133
3D Plot
  • In the plot below. as x goes from 0.0 to 1.0 the regression plane slopes downward much more than it does as y goes from 0 to 10, indicating Drug X has more of an effect than Y.
Standardized Independent Variables
  • Drug Alpha’s dosage is changed from a scale of 0-1 mg to 1-10 mg, comparable to Drug Beta.
    • data {x,y,z} = {0, 0, 70}, {5, 0, 60}, {0, 5, 65}, {5, 3, 55}, {3, 5, 58}, {5, 5, 53}, {7, 7, 50}
  • The coefficients of the regression equation are now comparable.
Addenda
Variance, Standard Deviation, Mean Absolute Deviation
  • These data sets have the same mean, 5:
    • A = {1, 6, 2, 7, 2, 8, 3, 8, 8, 5}
    • B = {4, 4, 4, 6, 4, 7, 3, 6, 5, 7}
  • But the values of A are more spread out, their average distance from the mean larger.
  • One way of quantifying average distance from the mean is the mean absolute deviation, e.g. for A:
  • For mathematical reasons, statisticians prefer the variance and its square root, the standard deviation.
    • The population variance is the sum of squares of deviations from the mean, divided by the number of data points.
    • The sample variance is like the population variance except the sum of squares is divided by the number of data points minus one.
    • In some circumstances, the variance is calculated by dividing the sum of squares by what are called degrees of freedom.
ANOVA Table for Simple Linear Regression
  • Data {x, y} = {1, 3}, {2, 4}, {3, 8}, {4, 7}, {5, 10}
  • Regression Equation: y = 1.7 x + 1.3
  • ANOVA Table
  • ANOVA is all about variance:
    • The Sums of Squares (SS) are numerators of variances.
    • The Degrees of Freedom (DF) are denominators of variances.
    • The Mean Squares (MS) are variances
      • 28.9 / 1 = 28.9
      • 4.3 / 3 = 1.43333
    • The F-Statistic is the ratio of one variance to another
      • 28.9 / 1.43333 = 20.1628.
    • The P-Value is the probability of getting an F-Statistic as high or higher than 20.1628 by chance.
Sum of Squares
  • Principle of Sum of Squares
    • SS(predicted DV – mean of predicted DV) + SS(observed DV – predicted DV) = SS(observed DV – mean of observed DV)
      • where
        • SS = sum of squares
        • DV = dependent variable
    • Sum of squares for predicted DV + sum of squares for residuals = sum of squares for observed values
    • x, Error Total
  • The sum of squares of predicted DV = the sum of squares of (predicted dependent variable – mean of predicted dependent variable)
  • The sum of squares of residuals is the sum of squares of (observed dependent variable – predicted dependent variable).
  • That is, the residual sum of squares.
  • The sum of squares of observed DVs is the sum of squares of (observed dependent variable – mean of dependent).
Mean Squares and Degrees of Freedom
  • The mean squares are the sums of squares divided by their degrees of freedom, making them variances.
    • 28.9 / 1 = 28.9
    • 4.3 / 3 = 1.43333
  • The Total Degrees of Freedom = 4, the number of data points minus one.
    • The total is divided up between x and Error, x getting 1 and Error the rest.
F-Statistic and its P-Value
  • An F-Statistic is the ratio of one variance to another.
    • MS x / MS Error = 28.9 / 1.43333 = 20.1628
  • The P-Value for a given F-Statistic is the probability of getting an F-Statistic as high or higher by chance.
    • Probability[x ≥ 20.1628, FRatioDistribution[1, 3]] = 0.0206096
      • (1 and 3 are the degrees of freedom for 28.9 and 1.43333)
  • The F-Distribution begins high and approaches zero as the F-Statistic increases.
  • The area under the curve to the right of 20.1628 line = 0.0206096.
ANOVA Tables for Multiple Linear Regression
  • data {x,y,z} = {0, 0, 70}, {0.5, 0, 60}, {0, 5, 65}, {0.5, 3, 55}, {0.3, 5, 58}, {0.5, 5, 53}, {0.7, 7, 50}
  • The ANOVA table for the regression equation:
      • where
        • x = Drug Alpha
        • y = Drug Beta
  • The ANOVA table is different when the order of x and y is reversed.
    • data {x,y,z} = {0, 0, 70}, {0, 0.5, 60}, {5, 0, 65}, {3, 0.5, 55}, {5, 0.3, 58}, {5, 0.5, 53}, {7, 0.7, 50}
  • The regression equation becomes:
      • where
        • x = Drug Βeta
        • y = Drug Alpha
    • The x and y lines in the ANOVA Table are different
  • ANOVA calculates the first variable, x, differently from the second.
    • The sum of squares for x is taken from the simple linear regression for x:
  • ANOVA calculates the sum of squares for y conditionally, given its calculation for x. This is called a partial or adjusted sum of squares.
  • Here’s an Anova Table where both sums of squares are partial, each calculated given the calculation of the other. The x and y lines match those of the y lines of the first two multiple regressions.
  • Regression Equation:
Least Squares Derivation of Simple Linear Regression Equation
  • Suppose the data consists of pairs of x’s and y’s:
    • data {x, y} = {1, 3}, {2, 4}, {3, 8}, {4, 7}, {5, 10}
    • So:
      • xs = {1, 2, 3, 4, 5}
      • ys = {3, 4, 8, 7, 10}
      • n = 5
  • The objective is to find an a and b that make y = a x + b yield the lowest residual sum of squares.
  • Let’s vary a and keep b constant, arbitrarily at b = 0:
  • As a increases, the residual sum of squares decreases, hits a low around a = 2, and increases.
  • At the low point, the rate of change of the residual sum with respect to a = 0.
  • In the language of calculus:
  • Now let’s vary b and keep a constant, arbitrarily at a = 1
  • As b increases, the residual sum of squares decreases, hits a low around b = 3.5, and increases.
  • At the low point, the rate of change of the residual sum with respect to b = 0.
  • That is:
  • The solutions of the equations are:
  • Taken together they yield:
    • a = 1.7
    • b = 1.3
  • The regression equation is therefore:
    • y = 1.7 x + 1.3
Parameter Table Calculations
  • Data {x, y} = {1, 3}, {2, 4}, {3, 8}, {4, 7}, {5, 10}
  • Regression Equation: y = 1.7 x + 1.3
  • Parameter Table
  • From the ANOVA Table
    • The Error Sum of Squares = 4.3
      • This is the residual sum of squares
        • Total[lm[“FitResiduals”]^2] = 4.3
    • The Mean Square Error = 1.43333
      • This is the variance of the residuals, dividing by n-2 rather than n-1.
        • Total[lm[“FitResiduals”]^2]/(n – 2) = 1.43333
  • The square root of the Mean Square Error = 1.19722
    • Sqrt[1.43333] = 1.19722
    • Sqrt[Total[lm[“FitResiduals”]^2]/(n – 2)] = 1.19722
    • This is the standard deviation of residuals
  • The standard error of the mean =  0.535413
    • The standard error of the mean = σ/√n
      • (standard deviation of residuals) /Sqrt[n] = 0.535413
  • The standard error of x, 1.7, = 0.378594
    • 0.535413 (1 /Sqrt[Total[(xs – Mean[xs])^2]/n] ) = 0.378594
  • The standard error of the intercept, 1.3, = 1.25565
    • 0.535413  Sqrt[1 + (Mean[xs ]^2/ (Total[(xs – Mean[xs])^2]/n) )] = 1.25565
  • The t-statistics are the estimates (minus 0) divided by the standard errors
    • (1.3 – 0) / 1.25565 = 1.03532
    • (1.7 – 0) / 0.378594 = 4.4903
  • The p-values for the t-statistics are the two-tailed probabilities of getting t-statistics as extreme or more extreme than 1.03532 and 4.4903.
    • Probability[x ≥ 1.035317 || ≤ -1.035317, StudentTDistribution[3] = 0.376655
    • Probability[x ≥ 4.4903 || x ≤ -4.4903, StudentTDistribution[3] = 0.0206096
R-Squared (Coefficient of Determination)
  • R-Squared
    • R-Squared, non-technically speaking, is the proportion of the observed dependent variable predicted by the regression equation.  It runs from 0 and 1.
    • SS(predicted DV – mean of predicted DV) / SS(observed DV – mean of observed DV)
      • Where
        • SS = sum of squares
        • DV = dependent variable
    • R-Squared is the  sum of squares of (predicted dependent variable – mean of predicted dependent variable) / the sum of squares of (observed dependent variable – mean of observed dependent variable).
  • Adjusted R-Squared
    • Adjusted R-Squared adjusts R-Squared for sample size and the number of independent variables.
    • Adjusted R-Squared = 1 – ( ((n – 1)/(n – k – 1)) (1 – R-squared) )
      • Where
        • n = the sample size
        • k = the number of independent variables.
    • people.duke.edu/~rnau/rsquared.htm
  • R-Squared and Adjusted R-Squared can be calculated from the ANOVA Table
    • R-Squared = (SS x + SS y) / SS Total
      • (246.746 + 37.0552) / 291.429 = 0.973826
    • Adjusted R-Squared = 1 – ( ( DF Total / (DF Total – DF x – DF y)) (1 – R-Squared)  )
      • 1 – ( (6/(6 – 1 – 1)) ( 1 – 0.973826)  ) = 0.960739
Regression to the Mean
  • Regression to the Mean
    • Suppose that the values of two variables are paired so that the value of the second is near the value of the first, its precise value a matter of chance.
      • Examples:
        • The heights of a father and son.
        • A student’s scores on a test and retest.
        • Initial and subsequent measurements of untreated high blood pressure
    • Then
      • For values of the first variable way above average, the value of the second is likely to be lower than the first.
      • For values of the first variable way below average, the value of the second is likely to be higher than the first.
  • Regression Fallacy
    • Thinking that an instance of regression to the mean is due to something other than chance, e.g. medication.
Example: Test and Retest
  • Students take a test.
  • Test scores are a function of what students know and luck.
  • Scores based solely on knowledge are normally distributed with a mean of 100 and a standard deviation of 15
    • See graph below
  • Luck is either +5 or -5, with equal probability.
  • For example, a student with a knowledge score of 120 will get either 115 or 125 on the test.
  • Suppose a student does well on the test, getting a 140.
  • There are two possibilities:
    • Student’s knowledge score = 135 and luck = +5
    • Student’s knowledge score = 145 and luck = -5
  • The first possibility is more likely, since a knowledge score of 135 is more likely than one of 145 given their normal distribution with mean 100.
    • Probability (knowledge score = 135 and luck = +5) = 0.00175 * 0.5 = 0.00087
    • Probability (knowledge score = 145 and luck = -5) = 0.000296 * 0.5 = 0.00015
  • Hence a student who does well on the test is more likely to have a knowledge score of 135, meaning on subsequent tests they’ll get 130 or 140 with equal probability.
  • The example is based on one in Statistics, 4th ed, by Freedman, Pisani, and Purves, page 173.
Example: Father and Son
  • Assumptions
    • The average male height is 70 inches (5’10”) with a standard deviation of 4 inches.
    • A son is likely to be approximately his father’s height.
  • In a Monte Carlo simulation the fathers’ heights are randomly distributed per the normal distribution with mean = 70 inches (5’10”) and standard deviation 4 inches.
  • The sons’ heights are randomly distributed around their fathers’ heights, given the appropriate segment of the normal distribution above.
  • For example, the probability distribution for the height of a son whose father is 78 inches tall (6’6”):
  • The result of the simulation is that a father who is 6.5 feet or over is taller than his son 84.9 percent of 10,000 iterations. 
  • Mathematica code for the simulation:
  • The line in red randomly determines the son’s height per the graph above. The reason for regression to the mean is the downward slope of the graph.
  • The regression fallacy is mistakenly thinking that the son’s height is determined by a probability distribution like this:
  • If we replace the red line in the code with son = RandomVariate[NormalDistribution[father, 4]], thus using the wrong probability distribution, the result of the simulation is that a father who is 6.5 feet or over is taller than his son 48.7 percent of 10,000 iterations.

History
  • britannica.com/topic/regression-to-the-mean
    • “An early example of RTM may be found in the work of Sir Francis Galton on heritability of height. He observed that tall parents tended to have somewhat shorter children than would be expected given their parents’ extreme height. Seeking an empirical answer, Galton measured the height of 930 adult children and their parents and calculated the average height of the parents. He noted that when the average height of the parents was greater than the mean of the population, the children were shorter than their parents. Likewise, when the average height of the parents was shorter than the population mean, the children were taller than their parents. Galton called this phenomenon regression toward mediocrity; it is now called RTM. This is a statistical, not a genetic, phenomenon.”