Descriptive Statistics

Data, Variables, and Values

  • Descriptive Statistics is concerned with summarizing and presenting data.
  • Data are recorded facts, such as census records, economic statistics, seismographic logs and the results of studies, clinical trials, surveys and experiments.
  • A dataset is an organized collection of data, typically arranged as a table with rows (records) and columns.
  • For example
  • The variables of the dataset are: Name, Population (in millions), Area (in thousands of square miles) and GDP (in billions of dollars). The variable Name identifies the entity (country) to which the other variables pertain.
  • The values of the variable Population, for example, are: 1,423, 1,438, 145, and 343. And the values of Area are: 3,601, 1,148, 6,323, and 3,537.

Categorical, Ordinal, Interval, and Ratio Variables

  • Mathematical operations and relations apply to variables in varying degrees, from identity and greater than to addition and division.

Categorical (Qualitative, Nominal) Variables

  • The values of categorical variables are labels with no meaningful order. For example:
    • Eye color: blue, brown, green
    • Blood type: A, B, AB, O
    • Car brand: Toyota, Ford, Honda
  • Categorical values may be numeric, for example zip codes.  In which case x = y and x ≠ y are meaningful but not  x > y and x < y. 

Ordinal Variables

  • The order of the values of ordinal variables, and only the order, is meaningful. That is,  x > y and x < y are meaningful in addition to x = y and x ≠ y.  But x – y and x + y are meaningless.
  • For example, the values of Education Level might be the numbers 1 to 6 according as:
    • 1 = No high school diploma
    • 2 = High school diploma or GED.
    • 3 = Some college credits earned but no degree.
    • 4 = Associate degree
    • 5 = Bachelor’s Degree
    • 6 = Master’s, doctorate, or professional degree
  • Thus, if Amy has educational level x and Mike educational level y, then if x > y Amy has had more formal education than Mike.
  • But adding educational levels makes no sense. The mathematical fact that 1 + 2 + 3 = 6 is meaningless when applied to education levels. The problem is that defining education leves does not thereby defined “educational units” so that, for example, Level 1 = 5 units, Level 2 = 10 units and so on. Without units, addition and subtraction make no sense.

Interval Variables

  • The values of interval variables have equal intervals as well as meaningful order. That is, x – y and x + y are meaningful in addition to x > y and x < y.  But x / y is not meaningful.
  • Consider year-of-birth. Suppose Amy is born in year x and Mike in year y. Then:
    • If x > y, Amy is younger than Mike.
    • If x – y = 2, Amy is two years younger than Mike.
    • But x / y = 2 does not mean that Amy is twice as old as Mike.
  • Other examples of interval variables are temperature in Fahrenheit or Celsius, IQ scores, and dates.

Ratio Variables

  • The values of ratio variables, like those of interval variables, have equal intervals and meaningful order. They also have meaningful ratios and a “true zero,” denoting a complete lack of the variable.
  • Consider a person’s age in years.  Suppose Amy is x years of age and Mike is y years of age. Then:
    • If x > y, Amy is older than Mike.
    • If x – y = 2, Amy is two years older than Mike.
    • If x / y = 2, Amy is twice as old as Mike. (Meaningful Ratio)
    • If x = 1, Amy is one year old.
    • If x = 1/365, Amy is one day old. (Meaningful Ratio)
    • If x = 0 Amy has no age. (True Zero).
  • Other examples are: weight, height, income, distance, duration of time, and temperature in Kelvin.

Statistics of a Single Variable

Dataset

  • Contents: Life expectancies (at birth) of 233 countries.
  • Source: Wolfram Country Data
  • Sample Entries
    • 80.626, 82.479, 72.15, 69.887, 68.002, 78.5, 64.045, 82.784, 73.082, 74.8
  • Variable Type: Ratio

Mean and Standard Deviation

  • Mean = 73.91.
    • That is, the average life expectancy is 73.91 years.
  • Standard Deviation = 7.28.
    • The standard deviation of 7.28 is a measure of the spread of life expectancies around the mean of 73.91. Here’s what a standard deviation of 7.28 looks like, compared to standard deviations of of 3, 5, and 10 (assuming a normal distribution):
    • The standard deviation is something like the average distance of data values from the mean.  But not exactly.  The exact average distance from the mean is the Mean Absolute Deviation, which is 5.88 for the dataset of life expectancies. MAD is easily calculated and understood.  Thus, for example, the MAD of the numbers {2, 4, 6}, is ((4 – 2) + (4 – 4) + (6 – 4)) / 3 = 4/3.  Statisticians use the standard deviation rather than MAD, however, for theoretical reasons (in particular the Central Limit Theorem).
    • Mathematically, the standard deviation is the square root of the variance.  The variance in turn is the sum of the square deviations of data values from the mean, divided by the number of values minus 1.  So for the numbers {2, 4, 6}:
      • the sum of square deviations  = (4 – 2)2 + (4 – 4)2 + (6 – 4)2 = 8
      • the variance = 8 / (3 – 1) = 4
      • the standard deviation = √4 = 2.

Order Statistics

  • Order statistics, calculated on a sorted dataset, are dividing lines between lower and higher data values. The chief order statistics for the life expectancies dataset are:
    • Minimum = 53.68
    • Quartiles = {69.77, 75.22, 79.55}
      • Median = 75.22
    • Maximum = 89.4
  • A box-and-whisker chart summarizes the statistics:
  • Key:
    • Minimum (54) is the lower whisker
    • First Quartile (70) is the bottom edge of the box
    • Median (75) is the white line through the box
    • Third Quartile (80) is the top edge of the box
    • Maximum (89) is the top whisker.
  • The idea underlying order statistics is that of a percentile (or quantile).
    • The nth percentile of a dataset is the data value P such that n percent of the data items have values less than P and the rest have values greater than P.
  • Thus:
    • The median of 75.22 (the 50th percentile) is the dividing line between 116 lower life expectancies (50% of the total 233) and 117 higher expectancies (50% of the total).
    • The third quartile of 79.55 (the 75th percentile) is the dividing line between 175 lower life expectancies (75%) and 58 higher expectancies (25%).
    • The maximum of 89.4 (the 100th percentile) is the dividing line between 233 lower life expectancies (100%) and 0 higher expectancies (0%).
  • Note: There are different ways of drawing the dividing lines. For example, I use an algorithm that yields 79.55 for the 75th percentile. Other algorithms yield 79.5248, 79.526 and 79.558.

Frequency Distributions and Histograms

  • The best tools for getting a feel for the data of a single variable are the frequency distribution and its graphical representation, the histogram.
  • Percentiles and quartiles, on the one hand, and frequency distributions and histograms, on the other, are in a sense inverse operations.
  • Both separate the data into bins.
  • But percentiles and quartiles first define the size of bins and then calculate the data values dividing them.
  • Frequency distributions and histograms, on the other hand, first define intervals of data values and then calculate the size of bins so defined.
  • Compare, for example:
  • Quartiles define equal size bins and yield different intervals of life expectancy.
  • Histograms define equal intervals of life expectancy and and yield different size bins.