  The Central Limit Theorem – How to Tame Wild Populations

People come in a variety of shapes and sizes. Get a few million people together in one place, say in Rhode Island or South Carolina, and it would be impossible to predict what a single person selected from either state would be like. Try to compare all Rhode Islanders to all South Carolinians and the task gets even more complex. Obviously, something is needed to simplify the process, and that’s why we have statistics.

The first step is to decide on a measurement, such as weight. Yes, this ignores that Mary Jane in SC has freckles and John in RI has a tattoo but we have to focus on something or we can’t make a comparison. Unfortunately, even after focusing on a single measurement, we still have two different populations of data points consisting of millions of wildly different numbers. These populations include everything from two pound preemies to 400+ pound bubbas.

Again, we must simplify and so we’ll focus on a parameter that can characterize the weights of all individuals in a population. A parameter is a number which summarizes a specific characteristic generated from measurements of every member of a population. Using a parameter it’s possible to represent a property of an entire population with a single number instead of millions of individual data points.

There are many possible parameters to choose from such as the median, mode, or interquartile range. Each is calculated in a different manner and illuminates the data from a different point of view. We’ll use the mean, because it’s one of the most useful and widely used. In spite of its harsh sounding name it has a very nice effect on helping us understand populations.

The mean, or average, summarizes something called central tendency. This is a fancy way of saying what’s typical or expected in a population. If all the data points were plotted on a line segment the most typical values would usually be found somewhere near the center of the line segment, hence, the term central tendency.

Of course central tendency isn’t the only issue. There’s also the issue of variability or spread. We like to call this wildness although, admittedly, wildness is not a standard term. Still, wildness seems appropriate, since, the characteristics of individuals drawn from populations with lots of variability or spread tend to be wildly unpredictable.

While the mean gives us a single number to describe a complex population, it’s, unfortunately, a parameter. We have to use every single data point in a population to calculate a parameter. With millions of data points to collect, this could be a real problem. By the time we got all the measurements, the two pound preemies might have turned into 400+ pound adults.

The solution is to use a randomly chosen sample of the population and calculate a statistic. Statistics are always based on samples. We would carefully collect some data points chosen at random and calculate a sample mean. We’ll call this statistic x-bar. Clearly, x-bar is not a parameter since it’s not calculated from the entire population. It’s only an estimate of the parameter.

If we selected another sample and calculated a second x-bar we might find that it differs considerably from the first. By their nature, random samples can sometimes give unexpected results even when flawlessly collected. For example, the first could be made up mostly of preemies while the second could be made up of sumo wrestlers. While such extremes are unlikely, selecting large sized samples would prevent them. After all, there are just so many preemies and sumo wrestlers available. A really large sized sample could never be made up entirely of either preemies or sumo wrestlers.

Ultimately, if one sample mean or x-bar is wildly different from another we would not be any better off than trying to look at the entire population. Obviously, we need to understand just how wild or variable sample means or x-bars are likely to be.

 Central Limit Theorem Applet The attached applet simulates a population by generating 16,000 floating point random numbers between 0 and 10. Each time the "New Population" button is pressed it generates a new set of random numbers. The plot labeled Population Distribution shows a histogram of the 16,000 data points. The applet uses two different pseudo random number generators (PRNG) . The "Uniform Distr" option uses Java's standard PRNG in which every value has an equal probability. The "Normal Distr" option uses a PRNG from Java in which the probability of generating a particular value is determined by a normal distribution. The skewed distribution uses the same PRNG, however, the left side is truncated. The binomial distribution is simply two skewed distributions that are mirror images and scaled appropriately. Both PRNGs are referred to as pseudo random number generators since they generate numbers from equations which over a long period of time repeat. However, the numbers they generate are very similar to ideal random numbers. Below the population histogram is a histogram representing the sampling distribution of x-bar. Each time the resample button is pressed a new set of samples is obtained from the population and the sampling distribution histogram is re-plotted. The slider called "Sample Size" helps illustrate the central limit theorem. When the sample size is increased, the sampling distribution becomes narrower as predicted by equation (1). As mentioned earlier, the central limit theorem is arguably the second most important principle in statistics. The law of large numbers is arguably even more important than the central limit theorem. It says essentially that probability and statistics can only predict overall results for a large number of data points or trials. For example, toss a coin thousands of times and it will come up heads in almost exactly 50% of the tosses. Toss it three times and it can never come up heads 50% of the time. Each sampling distribution data point is an x-bar calculated from an individual sample. The law of large numbers says it will take lots of these x-bars to make the sampling distribution look like it’s normally distributed. The "Number of Samples" slider can be used to illustrate this point. Reduce the number of samples to the minimum and the sampling distribution starts to look like a sprinkling of random points even when the sample size is large. Increase it to the maximum and the normal distribution shape appears. When playing with the applet remember to think "central limit theorem" when changing the sample size and to think "law of large numbers" when changing the number of samples.

To understand the wildness of samples, we would choose thousands of samples, calculate an x-bar for each, and display the x-bars in a histogram. This histogram represents a sampling distribution and when we look at it we see something truly amazing. Sampling distributions tend to be far less variable or wild than the populations they are drawn from (See Fig. 1A, 1B, 1C and 1D.) They also have essentially the same mean as the population.

Sampling distributions drawn from a uniformly distributed population start to look like normal distributions even with a sample size as small as 2 (see Fig. 1B). If the sample size is large enough they form nearly perfect normal distributions (see Fig. 1C). Make the sample size larger and the variability of the sampling distribution drops even more. The sampling distribution starts looking spike-like because a normal distribution with very little variability (in other words a small standard deviation) is so narrow that it looks spike-like (see Fig. 1D).

This may not seem Earth shattering but it’s really quite profound. Anytime we know data follows a normal distribution, we immediately have a lot more confidence that we can predict how the data will behave.

The situation is similar to hiring Mary Jane, who has a master’s degree in computer science versus Jim Bob who says he can compute. Jim Bob may turn out to be better at creating software. However, by knowing that Mary Jane has a master’s degree we feel a lot more confident in our ability to predict her programming capability.

It would be highly annoying if we had to generate an entire sampling distribution every time we want to be sure that our statistic based on a sample really is less wild than the data points in the population. Fortunately, we know this ahead of time, thanks to the (arguably) second most profound principle in all of statistics, the central limit theorem (the law of large numbers being the first).

The central limit theorem tells us that a sampling distribution always has significantly less wildness or variability, as measured by standard deviation, than the population it’s drawn from. Additionally, the sampling distribution will look more and more like normal distribution as the sample size is increased, even when the population itself is not normally distributed!

Although the central limit theorem works well regardless of the population's distribution, with really strange distributions, it takes a lot more than a sample size of two to do so. For example, bimodal distribution (see Fig. 2A) will not look normally distributed with only a sample size of two. The sampling distribution in this case will have a third high narrow peak in the center with a lower wider peak on either side (see Fig. 2B). Increasing the sample size by one adds another peak (see Fig. 2C). Eventually, with a large enough sample size, there are so many peaks that they run together and the sampling distribution starts to look like a typical normal distribution.

The applet included in this article illustrates this nicely. It contains a highly skewed as well as a bimodal population. All of the populations in it are created using various random number generators that more closely simulate what real populations are like than simply generating populations from smooth density curve equations.

If we use standard deviation as a measurement of wildness the standard deviation of a sampling distribution (sometimes called the standard error) can be predicted from the population standard deviation as follows:

 ss = sp / (n)^0.5 equation (1) where: ss  = standard deviation of sampling distribution or standard error sp = standard deviation of the population n = sample size

When we look at equation (1) we notice something very profound because it’s missing. There’s no term for the size of the population. This means that the reduction in wildness depends only on the sample size. In other words, a statistic based on a sample size of say 2000 will be just as meaningful if it’s drawn from a population of 20,000,000,000 as it will be if it’s drawn from a population of 20,000. Population size does not matter as long as it’s at least 10 times larger than the sample.

If the central limit theorem didn’t exist, it would not be possible to use statistics. We would be unable to reliably estimate a parameter like the mean by using an average derived from a much smaller sample. This would all but shut down research in the social sciences and the evaluation of new drugs since these depend on statistics. It would invalidate the use of polls and completely alter the nature of marketing research not to mention politics.

Thanks to the central limit theorem, we can be sure that a mean or x-bar based on a reasonably large randomly chosen sample will be remarkably close to the true mean of the population. If we need more certainty we need only increase the sample size. What’s more, it does not matter if we are characterizing a city, state, or the entire United States, we can use the same sample size. It will give the same level of certainty regardless of the population size.

For further Information

Sampling Distribution Applet: This applet can also be very helpful for understanding sampling distributions. However, be aware that in it the sampling distribution's vertical scale changes with the number of samples and the sample size only goes up to 25. This hides the dramatic narrowing effect on the sampling distribution caused by increasing sample size. The applet provided in the above article plots population and sampling distributions on exactly the same scales and allows sample sizes of 100, all of  which make the narrowing effect highly visible. Both applets are correct although they look different due to differences in scales. Fig. 1A) Histogram of Population - Uniform Distribution: all values in the population are randomly determined but all equally likely. This approximates a uniform distribution. Data points in population = 16,000; mean = 4.996 std dev 2.882 Fig. 1B) Sampling Distribution (from a uniform population) n = 2: number of samples = 4000; mean = 5.01; std dev 2.048; std error = 2.037 Fig. 1C) Sampling Distribution (from a uniform population) n = 10: number of samples = 2010; mean = 5.015; std dev 0.906; std error = 0.911 Fig. 1D) Sampling Distribution (from a uniform population) n = 80: number of samples = 4000; mean = 4.989; std dev 0.322; std error = 0.322 Fig. 2A) Histogram of Population - Bimodal Distribution: population = 16,000; mean = 5.002 std dev 4.242 Fig. 2B) Sampling Distribution (from a bimodal population) n = 2: number of samples = 4000; mean = 4.977; std dev 3.017; std error = 2.999 Fig. 2C) Sampling Distribution (from a bimodal population) n = 3: number of samples = 4000; mean = 4.946; std dev 2.425; std error = 2.449 Fig. 2D) Sampling Distribution n = 30: number of samples = 4000; mean = 5.032; std dev 0.722; std error = 0.722