7.3 - Comparing Two Population Means

In this section, we are going to approach constructing the confidence interval and developing the hypothesis test similarly to how we approached those of the difference in two proportions.

There are a few extra steps we need to take, however. First, we need to consider whether the two populations are independent. When considering the sample mean, there were two parameters we had to consider, \(\mu\) the population mean, and \(\sigma\) the population standard deviation. Therefore, the second step is to determine if we are in a situation where the population standard deviations are the same or if they are different.

Independent and Dependent Samples

It is important to be able to distinguish between an independent sample or a dependent sample.

The samples from two populations are independent if the samples selected from one of the populations has no relationship with the samples selected from the other population.

The samples are dependent (also called paired data) if each measurement in one sample is matched or paired with a particular measurement in the other sample. Another way to consider this is how many measurements are taken off of each subject. If only one measurement, then independent; if two measurements, then paired. Exceptions are in familial situations such as in a study of spouses or twins. In such cases, the data is almost always treated as paired data.

The following are examples to illustrate the two types of samples.

Example 7-3: Gas Mileage

We want to compare the gas mileage of two brands of gasoline. Describe how to design a study involving.

Answer: Randomly assign 12 cars to use Brand A and another 12 cars to use Brand B.

Answer: Using 12 cars, have each car use Brand A and Brand B. Compare the differences in mileage for each car.

  1. We want to compare whether people give a higher taste rating to Coke or Pepsi. To avoid a possible psychological effect, the subjects should taste the drinks blind (i.e., they don't know the identity of the drink). Describe how to design a study involving independent sample and dependent samples.
    1. Design involving independent samples
    2. Design involving dependent samples
    1. Answer: Randomly assign half of the subjects to taste Coke and the other half to taste Pepsi.
    2. Answer: Allow all the subjects to rate both Coke and Pepsi. The drinks should be given in random order. The same subject's ratings of the Coke and the Pepsi form a paired data set.
    1. We randomly select 20 males and 20 females and compare the average time they spend watching TV. Is this an independent sample or paired sample?
    2. We randomly select 20 couples and compare the time the husbands and wives spend watching TV. Is this an independent sample or paired sample?
    1. Answer: Independent Sample
    2. Answer: Paired sample

    The two types of samples require a different theory to construct a confidence interval and develop a hypothesis test. We consider each case separately, beginning with independent samples.

    7.3.1 - Inference for Independent Means

    7.3.1 - Inference for Independent Means

    Two-Cases for Independent Means

    As with comparing two population proportions, when we compare two population means from independent populations, the interest is in the difference of the two means. In other words, if \(\mu_1\) is the population mean from population 1 and \(\mu_2\) is the population mean from population 2, then the difference is \(\mu_1-\mu_2\). If \(\mu_1-\mu_2=0\) then there is no difference between the two population parameters.

    If each population is normal, then the sampling distribution of \(\bar_i\) is normal with mean \(\mu_i\), standard error \(\dfrac>\), and the estimated standard error \(\dfrac>\), for \(i=1, 2\).

    Using the Central Limit Theorem, if the population is not normal, then with a large sample, the sampling distribution is approximately normal.

    The theorem presented in this Lesson says that if either of the above are true, then \(\bar_1-\bar_2\) is approximately normal with mean \(\mu_1-\mu_2\), and standard error \(\sqrt+\dfrac>\).

    However, in most cases, \(\sigma_1\) and \(\sigma_2\) are unknown, and they have to be estimated. It seems natural to estimate \(\sigma_1\) by \(s_1\) and \(\sigma_2\) by \(s_2\). When the sample sizes are small, the estimates may not be that accurate and one may get a better estimate for the common standard deviation by pooling the data from both populations if the standard deviations for the two populations are not that different.

    Given this, there are two options for estimating the variances for the independent samples:

    1. Using pooled variances
    2. Using unpooled (or unequal) variances

    When to use which? When we are reasonably sure that the two populations have nearly equal variances, then we use the pooled variances test. Otherwise, we use the unpooled (or separate) variance test.

    7.3.1.1 - Pooled Variances

    7.3.1.1 - Pooled Variances

    Confidence Intervals for \(\boldsymbol<\mu_1-\mu_2>\): Pooled Variances

    When we have good reason to believe that the variance for population 1 is equal to that of population 2, we can estimate the common variance by pooling information from samples from population 1 and population 2.

    An informal check for this is to compare the ratio of the two sample standard deviations. If the two are equal, the ratio would be 1, i.e. \(\frac=1\). However, since these are samples and therefore involve error, we cannot expect the ratio to be exactly 1. When the sample sizes are nearly equal (admittedly "nearly equal" is somewhat ambiguous, so often if sample sizes are small one requires they be equal), then a good Rule of Thumb to use is to see if the ratio falls from 0.5 to 2. That is, neither sample standard deviation is more than twice the other.

    If this rule of thumb is satisfied, we can assume the variances are equal. Later in this lesson, we will examine a more formal test for equality of variances.

    Then the common standard deviation can be estimated by the pooled standard deviation:

    If we can assume the populations are independent, that each population is normal or has a large sample size, and that the population variances are the same, then it can be shown that.

    follows a t-distribution with \(n_1+n_2-2\) degrees of freedom.

    Now, we can construct a confidence interval for the difference of two means, \(\mu_1-\mu_2\).

    where \(t_\) comes from a t-distribution with \(n_1+n_2-2\) degrees of freedom.