In this section, we are going to approach constructing the confidence interval and developing the hypothesis test similarly to how we approached those of the difference in two proportions.
There are a few extra steps we need to take, however. First, we need to consider whether the two populations are independent. When considering the sample mean, there were two parameters we had to consider, \(\mu\) the population mean, and \(\sigma\) the population standard deviation. Therefore, the second step is to determine if we are in a situation where the population standard deviations are the same or if they are different.
It is important to be able to distinguish between an independent sample or a dependent sample.
The samples from two populations are independent if the samples selected from one of the populations has no relationship with the samples selected from the other population.
The samples are dependent (also called paired data) if each measurement in one sample is matched or paired with a particular measurement in the other sample. Another way to consider this is how many measurements are taken off of each subject. If only one measurement, then independent; if two measurements, then paired. Exceptions are in familial situations such as in a study of spouses or twins. In such cases, the data is almost always treated as paired data.
The following are examples to illustrate the two types of samples.
We want to compare the gas mileage of two brands of gasoline. Describe how to design a study involving.
Answer: Randomly assign 12 cars to use Brand A and another 12 cars to use Brand B.Answer: Using 12 cars, have each car use Brand A and Brand B. Compare the differences in mileage for each car.
The two types of samples require a different theory to construct a confidence interval and develop a hypothesis test. We consider each case separately, beginning with independent samples.
As with comparing two population proportions, when we compare two population means from independent populations, the interest is in the difference of the two means. In other words, if \(\mu_1\) is the population mean from population 1 and \(\mu_2\) is the population mean from population 2, then the difference is \(\mu_1-\mu_2\). If \(\mu_1-\mu_2=0\) then there is no difference between the two population parameters.
If each population is normal, then the sampling distribution of \(\bar_i\) is normal with mean \(\mu_i\), standard error \(\dfrac>\), and the estimated standard error \(\dfrac>\), for \(i=1, 2\).
Using the Central Limit Theorem, if the population is not normal, then with a large sample, the sampling distribution is approximately normal.
The theorem presented in this Lesson says that if either of the above are true, then \(\bar_1-\bar_2\) is approximately normal with mean \(\mu_1-\mu_2\), and standard error \(\sqrt+\dfrac>\).
However, in most cases, \(\sigma_1\) and \(\sigma_2\) are unknown, and they have to be estimated. It seems natural to estimate \(\sigma_1\) by \(s_1\) and \(\sigma_2\) by \(s_2\). When the sample sizes are small, the estimates may not be that accurate and one may get a better estimate for the common standard deviation by pooling the data from both populations if the standard deviations for the two populations are not that different.
Given this, there are two options for estimating the variances for the independent samples:
When to use which? When we are reasonably sure that the two populations have nearly equal variances, then we use the pooled variances test. Otherwise, we use the unpooled (or separate) variance test.
When we have good reason to believe that the variance for population 1 is equal to that of population 2, we can estimate the common variance by pooling information from samples from population 1 and population 2.
An informal check for this is to compare the ratio of the two sample standard deviations. If the two are equal, the ratio would be 1, i.e. \(\frac=1\). However, since these are samples and therefore involve error, we cannot expect the ratio to be exactly 1. When the sample sizes are nearly equal (admittedly "nearly equal" is somewhat ambiguous, so often if sample sizes are small one requires they be equal), then a good Rule of Thumb to use is to see if the ratio falls from 0.5 to 2. That is, neither sample standard deviation is more than twice the other.
If this rule of thumb is satisfied, we can assume the variances are equal. Later in this lesson, we will examine a more formal test for equality of variances.
Then the common standard deviation can be estimated by the pooled standard deviation:
If we can assume the populations are independent, that each population is normal or has a large sample size, and that the population variances are the same, then it can be shown that.
follows a t-distribution with \(n_1+n_2-2\) degrees of freedom.
Now, we can construct a confidence interval for the difference of two means, \(\mu_1-\mu_2\).
where \(t_\) comes from a t-distribution with \(n_1+n_2-2\) degrees of freedom.