Wednesday, November 8, 2017

Assignment 4- Hypothesis Testing



Background:

One of the basic concepts to science is testing a hypothesis. A hypothesis is a proposed explanation for a phenomenon based on observations.  Often researchers want to test a hypothesis to make inferences about a population from a sample.  This is called hypothesis testing which is the focus of this exercise.  In hypothesis testing, there are two hypotheses used: the null hypothesis and alternative hypothesis.  A null hypothesis states that there is no difference between the sample mean and the hypothesized mean and this hypothesis is either rejected or the researcher fails to reject it.  An alternative hypothesis states that there is a difference between the sample mean and hypothesized mean.  To test hypotheses, there are two main tests: z tests and t tests.  A z test is used to determine if two population means are different based on a normal distribution and a sample larger than 30.  A t test tests whether a sample or samples fit a normal distribution and is based on degrees of freedom and a sample size of less than 30.  Degrees of freedom are the sample size minus one to eliminate bias due to a smaller sample size.  The goal of both of these tests is to determine if there is a difference between the sample and hypothesized mean.  The sample mean is the average from sample data and the hypothesized mean is the average of the population and what the sample mean is compared to. Both of these tests merely tell the reader whether there was a difference between the sample mean and the hypothesized mean.  This is important because if there is a difference, then further analyses can be done to explain things such as why the result is different, what causes the difference, or what the implications are of having a difference. 
In this lab there are two parts.  In part one question one, a basic table is filled out to show how significance levels, z or t tests, and z or t values are determined.  A significance level is the confidence interval subtracted from 100 and divided by 100.  This must be divided by 2 if the test is two tailed (both sides of the normal distribution curve tested) or kept as is if  a one tailed test is used. The start of the area defined by the significance level is the critical value, also called the z or t value. The confidence interval is the range of values the true value likely falls within.  For question two, a hypothesis test is conducted on three different crop yield means using data from a Department of Agriculture and Live Stock Development organization in Kenya and survey results of farmers in Kenya to determine if yields in a certain district approach the country averages for yields.  Question three looks at levels of a particular stream pollutant using a hypothesis test.
 For part two, two shape files, block groups for the City of Eau Claire and block groups for all of Eau Claire County, are used to decide if the average value of homes for the City of Eau Claire block groups is significantly different from the average value of homes for the Eau Claire County block groups. 

Methods:

For part one question one the significance level, z or t test determination, and z or t value was recorded based on the information given.  The significance level is found by subtracting the confidence level from 100 and then dividing by 100.  This would be the answer for a one tailed test.  For a two tailed test, the resulting number is then divided by two because the test is being conducted on both ends of the normal distribution.  To determine if a z or t test should be used, the n value was taken into account.  If n is greater than 30, a z test is used.  If n is less than 30, a t test is used.  Finally, the z and t values, or critical values were found by consulting a z and t table of critical values given each significant level.
For part one question two, first the null and alternative hypotheses for ground nuts, cassava, and beans’ yields were stated to frame the question.  For ground nuts, the null hypothesis states that there is no difference between the sample yield of ground nuts and the country average of ground nuts yield and the alternative hypothesis states that there is a difference between the sample yield of ground nuts and the country average of ground nuts yield. For cassava, the null hypothesis states that there is no difference between the sample yield of cassava and the country average of cassava yield and the alternative hypothesis states that there is a difference between the sample yield of cassava and the country average of cassava yield. Finally, for beans the null hypothesis states that there is no difference between the sample yield of beans and the country average of beans yield and the alternative hypothesis states that there is a difference between the sample yield of beans and the country average of beans yield. Then for all three crops a two tailed t test with a 95% confidence level and significance level of 0.025 was used to test the hypotheses because the sample size is less than 30 (23).  The results for all three crops is given below in figure 1.
Figure 1. Part 1 question 2: hypothesis tests of ground nuts, cassava, and beans' yields in a certain district compared to the country of Kenya.

For part one question three, the null hypothesis states that there is no difference between the allowable limit of 4.4 mg/l of a stream pollutant and the sample mean pollutant level of 6.8 mg/l.  The alternative hypothesis states that there is a difference between the allowable limit of 4.4 mg/l of a stream pollutant and the sample mean pollutant level of 6.8 mg/l.  The sample is only 17, so a one tailed t-test will be used with a 95% confidence interval and a significance level of 0.5.  The methods and results of the test are shown below in figure 2.
Figure 2: Part 1 question 3 and part 2: hypothesis test of stream pollutant levels compared to allowable limit of pollutant and average home value comparison in the City of Eau Claire and Eau Claire County.

For part two, the average value of homes in the City of Eau Claire, average value of homes in Eau Claire County, standard deviation of the average value of homes in the City of Eau Claire, and the number of block groups in the City of Eau Claire were all obtained from the shapefiles in ArcMap.  Then, all these values were used in a two tailed z test with a 95% confidence interval and 0.025 significance level.  The results are shown in figure 2 above. 

Results:

Figure 3. Part 1 question 1: significance levels, z or t determinations, and z or t values for given interval types and confidence levels.

Figure 3 shows the results from part 1 question 1 in a table format. The last column demonstrates that the z or t value varies depending on the significance level that is given for either test.
Figure 1 (methods section) shows the t test for 3 crops grown in a certain district of Kenya, including the critical values for each hypothesis test which came out to -2.07 and +2.07. The t value for ground nuts was -0.64 and the probability of this score was 26.4%. For ground nuts, we fail to reject the null hypothesis because the t value of -0.64 does not fall below the critical value of -2.07 and the probability was larger than the the 2.5% significance level.  This means that there is not a difference between the sample yield of ground nuts and the country average of ground nuts yield.  The t value for cassava was -2.59 and the probability of this score was 0.84%. For cassava, we reject the null hypothesis because the t value of -2.59 falls below the critical value of -2.07 and the probability of 0.84% is lower than the 2.5% significance level.  This means there is a difference between the sample yield of cassava and the country average of cassava yield and the sample mean for cassava is higher than the country average.  The t value for beans was1.84 and the probability of this score was 96.03%. For beans, we also fail to reject the null hypothesis because the t value of 1.84 does not exceed 2.07 and the probability of 96.03% was not beyond the 97.5% significance level.  This means that there is not a difference between the sample yield of beans and the country average beans yield.  Out of the three crops, only cassava failed to approach the country average for yield. Ground nuts and beans both had no difference between the sample mean and country mean yields, so they adhered to the estimation of the Department of Agriculture and Live Stock Development organization in Kenya that yields in this certain district should approach the country averages. 

Figure 2 (methods section) shows the results of the t test to determine if pollutant levels in a stream are significantly higher than the allowable limit.  The critical value for the calculation is 1.75, the t value is 2.36, and the probability value for this t value is 98.6%.  With a t value of 2.36 and a probability of 98.6%, we reject the null hypothesis because the t value is larger than the critical value and the probability was above the 97.5% significance level .  Thus there is a difference between the allowable limit 4.4 mg/l of a stream pollutant and the sample mean pollutant level of 6.8 mg/l.  The sample mean of the pollutant level is higher than the allowable limit of the pollutant so the researcher can advocate for measures to be taken to reduce the level of pollutant. 
Figure 4. Map of the average value of homes in Eau Claire County and the City of Eau Claire block groups.

Figure 4 shows a map of the average value of homes for the City of Eau Claire and Eau Claire County block groups.  Based on the results of the z test in figure 2, the z value of -2.57 is smaller than the critical value of -1.96 and thus we reject the null hypothesis.  Therefore, there is a significant difference between the average values of homes in the City of Eau Claire and the average value of homes in Eau Claire County.  Based on the means for the city and the county, the homes in the City of Eau Claire have a lower average value than the homes in all of Eau Claire County.  The map supports this analysis because some block groups inside the City of Eau Claire are lighter purple than the block groups outside of the city which denotes a lesser home value.  There are also more dark purple block groups outside of the City of Eau Claire than inside the city. The z test performed provides a quantitative support for this visual trend seen in the map.

Discussion:

Z and t tests are a simple and easy way to determine if a sample mean differs from the population mean.  Z tests are good for using with large samples (greater than 30) and t tests are best used with samples less than 30.  What makes t tests so good with small samples is its dependence on degrees of freedom.  This eliminates some of the bias that can occur with small sample sizes, such as the influence of outliers.  Although this bias exists, t tests are still a great tool to use, especially when the sample size is small.  There are limitations to both of these tests.  Z and t tests merely determine if a sample mean differs from the population mean.  Further analyses are need to infer more from the data such as why the sample mean differs, where is the sample mean different, and what factors caused the sample mean to differ.  These tests are however a good start to determining if results obtained from a sample are significant enough for further analysis and they are widely applicable, such as the crop yield and stream pollution examples in part 1 and the home value question in part 2. 

Sources:

Definitions of statistical concepts and shapefiles were provided by Ryan Weichelt of the University of Wisconsin-Eau Claire. 

No comments:

Post a Comment

Assignment 6- Regression Analysis

Part I Introduction: Many political arguments exist as to the cause of poverty in urban areas.  The determined causes of poverty will ...