5.E: The Chi-Square Distribution (Exercises) - Mathematics

These are homework exercises to accompany the Textmap created for "Introductory Statistics" by OpenStax.

11.2: Facts about the Chi-Square Distribution

Decide whether the following statements are true or false.

Q 11.2.1

As the number of degrees of freedom increases, the graph of the chi-square distribution looks more and more symmetrical.

Q 11.2.2

The standard deviation of the chi-square distribution is twice the mean.

Q 11.2.3

The mean and the median of the chi-square distribution are the same if (df = 24).

11:3: Goodness-of-Fit Test

For each problem, use a solution sheet to solve the hypothesis test problem. Go to [link] for the chi-square solution sheet. Round expected frequency to two decimal places.

Q 11.3.1

A six-sided die is rolled 120 times. Fill in the expected frequency column. Then, conduct a hypothesis test to determine if the die is fair. The data in Table are the result of the 120 rolls.

Face ValueFrequencyExpected Frequency
115
229
316
415
530
615

The marital status distribution of the U.S. male population, ages 15 and older, is as shown in Table.Q 11.3.2

Marital StatusPercentExpected Frequency
never married31.3
married56.1
widowed2.5
divorced/separated10.1

Suppose that a random sample of 400 U.S. young adult males, 18 to 24 years old, yielded the following frequency distribution. We are interested in whether this age group of males fits the distribution of the U.S. adult population. Calculate the frequency one would expect when surveying 400 people. Fill in Table, rounding to two decimal places.

Marital StatusFrequency
never married140
married238
widowed2
divorced/separated20

S 11.3.2

Marital StatusPercentExpected Frequency
never married31.3125.2
married56.1224.4
widowed2.510
divorced/separated10.140.4
1. The data fits the distribution.
2. The data does not fit the distribution.
3. 3
4. chi-square distribution with (df = 3)
5. 19.27
6. 0.0002
7. Check student’s solution.
1. (alpha = 0.05)
2. Decision: Reject null
3. Reason for decision: (p ext{-value} < alpha)
4. Conclusion: Data does not fit the distribution.

Use the following information to answer the next two exercises: The columns in Table contain the Race/Ethnicity of U.S. Public Schools for a recent year, the percentages for the Advanced Placement Examinee Population for that class, and the Overall Student Population. Suppose the right column contains the result of a survey of 1,000 local students from that year who took an AP Exam.

Race/EthnicityAP Examinee PopulationOverall Student PopulationSurvey Frequency
Asian, Asian American, or Pacific Islander10.2%5.4%113
Black or African-American8.2%14.5%94
Hispanic or Latino15.5%15.9%136
White59.4%61.6%604
Not reported/other6.1%1.4%43

Q 11.3.3

Perform a goodness-of-fit test to determine whether the local results follow the distribution of the U.S. overall student population based on ethnicity.

Q 11.3.4

Perform a goodness-of-fit test to determine whether the local results follow the distribution of U.S. AP examinee population, based on ethnicity.

S 11.3.4

1. (H_{0}): The local results follow the distribution of the U.S. AP examinee population
2. (H_{0}): The local results do not follow the distribution of the U.S. AP examinee population
3. (df = 5)
4. chi-square distribution with (df = 5)
5. chi-square test statistic = 13.4
6. (p ext{-value} = 0.0199)
7. Check student’s solution.
1. (alpha = 0.05)
2. Decision: Reject null when (a = 0.05)
3. Reason for Decision: (p ext{-value} < alpha)
4. Conclusion: Local data do not fit the AP Examinee Distribution.
5. Decision: Do not reject null when (a = 0.01)
6. Conclusion: There is insufficient evidence to conclude that local data do not follow the distribution of the U.S. AP examinee distribution.

Q 11.3.5

The City of South Lake Tahoe, CA, has an Asian population of 1,419 people, out of a total population of 23,609. Suppose that a survey of 1,419 self-reported Asians in the Manhattan, NY, area yielded the data in Table. Conduct a goodness-of-fit test to determine if the self-reported sub-groups of Asians in the Manhattan area fit that of the Lake Tahoe area.

RaceLake Tahoe FrequencyManhattan Frequency
Asian Indian131174
Chinese118557
Filipino1,045518
Japanese8054
Korean1229
Vietnamese921
Other2466

Use the following information to answer the next two exercises: UCLA conducted a survey of more than 263,000 college freshmen from 385 colleges in fall 2005. The results of students' expected majors by gender were reported in The Chronicle of Higher Education (2/2/2006). Suppose a survey of 5,000 graduating females and 5,000 graduating males was done as a follow-up last year to determine what their actual majors were. The results are shown in the tables for Exercise and Exercise. The second column in each table does not add to 100% because of rounding.

Q 11.3.6

Conduct a goodness-of-fit test to determine if the actual college majors of graduating females fit the distribution of their expected majors.

MajorWomen - Expected MajorWomen - Actual Major
Arts & Humanities14.0%670
Biological Sciences8.4%410
Education13.0%650
Engineering2.6%145
Physical Sciences2.6%125
Professional18.9%975
Social Sciences13.0%605
Technical0.4%15
Other5.8%300
Undecided8.0%420

S 11.3.6

1. (H_{0}): The actual college majors of graduating females fit the distribution of their expected majors
2. (H_{a}): The actual college majors of graduating females do not fit the distribution of their expected majors
3. (df = 10)
4. chi-square distribution with (df = 10)
5. ( ext{test statistic} = 11.48)
6. (p ext{-value} = 0.3211)
7. Check student’s solution.
1. (alpha = 0.05)
2. Decision: Do not reject null when (a = 0.05) and (a = 0.01)
3. Reason for decision: (p ext{-value} > alpha)
4. Conclusion: There is insufficient evidence to conclude that the distribution of actual college majors of graduating females fits the distribution of their expected majors.

Q 11.3.7

Conduct a goodness-of-fit test to determine if the actual college majors of graduating males fit the distribution of their expected majors.

MajorMen - Expected MajorMen - Actual Major
Arts & Humanities11.0%600
Biological Sciences6.7%330
Education5.8%305
Engineering15.6%800
Physical Sciences3.6%175
Professional9.3%460
Social Sciences7.6%370
Technical1.8%90
Other8.2%400
Undecided6.6%340

Read the statement and decide whether it is true or false.

Q 11.3.8

In a goodness-of-fit test, the expected values are the values we would expect if the null hypothesis were true.

Q 11.3.9

In general, if the observed values and expected values of a goodness-of-fit test are not close together, then the test statistic can get very large and on a graph will be way out in the right tail.

Q 11.3.10

Use a goodness-of-fit test to determine if high school principals believe that students are absent equally during the week or not.

Q 11.3.11

The test to use to determine if a six-sided die is fair is a goodness-of-fit test.

Q 11.3.12

In a goodness-of fit test, if the p-value is 0.0113, in general, do not reject the null hypothesis.

Q 11.3.13

A sample of 212 commercial businesses was surveyed for recycling one commodity; a commodity here means any one type of recyclable material such as plastic or aluminum. Table shows the business categories in the survey, the sample size of each category, and the number of businesses in each category that recycle one commodity. Based on the study, on average half of the businesses were expected to be recycling one commodity. As a result, the last column shows the expected number of businesses in each category that recycle one commodity. At the 5% significance level, perform a hypothesis test to determine if the observed number of businesses that recycle one commodity follows the uniform distribution of the expected values.

Business TypeNumber in classObserved Number that recycle one commodityExpected number that recycle one commodity
Office351917.5
Retail/Wholesale482724
Food/Restaurants533526.5
Manufacturing/Medical522126
Hotel/Mixed24912

Q 11.3.14

Table contains information from a survey among 499 participants classified according to their age groups. The second column shows the percentage of obese people per age class among the study participants. The last column comes from a different study at the national level that shows the corresponding percentages of obese people in the same age classes in the USA. Perform a hypothesis test at the 5% significance level to determine whether the survey participants are a representative sample of the USA obese population.

Age Class (Years)Obese (Percentage)Expected USA average (Percentage)
20–3075.032.6
31–4026.532.6
41–5013.636.6
51–6021.936.6
61–7021.039.7

S 11.3.14

1. (H_{0}): Surveyed obese fit the distribution of expected obese
2. (H_{a}): Surveyed obese do not fit the distribution of expected obese
3. (df = 4)
4. chi-square distribution with (df = 4)
5. ( ext{test statistic} = 54.01)
6. (p ext{-value} = 0)
7. Check student’s solution.
1. (alpha: 0.05)
2. Decision: Reject the null hypothesis.
3. Reason for decision: (p ext{-value} < alpha)
4. Conclusion: At the 5% level of significance, from the data, there is sufficient evidence to conclude that the surveyed obese do not fit the distribution of expected obese.

11.4: Test of Independence

For each problem, use a solution sheet to solve the hypothesis test problem. Go to Appendix E for the chi-square solution sheet. Round expected frequency to two decimal places.

Q 11.4.1

A recent debate about where in the United States skiers believe the skiing is best prompted the following survey. Test to see if the best ski area is independent of the level of the skier.

Tahoe203040
Utah103060

Q 11.4.2

Car manufacturers are interested in whether there is a relationship between the size of car an individual drives and the number of people in the driver’s family (that is, whether car size and family size are independent). To test this, suppose that 800 car owners were randomly surveyed with the results in Table. Conduct a test of independence.

Family SizeSub & CompactMid-sizeFull-sizeVan & Truck
120354035
220507080
3–4205010090
5+20307070

S 11.4.2

1. (H_{0}): Car size is independent of family size.
2. (H_{a}): Car size is dependent on family size.
3. (df = 9)
4. chi-square distribution with (df = 9)
5. ( ext{test statistic} = 15.8284)
6. (p ext{-value} = 0.0706)
7. Check student’s solution.
1. (alpha: 0.05)
2. Decision: Do not reject the null hypothesis.
3. Reason for decision: (p ext{-value} > alpha)
4. Conclusion: At the 5% significance level, there is insufficient evidence to conclude that car size and family size are dependent.

Q 11.4.3

College students may be interested in whether or not their majors have any effect on starting salaries after graduation. Suppose that 300 recent graduates were surveyed as to their majors in college and their starting salaries after graduation. Table shows the data. Conduct a test of independence.

Major< $50,000$50,000 – $68,999$69,000 +
English5205
Engineering103060
Nursing101515
Psychology203020

Q 11.4.4

Some travel agents claim that honeymoon hot spots vary according to age of the bride. Suppose that 280 recent brides were interviewed as to where they spent their honeymoons. The information is given in Table. Conduct a test of independence.

Location20–2930–3940–4950 and over
Niagara Falls15252520
Poconos15252510
Europe1025155
Virgin Islands2025155
1. (H_{0}): Honeymoon locations are independent of bride’s age.
2. (H_{a}): Honeymoon locations are dependent on bride’s age.
3. (df = 9)
4. chi-square distribution with (df = 9)
5. ( ext{test statistic} = 15.7027)
6. (p ext{-value} = 0.0734)
7. Check student’s solution.
1. (alpha: 0.05)
2. Decision: Do not reject the null hypothesis.
3. Reason for decision: (p ext{-value} > alpha)
4. Conclusion: At the 5% significance level, there is insufficient evidence to conclude that honeymoon location and bride age are dependent.

Q 11.4.5

A manager of a sports club keeps information concerning the main sport in which members participate and their ages. To test whether there is a relationship between the age of a member and his or her choice of sport, 643 members of the sports club are randomly selected. Conduct a test of independence.

Sport18 - 2526 - 3031 - 4041 and over
racquetball42583046
tennis58763865
swimming72606533

Q 11.4.6

A major food manufacturer is concerned that the sales for its skinny french fries have been decreasing. As a part of a feasibility study, the company conducts research into the types of fries sold across the country to determine if the type of fries sold is independent of the area of the country. The results of the study are shown in Table. Conduct a test of independence.

Type of FriesNortheastSouthCentralWest
skinny fries70502025
curly fries100601530
steak fries20401010

S 11.4.6

1. (H_{0}): The types of fries sold are independent of the location.
2. (H_{a}): The types of fries sold are dependent on the location.
3. (df = 6)
4. chi-square distribution with (df = 6)
5. ( ext{test statistic} =18.8369)
6. (p ext{-value} = 0.0044)
7. Check student’s solution.
1. (alpha: 0.05)
2. Decision: Reject the null hypothesis.
3. Reason for decision: (p ext{-value} > alpha)
4. Conclusion: At the 5% significance level, There is sufficient evidence that types of fries and location are dependent.

Q 11.4.7

According to Dan Lenard, an independent insurance agent in the Buffalo, N.Y. area, the following is a breakdown of the amount of life insurance purchased by males in the following age groups. He is interested in whether the age of the male and the amount of life insurance purchased are independent events. Conduct a test for independence.

Age of MalesNone< $200,000$200,000–$400,000$401,001–$1,000,000$1,000,001+
20–2940154005
30–39355202010
40–4920030030
50+4030151510

Q 11.4.8

Suppose that 600 thirty-year-olds were surveyed to determine whether or not there is a relationship between the level of education an individual has and salary. Conduct a test of independence.

< $30,0001525105$30,000–$40,00020407030$40,000–$50,00010204055$50,000–$60,0005102060$60,000+0510150

S 11.4.8

1. (H_{0}): Salary is independent of level of education.
2. (H_{a}): Salary is dependent on level of education.
3. (df = 12)
4. chi-square distribution with (df = 12)
5. ( ext{test statistic} = 255.7704)
6. (p ext{-value} = 0)
7. Check student’s solution.
1. (alpha: 0.05)
2. Decision: Reject the null hypothesis.
3. Reason for decision: (p ext{-value} > alpha)
4. Conclusion: At the 5% significance level, There is sufficient evidence that types of fries and location are dependent.

Read the statement and decide whether it is true or false.

Q 11.4.9

The number of degrees of freedom for a test of independence is equal to the sample size minus one.

Q 11.4.10

The test for independence uses tables of observed and expected data values.

Q 11.4.11

The test to use when determining if the college or university a student chooses to attend is related to his or her socioeconomic status is a test for independence.

Q 11.4.12

In a test of independence, the expected number is equal to the row total multiplied by the column total divided by the total surveyed.

Q 11.4.13

An ice cream maker performs a nationwide survey about favorite flavors of ice cream in different geographic areas of the U.S. Based on Table, do the numbers suggest that geographic location is independent of favorite ice cream flavors? Test at the 5% significance level.

U.S. region/FlavorStrawberryChocolateVanillaRocky RoadMint Chocolate ChipPistachioRow total
East83127815796
Midwest1032221115696
West1221221915897
South1528308156102
Column Total45112101466027391

Q 11.4.14

Table provides a recent survey of the youngest online entrepreneurs whose net worth is estimated at one million dollars or more. Their ages range from 17 to 30. Each cell in the table illustrates the number of entrepreneurs who correspond to the specific age group and their net worth. Are the ages and net worth independent? Perform a test of independence at the 5% significance level.

Age Group Net Worth Value (in millions of US dollars)1–56–24≥25Row Total
17–2587520
26–3065920
Column Total14121440

S 11.4.14

1. (H_{0}): Age is independent of the youngest online entrepreneurs’ net worth.
2. (H_{5}): Age is dependent on the net worth of the youngest online entrepreneurs.
3. (df = 2)
4. chi-square distribution with (df = 2)
5. ( ext{test statistic} = 1.76)
6. (p ext{-value} = 0.4144)
7. Check student’s solution.
1. (alpha: 0.05)
2. Decision: Do not reject the null hypothesis.
3. Reason for decision: (p ext{-value} > alpha)
4. Conclusion: At the 5% significance level, there is insufficient evidence to conclude that age and net worth for the youngest online entrepreneurs are dependent.

Q 11.4.15

A 2013 poll in California surveyed people about taxing sugar-sweetened beverages. The results are presented in Table, and are classified by ethnic group and response type. Are the poll responses independent of the participants’ ethnic group? Conduct a test of independence at the 5% significance level.

Opinion/EthnicityAsian-AmericanWhite/Non-HispanicAfrican-AmericanLatinoRow Total
Against tax4843341160628
In Favor of tax5423424147459
No opinion1643161984
Column Total118710712721171

11.5: Test for Homogeneity

For each word problem, use a solution sheet to solve the hypothesis test problem. Round expected frequency to two decimal places.

Q 11.5.1

A psychologist is interested in testing whether there is a difference in the distribution of personality types for business majors and social science majors. The results of the study are shown in Table. Conduct a test of homogeneity. Test at a 5% level of significance.

 Open Conscientious Extrovert Agreeable Neurotic Business 41 52 46 61 58 Social Science 72 75 63 80 65

S 11.5.1

1. (H_{0}): The distribution for personality types is the same for both majors
2. (H_{a}): The distribution for personality types is not the same for both majors
3. (df = 4)
4. chi-square with (df = 4)
5. ( ext{test statistic} = 3.01)
6. (p ext{-value} = 0.5568)
7. Check student’s solution.
1. (alpha: 0.05)
2. Decision: Do not reject the null hypothesis.
3. Reason for decision: (p ext{-value} > alpha)
4. Conclusion: There is insufficient evidence to conclude that the distribution of personality types is different for business and social science majors.

Q 11.5.2

Do men and women select different breakfasts? The breakfasts ordered by randomly selected men and women at a popular breakfast place is shown in Table. Conduct a test for homogeneity at a 5% level of significance.

 French Toast Pancakes Waffles Omelettes Men 47 35 28 53 Women 65 59 55 60

Q 11.5.3

A fisherman is interested in whether the distribution of fish caught in Green Valley Lake is the same as the distribution of fish caught in Echo Lake. Of the 191 randomly selected fish caught in Green Valley Lake, 105 were rainbow trout, 27 were other trout, 35 were bass, and 24 were catfish. Of the 293 randomly selected fish caught in Echo Lake, 115 were rainbow trout, 58 were other trout, 67 were bass, and 53 were catfish. Perform a test for homogeneity at a 5% level of significance.

S 11.5.3

1. (H_{0}): The distribution for fish caught is the same in Green Valley Lake and in Echo Lake.
2. (H_{a}): The distribution for fish caught is not the same in Green Valley Lake and in Echo Lake.
3. (df = 3)
4. chi-square with (df = 3)
5. ( ext{test statistic} = 11.75)
6. (p ext{-value} = 0.0083)
7. Check student’s solution.
1. (alpha: 0.05)
2. Decision: Reject the null hypothesis.
3. Reason for decision: (p ext{-value} > alpha)
4. Conclusion: There is evidence to conclude that the distribution of fish caught is different in Green Valley Lake and in Echo Lake

Q 11.5.4

In 2007, the United States had 1.5 million homeschooled students, according to the U.S. National Center for Education Statistics. In Table you can see that parents decide to homeschool their children for different reasons, and some reasons are ranked by parents as more important than others. According to the survey results shown in the table, is the distribution of applicable reasons the same as the distribution of the most important reason? Provide your assessment at the 5% significance level. Did you expect the result you obtained?

Reasons for HomeschoolingApplicable Reason (in thousands of respondents)Most Important Reason (in thousands of respondents)Row Total
Concern about the environment of other schools1,3213091,630
Dissatisfaction with academic instruction at other schools1,0962581,354
To provide religious or moral instruction1,2575401,797
Child has special needs, other than physical or mental31555370
Other reasons (e.g., finances, travel, family time, etc.)485216701
Column Total5,4581,4776,935

Q 11.5.5

When looking at energy consumption, we are often interested in detecting trends over time and how they correlate among different countries. The information in Table shows the average energy use (in units of kg of oil equivalent per capita) in the USA and the joint European Union countries (EU) for the six-year period 2005 to 2010. Do the energy use values in these two areas come from the same distribution? Perform the analysis at the 5% significance level.

YearEuropean UnionUnited StatesRow Total
20103,4137,16410,557
20093,3027,05710,359
20083,5057,48810,993
20073,5377,75811,295
20063,5957,69711,292
20053,6137,84711,460
Column Total45,01120,96565,976

S 11.5.5

1. (H_{0}): The distribution of average energy use in the USA is the same as in Europe between 2005 and 2010.
2. (H_{a}): The distribution of average energy use in the USA is not the same as in Europe between 2005 and 2010.
3. (df = 4)
4. chi-square with (df = 4)
5. ( ext{test statistic} = 2.7434)
6. (p ext{-value} = 0.7395)
7. Check student’s solution.
1. (alpha: 0.05)
2. Decision: Do not reject the null hypothesis.
3. Reason for decision: (p ext{-value} > alpha)
4. Conclusion: At the 5% significance level, there is insufficient evidence to conclude that the average energy use values in the US and EU are not derived from different distributions for the period from 2005 to 2010.

Q 11.5.6

The Insurance Institute for Highway Safety collects safety information about all types of cars every year, and publishes a report of Top Safety Picks among all cars, makes, and models. Table presents the number of Top Safety Picks in six car categories for the two years 2009 and 2013. Analyze the table data to conclude whether the distribution of cars that earned the Top Safety Picks safety award has remained the same between 2009 and 2013. Derive your results at the 5% significance level.

Year Car TypeSmallMid-SizeLargeSmall SUVMid-Size SUVLarge SUVRow Total
20091222101027687
201331301911294124
Column Total435229215610211

11.6: Comparison of the Chi-Square Tests

For each word problem, use a solution sheet to solve the hypothesis test problem. Round expected frequency to two decimal places.

Q 11.6.1

Is there a difference between the distribution of community college statistics students and the distribution of university statistics students in what technology they use on their homework? Of some randomly selected community college students, 43 used a computer, 102 used a calculator with built in statistics functions, and 65 used a table from the textbook. Of some randomly selected university students, 28 used a computer, 33 used a calculator with built in statistics functions, and 40 used a table from the textbook. Conduct an appropriate hypothesis test using a 0.05 level of significance.

S 11.6.1

1. (H_{0}): The distribution for technology use is the same for community college students and university students.
2. (H_{a}): The distribution for technology use is not the same for community college students and university students.
3. (df = 2)
4. chi-square with (df = 2)
5. ( ext{test statistic} = 7.05)
6. (p ext{-value} = 0.0294)
7. Check student’s solution.
1. (alpha: 0.05)
2. Decision: Reject the null hypothesis.
3. Reason for decision: (p ext{-value} > alpha)
4. Conclusion: There is sufficient evidence to conclude that the distribution of technology use for statistics homework is not the same for statistics students at community colleges and at universities.

Read the statement and decide whether it is true or false.

Q 11.6.2

If (df = 2), the chi-square distribution has a shape that reminds us of the exponential.

11.7: Test of a Single Variance

Use the following information to answer the next twelve exercises: Suppose an airline claims that its flights are consistently on time with an average delay of at most 15 minutes. It claims that the average delay is so consistent that the variance is no more than 150 minutes. Doubting the consistency part of the claim, a disgruntled traveler calculates the delays for his next 25 flights. The average delay for those 25 flights is 22 minutes with a standard deviation of 15 minutes.

Q 11.7.1

Is the traveler disputing the claim about the average or about the variance?

Q 11.7.2

A sample standard deviation of 15 minutes is the same as a sample variance of __________ minutes.

Q 11.7.3

Is this a right-tailed, left-tailed, or two-tailed test?

Q 11.7.4

(H_{0}): __________

S 11.7.4

(H_{0}: sigma^{2} leq 150)

(df =) ________

Q 11.7.6

chi-square test statistic = ________

Q 11.7.7

(p ext{-value} =) ________

Q 11.7.8

Graph the situation. Label and scale the horizontal axis. Mark the mean and test statistic. Shade the (p ext{-value}).

S 11.7.8

Check student’s solution.

Q 11.7.9

Let (alpha = 0.05)

Decision: ________

Conclusion (write out in a complete sentence.): ________

Q 11.7.10

How did you know to test the variance instead of the mean?

S 11.7.10

The claim is that the variance is no more than 150 minutes.

Q 11.7.11

If an additional test were done on the claim of the average delay, which distribution would you use?

Q 11.7.12

If an additional test were done on the claim of the average delay, but 45 flights were surveyed, which distribution would you use?

S 11.7.12

a Student's (t)- or normal distribution

For each word problem, use a solution sheet to solve the hypothesis test problem. Round expected frequency to two decimal places.

Q 11.7.13

A plant manager is concerned her equipment may need recalibrating. It seems that the actual weight of the 15 oz. cereal boxes it fills has been fluctuating. The standard deviation should be at most 0.5 oz. In order to determine if the machine needs to be recalibrated, 84 randomly selected boxes of cereal from the next day’s production were weighed. The standard deviation of the 84 boxes was 0.54. Does the machine need to be recalibrated?

S 11.7.20

1. (H_{0}: sigma = 25^{2})
2. (H_{a}: sigma > 25^{2})
3. (df = n - 1 = 7)
4. test statistic: (chi^{2} = chi^{2}_{7} = frac{(n-1)s^{2}}{25^{2}} = frac{(8-1)(34.29)^{2}}{25^{2}} = 13.169)
5. (p ext{-value}: P(chi^{2}_{7} > 13.169) = 1- P(chi^{2}_{7} leq 13.169) = 0.0681)
1. (alpha: 0.05)
2. Decision: Do not reject the null hypothesis
3. Reason for decision: (p ext{-value} < alpha)
4. Conclusion: At the 5% level, there is insufficient evidence to conclude that the variance is more than 625.

Q 11.7.21

A company packages apples by weight. One of the weight grades is Class A apples. Class A apples have a mean weight of 150 g, and there is a maximum allowed weight tolerance of 5% above or below the mean for apples in the same consumer package. A batch of apples is selected to be included in a Class A apple package. Given the following apple weights of the batch, does the fruit comply with the Class A grade weight tolerance requirements. Conduct an appropriate hypothesis test.

1. at the 5% significance level
2. at the 1% significance level

Weights in selected apple batch (in grams): 158; 167; 149; 169; 164; 139; 154; 150; 157; 171; 152; 161; 141; 166; 172;

Practice

If the number of degrees of freedom for a chi-square distribution is 25, what is the population mean and standard deviation?

If df > 90, the distribution is _____________. If df = 15, the distribution is ________________.

When does the chi-square curve approximate a normal distribution?

Where is μ located on a chi-square curve?

Is it more likely the df is 90, 20, or two in the graph?

11.2 Goodness-of-Fit Test

Determine the appropriate test to be used in the next three exercises.

An archeologist is calculating the distribution of the frequency of the number of artifacts she finds in a dig site. Based on previous digs, the archeologist creates an expected distribution broken down by grid sections in the dig site. Once the site has been fully excavated, she compares the actual number of artifacts found in each grid section to see if her expectation was accurate.

An economist is deriving a model to predict outcomes on the stock market. He creates a list of expected points on the stock market index for the next two weeks. At the close of each day’s trading, he records the actual points on the index. He wants to see how well his model matched what actually happened.

A personal trainer is putting together a weight-lifting program for her clients. For a 90-day program, she expects each client to lift a specific maximum weight each week. As she goes along, she records the actual maximum weights her clients lifted. She wants to know how well her expectations met with what was observed.

Use the following information to answer the next five exercises: A teacher predicts that the distribution of grades on the final exam will be and they are recorded in Table 11.27.

The actual distribution for a class of 20 is in Table 11.28.

State the null and alternative hypotheses.

χ 2 test statistic = ______

At the 5% significance level, what can you conclude?

Use the following information to answer the next nine exercises: The following data are real. The cumulative number of AIDS cases reported for Santa Clara County is broken down by ethnicity as in Table 11.29.

Ethnicity Number of Cases
White 2,229
Hispanic 1,157
Black/African-American 457
Asian, Pacific Islander 232
Total = 4,075

The percentage of each ethnic group in Santa Clara County is as in Table 11.30.

Ethnicity Percentage of total county population Number expected (round to two decimal places)
White 42.9% 1748.18
Hispanic 26.7%
Black/African-American 2.6%
Asian, Pacific Islander 27.8%
Total = 100%

If the ethnicities of AIDS victims followed the ethnicities of the total county population, fill in the expected number of cases per ethnic group.
Perform a goodness-of-fit test to determine whether the occurrence of AIDS cases follows the ethnicities of the general population of Santa Clara County.

Is this a right-tailed, left-tailed, or two-tailed test?

degrees of freedom = _______

χ 2 test statistic = _______

Graph the situation. Label and scale the horizontal axis. Mark the mean and test statistic. Shade in the region corresponding to the p-value.

Reason for the Decision: ________________

Conclusion (write out in complete sentences): ________________

Does it appear that the pattern of AIDS cases in Santa Clara County corresponds to the distribution of ethnic groups in this county? Why or why not?

11.3 Test of Independence

Determine the appropriate test to be used in the next three exercises.

A pharmaceutical company is interested in the relationship between age and presentation of symptoms for a common viral infection. A random sample is taken of 500 people with the infection across different age groups.

The owner of a baseball team is interested in the relationship between player salaries and team winning percentage. He takes a random sample of 100 players from different organizations.

A marathon runner is interested in the relationship between the brand of shoes runners wear and their run times. She takes a random sample of 50 runners and records their run times as well as the brand of shoes they were wearing.

Use the following information to answer the next seven exercises: Transit Railroads is interested in the relationship between travel distance and the ticket class purchased. A random sample of 200 passengers is taken. Table 11.31 shows the results. The railroad wants to know if a passenger’s choice in ticket class is independent of the distance they must travel.

Traveling Distance Third class Second class First class Total
1–100 miles 21 14 6 41
101–200 miles 18 16 8 42
201–300 miles 16 17 15 48
301–400 miles 12 14 21 47
401–500 miles 6 6 10 22
Total 73 67 60 200

How many passengers are expected to travel between 201 and 300 miles and purchase second-class tickets?

How many passengers are expected to travel between 401 and 500 miles and purchase first-class tickets?

What is the test statistic?

What can you conclude at the 5% level of significance?

Use the following information to answer the next eight exercises: An article in the New England Journal of Medicine, discussed a study on smokers in California and Hawaii. In one part of the report, the self-reported ethnicity and smoking levels per day were given. Of the people smoking at most ten cigarettes per day, there were 9,886 African Americans, 2,745 Native Hawaiians, 12,831 Latinos, 8,378 Japanese Americans and 7,650 Whites. Of the people smoking 11 to 20 cigarettes per day, there were 6,514 African Americans, 3,062 Native Hawaiians, 4,932 Latinos, 10,680 Japanese Americans, and 9,877 Whites. Of the people smoking 21 to 30 cigarettes per day, there were 1,671 African Americans, 1,419 Native Hawaiians, 1,406 Latinos, 4,715 Japanese Americans, and 6,062 Whites. Of the people smoking at least 31 cigarettes per day, there were 759 African Americans, 788 Native Hawaiians, 800 Latinos, 2,305 Japanese Americans, and 3,970 Whites.

Section A Number and algebra

6.Solving equations by iterative methods

7 Boolean algebra and logic circuits

Section B Geometry and trigonometry

8 Introduction to trigonometry

9 Cartesian and polar co-ordinates

10 The circle and its properties

11 Trigonometric waveforms

13 Trigonometric identities and equations

14 The relationship between trigonometric and hyperbolic functions

16 Functions and their curves

17 Irregular areas, volumes and mean values of waveforms

Section D Complex numbers

Section E Matrices and determinants

20 The theory of matrices and determinants

21 Applications of matrices and determinants

Section F Vector geometry

23 Methods of adding alternating waveforms

24 Scalar and vector products

Section G Differential calculus

25 Methods of differentiation

26 Some applications of differentiation

27 Differentiation of parametric equations

28 Differentiation of implicit functions

29 Logarithmic differentiation

30 Differentiation of hyperbolic functions

31 Differentiation of inverse trigonometric and hyperbolic functions

32 Partial differentiation

33 Total differentials, rates of change and small changes

34 Maxima, minima and saddle points for functions of two variables

Section H Integral calculus

36 Some applications of integration

38 Integration using algebraic substitutions

39 Integration using trigonometric and hyperbolic substitutions

40 Integration using partial fractions

44 Double and triple integrals

Section I Differential equations

46 Introduction to differential equations

47 Homogeneous first order differential equations

48 Linear first order differential equations

49 Numerical methods for first order differential equations

50 First order differential equations (1)

51 First order differential equations (2)

52 Power series methods of solving ordinary differential equations

53 An introduction to partial differential equations

Section J Laplace transforms

54 Introduction to Laplace transforms

55 Properties of Laplace transforms

56 Inverse Laplace transforms

57 The Laplace transform of the Heaviside function

58 The solution of differential equations using Laplace transforms

59 The solution of simultaneous differential equations using Laplace transforms

The Uniform Distribution

For each probability and percentile problem, draw the picture.

Q 5.3.1

Births are approximately uniformly distributed between the 52 weeks of the year. They can be said to follow a uniform distribution from one to 53 (spread of 52 weeks).

1. (X sim) _________
2. Graph the probability distribution.
3. (f(x) =) _________
4. (mu =) _________
5. (sigma =) _________
6. Find the probability that a person is born at the exact moment week 19 starts. That is, find (P(x = 19) =) _________
7. (P(2 < x < 31) =) _________
8. Find the probability that a person is born after week 40.
9. (P(12 < x|x < 28) =) _________
10. Find the 70 th percentile.
11. Find the minimum for the upper quarter.

Q 5.3.2

A random number generator picks a number from one to nine in a uniform manner.

1. (X sim) _________
2. Graph the probability distribution.
3. (f(x) =) _________
4. (mu =) _________
5. (mu =) _________
6. (P(3.5 < x < 7.25) =) _________
7. (P(x > 5.67) =) _________
8. (P(x > 5|x > 3) =) _________
9. Find the 90 th percentile.

S 5.3.2

1. (X sim U(1, 9))
2. Check student&rsquos solution.
3. (f(x) = 18) where (1 leq x leq 9)
4. five
5. 2.3
6. (frac<15><32>)
7. (frac<333><800>)
8. (frac<2><3>)
9. 8.2

Q 5.3.3

According to a study by Dr. John McDougall of his live-in weight loss program at St. Helena Hospital, the people who follow his program lose between six and 15 pounds a month until they approach trim body weight. Let&rsquos suppose that the weight loss is uniformly distributed. We are interested in the weight loss of a randomly selected individual following the program for one month.

1. Define the random variable. (X =) _________
2. (X sim) _________
3. Graph the probability distribution.
4. (f(x) =) _________
5. (mu =) _________
6. (sigma =) _________
7. Find the probability that the individual lost more than ten pounds in a month.
8. Suppose it is known that the individual lost more than ten pounds in a month. Find the probability that he lost less than 12 pounds in the month.
9. (P(7 < x < 13|x > 9) =) __________. State this in a probability question, similarly to parts g and h, draw the picture, and find the probability.

Q 5.3.4

A subway train on the Red Line arrives every eight minutes during rush hour. We are interested in the length of time a commuter must wait for a train to arrive. The time follows a uniform distribution.

1. Define the random variable. (X =) _______
2. (X sim) _______
3. Graph the probability distribution.
4. (f(x) =) _______
5. (mu =) _______
6. (sigma =) _______
7. Find the probability that the commuter waits less than one minute.
8. Find the probability that the commuter waits between three and four minutes.
9. Sixty percent of commuters wait more than how long for the train? State this in a probability question, similarly to parts g and h, draw the picture, and find the probability.

S 5.3.5

1. (X) represents the length of time a commuter must wait for a train to arrive on the Red Line.
2. (X sim U(0, 8))
3. (f(x) = frac<1><8>) where (leq x leq 8)
4. four
5. 2.31
6. (frac<1><8>)
7. (frac<1><8>)
8. 3.2

Q 5.3.6

The age of a first grader on September 1 at Garden Elementary School is uniformly distributed from 5.8 to 6.8 years. We randomly select one first grader from the class.

1. Define the random variable. (X =) _________
2. (X sim) _________
3. Graph the probability distribution.
4. (f(x) =) _________
5. (mu =) _________
6. (sigma =) _________
7. Find the probability that she is over 6.5 years old.
8. Find the probability that she is between four and six years old.
9. Find the 70 th percentile for the age of first graders on September 1 at Garden Elementary School.

Use the following information to answer the next three exercises. The Sky Train from the terminal to the rental&ndashcar and long&ndashterm parking center is supposed to arrive every eight minutes. The waiting times for the train are known to follow a uniform distribution.

Computational Exercises

Suppose that a missile is fired at a target at the origin of a plane coordinate system, with units in meters. The missile lands at ((X, Y)) where (X) and (Y) are independent and each has the normal distribution with mean 0 and variance 100. The missile will destroy the target if it lands within 20 meters of the target. Find the probability of this event.

Let (Z) denote the distance from the missile to the target. (P(Z lt 20) = 1 - e^ <-2>approx 0.8647)

Suppose that (X) has the chi-square distribution with (n = 18) degrees of freedom. For each of the following, compute the true value using the special distribution calculator and then compute the normal approximation. Compare the results.

5.E: The Chi-Square Distribution (Exercises) - Mathematics

where &nu is the shape parameter and &Gamma is the gamma function. The formula for the gamma function is

In a testing context, the chi-square distribution is treated as a "standardized distribution" (i.e., no location or scale parameters). However, in a distributional modeling context (as with other probability distributions), the chi-square distribution itself can be transformed with a location parameter, &mu, and a scale parameter, &sigma.

The following is the plot of the chi-square probability density function for 4 different values of the shape parameter.

Cumulative Distribution Function The formula for the cumulative distribution function of the chi-square distribution is

where &Gamma is the gamma function defined above and &gamma is the incomplete gamma function. The formula for the incomplete gamma function is

The following is the plot of the chi-square cumulative distribution function with the same values of &nu as the pdf plots above.

Percent Point Function The formula for the percent point function of the chi-square distribution does not exist in a simple closed form. It is computed numerically.

The following is the plot of the chi-square percent point function with the same values of &nu as the pdf plots above.

5.E: The Chi-Square Distribution (Exercises) - Mathematics

Suppose we wish to determine if an ordinary-looking six-sided die is fair, or balanced, meaning that every face has probability 1/6 of landing on top when the die is tossed. We could toss the die dozens, maybe hundreds, of times and compare the actual number of times each face landed on top to the expected number, which would be 1/6 of the total number of tosses. We wouldn’t expect each number to be exactly 1/6 of the total, but it should be close. To be specific, suppose the die is tossed n = 60 times with the results summarized in Table 11.8 "Die Contingency Table". For ease of reference we add a column of expected frequencies, which in this simple example is simply a column of 10s. The result is shown as Table 11.9 "Updated Die Contingency Table". In analogy with the previous section we call this an “updated” table. A measure of how much the data deviate from what we would expect to see if the die really were fair is the sum of the squares of the differences between the observed frequency O and the expected frequency E in each row, or, standardizing by dividing each square by the expected number, the sum Σ ( O − E ) 2 ∕ E . If we formulate the investigation as a test of hypotheses, the test is

H 0 : The die is fair vs. H a : The die is n o t fair

Table 11.8 Die Contingency Table

Die Value Assumed Distribution Observed Frequency
1 1/6 9
2 1/6 15
3 1/6 9
4 1/6 8
5 1/6 6
6 1/6 13

Table 11.9 Updated Die Contingency Table

Die Value Assumed Distribution Observed Freq. Expected Freq.
1 1/6 9 10
2 1/6 15 10
3 1/6 9 10
4 1/6 8 10
5 1/6 6 10
6 1/6 13 10

We would reject the null hypothesis that the die is fair only if the number Σ ( O − E ) 2 ∕ E is large, so the test is right-tailed. In this example the random variable Σ ( O − E ) 2 ∕ E has the chi-square distribution with five degrees of freedom. If we had decided at the outset to test at the 10% level of significance, the critical value defining the rejection region would be, reading from Figure 12.4 "Critical Values of Chi-Square Distributions", χ α 2 = χ 0.10 2 = 9.236 , so that the rejection region would be the interval [ 9.236 , ∞ ) . When we compute the value of the standardized test statistic using the numbers in the last two columns of Table 11.9 "Updated Die Contingency Table", we obtain

Σ ( O − E ) 2 E = ( − 1 ) 2 10 + 5 2 10 + ( − 1 ) 2 10 + ( − 2 ) 2 10 + ( − 4 ) 2 10 + 3 2 10 = 0.1 + 2.5 + 0.1 + 0.4 + 1.6 + 0.9 = 5.6

Since 5.6 < 9.236 the decision is not to reject H0. See Figure 11.5 "Balanced Die". The data do not provide sufficient evidence, at the 10% level of significance, to conclude that the die is loaded.

In the general situation we consider a discrete random variable that can take I different values, x 1 , x 2 , … , x I , for which the default assumption is that the probability distribution is

x x 1 x 2 … x I P ( x ) p 1 p 2 … p I

We wish to test the hypotheses

H 0 : The assumed probability distribution for X is valid vs. H a : The assumed probability distribution for X is n o t valid

We take a sample of size n and obtain a list of observed frequencies. This is shown in Table 11.10 "General Contingency Table". Based on the assumed probability distribution we also have a list of assumed frequencies, each of which is defined and computed by the formula

Table 11.10 General Contingency Table

Factor Levels Assumed Distribution Observed Frequency
1 p1 O1
2 p2 O2
I pI OI

Table 11.10 "General Contingency Table" is updated to Table 11.11 "Updated General Contingency Table" by adding the expected frequency for each value of X. To simplify the notation we drop indices for the observed and expected frequencies and represent Table 11.11 "Updated General Contingency Table" by Table 11.12 "Simplified Updated General Contingency Table".

Table 11.11 Updated General Contingency Table

Factor Levels Assumed Distribution Observed Freq. Expected Freq.
1 p1 O1 E1
2 p2 O2 E2
I pI OI EI

Table 11.12 Simplified Updated General Contingency Table

Factor Levels Assumed Distribution Observed Freq. Expected Freq.
1 p1 O E
2 p2 O E
I pI O E

Here is the test statistic for the general hypothesis based on Table 11.12 "Simplified Updated General Contingency Table", together with the conditions that it follow a chi-square distribution.

Test Statistic for Testing Goodness of Fit to a Discrete Probability Distribution

where the sum is over all the rows of the table (one for each value of X).

1. the true probability distribution of X is as assumed, and
2. the observed count O of each cell in Table 11.12 "Simplified Updated General Contingency Table" is at least 5,

then χ 2 approximately follows a chi-square distribution with d f = I − 1 degrees of freedom.

The test is known as a goodness-of-fit χ 2 test since it tests the null hypothesis that the sample fits the assumed probability distribution well. It is always right-tailed, since deviation from the assumed probability distribution corresponds to large values of χ 2 .

Testing is done using either of the usual five-step procedures.

Example 2

Table 11.13 "Ethnic Groups in the Census Year" shows the distribution of various ethnic groups in the population of a particular state based on a decennial U.S. census. Five years later a random sample of 2,500 residents of the state was taken, with the results given in Table 11.14 "Sample Data Five Years After the Census Year" (along with the probability distribution from the census year). Test, at the 1% level of significance, whether there is sufficient evidence in the sample to conclude that the distribution of ethnic groups in this state five years after the census had changed from that in the census year.

Table 11.13 Ethnic Groups in the Census Year

Ethnicity White Black Amer.-Indian Hispanic Asian Others
Proportion 0.743 0.216 0.012 0.012 0.008 0.009

Table 11.14 Sample Data Five Years After the Census Year

Ethnicity Assumed Distribution Observed Frequency
White 0.743 1732
Black 0.216 538
American-Indian 0.012 32
Hispanic 0.012 42
Asian 0.008 133
Others 0.009 23

We test using the critical value approach.

Step 1. The hypotheses of interest in this case can be expressed as

Step 3. To compute the value of the test statistic we must first compute the expected number for each row of Table 11.14 "Sample Data Five Years After the Census Year". Since n = 2500, using the formula E i = n × p i and the values of pi from either Table 11.13 "Ethnic Groups in the Census Year" or Table 11.14 "Sample Data Five Years After the Census Year",

E 1 = 2500 × 0.743 = 1857.5 E 2 = 2500 × 0.216 = 540 E 3 = 2500 × 0.012 = 30 E 4 = 2500 × 0.012 = 30 E 5 = 2500 × 0.008 = 20 E 6 = 2500 × 0.009 = 22.5

Table 11.15 Observed and Expected Frequencies Five Years After the Census Year

Ethnicity Assumed Dist. Observed Freq. Expected Freq.
White 0.743 1732 1857.5
Black 0.216 538 540
American-Indian 0.012 32 30
Hispanic 0.012 42 30
Asian 0.008 133 20
Others 0.009 23 22.5

The value of the test statistic is

Since the random variable takes six values, I = 6. Thus the test statistic follows the chi-square distribution with d f = 6 − 1 = 5 degrees of freedom.

Since the test is right-tailed, the critical value is χ 0.01 2 . Reading from Figure 12.4 "Critical Values of Chi-Square Distributions", χ 0.01 2 = 15.086 , so the rejection region is [ 15.086 , ∞ ) .

Key Takeaway

• The chi-square goodness-of-fit test A test based on a chi-square statistic to check whether a sample is taken from a population with a hypothesized probability distribution. can be used to evaluate the hypothesis that a sample is taken from a population with an assumed specific probability distribution.

Basic

A data sample is sorted into five categories with an assumed probability distribution.

Factor Levels Assumed Distribution Observed Frequency
1 p 1 = 0.1 10
2 p 2 = 0.4 35
3 p 3 = 0.4 45
4 p 4 = 0.1 10
1. Find the size n of the sample.
2. Find the expected number E of observations for each level, if the sampled population has a probability distribution as assumed (that is, just use the formula E i = n × p i ).
3. Find the chi-square test statistic χ 2 .
4. Find the number of degrees of freedom of the chi-square test statistic.

A data sample is sorted into five categories with an assumed probability distribution.

Factor Levels Assumed Distribution Observed Frequency
1 p 1 = 0.3 23
2 p 2 = 0.3 30
3 p 3 = 0.2 19
4 p 4 = 0.1 8
5 p 5 = 0.1 10
1. Find the size n of the sample.
2. Find the expected number E of observations for each level, if the sampled population has a probability distribution as assumed (that is, just use the formula E i = n × p i ).
3. Find the chi-square test statistic χ 2 .
4. Find the number of degrees of freedom of the chi-square test statistic.

Applications

Retailers of collectible postage stamps often buy their stamps in large quantities by weight at auctions. The prices the retailers are willing to pay depend on how old the postage stamps are. Many collectible postage stamps at auctions are described by the proportions of stamps issued at various periods in the past. Generally the older the stamps the higher the value. At one particular auction, a lot of collectible stamps is advertised to have the age distribution given in the table provided. A retail buyer took a sample of 73 stamps from the lot and sorted them by age. The results are given in the table provided. Test, at the 5% level of significance, whether there is sufficient evidence in the data to conclude that the age distribution of the lot is different from what was claimed by the seller.

Year Claimed Distribution Observed Frequency
Before 1940 0.10 6
1940 to 1959 0.25 15
1960 to 1979 0.45 30
After 1979 0.20 22

The litter size of Bengal tigers is typically two or three cubs, but it can vary between one and four. Based on long-term observations, the litter size of Bengal tigers in the wild has the distribution given in the table provided. A zoologist believes that Bengal tigers in captivity tend to have different (possibly smaller) litter sizes from those in the wild. To verify this belief, the zoologist searched all data sources and found 316 litter size records of Bengal tigers in captivity. The results are given in the table provided. Test, at the 5% level of significance, whether there is sufficient evidence in the data to conclude that the distribution of litter sizes in captivity differs from that in the wild.

An online shoe retailer sells men’s shoes in sizes 8 to 13. In the past orders for the different shoe sizes have followed the distribution given in the table provided. The management believes that recent marketing efforts may have expanded their customer base and, as a result, there may be a shift in the size distribution for future orders. To have a better understanding of its future sales, the shoe seller examined 1,040 sales records of recent orders and noted the sizes of the shoes ordered. The results are given in the table provided. Test, at the 1% level of significance, whether there is sufficient evidence in the data to conclude that the shoe size distribution of future sales will differ from the historic one.

Shoe Size Past Size Distribution Recent Size Frequency
8.0 0.03 25
8.5 0.06 43
9.0 0.09 88
9.5 0.19 221
10.0 0.23 272
10.5 0.14 150
11.0 0.10 107
11.5 0.06 51
12.0 0.05 37
12.5 0.03 35
13.0 0.02 11

An online shoe retailer sells women’s shoes in sizes 5 to 10. In the past orders for the different shoe sizes have followed the distribution given in the table provided. The management believes that recent marketing efforts may have expanded their customer base and, as a result, there may be a shift in the size distribution for future orders. To have a better understanding of its future sales, the shoe seller examined 1,174 sales records of recent orders and noted the sizes of the shoes ordered. The results are given in the table provided. Test, at the 1% level of significance, whether there is sufficient evidence in the data to conclude that the shoe size distribution of future sales will differ from the historic one.

Shoe Size Past Size Distribution Recent Size Frequency
5.0 0.02 20
5.5 0.03 23
6.0 0.07 88
6.5 0.08 90
7.0 0.20 222
7.5 0.20 258
8.0 0.15 177
8.5 0.11 121
9.0 0.08 91
9.5 0.04 53
10.0 0.02 31

A chess opening is a sequence of moves at the beginning of a chess game. There are many well-studied named openings in chess literature. French Defense is one of the most popular openings for black, although it is considered a relatively weak opening since it gives black probability 0.344 of winning, probability 0.405 of losing, and probability 0.251 of drawing. A chess master believes that he has discovered a new variation of French Defense that may alter the probability distribution of the outcome of the game. In his many Internet chess games in the last two years, he was able to apply the new variation in 77 games. The wins, losses, and draws in the 77 games are given in the table provided. Test, at the 5% level of significance, whether there is sufficient evidence in the data to conclude that the newly discovered variation of French Defense alters the probability distribution of the result of the game.

Result for Black Probability Distribution New Variation Wins
Win 0.344 31
Loss 0.405 25
Draw 0.251 21

The Department of Parks and Wildlife stocks a large lake with fish every six years. It is determined that a healthy diversity of fish in the lake should consist of 10% largemouth bass, 15% smallmouth bass, 10% striped bass, 10% trout, and 20% catfish. Therefore each time the lake is stocked, the fish population in the lake is restored to maintain that particular distribution. Every three years, the department conducts a study to see whether the distribution of the fish in the lake has shifted away from the target proportions. In one particular year, a research group from the department observed a sample of 292 fish from the lake with the results given in the table provided. Test, at the 5% level of significance, whether there is sufficient evidence in the data to conclude that the fish population distribution has shifted since the last stocking.

Fish Target Distribution Fish in Sample
Largemouth Bass 0.10 14
Smallmouth Bass 0.15 49
Striped Bass 0.10 21
Trout 0.10 22
Catfish 0.20 75
Other 0.35 111

Large Data Set Exercise

Large Data Set 4 records the result of 500 tosses of six-sided die. Test, at the 10% level of significance, whether there is sufficient evidence in the data to conclude that the die is not “fair” (or “balanced”), that is, that the probability distribution differs from probability 1/6 for each of the six faces on the die.

Tests for Independence

Hypotheses tests encountered earlier in the book had to do with how the numerical values of two population parameters compared. In this subsection we will investigate hypotheses that have to do with whether or not two random variables take their values independently, or whether the value of one has a relation to the value of the other. Thus the hypotheses will be expressed in words, not mathematical symbols. We build the discussion around the following example.

There is a theory that the gender of a baby in the womb is related to the baby’s heart rate: baby girls tend to have higher heart rates. Suppose we wish to test this theory. We examine the heart rate records of 40 babies taken during their mothers’ last prenatal checkups before delivery, and to each of these 40 randomly selected records we compute the values of two random measures: 1) gender and 2) heart rate. In this context these two random measures are often called factors A variable with several qualitative levels. . Since the burden of proof is that heart rate and gender are related, not that they are unrelated, the problem of testing the theory on baby gender and heart rate can be formulated as a test of the following hypotheses:

H 0 : Baby gender and baby heart rate are independent vs . H a : Baby gender and baby heart rate are n o t independent

The factor gender has two natural categories or levels: boy and girl. We divide the second factor, heart rate, into two levels, low and high, by choosing some heart rate, say 145 beats per minute, as the cutoff between them. A heart rate below 145 beats per minute will be considered low and 145 and above considered high. The 40 records give rise to a 2 × 2 contingency table. By adjoining row totals, column totals, and a grand total we obtain the table shown as Table 11.1 "Baby Gender and Heart Rate". The four entries in boldface type are counts of observations from the sample of n = 40. There were 11 girls with low heart rate, 17 boys with low heart rate, and so on. They form the core of the expanded table.

Table 11.1 Baby Gender and Heart Rate

Heart Rate
Low High Row Total
Gender Girl 11 7 18
Boy 17 5 22
Column Total 28 12 Total = 40

In analogy with the fact that the probability of independent events is the product of the probabilities of each event, if heart rate and gender were independent then we would expect the number in each core cell to be close to the product of the row total R and column total C of the row and column containing it, divided by the sample size n. Denoting such an expected number of observations E, these four expected values are:

• 1st row and 1st column: E = ( R × C ) ∕ n = 18 × 28 ∕ 40 = 12.6
• 1st row and 2nd column: E = ( R × C ) ∕ n = 18 × 12 ∕ 40 = 5.4
• 2nd row and 1st column: E = ( R × C ) ∕ n = 22 × 28 ∕ 40 = 15.4
• 2nd row and 2nd column: E = ( R × C ) ∕ n = 22 × 12 ∕ 40 = 6.6

We update Table 11.1 "Baby Gender and Heart Rate" by placing each expected value in its corresponding core cell, right under the observed value in the cell. This gives the updated table Table 11.2 "Updated Baby Gender and Heart Rate".

Table 11.2 Updated Baby Gender and Heart Rate

Heart Rate
Low High Row Total
Gender Girl O = 11 E = 12.6 O = 7 E = 5.4 R = 18
Boy O = 17 E = 15.4 O = 5 E = 6.6 R = 22
Column Total C = 28 C = 12 n = 40

A measure of how much the data deviate from what we would expect to see if the factors really were independent is the sum of the squares of the difference of the numbers in each core cell, or, standardizing by dividing each square by the expected number in the cell, the sum Σ ( O − E ) 2 ∕ E . We would reject the null hypothesis that the factors are independent only if this number is large, so the test is right-tailed. In this example the random variable Σ ( O − E ) 2 ∕ E has the chi-square distribution with one degree of freedom. If we had decided at the outset to test at the 10% level of significance, the critical value defining the rejection region would be, reading from Figure 12.4 "Critical Values of Chi-Square Distributions", χ α 2 = χ 0.10 2 = 2.706 , so that the rejection region would be the interval [ 2.706 , ∞ ) . When we compute the value of the standardized test statistic we obtain

Σ ( O − E ) 2 E = ( 11 − 12.6 ) 2 12.6 + ( 7 − 5.4 ) 2 5.4 + ( 17 − 15.4 ) 2 15.4 + ( 5 − 6.6 ) 2 6.6 = 1.231

Since 1.231 < 2.706, the decision is not to reject H0. See Figure 11.3 "Baby Gender Prediction". The data do not provide sufficient evidence, at the 10% level of significance, to conclude that heart rate and gender are related.

Figure 11.3 Baby Gender Prediction

With this specific example in mind, now turn to the general situation. In the general setting of testing the independence of two factors, call them Factor 1 and Factor 2, the hypotheses to be tested are

H 0 : The two factors are independent vs . H a : The two factors are n o t independent

As in the example each factor is divided into a number of categories or levels. These could arise naturally, as in the boy-girl division of gender, or somewhat arbitrarily, as in the high-low division of heart rate. Suppose Factor 1 has I levels and Factor 2 has J levels. Then the information from a random sample gives rise to a general I × J contingency table, which with row totals, column totals, and a grand total would appear as shown in Table 11.3 "General Contingency Table". Each cell may be labeled by a pair of indices ( i , j ) . O i j stands for the observed count of observations in the cell in row i and column j, Ri for the i t h row total and Cj for the j t h column total. To simplify the notation we will drop the indices so Table 11.3 "General Contingency Table" becomes Table 11.4 "Simplified General Contingency Table". Nevertheless it is important to keep in mind that the Os, the Rs and the Cs, though denoted by the same symbols, are in fact different numbers.

Table 11.3 General Contingency Table

Factor 2 Levels
1 · · · j · · · J Row Total
Factor 1 Levels 1 O11 · · · O 1 j · · · O 1 J R1
i O i 1 · · · O i j · · · O i J Ri
I O I 1 · · · O I j · · · O I J RI
Column Total C1 · · · Cj · · · CJ n

Table 11.4 Simplified General Contingency Table

Factor 2 Levels
1 · · · j · · · J Row Total
Factor 1 Levels 1 O · · · O · · · O R
i O · · · O · · · O R
I O · · · O · · · O R
Column Total C · · · C · · · C n

As in the example, for each core cell in the table we compute what would be the expected number E of observations if the two factors were independent. E is computed for each core cell (each cell with an O in it) of Table 11.4 "Simplified General Contingency Table" by the rule applied in the example:

where R is the row total and C is the column total corresponding to the cell, and n is the sample size.

After the expected number is computed for every cell, Table 11.4 "Simplified General Contingency Table" is updated to form Table 11.5 "Updated General Contingency Table" by inserting the computed value of E into each core cell.

Table 11.5 Updated General Contingency Table

Factor 2 Levels
1 · · · j · · · J Row Total
Factor 1 Levels 1 O E · · · O E · · · O E R
i O E · · · O E · · · O E R
I O E · · · O E · · · O E R
Column Total C · · · C · · · C n

Here is the test statistic for the general hypothesis based on Table 11.5 "Updated General Contingency Table", together with the conditions that it follow a chi-square distribution.

Test Statistic for Testing the Independence of Two Factors

where the sum is over all core cells of the table.

1. the two study factors are independent, and
2. the observed count O of each cell in Table 11.5 "Updated General Contingency Table" is at least 5,

then χ 2 approximately follows a chi-square distribution with d f = ( I − 1 ) × ( J − 1 ) degrees of freedom.

The same five-step procedures, either the critical value approach or the p-value approach, that were introduced in Section 8.1 "The Elements of Hypothesis Testing" and Section 8.3 "The Observed Significance of a Test" of Chapter 8 "Testing Hypotheses" are used to perform the test, which is always right-tailed.

Example 1

A researcher wishes to investigate whether students’ scores on a college entrance examination (CEE) have any indicative power for future college performance as measured by GPA. In other words, he wishes to investigate whether the factors CEE and GPA are independent or not. He randomly selects n = 100 students in a college and notes each student’s score on the entrance examination and his grade point average at the end of the sophomore year. He divides entrance exam scores into two levels and grade point averages into three levels. Sorting the data according to these divisions, he forms the contingency table shown as Table 11.6 "CEE versus GPA Contingency Table", in which the row and column totals have already been computed.

Table 11.6 CEE versus GPA Contingency Table

GPA
<2.7 2.7 to 3.2 >3.2 Row Total
CEE < 1800 35 12 5 52
≥ 1800 6 24 18 48
Column Total 41 36 23 Total = 100

Test, at the 1% level of significance, whether these data provide sufficient evidence to conclude that CEE scores indicate future performance levels of incoming college freshmen as measured by GPA.

We perform the test using the critical value approach, following the usual five-step method outlined at the end of Section 8.1 "The Elements of Hypothesis Testing" in Chapter 8 "Testing Hypotheses".

Step 1. The hypotheses are

Step 3. To compute the value of the test statistic we must first computed the expected number for each of the six core cells (the ones whose entries are boldface):

• 1st row and 1st column: E = ( R × C ) ∕ n = 41 × 52 ∕ 100 = 21.32
• 1st row and 2nd column: E = ( R × C ) ∕ n = 36 × 52 ∕ 100 = 18.72
• 1st row and 3rd column: E = ( R × C ) ∕ n = 23 × 52 ∕ 100 = 11.96
• 2nd row and 1st column: E = ( R × C ) ∕ n = 41 × 48 ∕ 100 = 19.68
• 2nd row and 2nd column: E = ( R × C ) ∕ n = 36 × 48 ∕ 100 = 17.28
• 2nd row and 3rd column: E = ( R × C ) ∕ n = 23 × 48 ∕ 100 = 11.04

Table 11.7 Updated CEE versus GPA Contingency Table

GPA
<2.7 2.7 to 3.2 >3.2 Row Total
CEE < 1800 O = 35 E = 21.32 O = 12 E = 18.72 O = 5 E = 11.96 R = 52
≥ 1800 O = 6 E = 19.68 O = 24 E = 17.28 O = 18 E = 11.04 R = 48
Column Total C = 41 C = 36 C = 23 n = 100

Step 4. Since the CEE factor has two levels and the GPA factor has three, I = 2 and J = 3. Thus the test statistic follows the chi-square distribution with d f = ( 2 − 1 ) × ( 3 − 1 ) = 2 degrees of freedom.

Since the test is right-tailed, the critical value is χ 0.01 2 . Reading from Figure 12.4 "Critical Values of Chi-Square Distributions", χ 0.01 2 = 9.210 , so the rejection region is [ 9.210 , ∞ ) .

Key Takeaways

• Critical values of a chi-square distribution with degrees of freedom d f are found in Figure 12.4 "Critical Values of Chi-Square Distributions".
• A chi-square test A test based on a chi-square statistic to check whether two factors are independent. can be used to evaluate the hypothesis that two random variables or factors are independent.

Basic

Find χ 0.01 2 for each of the following number of degrees of freedom.

Find χ 0.05 2 for each of the following number of degrees of freedom.

Find χ 0.10 2 for each of the following number of degrees of freedom.

Find χ 0.01 2 for each of the following number of degrees of freedom.

For d f = 7 and α = 0.05 , find

For d f = 17 and α = 0.01 , find

A data sample is sorted into a 2 × 2 contingency table based on two factors, each of which has two levels.

1. Find the column totals, the row totals, and the grand total, n, of the table.
2. Find the expected number E of observations for each cell based on the assumption that the two factors are independent (that is, just use the formula E = ( R × C ) ∕ n ).
3. Find the value of the chi-square test statistic χ 2 .
4. Find the number of degrees of freedom of the chi-square test statistic.

A data sample is sorted into a 3 × 2 contingency table based on two factors, one of which has three levels and the other of which has two levels.

Factor 1
Level 1 Level 2 Row Total
Factor 2 Level 1 20 10 R
Level 2 15 5 R
Level 3 10 20 R
Column Total C C n
1. Find the column totals, the row totals, and the grand total, n, of the table.
2. Find the expected number E of observations for each cell based on the assumption that the two factors are independent (that is, just use the formula E = ( R × C ) ∕ n ).
3. Find the value of the chi-square test statistic χ 2 .
4. Find the number of degrees of freedom of the chi-square test statistic.

Applications

A child psychologist believes that children perform better on tests when they are given perceived freedom of choice. To test this belief, the psychologist carried out an experiment in which 200 third graders were randomly assigned to two groups, A and B. Each child was given the same simple logic test. However in group B, each child was given the freedom to choose a text booklet from many with various drawings on the covers. The performance of each child was rated as Very Good, Good, and Fair. The results are summarized in the table provided. Test, at the 5% level of significance, whether there is sufficient evidence in the data to support the psychologist’s belief.

In regard to wine tasting competitions, many experts claim that the first glass of wine served sets a reference taste and that a different reference wine may alter the relative ranking of the other wines in competition. To test this claim, three wines, A, B and C, were served at a wine tasting event. Each person was served a single glass of each wine, but in different orders for different guests. At the close, each person was asked to name the best of the three. One hundred seventy-two people were at the event and their top picks are given in the table provided. Test, at the 1% level of significance, whether there is sufficient evidence in the data to support the claim that wine experts’ preference is dependent on the first served wine.

Is being left-handed hereditary? To answer this question, 250 adults are randomly selected and their handedness and their parents’ handedness are noted. The results are summarized in the table provided. Test, at the 1% level of significance, whether there is sufficient evidence in the data to conclude that there is a hereditary element in handedness.

Some geneticists claim that the genes that determine left-handedness also govern development of the language centers of the brain. If this claim is true, then it would be reasonable to expect that left-handed people tend to have stronger language abilities. A study designed to text this claim randomly selected 807 students who took the Graduate Record Examination (GRE). Their scores on the language portion of the examination were classified into three categories: low, average, and high, and their handedness was also noted. The results are given in the table provided. Test, at the 5% level of significance, whether there is sufficient evidence in the data to conclude that left-handed people tend to have stronger language abilities.

It is generally believed that children brought up in stable families tend to do well in school. To verify such a belief, a social scientist examined 290 randomly selected students’ records in a public high school and noted each student’s family structure and academic status four years after entering high school. The data were then sorted into a 2 × 3 contingency table with two factors. Factor 1 has two levels: graduated and did not graduate. Factor 2 has three levels: no parent, one parent, and two parents. The results are given in the table provided. Test, at the 1% level of significance, whether there is sufficient evidence in the data to conclude that family structure matters in school performance of the students.

Family No parent 18 31
One parent 101 44
Two parents 70 26

A large middle school administrator wishes to use celebrity influence to encourage students to make healthier choices in the school cafeteria. The cafeteria is situated at the center of an open space. Everyday at lunch time students get their lunch and a drink in three separate lines leading to three separate serving stations. As an experiment, the school administrator displayed a poster of a popular teen pop star drinking milk at each of the three areas where drinks are provided, except the milk in the poster is different at each location: one shows white milk, one shows strawberry-flavored pink milk, and one shows chocolate milk. After the first day of the experiment the administrator noted the students’ milk choices separately for the three lines. The data are given in the table provided. Test, at the 1% level of significance, whether there is sufficient evidence in the data to conclude that the posters had some impact on the students’ drink choices.

Student Choice
Regular Strawberry Chocolate
Poster Choice
Regular 38 28 40
Strawberry 18 51 24
Chocolate 32 32 53

Large Data Set Exercise

Large Data Set 8 records the result of a survey of 300 randomly selected adults who go to movie theaters regularly. For each person the gender and preferred type of movie were recorded. Test, at the 5% level of significance, whether there is sufficient evidence in the data to conclude that the factors “gender” and “preferred type of movie” are dependent.

Basic exercises for lognormal distribution

This post presents exercises on the lognormal distribution. These exercises are to reinforce the basic properties discussed in this companion blog post.

Exercise 1
Let be a normal random variable with mean 6.5 and standard deviation 0.8. Consider the random variable . what is the probability ?

Exercise 2
Suppose follows a lognormal distribution with parameters and . Let . Determine the following:

• The probability that exceed 1.
• The 40th percentile of .
• The 80th percentile of .

Exercise 3
Let follows a lognormal distribution with parameters and . Compute the mean, second moment, variance, third moment and the fourth moment.

Exercise 4
Let be the same lognormal distribution as in Exercise 3. Use the results in Exercise 3 to compute the coefficient of variation, coefficient of skewness and the kurtosis.

Exercise 5
Given the following facts about a lognormal distribution:

• The lower quartile (i.e. 25% percentile) is 1000.
• The upper quartile (i.e. 75% percentile) is 4000.

Determine the mean and variance of the given lognormal distribution.

Exercise 6
Suppose that a random variable follows a lognormal distribution with mean 149.157 and variance 223.5945. Determine the probability .

Exercise 7
Suppose that a random variable follows a lognormal distribution with mean 1200 and median 1000. Determine the probability .

Exercise 8
Customers of a very popular restaurant usually have to wait in line for a table. Suppose that the wait time (in minutes) for a table follows a lognormal distribution with parameters and . Concerned about long wait time, the restaurant owner improves the wait time by expanding the facility and hiring more staff. As a result, the wait time for a table is cut by half. After the restaurant expansion,