Summaries

Session 1.1

19^th August 2009

Sampling a Simple Population

We use random sampling to estimate an empirical model of a population. We check the empirical model by direct inspection of the population. We repeat sampling with replacement, obtaining multiple random samples from the same population, obtained in the same process. We combine (pool) compatible samples to form larger samples. Pooling samples of size 50, we obtain samples of size 100, 150 and 300. In general, as sample size increases, samples become more precise and reliable, provided that the sampling process is reliable.

Random sampling is the basis for obtaining information in statistical activities. Sampling is necessary, tedious, time consuming and expensive. Random sampling incorporates reliability, precision and uncertainty.

Session Overview

In this session, we begin the study of probability. We begin with a very basic example of a population, and explore the process of sampling a population.

We examine two modes of sampling a population: census (total enumeration), in which every member of the population is examined; and random sampling with replacement (SRS/WR), in which single members are repeatedly selected from the population. One practical reason why we would want a sampling process is that we wish to estimate some property of the population. Total enumeration allows a definitive settling of the question, and random sampling allows an approximate answer. In most practical settings, the populations of interest are too difficult to totally enumerate – the population is too large, or too complex, or cannot be accessed in total. In practical applications, it is sufficient (and usually necessary) to use a suitable random sample in lieu of the total population.

In our first case, we begin with a color bowl whose true color frequencies are not known. We obtain six (6) random samples, each consisting of 50 draws with replacement (SRS/WR). We then compute sample color frequencies in order to estimate the population color frequencies, and then we check the estimates against the true structure of the bowl.

We then explore a bit of decision theory by playing with Ellsberg’s Urns.

Prediction and Probabilistic Randomness: Predicting the Behavior of a Six-sided Die

Samples – Face Values and Predictions

You should be able to begin with the counts in the table and work out the proportions and percentages.

6.30 Samples

	Samples			Samples			Pooled
	#1			#2			12
Face Value	Count	Proportion	Percent	Count	Proportion	Percent	Count	Proportion	Percent
1	9	0.18	18	8	0.16	16	17	0.17	17
2	7	0.14	14	10	0.2	20	17	0.17	17
3	8	0.16	16	9	0.18	18	17	0.17	17
4	7	0.14	14	6	0.12	12	13	0.13	13
5	10	0.2	20	11	0.22	22	21	0.21	21
6	9	0.18	18	6	0.12	12	15	0.15	15
Total	50	1	100	50	1	100	100	1	100
Prediction
Hit	11	0.22	22	8	0.16	16	19	0.19	19
Miss	39	0.78	78	42	0.84	84	81	0.81	81
Total	50	1	100	50	1	100	100	1	100

	Samples			Samples			Pooled
	#3			#4			34
Face Value	Count	Proportion	Percent	Count	Proportion	Percent	Count	Proportion	Percent
1	9	0.18	18	9	0.18	18	18	0.18	18
2	7	0.14	14	8	0.16	16	15	0.15	15
3	7	0.14	14	11	0.22	22	18	0.18	18
4	6	0.12	12	5	0.1	10	11	0.11	11
5	12	0.24	24	10	0.2	20	22	0.22	22
6	9	0.18	18	7	0.14	14	16	0.16	16
Total	50	1	100	50	1	100	100	1	100
Prediction
Hit	8	0.16	16	13	0.26	26	21	0.21	21
Miss	42	0.84	84	37	0.74	74	79	0.79	79
Total	50	1	100	50	1	100	100	1	100

	Samples			Samples			Pooled
	#5			#6			56
Face Value	Count	Proportion	Percent	Count	Proportion	Percent	Count	Proportion	Percent
1	9	0.18	18	6	0.12	12	15	0.15	15
2	6	0.12	12	11	0.22	22	17	0.17	17
3	5	0.1	10	6	0.12	12	11	0.11	11
4	7	0.14	14	12	0.24	24	19	0.19	19
5	9	0.18	18	7	0.14	14	16	0.16	16
6	14	0.28	28	8	0.16	16	22	0.22	22
Total	50	1	100	50	1	100	100	1	100
Prediction
Hit	10	0.2	20	12	0.24	24	22	0.22	22
Miss	40	0.8	80	38	0.76	76	78	0.78	78
Total	50	1	100	50	1	100	100	1	100


	Pooled			Pooled			Pooled
	135			246			All
Face Value	Count	Proportion	Percent	Count	Proportion	Percent	Count	Proportion	Percent
1	27	0.18	18.00	23	0.15	15.33	50	0.17	16.67
2	20	0.13	13.33	29	0.19	19.33	49	0.16	16.33
3	20	0.13	13.33	26	0.17	17.33	46	0.15	15.33
4	20	0.13	13.33	23	0.15	15.33	43	0.14	14.33
5	31	0.21	20.67	28	0.19	18.67	59	0.20	19.67
6	32	0.21	21.33	21	0.14	14.00	53	0.18	17.67
Total	150	1.00	100.00	150	1.00	100.00	300	1.00	100.00
Prediction
Hit	29	0.19	19.33	33	0.22	22.00	62	0.21	20.67
Miss	121	0.81	80.67	117	0.78	78.00	238	0.79	79.33
Total	150	1	100	150	1.00	100.00	300	1.00	100.00

8.00 Samples

	Samples			Samples			Pooled
	#1			#2			12
Face Value	Count	Proportion	Percent	Count	Proportion	Percent	Count	Proportion	Percent
1	9	0.18	18	7	0.14	14	16	0.16	16
2	11	0.22	22	11	0.22	22	22	0.22	22
3	4	0.08	8	9	0.18	18	13	0.13	13
4	9	0.18	18	3	0.06	6	12	0.12	12
5	10	0.2	20	10	0.2	20	20	0.2	20
6	7	0.14	14	10	0.2	20	17	0.17	17
Total	50	1	100	50	1	100	100	1	100
Prediction
Hit	5	0.1	10	6	0.12	12	11	0.11	11
Miss	45	0.9	90	44	0.88	88	89	0.89	89
Total	50	1	100	50	1	100	100	1	100

	Samples			Samples			Pooled
	#3			#4			34
Face Value	Count	Proportion	Percent	Count	Proportion	Percent	Count	Proportion	Percent
1	8	0.16	16	10	0.2	20	18	0.18	18
2	13	0.26	26	12	0.24	24	25	0.25	25
3	11	0.22	22	7	0.14	14	18	0.18	18
4	5	0.1	10	13	0.26	26	18	0.18	18
5	6	0.12	12	3	0.06	6	9	0.09	9
6	7	0.14	14	5	0.1	10	12	0.12	12
Total	50	1	100	50	1	100	100	1	100
Prediction
Hit	11	0.22	22	13	0.26	26	24	0.24	24
Miss	39	0.78	78	37	0.74	74	76	0.76	76
Total	50	1	100	50	1	100	100	1	100

	Samples			Samples			Pooled
	#5			#6			56
Face Value	Count	Proportion	Percent	Count	Proportion	Percent	Count	Proportion	Percent
1	9	0.18	18	6	0.12	12	15	0.15	15
2	7	0.14	14	11	0.22	22	18	0.18	18
3	9	0.18	18	9	0.18	18	18	0.18	18
4	7	0.14	14	8	0.16	16	15	0.15	15
5	4	0.08	8	10	0.2	20	14	0.14	14
6	14	0.28	28	6	0.12	12	20	0.2	20
Total	50	1	100	50	1	100	100	1	100
Prediction
Hit	9	0.18	18	7	0.14	14	16	0.16	16
Miss	41	0.82	82	43	0.86	86	84	0.84	84
Total	50	1	100	50	1	100	100	1	100


	Pooled			Pooled			Pooled
	135			246			All
Face Value	Count	Proportion	Percent	Count	Proportion	Percent	Count	Proportion	Percent
1	26	0.17	17.33	23	0.15	15.33	49	0.16	16.33
2	31	0.21	20.67	34	0.23	22.67	65	0.22	21.67
3	24	0.16	16.00	25	0.17	16.67	49	0.16	16.33
4	21	0.14	14.00	24	0.16	16.00	45	0.15	15.00
5	20	0.13	13.33	23	0.15	15.33	43	0.14	14.33
6	28	0.19	18.67	21	0.14	14.00	49	0.16	16.33
Total	150	1.00	100.00	150	1.00	100.00	300	1.00	100.00
Prediction
Hit	25	0.17	16.67	26	0.17	17.33	51	0.17	17.00
Miss	125	0.83	83.33	124	0.83	82.67	249	0.83	83.00
Total	150	1.00	100.00	150	1.00	100.00	300	1.00	100.00

In the fair die model for this case, in long runs of tosses of the die: approximately 16⅔% of tosses show “1”, approximately 16⅔% of tosses show “2”, approximately 16⅔% of tosses show “3”, approximately 16⅔% of tosses show “4”, approximately 16⅔% of tosses show “5”, and approximately 16⅔% of tosses show “6.” The sample data are generally compatible with a fair die assumption (equally-likely face values) and with a baseline expected prediction success rate of (1/6), or 16⅔%. Sample performance seems to improve with increasing sample size – but the samples do not exactly fit the fair assumption.

Sample versus Fair Model

6.30

Face Value 1: 16.7% versus 16.7%

Face Value 2: 16.3% versus 16.7%

Face Value 3: 15.3% versus 16.7%

Face Value 4: 14.3% versus 16.7%

Face Value 5: 19.7% versus 16.7%

Face Value 6: 17.7% versus 16.7%

Prediction “Hit”: 20.7% versus 16.7%

Prediction “Miss”: 79.3% versus 83.3%

8.00

Face Value 1: 16.3% versus 16.7%

Face Value 2: 21.7% versus 16.7%

Face Value 3: 16.3% versus 16.7%

Face Value 4: 15.0% versus 16.7%

Face Value 5: 14.3% versus 16.7%

Face Value 6: 16.3% versus 16.7%

Prediction “Hit”: 17.0% versus 16.7%

Prediction “Miss”: 83.0% versus 83.3%

Case Study 1.1: A Color Bowl

In random sampling, we might get a complete list of colors - we'd need a total sample (census) for that kind of listing. The sample proportions of each listed color approximate the corresponding model proportion in the bowl itself. In census sampling, every object in the bowl is counted. The listing is complete, and the model proportions may be calculated directly.

The basic idea in case study 1.1 is that random samples give imperfect pictures of what is being sampled. However, with sufficiently large samples, these samples can reliably yield good pictures of the processes or populations being sampled. And the essence of many statistical applications is the study of selected processes or populations. For a sense of the efficiency of the samples, compare sample and true percentages.

Some Formulas – Proportions, Percentages, Counts

The class represents some property or attribute, for example, blue, or red. Each member, or unit, of a sample can be classified – the result of the classification of the unit is the unit’s class.

Sample Proportion (p)

n_class ~ number of units of sample in class

n_total ~ total number of units in sample

p_class = n_class / n_total

p_class ~ proportion of sample in class

Sample Percent (pct)

n_class ~ number of units of sample in class

n_total ~ total number of units in sample

p_class = n_class / n_total

pct_class = 100*(n_class / n_total)

pct_class = 100* p_class

pct_class ~ percent of sample in class

Population Proportion (P)

N_class ~ number of units of population in class

N_total ~ total number of units in population

P_class = N_class / N_total

P_class ~ proportion of population in class

Population Percent (PCT)

N_class ~ number of units of population in class

N_total ~ total number of units in population

P_class = N_class / N_total

PCT_class = 100*(N_class / N_total)

PCT_class = 100* P_class

PCT_class ~ percent of population in class

In this setting,

n_blue ~ number of blue draws in sample

n_total ~ total number of draws per sample

p_blue = n_blue / n_total

p_blue ~ proportion of sample draws showing blue

pct_blue = 100*p_blue

pct_blue ~ percent of sample draws showing blue

N_blue ~ number of blue marbles in bowl

N_total ~ total number of marbles in bowl

P_blue = N_blue / N_blue

P_blue ~ proportion of marbles in bowl that are blue

n_green ~ number of green draws in sample

n_total ~ total number of draws per sample

p_green = n_green / n_green

p_green ~ proportion of sample draws showing green

pct_green = 100*p_green

pct_green ~ percent of sample draws showing green

N_green ~ number of green marbles in bowl

N_total ~ total number of marbles in bowl

P_green = N_green / N_green

P_green ~ proportion of marbles in bowl that are green

n_red ~ number of red draws in sample

n_total ~ total number of draws per sample

p_red = n_red / n_red

p_red ~ proportion of sample draws showing red

pct_red = 100*p_red

pct_red ~ percent of sample draws showing red

N_red ~ number of red marbles in bowl

N_total ~ total number of marbles in bowl

P_red = N_red / N_red

P_red ~ proportion of marbles in bowl that are red

n_yellow ~ number of yellow draws in sample

n_total ~ total number of draws per sample

p_yellow = n_yellow / n_yellow

p_yellow ~ proportion of sample draws showing yellow

pct_yellow = 100*p_yellow

pct_yellow ~ percent of sample draws showing yellow

N_yellow ~ number of yellow marbles in bowl

N_total ~ total number of marbles in bowl

P_yellow = N_yellow / N_yellow

P_yellow ~ proportion of marbles in bowl that are yellow

Samples – Bowl

6.30

	#1			#2			Pooled 12
Color	Count	Proportion	Percent	Count	Proportion	Percent	Count	Proportion	Percent
Blue	4	0.08	8	7	0.14	14	11	0.11	11
Green	2	0.04	4	3	0.06	6	5	0.05	5
Red	18	0.36	36	22	0.44	44	40	0.4	40
Yellow	26	0.52	52	18	0.36	36	44	0.44	44
Total	50	1	100	50	1	100	100	1	100

	#3			#4			Pooled 34
Color	Count	Proportion	Percent	Count	Proportion	Percent	Count	Proportion	Percent
Blue	8	0.16	16	9	0.18	18	17	0.17	17
Green	4	0.08	8	0	0	0	4	0.04	4
Red	16	0.32	32	22	0.44	44	38	0.38	38
Yellow	22	0.44	44	19	0.38	38	41	0.41	41
Total	50	1	100	50	1	100	100	1	100

	#5			#6			Pooled 56
Color	Count	Proportion	Percent	Count	Proportion	Percent	Count	Proportion	Percent
Blue	6	0.12	12	9	0.18	18	15	0.15	15
Green	3	0.06	6	7	0.14	14	10	0.1	10
Red	21	0.42	42	17	0.34	34	38	0.38	38
Yellow	20	0.4	40	17	0.34	34	37	0.37	37
Total	50	1	100	50	1	100	100	1	100

	Pooled 135			Pooled 246			Pooled All
Color	Count	Proportion	Percent	Count	Proportion	Percent	Count	Proportion	Percent
Blue	18	0.12	12.00	25	0.17	16.67	43	0.14	14.33
Green	9	0.06	6.00	10	0.07	6.67	19	0.06	6.33
Red	55	0.37	36.67	61	0.41	40.67	116	0.39	38.67
Yellow	68	0.45	45.33	54	0.36	36.00	122	0.41	40.67
Total	150	1	100	150	1	100	300	1	100

8.00 Samples

	#1				#2		Pooled 12
Color	Count	Proportion	Percent	Count	Proportion	Percent	Count	Proportion	Percent
Blue	8	0.16	16	9	0.18	18	17	0.17	17
Green	2	0.04	4	4	0.08	8	6	0.06	6
Red	24	0.48	48	17	0.34	34	41	0.41	41
Yellow	16	0.32	32	20	0.4	40	36	0.36	36
Total	50	1	100	50	1	100	100	1	100

	#3			#4			Pooled 34
Color	Count	Proportion	Percent	Count	Proportion	Percent	Count	Proportion	Percent
Blue	12	0.24	24	12	0.24	24	24	0.24	24
Green	3	0.06	6	1	0.02	2	4	0.04	4
Red	18	0.36	36	20	0.4	40	38	0.38	38
Yellow	17	0.34	34	17	0.34	34	34	0.34	34
Total	50	1	100	50	1	100	100	1	100

	#5			#6			Pooled 56
Color	Count	Proportion	Percent	Count	Proportion	Percent	Count	Proportion	Percent
Blue	5	0.1	10	5	0.1	10	10	0.1	10
Green	0	0	0	3	0.06	6	3	0.03	3
Red	20	0.4	40	22	0.44	44	42	0.42	42
Yellow	25	0.5	50	20	0.4	40	45	0.45	45
Total	50	1	100	50	1	100	100	1	100

	Pooled 135			Pooled 246			Pooled All
Color	Count	Proportion	Percent	Count	Proportion	Percent	Count	Proportion	Percent
Blue	25	0.17	16.67	26	0.17	17.33	51	0.17	17.00
Green	5	0.03	3.33	8	0.05	5.33	13	0.04	4.33
Red	62	0.41	41.33	59	0.39	39.33	121	0.40	40.33
Yellow	58	0.39	38.67	57	0.38	38.00	115	0.38	38.33
Total	150	1	100	150	1	100	300	1	100

You should be able to begin with the counts in the table and work out the proportions and percentages.

The True State of the Bowl

Color	Count	Proportion	Percent	E50	E100	E150	E200	E250	E300
Blue	3	0.1364	13.64	6.82	13.64	20.45	27.27	34.09	40.91
Green	1	0.0455	4.55	2.27	4.55	6.82	9.09	11.36	13.64
Red	9	0.4091	40.91	20.45	40.91	61.36	81.82	102.27	122.73
Yellow	9	0.4091	40.91	20.45	40.91	61.36	81.82	102.27	122.73
Total	22	1	100	50	100	150	200	250	300

Sample versus Population

6.30

14.3% versus 13.6%

6.3% versus 4.5%

38.6% versus 40.9%

40.7% versus 40.9%

8.00

17.0% versus 13.6%

4.3% versus 4.5%

40.3% versus 40.9%

38.3% versus 40.9%

The true proportions are probabilities:

In long runs of draws with replacement from the bowl, approximately 13.6 percent of draws show blue.

In long runs of draws with replacement from the bowl, approximately 4.5 percent of draws show green.

In long runs of draws with replacement from the bowl, approximately 40.9 percent of draws show red.

In long runs of draws with replacement from the bowl, approximately 40.9 percent of draws show yellow.

We see reasonable, but not exact matches between the sample proportions (p) and the probabilities (P).

The probabilities imply perfect or expected counts: E=n*P:

In samples of 50 draws with replacement from the bowl, approximately 6 or 7 draws show blue.

In samples of 50 draws with replacement from the bowl, approximately 2 or 3 draws show green.

In samples of 50 draws with replacement from the bowl, approximately 20 or 21 draws show red.

In samples of 50 draws with replacement from the bowl, approximately 20 or 21 draws show yellow.

In samples of 100 draws with replacement from the bowl, approximately 13 or 14 draws show blue.

In samples of 100 draws with replacement from the bowl, approximately 4 or 5 draws show green.

In samples of 100 draws with replacement from the bowl, approximately 40 or 41 draws show red.

In samples of 100 draws with replacement from the bowl, approximately 40 or 41 draws show yellow.

In samples of 150 draws with replacement from the bowl, approximately 20 or 21 draws show blue.

In samples of 150 draws with replacement from the bowl, approximately 6 or 7 draws show green.

In samples of 150 draws with replacement from the bowl, approximately 61 or 62 draws show red.

In samples of 150 draws with replacement from the bowl, approximately 61 or 62 draws show yellow.

In samples of 200 draws with replacement from the bowl, approximately 27 or 28 draws show blue.

In samples of 200 draws with replacement from the bowl, approximately 9 or 10 draws show green.

In samples of 200 draws with replacement from the bowl, approximately 81 or 82 draws show red.

In samples of 200 draws with replacement from the bowl, approximately 81 or 82 draws show yellow.

In samples of 250 draws with replacement from the bowl, approximately 34 or 35 draws show blue.

In samples of 250 draws with replacement from the bowl, approximately 11 or 12 draws show green.

In samples of 250 draws with replacement from the bowl, approximately 102 or 103 draws show red.

In samples of 250 draws with replacement from the bowl, approximately 102 or 103 draws show yellow.

In samples of 300 draws with replacement from the bowl, approximately 40 or 41 draws show blue.

In samples of 300 draws with replacement from the bowl, approximately 13 or 14 draws show green.

In samples of 300 draws with replacement from the bowl, approximately 122 or 123 draws show red.

In samples of 300 draws with replacement from the bowl, approximately 122 or 123 draws show yellow.

We see reasonable, but not exact matches between the sample proportions and the probabilities.

We didn’t get to these, but look up the Ellsberg games.

Regarding Ellsberg I

The 1^st Game: The first bowl is 50%/50% split between blue and green. The best we can do is break even, regardless of strategy. The simplest strategy involves picking one of the colors and always betting on that color.

The 2^nd Game: The second bowl is an unknown composite of red and yellow. We might be able to win this game if 1) there is a dominant color and 2) we can determine that dominant color. A simple strategy here is to pick one color and ride it for awhile. Then stop betting and check the number of winning bets. If the color being betted is losing on a regular basis, switch colors.

The 3^rd Game: This game only makes sense if the second bowl is dominant in red, bet on red – if red consistently shows, stay on the second bowl. Otherwise, either stop playing, or stick with the first bowl.

Regarding Ellsberg II

The 1^st Game: The first bowl is 20% red / 40% black / 40% white. The simplest strategy involves picking one of the colors and always betting on that color. Regardless of betting choice, there is a 40% chance of losing for the single bet, and 20% for getting kicked off the game.

The 2nd Game: The second bowl is 20% red / 80% black or white. The simplest strategy involves picking one of the colors and always betting on that color. If either white or black is sufficiently dominant, this game might be worth playing. The problem is that regardless of the possible advantage in the white/black part of the bowl, there is still a 20% chance of getting killed (permanently losing). But to detect this advantage, one is forced to pick a betting color (white or black) and spend some money.

The idea underlying the Ellsberg games is to illustrate the concept of making decisions about selected processes or populations by making decisions using random samples.