Summaries
Session 0.1/1.1
1st June 2009
Sampling a Simple Population
We use random sampling to estimate
an empirical model of a population. We check the empirical model by direct
inspection of the population. We repeat sampling with replacement, obtaining
multiple random samples from the same population, obtained in the same process.
We combine (pool) compatible samples to form larger samples. Pooling samples of
size 50, we obtain samples of size 100, 150 and 300. In general, as sample size
increases, samples become more precise and reliable, provided that the sampling
process is reliable.
Random sampling is the basis for
obtaining information in statistical activities. Sampling is necessary,
tedious, time consuming and expensive. Random sampling incorporates
reliability, precision and uncertainty.
Session Overview
In this session, we begin the study
of probability. We begin with a very basic example of a population, and explore
the process of sampling a population.
We examine two modes of sampling a
population: census (total enumeration), in which every member of the
population is examined; and random sampling with replacement (SRS/WR),
in which single members are repeatedly selected from the population. One
practical reason why we would want a sampling process is that we wish to
estimate some property of the population. Total enumeration allows a definitive
settling of the question, and random sampling allows an approximate answer. In
most practical settings, the populations of interest are too difficult to totally
enumerate – the population is too large, or too complex, or cannot be accessed
in total. In practical applications, it is sufficient (and usually necessary)
to use a suitable random sample in lieu of the total population.
In our first case, we begin with a
color bowl whose true color frequencies are not known. We obtain six (6) random
samples, each consisting of 50 draws with replacement (SRS/WR). We then compute
sample color frequencies in order to estimate the population color frequencies,
and then we check the estimates against the true structure of the bowl.
We then explore a bit of decision
theory by playing with Ellsberg’s Urns.
Prediction and
Probabilistic Randomness: Predicting the Behavior of a Six-sided Die
Samples – Face Values
and Predictions
You should be able to begin with the counts in the table and work out the
proportions and percentages.
Prediction and the Fair Die |
|||||||||||
Samples |
Samples |
Pooled |
|||||||||
#1 |
#2 |
12 |
|||||||||
Face Value |
Count |
Proportion |
Percent |
Face Value |
Count |
Proportion |
Percent |
Face Value |
Count |
Proportion |
Percent |
1 |
7 |
0.14 |
14 |
1 |
15 |
0.3 |
30 |
1 |
22 |
0.22 |
22 |
2 |
8 |
0.16 |
16 |
2 |
7 |
0.14 |
14 |
2 |
15 |
0.15 |
15 |
3 |
5 |
0.1 |
10 |
3 |
7 |
0.14 |
14 |
3 |
12 |
0.12 |
12 |
4 |
8 |
0.16 |
16 |
4 |
9 |
0.18 |
18 |
4 |
17 |
0.17 |
17 |
5 |
14 |
0.28 |
28 |
5 |
8 |
0.16 |
16 |
5 |
22 |
0.22 |
22 |
6 |
8 |
0.16 |
16 |
6 |
4 |
0.08 |
8 |
6 |
12 |
0.12 |
12 |
Total |
50 |
1 |
100 |
Total |
50 |
1 |
100 |
Total |
100 |
1 |
100 |
Prediction |
Prediction |
Prediction |
|||||||||
Hit |
12 |
0.24 |
24 |
Hit |
5 |
0.1 |
10 |
Hit |
17 |
0.17 |
17 |
Miss |
38 |
0.76 |
76 |
Miss |
45 |
0.9 |
90 |
Miss |
83 |
0.83 |
83 |
Total |
50 |
1 |
100 |
Total |
50 |
1 |
100 |
Total |
100 |
1 |
100 |
Samples |
Samples |
Pooled |
|||||||||
#3 |
#4 |
34 |
|||||||||
Face Value |
Count |
Proportion |
Percent |
Face Value |
Count |
Proportion |
Percent |
Face Value |
Count |
Proportion |
Percent |
1 |
10 |
0.2 |
20 |
1 |
7 |
0.14 |
14 |
1 |
17 |
0.17 |
17 |
2 |
9 |
0.18 |
18 |
2 |
3 |
0.06 |
6 |
2 |
12 |
0.12 |
12 |
3 |
7 |
0.14 |
14 |
3 |
15 |
0.3 |
30 |
3 |
22 |
0.22 |
22 |
4 |
10 |
0.2 |
20 |
4 |
9 |
0.18 |
18 |
4 |
19 |
0.19 |
19 |
5 |
10 |
0.2 |
20 |
5 |
10 |
0.2 |
20 |
5 |
20 |
0.2 |
20 |
6 |
4 |
0.08 |
8 |
6 |
6 |
0.12 |
12 |
6 |
10 |
0.1 |
10 |
Total |
50 |
1 |
100 |
Total |
50 |
1 |
100 |
Total |
100 |
1 |
100 |
Prediction |
Prediction |
Prediction |
|||||||||
Hit |
6 |
0.12 |
12 |
Hit |
7 |
0.14 |
14 |
Hit |
13 |
0.13 |
13 |
Miss |
44 |
0.88 |
88 |
Miss |
43 |
0.86 |
86 |
Miss |
87 |
0.87 |
87 |
Total |
50 |
1 |
100 |
Total |
50 |
1 |
100 |
Total |
100 |
1 |
100 |
Samples |
Samples |
Pooled |
|||||||||
#5 |
#6 |
56 |
|||||||||
Face Value |
Count |
Proportion |
Percent |
Face Value |
Count |
Proportion |
Percent |
Face Value |
Count |
Proportion |
Percent |
1 |
7 |
0.14 |
14 |
1 |
7 |
0.14 |
14 |
1 |
14 |
0.14 |
14 |
2 |
12 |
0.24 |
24 |
2 |
8 |
0.16 |
16 |
2 |
20 |
0.2 |
20 |
3 |
5 |
0.1 |
10 |
3 |
9 |
0.18 |
18 |
3 |
14 |
0.14 |
14 |
4 |
11 |
0.22 |
22 |
4 |
6 |
0.12 |
12 |
4 |
17 |
0.17 |
17 |
5 |
9 |
0.18 |
18 |
5 |
7 |
0.14 |
14 |
5 |
16 |
0.16 |
16 |
6 |
6 |
0.12 |
12 |
6 |
13 |
0.26 |
26 |
6 |
19 |
0.19 |
19 |
Total |
50 |
1 |
100 |
Total |
50 |
1 |
100 |
Total |
100 |
1 |
100 |
Prediction |
Prediction |
Prediction |
|||||||||
Hit |
10 |
0.2 |
20 |
Hit |
4 |
0.08 |
8 |
Hit |
14 |
0.14 |
14 |
Miss |
40 |
0.8 |
80 |
Miss |
46 |
0.92 |
92 |
Miss |
86 |
0.86 |
86 |
Total |
50 |
1 |
100 |
Total |
50 |
1 |
100 |
Total |
100 |
1 |
100 |
Pooled |
Pooled |
Pooled |
|||||||||
135 |
246 |
All |
|||||||||
Face Value |
Count |
Proportion |
Percent |
Face Value |
Count |
Proportion |
Percent |
Face
Value |
Count |
Proportion |
Percent |
1 |
24 |
0.16 |
16 |
1 |
29 |
0.19 |
19.33 |
1 |
53 |
0.18 |
17.7 |
2 |
29 |
0.193 |
19.33 |
2 |
18 |
0.12 |
12 |
2 |
47 |
0.16 |
15.7 |
3 |
17 |
0.113 |
11.33 |
3 |
31 |
0.21 |
20.67 |
3 |
48 |
0.16 |
16 |
4 |
29 |
0.193 |
19.33 |
4 |
24 |
0.16 |
16 |
4 |
53 |
0.18 |
17.7 |
5 |
33 |
0.22 |
22 |
5 |
25 |
0.17 |
16.67 |
5 |
58 |
0.19 |
19.3 |
6 |
18 |
0.12 |
12 |
6 |
23 |
0.15 |
15.33 |
6 |
41 |
0.14 |
13.7 |
Total |
150 |
1 |
100 |
Total |
150 |
1 |
100 |
Total |
300 |
1 |
100 |
Prediction |
Prediction |
Prediction |
|||||||||
Hit |
28 |
0.187 |
18.67 |
Hit |
16 |
0.11 |
10.67 |
Hit |
44 |
0.15 |
14.7 |
Miss |
122 |
0.813 |
81.33 |
Miss |
134 |
0.89 |
89.33 |
Miss |
256 |
0.85 |
85.3 |
Total |
150 |
1 |
100 |
Total |
150 |
1 |
100 |
Total |
300 |
1 |
100 |
In the fair die model
for this case, in long runs of tosses of the die: approximately 16⅔% of
tosses show “1”, approximately 16⅔% of tosses show “2”, approximately 16⅔%
of tosses show “3”, approximately 16⅔% of tosses show “4”, approximately
16⅔% of tosses show “5”, and approximately 16⅔% of tosses show “6.”
The sample data are generally compatible with a fair die assumption
(equally-likely face values) and with a baseline expected prediction success
rate of (1/6), or 16⅔%. Sample performance seems to improve with
increasing sample size – but the samples do not exactly fit the fair
assumption.
Sample versus Fair Model
Face Value 1: 17.7% versus 16.7%
Face Value 2: 15.7% versus 16.7%
Face Value 3: 16% versus 16.7%
Face Value 4: 17.7% versus 16.7%
Face Value 5: 19.3% versus 16.7%
Face Value 6: 13.7% versus 16.7%
Prediction “Hit”: 14.7% versus 16.7%
Prediction “Miss”: 85.3% versus 83.3%
Case Study 1.1: A Color
Bowl
In random sampling, we
might get a complete list of colors - we'd need a total sample (census) for
that kind of listing. The sample proportions of each listed color approximate
the corresponding model proportion in the bowl itself. In census sampling,
every object in the bowl is counted. The listing is complete, and the model
proportions may be calculated directly.
The basic idea in case study 1.1 is
that random samples give imperfect pictures of what is being sampled. However,
with sufficiently large samples, these samples can reliably yield good pictures
of the processes or populations being sampled. And the essence of many
statistical applications is the study of selected processes or populations. For
a sense of the efficiency of the samples, compare sample and true percentages.
Some Formulas – Proportions, Percentages,
Counts
The class represents
some property or attribute, for example, blue, or red.
Each member, or unit, of a sample can be classified – the result of the
classification of the unit is the unit’s class.
Sample Proportion (p)
nclass ~ number of units of sample in class
ntotal ~ total number of units in sample
pclass = nclass /
ntotal
pclass ~ proportion of sample in class
Sample Percent (pct)
nclass ~ number of units of sample in class
ntotal ~ total number of units in sample
pclass = nclass /
ntotal
pctclass = 100*(nclass
/ ntotal)
pctclass = 100* pclass
pctclass ~ percent of sample in class
Population Proportion
(P)
Nclass ~ number of units of population in class
Ntotal ~ total number of units in population
Pclass = Nclass /
Ntotal
Pclass ~ proportion of population in class
Population Percent (PCT)
Nclass ~ number of units of population in class
Ntotal ~ total number of units in population
Pclass = Nclass /
Ntotal
PCTclass = 100*(Nclass
/ Ntotal)
PCTclass = 100* Pclass
PCTclass ~ percent of population in class
In this setting,
nblue ~ number of blue draws in sample
ntotal ~ total number of draws per sample
pblue = nblue / ntotal
pblue ~ proportion of sample draws showing blue
pctblue = 100*pblue
pctblue ~ percent of sample draws showing blue
Nblue ~ number of blue marbles in bowl
Ntotal ~ total number of marbles in bowl
Pblue = Nblue / Nblue
Pblue ~ proportion of marbles in bowl that are blue
ngreen ~ number of green draws in sample
ntotal ~ total number of draws per sample
pgreen = ngreen /
ngreen
pgreen ~ proportion of sample draws showing green
pctgreen = 100*pgreen
pctgreen ~ percent of sample draws showing green
Ngreen ~ number of green marbles in bowl
Ntotal ~ total number of marbles in bowl
Pgreen = Ngreen /
Ngreen
Pgreen ~ proportion of marbles in bowl that are green
nred ~ number of red draws in sample
ntotal ~ total number of draws per sample
pred = nred / nred
pred ~ proportion of sample draws showing red
pctred = 100*pred
pctred ~ percent of sample draws showing red
Nred ~ number of red marbles in bowl
Ntotal ~ total number of marbles in bowl
Pred = Nred / Nred
Pred ~ proportion of marbles in bowl that are red
nyellow ~ number of yellow
draws in sample
ntotal ~ total number of
draws per sample
pyellow = nyellow / nyellow
pyellow ~ proportion of
sample draws showing yellow
pctyellow = 100*pyellow
pctyellow ~ percent of sample
draws showing yellow
Nyellow ~ number of yellow
marbles in bowl
Ntotal ~ total number of
marbles in bowl
Pyellow = Nyellow / Nyellow
Pyellow ~ proportion of
marbles in bowl that are yellow
Samples – Bowl
You should be able to begin with the counts in the table and work out the
proportions and percentages.
#1 |
#2 |
Pooled 12 |
|||||||||
Color |
Count |
Proportion |
Percent |
Color |
Count |
Proportion |
Percent |
Color |
Count |
Proportion |
Percent |
Blue |
23 |
0.46 |
46 |
Blue |
19 |
0.38 |
38 |
Blue |
42 |
0.42 |
42 |
Green |
8 |
0.16 |
16 |
Green |
13 |
0.26 |
26 |
Green |
21 |
0.21 |
21 |
Red |
19 |
0.38 |
38 |
Red |
13 |
0.26 |
26 |
Red |
32 |
0.32 |
32 |
Yellow |
0 |
0 |
0 |
Yellow |
5 |
0.1 |
10 |
Yellow |
5 |
0.05 |
5 |
Total |
50 |
1 |
100 |
Total |
50 |
1 |
100 |
Total |
100 |
1 |
100 |
#3 |
#4 |
Pooled 34 |
|||||||||
Color |
Count |
Proportion |
Percent |
Color |
Count |
Proportion |
Percent |
Color |
Count |
Proportion |
Percent |
Blue |
22 |
0.44 |
44 |
Blue |
17 |
0.34 |
34 |
Blue |
39 |
0.39 |
39 |
Green |
6 |
0.12 |
12 |
Green |
15 |
0.3 |
30 |
Green |
21 |
0.21 |
21 |
Red |
19 |
0.38 |
38 |
Red |
13 |
0.26 |
26 |
Red |
32 |
0.32 |
32 |
Yellow |
3 |
0.06 |
6 |
Yellow |
5 |
0.1 |
10 |
Yellow |
8 |
0.08 |
8 |
Total |
50 |
1 |
100 |
Total |
50 |
1 |
100 |
Total |
100 |
1 |
100 |
#5 |
#6 |
Pooled 56 |
|||||||||
Color |
Count |
Proportion |
Percent |
Color |
Count |
Proportion |
Percent |
Color |
Count |
Proportion |
Percent |
Blue |
13 |
0.26 |
26 |
Blue |
23 |
0.46 |
46 |
Blue |
36 |
0.36 |
36 |
Green |
19 |
0.38 |
38 |
Green |
10 |
0.2 |
20 |
Green |
29 |
0.29 |
29 |
Red |
10 |
0.2 |
20 |
Red |
15 |
0.3 |
30 |
Red |
25 |
0.25 |
25 |
Yellow |
8 |
0.16 |
16 |
Yellow |
2 |
0.04 |
4 |
Yellow |
10 |
0.1 |
10 |
Total |
50 |
1 |
100 |
Total |
50 |
1 |
100 |
Total |
100 |
1 |
100 |
Pooled 135 |
Pooled 246 |
Pooled
All |
|||||||||
Color |
Count |
Proportion |
Percent |
Color |
Count |
Proportion |
Percent |
Color |
Count |
Proportion |
Percent |
Blue |
58 |
0.386667 |
38.67 |
Blue |
59 |
0.393333 |
39.333 |
Blue |
117 |
0.39 |
39 |
Green |
33 |
0.22 |
22 |
Green |
38 |
0.253333 |
25.333 |
Green |
71 |
0.236667 |
23.667 |
Red |
48 |
0.32 |
32 |
Red |
41 |
0.273333 |
27.333 |
Red |
89 |
0.296667 |
29.667 |
Yellow |
11 |
0.073333 |
7.333 |
Yellow |
12 |
0.08 |
8 |
Yellow |
23 |
0.076667 |
7.6667 |
Total |
150 |
1 |
100 |
Total |
150 |
1 |
100 |
Total |
300 |
1 |
100 |
The
True State of the Bowl
Color |
Count |
Proportion |
Percent |
E50 |
E100 |
E150 |
E200 |
E250 |
E300 |
Blue |
8 |
0.333333 |
33.33 |
16.667 |
33.3 |
50 |
66.7 |
83.3 |
100 |
Green |
6 |
0.25 |
25 |
12.5 |
25 |
37.5 |
50 |
62.5 |
75 |
Red |
8 |
0.333333 |
33.33 |
16.667 |
33.3 |
50 |
66.7 |
83.3 |
100 |
Yellow |
2 |
0.083333 |
8.333 |
4.1667 |
8.33 |
12.5 |
16.7 |
20.8 |
25 |
Total |
24 |
1 |
100 |
50 |
100 |
150 |
200 |
250 |
300 |
Sample versus Population
39% versus 33.3%
23.7% versus 25%
29.7% versus 33.3%
7.7% versus 8.3%
The true proportions are probabilities:
In long runs of draws
with replacement from the bowl, approximately 33.3 percent of draws show blue.
In long runs of draws with
replacement from the bowl, approximately 25 percent of draws show green.
In long runs of draws with
replacement from the bowl, approximately 33.3 percent of draws show red.
In long runs of draws with replacement
from the bowl, approximately 8.3 percent of draws show yellow.
We see reasonable, but not exact
matches between the sample proportions (p) and the probabilities (P).
The probabilities imply perfect or expected counts: E=n*P:
In samples of 50 draws
with replacement from the bowl, approximately 16 or 17 draws show blue.
In samples of 50 draws with
replacement from the bowl, approximately 12 or 13 draws show green.
In samples of 50 draws
with replacement from the bowl, approximately 16 or 17 draws show red.
In samples of 50 draws with replacement
from the bowl, approximately 4 or 5 draws show yellow.
In samples of 100 draws
with replacement from the bowl, approximately 33 or 34 draws show blue.
In samples of 100 draws with
replacement from the bowl, approximately 25 draws show green.
In samples of 100 draws
with replacement from the bowl, approximately 33 or 34 draws show red.
In samples of 100 draws with
replacement from the bowl, approximately 8 or 9 draws show yellow.
In samples of 150 draws
with replacement from the bowl, approximately 50 draws show blue.
In samples of 150 draws with
replacement from the bowl, approximately 37 or 38 draws show green.
In samples of 150 draws
with replacement from the bowl, approximately 50 draws show red.
In samples of 150 draws with
replacement from the bowl, approximately 12 or 13 draws show yellow.
In samples of 200 draws
with replacement from the bowl, approximately 66 or 67 draws show blue.
In samples of 200 draws with
replacement from the bowl, approximately 50 draws show green.
In samples of 200 draws
with replacement from the bowl, approximately 66 or 67 draws show red.
In samples of 200 draws with
replacement from the bowl, approximately 16 or 17 draws show yellow.
In samples of 250 draws
with replacement from the bowl, approximately 83 or 84 draws show blue.
In samples of 250 draws with
replacement from the bowl, approximately 62 or 63 draws show green.
In samples of 250 draws
with replacement from the bowl, approximately 83 or 84 draws show red.
In samples of 250 draws with
replacement from the bowl, approximately 20 or 21 draws show yellow.
In samples of 300 draws
with replacement from the bowl, approximately 100 draws show blue.
In samples of 300 draws with
replacement from the bowl, approximately 75 draws show green.
In samples of 300 draws
with replacement from the bowl, approximately 100 draws show red.
In samples of 300 draws with
replacement from the bowl, approximately 25 draws show yellow.
We see reasonable, but not exact
matches between the sample proportions and the probabilities.
We didn’t get to these, so read up.
Regarding Ellsberg I
The 1st Game: The first bowl is 50%/50% split between blue and green. The best we can do is break even, regardless of strategy.
The simplest strategy involves picking one of the colors and always betting on
that color.
The 2nd Game: The second bowl is an unknown composite of red and yellow. We might be able to win this game if 1) there is a dominant
color and 2) we can determine that dominant color. A simple strategy here is to
pick one color and ride it for awhile. Then stop betting and check the number
of winning bets. If the color being betted is losing on a regular basis, switch
colors.
The 3rd Game: This game only makes sense if the second bowl is dominant
in red, bet on red
– if red consistently shows, stay on the second
bowl. Otherwise, either stop playing, or stick with the first bowl.
Regarding Ellsberg II
The 1st Game: The first bowl is 20% red /
40% black / 40% white. The simplest strategy
involves picking one of the colors and always betting on that color. Regardless
of betting choice, there is a 40% chance of losing for the single bet, and 20%
for getting kicked off the game.
The 2nd Game: The second bowl is 20% red /
80% black or white. The simplest strategy involves
picking one of the colors and always betting on that color. If either white or
black is sufficiently dominant, this game might be worth playing. The problem
is that regardless of the possible advantage in the white/black part of the
bowl, there is still a 20% chance of getting killed (permanently losing). But
to detect this advantage, one is forced to pick a betting color (white or
black) and spend some money.
The idea underlying the Ellsberg
games is to illustrate the concept of making decisions about selected processes
or populations by making decisions using random samples.