Summaries

Session 1.1

13th January 2010

Sampling a Simple Population

We use random sampling to estimate an empirical model of a population. We check the empirical model by direct inspection of the population. We repeat sampling with replacement, obtaining multiple random samples from the same population, obtained in the same process. We combine (pool) compatible samples to form larger samples. Pooling samples of size 50, we obtain samples of size 100, 150 and 300. In general, as sample size increases, samples become more precise and reliable, provided that the sampling process is reliable.

Random sampling is the basis for obtaining information in statistical activities. Sampling is necessary, tedious, time consuming and expensive. Random sampling incorporates reliability, precision and uncertainty.

Session Overview

In this session, we begin the study of probability. We begin with a very basic example of a population, and explore the process of sampling a population.

We examine two modes of sampling a population: census (total enumeration), in which every member of the population is examined; and random sampling with replacement (SRS/WR), in which single members are repeatedly selected from the population. One practical reason why we would want a sampling process is that we wish to estimate some property of the population. Total enumeration allows a definitive settling of the question, and random sampling allows an approximate answer. In most practical settings, the populations of interest are too difficult to totally enumerate – the population is too large, or too complex, or cannot be accessed in total. In practical applications, it is sufficient (and usually necessary) to use a suitable random sample in lieu of the total population.

In our first case, we begin with a color bowl whose true color frequencies are not known. We obtain six (6) random samples, each consisting of 50 draws with replacement (SRS/WR). We then compute sample color frequencies in order to estimate the population color frequencies, and then we check the estimates against the true structure of the bowl.

We then explore a bit of decision theory by playing with Ellsberg’s Urns.

Prediction and Probabilistic Randomness: Predicting the Behavior of a Six-sided Die

Samples – Face Values and Predictions

You should be able to begin with the counts in the table and work out the proportions and percentages.

6.30 Samples

#1

#2

Pooled 12

Face Value

Count

Proportion

Percent

Count

Proportion

Percent

Count

Proportion

Percent

1

10

0.2

20

10

0.2

20

20

0.2

20

2

8

0.16

16

8

0.16

16

16

0.16

16

3

5

0.1

10

8

0.16

16

13

0.13

13

4

9

0.18

18

14

0.28

28

23

0.23

23

5

8

0.16

16

5

0.1

10

13

0.13

13

6

10

0.2

20

5

0.1

10

15

0.15

15

Total

50

1

100

50

1

100

100

1

100

Prediction

Hit

11

0.22

22

18

0.36

36

29

0.29

29

Miss

39

0.78

78

32

0.64

64

71

0.71

71

Total

50

1

100

50

1

100

100

1

100

Samples

Samples

Pooled

#3

#4

34

Face Value

Count

Proportion

Percent

Count

Proportion

Percent

Count

Proportion

Percent

1

8

0.16

16

6

0.12

12

14

0.14

14

2

8

0.16

16

15

0.3

30

23

0.23

23

3

9

0.18

18

6

0.12

12

15

0.15

15

4

8

0.16

16

4

0.08

8

12

0.12

12

5

8

0.16

16

8

0.16

16

16

0.16

16

6

9

0.18

18

11

0.22

22

20

0.2

20

Total

50

1

100

50

1

100

100

1

100

Prediction

Hit

7

0.14

14

9

0.18

18

16

0.16

16

Miss

43

0.86

86

41

0.82

82

84

0.84

84

Total

50

1

100

50

1

100

100

1

100

Samples

Samples

Pooled

#5

#6

56

Face Value

Count

Proportion

Percent

Count

Proportion

Percent

Count

Proportion

Percent

1

6

0.12

12

9

0.18

18

15

0.15

15

2

13

0.26

26

11

0.22

22

24

0.24

24

3

6

0.12

12

12

0.24

24

18

0.18

18

4

9

0.18

18

5

0.1

10

14

0.14

14

5

8

0.16

16

6

0.12

12

14

0.14

14

6

8

0.16

16

7

0.14

14

15

0.15

15

Total

50

1

100

50

1

100

100

1

100

Prediction

Hit

13

0.26

26

9

0.18

18

22

0.22

22

Miss

37

0.74

74

41

0.82

82

78

0.78

78

Total

50

1

100

50

1

100

100

1

100

Pooled

Pooled

Pooled

135

246

123456

Face Value

Count

Proportion

Percent

Count

Proportion

Percent

Count

Proportion

Percent

1

24

0.16

16.00

25

0.17

16.67

49

0.16

16.33

2

29

0.19

19.33

34

0.23

22.67

63

0.21

21.00

3

20

0.13

13.33

26

0.17

17.33

46

0.15

15.33

4

26

0.17

17.33

23

0.15

15.33

49

0.16

16.33

5

24

0.16

16.00

19

0.13

12.67

43

0.14

14.33

6

27

0.18

18.00

23

0.15

15.33

50

0.17

16.67

Total

150

1.00

100.00

150

1.00

100.00

300

1.00

100.00

Prediction

Hit

31

0.21

20.67

36

0.24

24.00

67

0.22

22.33

Miss

119

0.79

79.33

114

0.76

76.00

233

0.78

77.67

Total

150

1

100

150

1.00

100.00

300

1.00

100.00

8.00 Samples

Samples

Samples

Pooled

#1

#2

12

Face Value

Count

Proportion

Percent

Count

Proportion

Percent

Count

Proportion

Percent

1

13

0.26

26

8

0.16

16

21

0.21

21

2

9

0.18

18

7

0.14

14

16

0.16

16

3

12

0.24

24

10

0.2

20

22

0.22

22

4

7

0.14

14

10

0.2

20

17

0.17

17

5

3

0.06

6

8

0.16

16

11

0.11

11

6

6

0.12

12

7

0.14

14

13

0.13

13

Total

50

1

100

50

1

100

100

1

100

Prediction

Hit

9

0.18

18

14

0.28

28

23

0.23

23

Miss

41

0.82

82

36

0.72

72

77

0.77

77

Total

50

1

100

50

1

100

100

1

100

Samples

Samples

Pooled

#3

#4

34

Face Value

Count

Proportion

Percent

Count

Proportion

Percent

Count

Proportion

Percent

1

9

0.18

18

7

0.14

14

16

0.16

16

2

8

0.16

16

8

0.16

16

16

0.16

16

3

10

0.2

20

13

0.26

26

23

0.23

23

4

8

0.16

16

7

0.14

14

15

0.15

15

5

6

0.12

12

9

0.18

18

15

0.15

15

6

9

0.18

18

6

0.12

12

15

0.15

15

Total

50

1

100

50

1

100

100

1

100

Prediction

Hit

8

0.16

16

8

0.16

16

16

0.16

16

Miss

42

0.84

84

42

0.84

84

84

0.84

84

Total

50

1

100

50

1

100

100

1

100

Samples

Samples

Pooled

#5

#6

56

Face Value

Count

Proportion

Percent

Count

Proportion

Percent

Count

Proportion

Percent

1

9

0.18

18

11

0.22

22

20

0.2

20

2

6

0.12

12

12

0.24

24

18

0.18

18

3

8

0.16

16

5

0.1

10

13

0.13

13

4

13

0.26

26

5

0.1

10

18

0.18

18

5

8

0.16

16

11

0.22

22

19

0.19

19

6

6

0.12

12

6

0.12

12

12

0.12

12

Total

50

1

100

50

1

100

100

1

100

Prediction

Hit

8

0.16

16

7

0.14

14

15

0.15

15

Miss

42

0.84

84

43

0.86

86

85

0.85

85

Total

50

1

100

50

1

100

100

1

100

Pooled

Pooled

Pooled

135

246

All

Face Value

Count

Proportion

Percent

Count

Proportion

Percent

Count

Proportion

Percent

1

31

0.21

20.67

26

0.17

17.33

57

0.19

19.00

2

23

0.15

15.33

27

0.18

18.00

50

0.17

16.67

3

30

0.20

20.00

28

0.19

18.67

58

0.19

19.33

4

28

0.19

18.67

22

0.15

14.67

50

0.17

16.67

5

17

0.11

11.33

28

0.19

18.67

45

0.15

15.00

6

21

0.14

14.00

19

0.13

12.67

40

0.13

13.33

Total

150

1.00

100.00

150

1.00

100.00

300

1.00

100.00

Prediction

Hit

25

0.17

16.67

29

0.19

19.33

54

0.18

18.00

Miss

125

0.83

83.33

121

0.81

80.67

246

0.82

82.00

Total

150

1.00

100.00

150

1.00

100.00

300

1.00

100.00

 

In the fair die model for this case, in long runs of tosses of the die: approximately 16⅔% of tosses show “1”, approximately 16⅔% of tosses show “2”, approximately 16⅔% of tosses show “3”, approximately 16⅔% of tosses show “4”, approximately 16⅔% of tosses show “5”, and approximately 16⅔% of tosses show “6.” The sample data are generally compatible with a fair die assumption (equally-likely face values) and with a baseline expected prediction success rate of (1/6), or 16⅔%. Sample performance seems to improve with increasing sample size – but the samples do not exactly fit the fair assumption.

Case Study 1.1: A Color Bowl

In random sampling, we might get a complete list of colors - we'd need a total sample (census) for that kind of listing. The sample proportions of each listed color approximate the corresponding model proportion in the bowl itself. In census sampling, every object in the bowl is counted. The listing is complete, and the model proportions may be calculated directly.

The basic idea in case study 1.1 is that random samples give imperfect pictures of what is being sampled. However, with sufficiently large samples, these samples can reliably yield good pictures of the processes or populations being sampled. And the essence of many statistical applications is the study of selected processes or populations. For a sense of the efficiency of the samples, compare sample and true percentages.

Some Formulas – Proportions, Percentages, Counts

The class represents some property or attribute, for example, blue, or red. Each member, or unit, of a sample can be classified – the result of the classification of the unit is the unit’s class.

Sample Proportion (p)

nclass ~ number of units of sample in class

ntotal ~ total number of units in sample

pclass = nclass / ntotal

pclass ~ proportion of sample in class

 

Sample Percent (pct)

nclass ~ number of units of sample in class

ntotal ~ total number of units in sample

pclass = nclass / ntotal

pctclass = 100*(nclass / ntotal)

pctclass = 100* pclass

pctclass ~ percent of sample in class

 

Population Proportion (P)

Nclass ~ number of units of population in class

Ntotal ~ total number of units in population

Pclass = Nclass / Ntotal

Pclass ~ proportion of population in class

 

Population Percent (PCT)

Nclass ~ number of units of population in class

Ntotal ~ total number of units in population

Pclass = Nclass / Ntotal

PCTclass = 100*(Nclass / Ntotal)

PCTclass = 100* Pclass

PCTclass ~ percent of population in class

 

In this setting,

 

nblue ~ number of blue draws in sample

ntotal ~ total number of draws per sample

pblue = nblue / ntotal

pblue ~ proportion of sample draws showing blue

pctblue = 100*pblue

pctblue ~ percent of sample draws showing blue

 

Nblue ~ number of blue marbles in bowl

Ntotal ~ total number of marbles in bowl

Pblue = Nblue / Nblue

Pblue ~ proportion of marbles in bowl that are blue

 

ngreen ~ number of green draws in sample

ntotal ~ total number of draws per sample

pgreen = ngreen / ngreen

pgreen ~ proportion of sample draws showing green

pctgreen = 100*pgreen

pctgreen ~ percent of sample draws showing green

 

Ngreen ~ number of green marbles in bowl

Ntotal ~ total number of marbles in bowl

Pgreen = Ngreen / Ngreen

Pgreen ~ proportion of marbles in bowl that are green

 

nred ~ number of red draws in sample

ntotal ~ total number of draws per sample

pred = nred / nred

pred ~ proportion of sample draws showing red

pctred = 100*pred

pctred ~ percent of sample draws showing red

 

Nred ~ number of red marbles in bowl

Ntotal ~ total number of marbles in bowl

Pred = Nred / Nred

Pred ~ proportion of marbles in bowl that are red

 

nyellow ~ number of yellow draws in sample

ntotal ~ total number of draws per sample

pyellow = nyellow / nyellow

pyellow ~ proportion of sample draws showing yellow

pctyellow = 100*pyellow

pctyellow ~ percent of sample draws showing yellow

 

Nyellow ~ number of yellow marbles in bowl

Ntotal ~ total number of marbles in bowl

Pyellow = Nyellow / Nyellow

Pyellow ~ proportion of marbles in bowl that are yellow

Samples – Bowl

6:30

Sample #1

Sample #2

Pooled12

Color

n

p=n/50

pct=100*p

n

p=n/50

pct=100*p

n12

p12=

n12/100

pct12=

100*p12

Blue

4

0.08

8

4

0.08

8

8

0.08

8

Green

11

0.22

22

8

0.16

16

19

0.19

19

Red

20

0.4

40

14

0.28

28

34

0.34

34

Yellow

15

0.3

30

24

0.48

48

39

0.39

39

Total

50

1

100

50

1

100

100

1

100

Sample #3

Sample #4

Pooled12

Color

n

p=n/50

pct=100*p

n

p=n/50

pct=100*p

n34

p34=

n34/100

pct34=

100*p34

Blue

8

0.16

16

6

0.12

12

14

0.14

14

Green

8

0.16

16

14

0.28

28

22

0.22

22

Red

18

0.36

36

21

0.42

42

39

0.39

39

Yellow

16

0.32

32

9

0.18

18

25

0.25

25

Total

50

1

100

50

1

100

100

1

100

Sample #5

Sample #6

Pooled12

Color

n

p=n/50

pct=100*p

n

p=n/50

pct=100*p

n56

p56=n56/100

pct56=

100*p56

Blue

0

#DIV/0!

#DIV/0!

0

######

#DIV/0!

0

#DIV/0!

#DIV/0!

Green

0

#DIV/0!

#DIV/0!

0

######

#DIV/0!

0

#DIV/0!

#DIV/0!

Red

0

#DIV/0!

#DIV/0!

0

######

#DIV/0!

0

#DIV/0!

#DIV/0!

Yellow

0

#DIV/0!

#DIV/0!

0

######

#DIV/0!

0

#DIV/0!

#DIV/0!

Total

0

#DIV/0!

#DIV/0!

0

######

#DIV/0!

0

#DIV/0!

#DIV/0!

Pooled13

Pooled24

PooledAll

Color

n13

p13

pct13

n24

p24

pct24

nAll

pAll=nAll/300

pctAll=

100*pAll

Blue

12

0.12

12

10

0.1

10

22

0.11

11

Green

19

0.19

19

22

0.22

22

41

0.205

20.5

Red

38

0.38

38

35

0.35

35

73

0.365

36.5

Yellow

31

0.31

31

33

0.33

33

64

0.32

32

Total

100

1

100

100

1

100

200

1

100

Truth

Color

n

p

pct

Blue

3

0.125

12.5

Green

5

0.2083

20.83333

Red

9

0.375

37.5

Yellow

7

0.2917

29.16667

Total

24

1

100

8:00

Sample #1

Sample #2

Pooled12

n

p=n/50

pct=100*p

n

p=n/50

pct=100*p

n12

p12

=n12/100

pct12

=100*p12

11

0.22

22

7

0.14

14

18

0.18

18

4

0.08

8

5

0.1

10

9

0.09

9

11

0.22

22

14

0.28

28

25

0.25

25

24

0.48

48

24

0.48

48

48

0.48

48

50

1

100

50

1

100

100

1

100

Sample #3

Sample #4

Pooled12

n

p=n/50

pct=100*p

n

p=n/50

pct=100*p

n34

p34

=n34/100

pct34

=100*p34

11

0.22

22

12

0.24

24

23

0.23

23

6

0.12

12

2

0.04

4

8

0.08

8

13

0.26

26

18

0.36

36

31

0.31

31

20

0.4

40

18

0.36

36

38

0.38

38

50

1

100

50

1

100

100

1

100

Sample #5

Sample #6

Pooled12

n

p=n/50

pct=100*p

n

p=n/50

pct=100*p

n56

p56=

n56/100

pct56=

100*p56

10

0.2

20

10

0.2

20

20

0.2

20

8

0.16

16

4

0.08

8

12

0.12

12

11

0.22

22

10

0.2

20

21

0.21

21

21

0.42

42

26

0.52

52

47

0.47

47

50

1

100

50

1

100

100

1

100

Pooled135

Pooled246

PooledAll

n135

p135

=

n135/150

pct135

=

100*p135

n246

p246

=

n246/150

pct246

=

100*p246

nAll

pAll=

nAll/300

pctAll=

100*pAll

32

0.213

21.333

29

0.193

19.333

61

0.203

20.333

18

0.120

12.000

11

0.073

7.333

29

0.097

9.667

35

0.233

23.333

42

0.280

28.000

77

0.257

25.667

65

0.433

43.333

68

0.453

45.333

133

0.443

44.333

150

1

100

150

1

100

300

1

100

Truth

n

p

pct

4

0.182

18.182

2

0.091

9.091

6

0.273

27.273

10

0.455

45.455

22

1

100

You should be able to begin with the counts in the table and work out the proportions and percentages.

The True State of the Bowl

6:30

Color

N

P

PCT

Blue

3

0.125

12.5

Green

5

0.2083

20.83333

Red

9

0.375

37.5

Yellow

7

0.2917

29.16667

Total

24

1

100

The true proportions are probabilities:

In long runs of draws with replacement from the bowl, approximately 12.5 percent of draws show blue.

In long runs of draws with replacement from the bowl, approximately 20.8 percent of draws show green.

In long runs of draws with replacement from the bowl, approximately 37.5 percent of draws show red.

In long runs of draws with replacement from the bowl, approximately 29.2 percent of draws show yellow.

8:00

Color

N

P

PCT

Blue

4

0.182

18.182

Green

2

0.091

9.091

Red

6

0.273

27.273

Yellow

10

0.455

45.455

Total

22

1

100

The true proportions are probabilities:

In long runs of draws with replacement from the bowl, approximately 18.2 percent of draws show blue.

In long runs of draws with replacement from the bowl, approximately 9.1 percent of draws show green.

In long runs of draws with replacement from the bowl, approximately 27.3 percent of draws show red.

In long runs of draws with replacement from the bowl, approximately 45.5 percent of draws show yellow.

Sample versus Population

6.30

11.0% versus 12.5%

20.5% versus 20.8%

36.5% versus 37.5%

32.0% versus 29.2%

8:00

20.3% versus 18.2%

9.7% versus 9.1%

25.7% versus 27.3%

44.3% versus 45.5%

We see reasonable, but not exact matches between the sample proportions (p) and the probabilities (P).

We didn’t get to these, but look up the Ellsberg games.

Regarding Ellsberg I 

The 1st Game: The first bowl is 50%/50% split between blue and green. The best we can do is break even, regardless of strategy. The simplest strategy involves picking one of the colors and always betting on that color.

The 2nd Game: The second bowl is an unknown composite of red and yellow. We might be able to win this game if 1) there is a dominant color and 2) we can determine that dominant color. A simple strategy here is to pick one color and ride it for awhile. Then stop betting and check the number of winning bets. If the color being betted is losing on a regular basis, switch colors.

The 3rd Game: This game only makes sense if the second bowl is dominant in red, bet on red – if red consistently shows, stay on the second bowl. Otherwise, either stop playing, or stick with the first bowl.

Regarding Ellsberg II

The 1st Game: The first bowl is 20% red / 40% black / 40% white. The simplest strategy involves picking one of the colors and always betting on that color. Regardless of betting choice, there is a 40% chance of losing for the single bet, and 20% for getting kicked off the game. 

The 2nd Game: The second bowl is 20% red / 80% black or white. The simplest strategy involves picking one of the colors and always betting on that color. If either white or black is sufficiently dominant, this game might be worth playing. The problem is that regardless of the possible advantage in the white/black part of the bowl, there is still a 20% chance of getting killed (permanently losing). But to detect this advantage, one is forced to pick a betting color (white or black) and spend some money.

The idea underlying the Ellsberg games is to illustrate the concept of making decisions about selected processes or populations by making decisions using random samples.