Chapter 1
Introduction

Chapter 2
Plotting data

Chapter 3
Summary statistics

Chapter 4
Probability

Chapter 5
The binomial distribution

Chapter 6
The Poisson distribution

Chapter 7
The normal distribution

Chapter 8
Error estimation

Chapter 9
Comparing distributions

Chapter 10
Regression

Chapter 11
Fractals and chaos

Chapter 12
Unsupervised learning

Chapter 13
Supervised learning

Chapter 14
Compositional data

Chapter 15
Directional data

Chapter 16
Spatial data

Contents

Chapter 1Introduction

Chapter 2Plotting data

Chapter 3Summary statistics

Chapter 4Probability

Chapter 5The binomial distribution

Chapter 6The Poisson distribution

Chapter 7The normal distribution

Chapter 8Error estimation

Chapter 9Comparing distributions

Chapter 10Regression

Chapter 11Fractals and chaos

Chapter 12Unsupervised learning

Chapter 13Supervised learning

Chapter 14Compositional data

Chapter 15Directional data

Chapter 1
Introduction

Chapter 2
Plotting data

Chapter 3
Summary statistics

Chapter 4
Probability

Chapter 5
The binomial distribution

Chapter 6
The Poisson distribution

Chapter 7
The normal distribution

Chapter 8
Error estimation

Chapter 9
Comparing distributions

Chapter 10
Regression

Chapter 11
Fractals and chaos

Chapter 12
Unsupervised learning

Chapter 13
Supervised learning

Chapter 14
Compositional data

Chapter 15
Directional data

Chapter 16
Spatial data

Chapter 17
An introduction to R

Chapter 18
Exercises

Chapter 19
Solutions

Chapter 20
Tables

2.1 Data types

2.2 Histograms

2.3 Kernel density estimation

2.4 Data transformations

2.5 Multivariate distributions

2.6 Empirical cumulative distribution functions

3.1 Location

3.2 Dispersion

3.3 Shape

3.4 Box-and-whisker plots

4.1 Permutations

4.2 Combinations

4.3 Conditional probability

5.1 Parameter estimation

5.2 Hypothesis tests

5.3 Statistical power

5.4 Type-I and type-II errors

5.5 Pitfalls of statistical hypothesis testing

5.6 Confidence intervals

6.1 Probability mass function

6.2 Parameter estimation

6.3 Hypothesis tests

6.4 Multiple testing

6.5 Confidence intervals

7.1 The Central Limit Theorem

7.2 The multivariate normal distribution

7.3 Properties

7.4 Parameter estimation

8.1 Error propagation

8.2 Examples

8.3 Standard deviation vs. standard error

8.4 Fisher Information

9.1 Q-Q plots

9.2 The t-test

9.3 Confidence intervals

9.4 The χ²-test

9.5 Comparing two or more samples

9.6 Cherry picking (Type-I errors revisited)

9.7 Effect size (Type-II errors revisited)

9.8 Non-parametric tests

10.1 The correlation coefficient

10.2 Least Squares

10.3 Maximum Likelihood

10.4 Common mistakes

10.5 Weighted regression

11.1 Power law distributions

11.2 How long is the coast of Britain?

11.3 Fractals

11.4 Chaos

12.1 Principal Component Analysis

12.2 Multidimensional Scaling

12.3 K-means clustering

12.4 Hierarchical clustering

13.1 Discriminant Analysis

13.2 Decision trees

14.1 Ratios of Jeffreys quantities

14.2 Logratio transformations

14.3 PCA of compositional data

14.4 LDA of compositional data

14.5 Logratio processes

15.1 Circular data

15.2 Circular distributions

15.3 Spherical data

15.4 Spherical distributions

15.5 Orientations

2.1 Data types

2.2 Histograms

2.3 Kernel density estimation

2.4 Data transformations

2.5 Multivariate distributions

2.6 Empirical cumulative distribution functions

3.1 Location

3.2 Dispersion

3.3 Shape

3.4 Box-and-whisker plots

4.1 Permutations

4.2 Combinations

4.3 Conditional probability

5.1 Parameter estimation

5.2 Hypothesis tests

5.3 Statistical power

5.4 Type-I and type-II errors

5.5 Pitfalls of statistical hypothesis testing

5.6 Confidence intervals

6.1 Probability mass function

6.2 Parameter estimation

6.3 Hypothesis tests

6.4 Multiple testing

6.5 Confidence intervals

7.1 The Central Limit Theorem

7.2 The multivariate normal distribution

7.3 Properties

7.4 Parameter estimation

8.1 Error propagation

8.2 Examples

8.3 Standard deviation vs. standard error

8.4 Fisher Information

9.1 Q-Q plots

9.2 The t-test

9.3 Confidence intervals

9.4 The χ2-test

9.5 Comparing two or more samples

9.6 Cherry picking (Type-I errors revisited)

9.7 Effect size (Type-II errors revisited)

9.8 Non-parametric tests

10.1 The correlation coefficient

10.2 Least Squares

10.3 Maximum Likelihood

10.4 Common mistakes

10.5 Weighted regression

11.1 Power law distributions

11.2 How long is the coast of Britain?

11.3 Fractals

11.4 Chaos

12.1 Principal Component Analysis

12.2 Multidimensional Scaling

12.3 K-means clustering

12.4 Hierarchical clustering

13.1 Discriminant Analysis

13.2 Decision trees

14.1 Ratios of Jeffreys quantities

14.2 Logratio transformations

14.3 PCA of compositional data

14.4 LDA of compositional data

14.5 Logratio processes

15.1 Circular data

15.2 Circular distributions

15.3 Spherical data

15.4 Spherical distributions

9.4 The χ²-test

Statistics for geoscientists

Pieter Vermeesch
Department of Earth Sciences
University College London
p.vermeesch@ucl.ac.uk
______________

1 Introduction
2 Plotting data
2.1 Data types
2.2 Histograms
2.3 Kernel density estimation
2.4 Data transformations
2.5 Multivariate distributions
2.6 Empirical cumulative distribution functions
3 Summary statistics
3.1 Location
3.2 Dispersion
3.3 Shape
3.4 Box-and-whisker plots
4 Probability
4.1 Permutations
4.2 Combinations
4.3 Conditional probability
5 The binomial distribution
5.1 Parameter estimation
5.2 Hypothesis tests
5.3 Statistical power
5.4 Type-I and type-II errors
5.5 Pitfalls of statistical hypothesis testing
5.6 Confidence intervals
6 The Poisson distribution
6.1 Probability mass function
6.2 Parameter estimation
6.3 Hypothesis tests
6.4 Multiple testing
6.5 Confidence intervals
7 The normal distribution
7.1 The Central Limit Theorem
7.2 The multivariate normal distribution
7.3 Properties
7.4 Parameter estimation
8 Error estimation
8.1 Error propagation
8.2 Examples
8.3 Standard deviation vs. standard error
8.4 Fisher Information
9 Comparing distributions
9.1 Q-Q plots
9.2 The t-test
9.3 Confidence intervals
9.4 The χ²-test
9.5 Comparing two or more samples
9.6 Cherry picking (Type-I errors revisited)
9.7 Effect size (Type-II errors revisited)
9.8 Non-parametric tests
10 Regression
10.1 The correlation coefficient
10.2 Least Squares
10.3 Maximum Likelihood
10.4 Common mistakes
10.5 Weighted regression
11 Fractals and chaos
11.1 Power law distributions
11.2 How long is the coast of Britain?
11.3 Fractals
11.4 Chaos
12 Unsupervised learning
12.1 Principal Component Analysis
12.2 Multidimensional Scaling
12.3 K-means clustering
12.4 Hierarchical clustering
13 Supervised learning
13.1 Discriminant Analysis
13.2 Decision trees
14 Compositional data
14.1 Ratios of Jeffreys quantities
14.2 Logratio transformations
14.3 PCA of compositional data
14.4 LDA of compositional data
14.5 Logratio processes
15 Directional data
15.1 Circular data
15.2 Circular distributions
15.3 Spherical data
15.4 Spherical distributions
15.5 Orientations
16 Spatial data
16.1 The semivariogram
16.2 Semivariogram models
16.3 Kriging interpolation
17 An introduction to R
17.1 The basics
17.2 Plotting data
17.3 Summary Statistics
17.4 Probability
17.5 The binomial distribution
17.6 The Poisson distribution
17.7 The normal distribution
17.8 Error propagation
17.9 Comparing distributions
17.10 Regression
17.11 Fractals and chaos
17.12 Unsupervised learning
17.13 Supervised learning
17.14 Compositional data
17.15 Directional data
17.16 Spatial data
18 Exercises
18.1 The basics
18.2 Plotting data
18.3 Summary statistics
18.4 Probability
18.5 The binomial distribution
18.6 The Poisson distribution
18.7 The normal distribution
18.8 Error propagation
18.9 Comparing distributions
18.10 Regression
18.11 Fractals and chaos
18.12 Unsupervised learning
18.13 Supervised learning
18.14 Compositional data
18.15 Directional data
18.16 Spatial data
19 Solutions
19.1 The basics
19.2 Plotting data
19.3 Summary statistics
19.4 Probability
19.5 The binomial distribution
19.6 The Poisson distribution
19.7 The normal distribution
19.8 Error propagation
19.9 Comparing distributions
19.10 Regression
19.11 Fractals and chaos
19.12 Unsupervised learning
19.13 Supervised learning
19.14 Compositional data
19.15 Directional data
19.16 Spatial data
20 Tables

According to the Oxford dictionary of English, the definition of ‘statistics’ is:

The practice or science of collecting and analysing numerical data in large quantities, especially for the purpose of inferring proportions in a whole from those in a representative sample.

The words ‘inferring’ and ‘sample’ are written in bold face because they are really central to the practice and purpose of Science in general, and Geology as a whole. For example:

1.: The true proportion of quartz grains in a sand deposit is unknown but can be estimated by counting a number of grains from a representative sample.
2.: The true crystallisation age of a rock is unknown but can be estimated by measuring the ²⁰⁶Pb/²³⁸U-ratio in a representative number of U-bearing mineral grains from that rock.
3.: The true ²⁰⁶Pb/²³⁸U-ratio of a U-bearing mineral is unknown but can be estimated by repeatedly measuring the ratio of ²⁰⁶Pb- and ²³⁸U- ions extracted from that mineral in a mass spectrometer.
4.: The spatial distribution of arsenic in groundwater is unknown but can be estimated by measuring the arsenic content of a finite number of water wells.

Thus, pretty much everything that we do as Earth Scientists involves statistics in one way or another. We need statistics to plot, summarise, average, interpolate and extrapolate data; to quantify the precision of quantitative inferences; to fit analytical models to our measurements; to classify data and to extract patterns from it.

The field of statistics is as vast as it is deep and it is simply impossible to cover all its aspects in a single text. These notes are meant as a basic statistical ‘survival guide’ for geoscientists, which may be used as a starting point for further study. There are dozens of statistics textbooks that could be used for this purpose. Two books that I have consulted whilst writing these notes are “Mathematical Statistics and Data Analysis” by John A. Rice (Duxbury Press), which provides an in-depth introduction to general statistics; and “Statistics and Data Analysis in Geology” by John C. Davis (Wiley), which presents an excellent and very detailed introduction to geoscience-specific statistics.

Important though a basic understanding of statistics may be, it is only when this understanding is combined with some programming skills that it becomes truly useful. These notes take a hands-on approach, using a popular statistical programming language called R. To make life easier and lower the barrier to using the statistical methods described herein, a large number of datasets and predefined R functions have been bundled in an accompanying R-package called geostats. To demonstrate the versatility of R, all the 189 figures in these notes were made with R, without involving any other graphics software. Most of these figures can be reproduced using the geostats package.

These notes begin by introducing the basic principles of general statistics, before moving on to ‘geological’ data. Their main purpose is to instill a critical attitude in the student, and to raise an awareness of the many pitfalls of blindly applying statistical ‘black boxes’ to geological problems. For example, Chapters 2 and 3 introduce basic notions of exploratory data analysis using plots and summary statistics. Right from the start they will show that geological data, such as sedimentary clast sizes or reservoir porosity measurements, cannot readily be analysed using these techniques without some simple pre-treatment.

Chapters 4, 5 and 6 introduce the basic principles of probability theory, and uses these to derive two classical examples of discrete probability distributions, namely the binomial and Poisson distributions. These two distributions are used as a vehicle to introduce the important concepts of parameter estimation, hypothesis tests and confidence intervals. Here the text goes into considerably more detail than it does elsewhere. In a departure from many other introductory texts, Section 5.1 briefly introduces the method of maximum likelihood. This topic is revisited again in Sections 6.2, 7.4, 8.4 and 10.3. Even though few geoscientists may use maximum likelihood theory themselves, it is nevertheless useful that they are aware of its main principles, because these principles underlie many common statistical tools.

Chapter 7 discusses the normal or Gaussian distribution. It begins with an empirical demonstration of the central limit theorem in one and two dimensions. This demonstration serves as an explanation of “what is so normal about the Gaussian distribution?”. This is an important question because much of the remainder of the notes (including Chapters 11, 14 and 15) point out that many geological datasets are not normal. Ignoring this non-normality can lead to biased and sometimes nonsensical results.

Scientists use measurements to make quantitative inferences about the world. Random sampling fluctuations can have a significant effect on these inferences, and it is important that these are evaluated before drawing any scientific conclusions. In fact, it may be argued that estimating the precision of a quantitative inference is as important as the inference itself. Chapter 8 uses basic calculus to derive some generic error propagation formulas. However, these formulas are useful regardless of whether one is able to derive them or not. The error propagation chapter points out a strong dependency of statistical precision on sample size.

This sample size dependency is further explored in Chapter 9, which introduces a number of statistical tests to compare one dataset with a theoretical distribution, or to compare two datasets with each other. Useful though these tests are in principle, in practice their outcome should be treated with caution. Here we revisit an important caveat that is also discussed in Section 5.5, namely that the power of statistical tests to detect even the tiniest violation of a null hypothesis, monotonically increases with sample size. This is the reason why statistical hypothesis tests have come under increasing scrutiny in recent years. Chapter 9 advocates that, rather than testing whether a hypothesis is true or false, it is much more productive to quantify ‘how false’ said hypothesis is.

Chapter 10 introduces linear regression using the method of least squares. But whereas most introductory statistics texts only use least squares, these notes also demonstrate the mathematical equivalence of least squares with the method of maximum likelihood using normal residuals. The advantage of this approach is that it facilitates the derivation of confidence intervals and prediction intervals for linear fits, using the error propagation methods of Chapter 8. The method of maximum likelihood also provides a natural route towards more advanced weighted regression algorithms that account for correlated uncertainties in both variables (Section 10.5).

Chapter 11 covers the subjects of fractals and chaos. These topics may seem a bit esoteric at first glance, but do play a key role in the geosciences. We will see that the scale invariance of many geological processes gives rise to data that cannot be captured by conventional probability distributions. Earthquake magnitudes, fracture networks and the grain size distribution of glacial till are just three examples of quantities that follow power law distributions, which do not have a well defined mean. In the briefest of introductions to the fascinating world of deterministic chaos, Section 11.4 uses a simple magnetic pendulum example to introduce the ‘butterfly effect’, which describes the phenomenon whereby even simple systems of nonlinear equations can give rise to unpredictable and highly complex behaviour.

Chapters 12 and 13 introduce some simple algorithms for unsupervised and supervised machine learning, respectively. As the name suggests, unsupervised learning helps scientists visualise and analyse large datasets without requiring any prior knowledge of their underlying structure. It includes multivariate ordination techniques such as Principal Component Analysis (PCA, Section 12.1) and Multidimensional Scaling (MDS, Section 12.2); as well as the k-means (Section 12.3) and hierarchical (Section 12.4) clustering algorithms. Supervised learning algorithms such as discriminant analysis (Section 13.1) and decision trees (Section 13.2) define decision rules to identify pre-defined groups in some training data. These rules can then be used to assign a dataset of new samples to one of these same groups.

Chapter 14 discusses compositional data, which are one of the most common data types in the Earth Sciences. Chemical compositions and the relative abundances of minerals, isotopes and animal species are all strictly positive quantities that can be renormalised to a constant sum. They cannot be safely analysed by statistical methods that assume normal distributions. Even the simplest statistical operations such as the calculation of an arithmetic mean or a standard deviation can go horribly wrong when applied to compositional data. Fortunately, the compositional data conundrum can be solved using a simple data transformation using logratios (Section 14.2). Following logratio transformation, compositional datasets can be safely averaged or analysed by more sophisticated statistical methods.

Directional data such as compass bearings and geographical coordinates are another type of measurement that is probably more common in the Earth Sciences than in any other discipline. Chapter 15 shows that, like the compositional data of Chapter 14, also directional data cannot be safely analysed with ‘normal statistics’. For example, the arithmetic mean of two angles 1^∘ and 359^∘ is 180^∘, which is clearly a nonsensical result. Chapter 15 introduces the vector sum as a better way to average directional data, both in one dimension (i.e., a circle, Sections 15.1–15.2) and two dimensions (i.e., a sphere, Sections 15.3–15.4).

The spatial dimension is crucially important for the Earth Sciences. Geostatistics is a subfield of statistics that specifically deals with this dimension. Chapter 16 provides a basic introduction to kriging interpolation, which is a widely used method to create topographical and compositional maps from a set of sparsely distributed measurements. This chapter concludes the theoretical half of these notes. The remainder of the text (Chapters 17–19) is dedicated to practical exercises.

Chapter 17 provides an extensive introduction to the R programming language. It comprises 16 sections, one for each preceding chapter, and reproduces most of the examples that were covered in the first half of the text. Fortunately, several statistical procedures that are quite complicated to do manually, turn out to be really simple in R. For example, the binomial hypothesis test of Section 5.2 involves six different manual steps, or one simple R command called binom.test. And the Poisson test of Section 6.3 is implemented in an equally simple function called poisson.test.

Despite this simplicity, learning R does present the reader with a steep learning curve. The only way to climb this curve is to practice. One way to do so is to solve the exercises in Chapter 18. Like the R tutorials of Chapter 17, these exercises are also grouped in 16 sections, one for each of the 16 theory chapters. Most of the exercises can be solved with R, using the functions provided in Chapter 17. Fully worked solutions are provided in Chapter 19, which again comprises 16 sections.

I am grateful to my students Peter Elston, Veerle Troelstra and Shiqi Zheng for pointing out mistakes and inconsistencies in earlier versions of these notes. I would also like to thank Jianmei Wang for thoroughly checking Chapters 1–6. The source code of the geostats package and of these notes are provided on GitHub at https://github.com/pvermees/geostats/. Should you find any remaining mistakes in the text or bugs in the software, then you can correct them there and issue a pull request. Or alternatively, you can simply drop me an email. Thanks in advance for your help!

Pieter Vermeesch
University College London
November 2025

A picture says more than a thousand words and nowhere is this more true than in statistics. So before we explore the more quantitative aspects of data analysis, it is useful to visualise the data. Consider, for example, the following four bivariate datasets (Anscombe’s quartet¹ ):

I		II		III		IV
x	y	x	y	x	y	x	y

10.0	8.04	10.0	9.14	10.0	7.46	8.0	6.58
8.0	6.95	8.0	8.14	8.0	6.77	8.0	5.76
13.0	7.58	13.0	8.74	13.0	12.74	8.0	7.71
9.0	8.81	9.0	8.77	9.0	7.11	8.0	8.84
11.0	8.33	11.0	9.26	11.0	7.81	8.0	8.47
14.0	9.96	14.0	8.10	14.0	8.84	8.0	7.04
6.0	7.24	6.0	6.13	6.0	6.08	8.0	5.25
4.0	4.26	4.0	3.10	4.0	5.39	19.0	12.50
12.0	10.84	12.0	9.13	12.0	8.15	8.0	5.56
7.0	4.82	7.0	7.26	7.0	6.42	8.0	7.91
5.0	5.68	5.0	4.74	5.0	5.73	8.0	6.89

Table 2.1: Anscombe’s quartet of bivariate data pairs.

For all four datasets (I – IV):

the mean (see Chapter 3) of x is 9
the variance (see Chapter 3) of x is 11
the mean of y is 7.50
the variance of y is 4.125
the correlation coefficient (see Chapter 10) between x and y is 0.816
the best fit line (see Chapter 10) is given by y = 3.00 + 0.500x

So at the summary level, it would appear that all four datasets are identical. However, when we visualise the data as bivariate scatter plots, they turn out to be very different:

Figure 2.1: Anscombe’s quartet shown as bivariate scatter plots.

Bivariate scatter plots are just one way to visualise analytical data. Many other graphical devices exist, each of which is appropriate for a particular type of data. The following section of this chapter will introduce a number of these data types, and the associated plots.

Consider the following data table, which reports six different properties of twenty fictitious river catchments:

#	lithology	age	springs	pH	Ca/Mg	vegetation

1	basalt	Cz	0	6.2	0.35	5.8
2	granite	Mz	1	4.4	11.00	28
3	basalt	Mz	0	5.6	6.00	12
4	sandstone	Cz	6	5.2	1.80	27
5	shale	pC	3	4.5	2.30	40
6	basalt	Mz	1	5.4	0.59	12
7	shale	Pz	2	4.8	8.40	3.8
8	basalt	Pz	0	5.9	2.90	6.3
9	shale	Mz	1	3.8	5.90	17
10	shale	Mz	2	4.0	2.10	16
11	sandstone	Pz	4	5.0	1.20	95
12	shale	pC	2	4.1	2.10	94
13	sandstone	Mz	4	5.2	1.10	92
14	basalt	Cz	1	5.5	1.60	88
15	sandstone	Pz	4	5.3	0.90	88
16	granite	Pz	0	4.6	1.70	70
17	basalt	Mz	0	5.7	3.40	92
18	shale	Pz	1	4.6	0.53	72
19	granite	Mz	0	4.6	2.20	74
20	basalt	Cz	1	5.6	7.70	84

Table 2.2: Twenty imaginary river catchments have been analysed for six different types of data: the lithology of the underlying bedrock and its stratigraphic age (where ‘Cz’ is Cenozoic, ‘Mz’ is Mesozoic, ‘Pz’ is Palaeozoic, and ‘pC’ is Precambrian); the number of natural springs in the catchments; the pH of the river water; its Ca/Mg ratio; and the percentage of the catchment area that is covered by vegetation.

The first three columns contain discrete data values and can be visualised as bar charts or histograms.

1.: Categorical data take on a limited number of values, assigning each ‘object’ to a particular unordered class or category. Geological examples of categorical data include animal species in a bone bed; the modal composition of a thin section; and lithologies in a mapping area (Figure 2.2.a).
2.: Ordinal data are a special type of categorical data, in which the classes are ordered but the distances between them are either irregular or unknown. Geological examples of ordinal quantities include Moh’s hardness scale; metamorphic grade; and the geologic timescale (Figure 2.2.b).
3.: Count data are a special type of ordinal data in which the categories are evenly spaced into integer intervals. Geological examples of count data include the number of gold chips found in a panning session; the annual number of earthquakes that exceed a certain magnitude; the number of dry wells in a wildcat drilling survey; and the number of natural springs in a river catchment (Figure 2.2.c).

Figure 2.2: Examples of discrete datasets, shown as bar charts. a) Categorical dataset of the dominant lithologies in 20 river catchments. The order of the categories along the horizontal axis is completely arbitrary and can be changed without loss of information. b) Ordinal dataset summarising the stratigraphic age of the catchments, as defined in the caption of Table 2.2. The five categories are listed in chronological order, but span vastly different lengths of time; c) The number of natural springs per catchment.

Not all geological or geophysical measurements take on discrete values. Many are free to take on decimal values. We can also subdivide these continuous data into further classes:

4.: Cartesian quantities² can have any decimal value, including positive and negative ones. Geoscientific examples of this data type include the magnitude of earthquakes; the spontaneous electrical potential between geological strata; or the pH of aqueous solutions (Figure 2.3.a).
5.: Jeffreys quantities² can only have positive values. Examples of this include mass, volume, density, speed, or the ratios of chemical concentrations (Figure 2.3.b).
6.: Proportions are quantities that are constrained to a finite interval from 0 to 1 (or from 0 to 100%). Examples of this include chemical concentrations; porosities; and volume or area fractions (Figure 2.3.c).

Figure 2.3: Three continuous datasets, shown as histograms, reporting the following properties of the 20 river catchments of Table 2.2: a) pH of the river water (free to take on positive and negative values); b) the Ca/Mg ratio of the water (non-negative numbers); and c) the proportion of the catchment area covered by vegetation (values between 0 and 100%).

Discrete (categorical, ordinal and count) data naturally fall into bins and the most obvious way to visualise them is as bar charts (Figure 2.2). Likewise, continuous (Cartesian, Jeffreys, proportions) data can also be collected into bins and plotted as histograms (Figure 2.3). However, this binning exercise poses two practical problems.

1.

How many bins should be used, and how wide should they be?

The number of bins strongly affects the appearance of a histogram. Figure 2.4 shows the pH data on two histograms with the individual measurements marked as vertical ticks underneath. This is also known as a rug plot, and allows us to better assess the effect of bin width on the appearance of histograms:

Figure 2.4: Histogram a) uses a bin width of 1 pH unit whereas histogram b) uses a bin width of 0.5 pH units. The two histograms look considerably different and it is not immediately clear which choice of bin width is best.

A number of rules of thumb are available to choose the optimal number of bins. For example, Excel uses a simple square root rule:

√ -- #bins = ⌈ n⌉

(2.1)

where n is the number of observations (i.e. n = 20 for the pH example) and ⌈∗⌉ means the ‘ceiling’ of ∗, i.e. the value rounded up to the next integer. R uses Sturges’ Rule:

#bins = ⌈log2[n]+ 1⌉

(2.2)

however no rule of thumb is optimal in all situations.

2.

Where to place the bins?

Even when the number of bins has been fixed, just shifting them slightly to the left or to the right can have a significant effect on the appearance of the histogram. For example:

Figure 2.5: Two histograms of the pH data whose bin widths are the same, but whose bins have been offset by 0.25 pH units. This arbitrary decision strongly affects the appearance of the histogram.

There is no optimal solution to the bin placement problem. This reflects a limitation of histograms as a way to present continuous data.

To overcome the bin placement problem, let us explore a variant of the ordinary histogram that is constructed as follows:

1.: Rank the measurements from low to high along a line.
2.: Place a rectangular ‘box’ on top of each measurement.
3.: Stack the boxes to create one connected line.

Figure 2.6: Construction of a stacked histogram of three data points, shown as a rug plot along the bottom axis. The rectangular ‘kernels’ (with bandwidth h) shown in grey are centred around each of these data points. The black step function is obtained by taking the sum of these boxes. This procedure removes the need to choose bin locations.

Normalising the area under the resulting curve produces a so-called Kernel Density Estimate (KDE). The mathematical definition of this function is:

n ( ) 1--∑ x−--xi KDE (x) = nh K h i=1

(2.3)

where x_i is the i^th measurement (out of n), h is the ‘bandwidth’ of the kernel density estimator (h = 1 in Figure 2.6), and K(u) is the ‘kernel’ function. For the rectangular kernel:

K (u ) = 1∕2 if |u| ≤ 1, and K (u) = 0 otherwise

(2.4)

where ‘|∗|’ stands for “the absolute value of ∗”. Applying this method to the pH data:

Figure 2.7: Rectangular KDE of the pH data, constructed using the same procedure as shown in Figure 2.6, but using a bandwidth h = 0.3276. To turn the curve into a ‘density’, the area under it has been normalised to unity.

Instead of a rectangular kernel, we could also use triangles to construct the KDE curve, or any other (symmetric) function. One popular choice is the Gaussian function:

1 [ u2] K (u) = √----exp − --- 2π 2

(2.5)

which produces a continuous KDE function:

Figure 2.8: Using a Gaussian kernel instead of a rectangular kernel on the three data points of Figure 2.6, with bandwidth h = 1. This produces a smooth KDE. The area under each of the kernels has been reduced to 1/3 so that the area under the KDE is 1.

Using the Gaussian kernel to plot the pH data:

Figure 2.9: Gaussian KDE of the pH data, with bandwidth h = 0.3276. The continuous curve does more justice to the continuous data than the discrete step function of Figures 2.4, 2.5 or 2.7.

Although kernel density estimation solves the bin placement problem, it is not entirely free of design decisions. The bandwidth h of a KDE fulfils a similar role as the bin width of a histogram. Changes in h affect the smoothness of the KDE curve:

Figure 2.10: KDEs of the pH data with a) a kernel bandwidth of h = 0.1; and b) a bandwidth of h = 1. Using a narrow bandwidth undermooths the data, whereas a wide bandwidth produces an oversmoothed distribution.

Bandwidth selection is a similar problem to bin width selection. A deeper discussion of this problem falls outside the scope of text. Suffice it to say that most statistical software (including R) use equivalent rules of thumb to Sturges’ Rule to set the bandwidth. But these values can be easily overruled by the user.

As discussed in Section 2.1, Jeffreys quantities such as the Ca/Mg ratios of Figure 2.3.b exist within an infinite half space between 0 and +∞. However, when we visualise them as a KDE, the left tail of this curve may extend into negative data space, implying that there is a finite chance of observing negative Ca/Mg ratios:

Figure 2.11: KDE of 20 Ca/Mg ratio measurements. Even though all the measurements are strictly positive, the curve extends into negative data space.

This problem can be solved by transforming the Jeffreys quantities to the entire infinite space of numbers by applying a logarithmic transformation:

u = ln[x ]

(2.6)

After constructing the KDE, the results can then be mapped³ back to linear space using an exponential transformation:

x = exp [u]

(2.7)

Figure 2.12: a) KDE of the Ca/Mg ratio measurements, after applying a logarithmic transformation. Note how the distribution has become more symmetric compared to the linear scale of Figure 2.11. b) The same KDE mapped back to the linear scale. Unlike Figure 2.11, the mapped distribution does not cross over into negative values.

Jeffreys quantities are just one example of constrained measurements. As another example, consider the vegetation data of Figure 2.3.c. Again, the Gaussian KDE of the data plots into physically impossible values of < 0 and > 100%:

Figure 2.13: KDE of 20 vegetation measurements. Even though all the measurements are between 0 and 100%, the KDE extends beyond these hard limits.

Using a similar approach as before, the dataspace can be opened up from the constraints of the 0 to 1 interval to the entire line of numbers, from −∞ to +∞. For proportions, this is achieved by the logistic transformation:

[ ] --x-- u = logit(x) = ln 1 − x

(2.8)

After constructing the density estimate (or carrying out any other numerical manipulation), the results can be mapped back to the 0 to 1 interval with the inverse logit transformation:

exp[u] x = logit−1(u) = ---------- exp [u] + 1

(2.9)

Figure 2.14: a) KDE of the vegetation data, after applying a logistic transformation. Note the two horizonal axes. The top axis marks the transformed values on a linear scale that extends from −∞ to +∞. The bottom axis is labelled by the actual vegetation values on a non-linear scale that extends from 0 to 100%. b) The same distribution mapped back to the 0 – 100% interval.

We will see in Chapter 14 that the logistic transformation is a special case of a general class of logratio transformations that are useful for the analysis of compositional data.

KDEs can be generalised from one to two dimensions. For example, consider a dataset of eruption timings from the Old Faithful geyser in Yellowstone national park (Wyoming, USA):

Figure 2.15: Old Faithful eruption measurements. The dataset records 272 observations of 2 variables: the duration of each eruption, and the waiting time between them. Both variables are expressed in minutes. The lower left panel shows the bivariate measurements as grey circles. The contour lines represent a 2-dimensional KDE. The marginal distributions of the waiting times (top) and eruption durations (right) are shown as 1-dimensional KDEs.

It is generally not possible to visualise datasets of more than two dimensions in a single graphic. In this case there are two options:

1.: plot the data as a series of 1- or 2-dimensional marginal plots; or
2.: extract the most important patterns or trends in the data by projection onto a lower dimensional plane. Then show these projected data as a lower dimensional graphic.

The second strategy is also known as “ordination” and will be discussed in detail in Section 12.1.

Both histograms and kernel density estimates require the selection of a ‘smoothing parameter’. For the histogram, this is the bin width; for the KDE, it is the bandwidth (h in Equation 2.3). Despite the existence of rules of thumb to automatically choose an appropriate value for the smoothing parameter, there nevertheless is a level of arbitrariness associated with them. The empirical cumulative distribution function (ECDF) is an alternative data visualisation device that does not require smoothing. An ECDF is step function that jumps up by 1∕n at each of n data points. The mathematical formula for this procedure can be written as:

n ∑ F (x) = I(xi < x)∕n i=1

(2.10)

where I(∗) = 1 if ∗ is ‘true’ and I(∗) = 0 if ∗ is ‘false’. The y-coordinates of the ECDF are values from 0 to 1 that mark the fraction of the measurements that are less than a particular value. Plotting the pH, Ca/Mg, vegetation and geyser data as ECDFs:

Figure 2.16: Empirical cumulative distribution functions (ECDFs) of, from left to right: a) the pH data (whose KDE is shown in Figure 2.9); b) the Ca/Mg ratios of Figure 2.11; c) the vegetation data of Figure 2.13; and d) the eruption time data of Figure 2.15. Note that ECDFs are only applicable to 1-dimensional datasets.

ECDFs do not require binning or selecting a bandwidth. Because they do not require smoothing, they do not spill over into physically impossible values for the Ca/Mg and vegetation data. The ECDF is completely determined by the raw data.

The visual interpretation of ECDFs is different from that of histograms or KDEs. Whereas different clusters of values stand out as ‘peaks’ in a histogram or KDE, they are marked by steep segments of the ECDF. For example, the two peaks in the KDE of the geyser data (Figure 2.15) correspond to two steps in the ECDF (Figure 2.16.d).

After a purely qualitative inspection of the data, we can now move on to a more quantitative description. This chapter will introduce a number of summary statistics to summarise larger datasets using just a few numerical values, including:

1.: a measure of location, representing the ‘average’ of the data;
2.: a measure of statistical dispersion, quantifying the spread of the data; and
3.: a measure of the shape of the distribution.

Before proceeding with this topic, it is useful to bear in mind that these summary statistics have limitations. The Anscombe quartet of Table 2.1 and Figure 2.1 showed that very different looking datasets can have identical summary statistics. But with this caveat in mind, summary statistics are an essential component of data analysis provided that they are preceded by a visual inspection of the data.

There are many ways to define the ‘average’ value of a multi-value dataset. In this chapter, we will introduce three of these but later chapters will introduce a few more.

1.

Mean. Given a dataset x = {x₁,…,x_i,…,x_n} comprising n values, the arithmetic mean (x) is defined as:

∑n ¯x = 1- xi n i=1

(3.1)

2.

Median. The median is obtained by ranking the observations according to size, and selecting the middle value. If n is an odd number, then:

med(x) = x < ...< x < ...< x 1 (n+1)∕2 n

(3.2)

If n is an even number, then the median is the average of the two numbers on either side of the middle. Graphically, the median can be identified as the value that corresponds to the halfway point of the ECDF.

3.

Mode. For discrete datasets, the mode is the most frequently occurring value. For continuous datasets, it can be roughly identified as the steepest region on the ECDF, or as the highest point on a histogram or KDE. Note that the outcome of the latter procedure may depend on the histogram bandwidth or KDE bandwidth.

Applying these three concepts to the pH data, the mean is given by:

6.2+ 4.4 + 5.6 + 5.2+ 4.5+ 5.4 + 4.8 + 5.9 + 3.9+ 4.0+ ¯x = --5.1+-4.1+--5.1-+-5.5-+-5.2+-4.6+-5.7-+-4.6-+-4.6-+-5.6--= 5.00 20

The median is obtained by ranking the values in increasing order, and marking the two middle values in bold:

3.8, 4.0, 4.1, 4.4, 4.5, 4.6, 4.6, 4.6, 4.8, 5.0, 5.2, 5.2, 5.3, 5.4, 5.5, 5.6, 5.6, 5.7, 5.9, 6.2

Then the median is the average of these two values (median[x] = 5.1). The most frequently occurring pH value in the dataset is 4.6, but this is a rounded value from a continuous distribution. If instead we use a KDE to identify the mode, then this yields a value of 5.4:

Figure 3.1: KDE (top) and ECDF (bottom) of the pH data (mean = 5.0, median = 5.1, mode = 5.4). The dash-dot line on the bottom panel marks halfway mark of the ECDF. The intersection of this line with the ECDF marks the median. For this dataset, all three measures of location are closely spaced together in the densest sampling area.

Calculating the same three summary statistics for the Ca/Mg data:

Figure 3.2: KDE and ECDF of the Ca/Mg data (mean = 3.2, median = 2.1, mode = 0.45). The mean is strongly affected by the long ‘tail’ of large outliers. It is 7 times greater than the mode. Only 6 out of 20 catchments (30%) are larger than the mean of the distribution and only 1 (5%) is smaller than the mode.

Finally, repeating the exercise one more time for the vegetation and geyser eruption data:

Figure 3.3: KDE and ECDF of a) the vegetation data (mean = 0.51, median = 0.55, mode = 0.97); and b) the geyser eruption data (mean = 3.5, median = 4.0, mode = 4.4). Both of these distributions are ‘bimodal’, meaning that they have two ‘peaks’ in the KDE, corresponding to two steep segments in the ECDFs. The dotted lines mark the highest one of them and ignores the other one. The mean and median fall in between the two modes and are not representative of the data.

Comparing the three sets of examples leads to the following conclusions:

1.: the mean is a meaningful measure of location for unimodal and symmetric distribution;
2.: the mean is strongly affected by outliers whereas the median is not;
3.: therefore the median is a more robust estimator of location for asymmetric datasets than the mean;
4.: multimodal datasets cannot be adequately summarised with a single location parameter.

It is rare for all the observations in a dataset to be exactly the same. In most cases they are spread out over a range of values. The amount of spread can be defined in a number of ways, the most common of which are:

1.

Standard deviation. Given n measurements x_i (for 1 ≤ i ≤ n), the standard deviation is defined as:

┌ ----------------- ││ ∑n s[x] = ∘ --1-- (xi − ¯x)2 n − 1 i=1

(3.3)

where x is the arithmetic mean (Equation 3.1). The square of the standard deviation (i.e. s[x]²) is also known as the variance.

2.

Median Absolute Deviation (MAD). Whereas the standard deviation quantifies the dispersion around the mean, MAD quantifies the dispersion around the median:

MAD = median |xi − median (x )|

(3.4)

3.

Interquartile range (IQR). The ECDF can be used to define the quantiles of a sample distribution. For example, the 0.1 quantile (or 10^th percentile) of the pH data is 3.9, because 10% of the pH values are ≤3.9. Thus, the median is equivalent to the “0.5 quantile” or the “50 percentile”. The 25, 50 and 75 percentiles are also known as the quartiles. The IQR marks the difference between the third and the first quantile (i.e., the 25 and 75 percentile).

Calculating the standard deviation of the pH data (whose mean is x = 5.0):

i	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20

x_i	6.2	4.4	5.6	5.2	4.5	5.4	4.8	5.9	3.8	4.0	5.0	4.1	5.2	5.5	5.3	4.6	5.7	4.6	4.6	5.6
(x_i −x)	1.2	-.6	.6	.2	-.5	.4	-.2	.9	-1.2	-1.0	.0	-.9	.2	.5	.3	-.4	.7	-.4	-.4	.6
(x_i −x)²	1.44	.36	.36	.04	.25	.16	.04	.81	1.44	1.0	.0	.81	.04	.25	.09	.16	.49	.16	.16	.36

Taking the sum of the last row:

∑20 (xi − ¯x)2 = 8.42 i=1

from which we get:

∘ ------- s[x] = 8.42∕19 = 0.67

Sorting the pH values in increasing order and recalling that med(x) = 5.1:

i	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20

x_i	3.8	4.0	4.1	4.4	4.5	4.6	4.6	4.6	4.8	5.0	5.2	5.2	5.3	5.4	5.5	5.6	5.6	5.7	5.9	6.2
x_i − med(x)	-1.3	-1.1	-1.0	-0.7	-0.6	-0.5	-0.5	-0.5	-0.3	-0.1	0.1	0.1	0.2	0.3	0.4	0.5	0.5	0.6	0.8	1.1
\|x_i − med(x)\|	1.3	1.1	1.0	0.7	0.6	0.5	0.5	0.5	0.3	0.1	0.1	0.1	0.2	0.3	0.4	0.5	0.5	0.6	0.8	1.1
sorted	0.1	0.1	0.1	0.2	0.3	0.3	0.4	0.5	0.5	0.5	0.5	0.5	0.6	0.6	0.7	0.8	1.0	1.1	1.1	1.3

The MAD is given by the mean of the bold values on the final row of this table, yielding a value of MAD = 0.5.

The 25 and 75 percentiles are obtained by averaging the two pairs of bold faced numbers on the second row of the table. They are 4.55 and 5.55, respectively. Therefore, IQR = 5.55 - 4.55 = 1.00. Showing the same calculation on an ECDF:

Figure 3.4: ECDF of the pH data with indication of the three quantiles, namely the 25 percentile, median and 75 percentile. The Interquartile Range (IQR) is defined as the difference between the 75 and 25 percentiles. This is 1 pH unit in this example. The standard deviation and Median Absolute Deviation are s[x] = 0.67 and MAD = 0.5 pH units, respectively.

The skewness of a distribution is defined as:

n ---1---∑ 3 skew(x) = n ⋅s[x]3 (xi − ¯x) i=1

(3.5)

To assess the meaning of this new summary statistic, let us plot the pH and Ca/Mg datasets alongside the distribution of Covid-19 death rates (in deaths per 100,000 people) in the UK:

Figure 3.5: a) the frequency distributions of the pH data is symmetric and is characterised by a near zero (but ever so slightly negative) skewness; b) the Ca/Mg measurements are positively skewed, i.e. they heavily lean towards small values with a heavy ‘tail’ of higher values; c) finally, the distribution of Covid-19 death rates in the UK for the year 2020 is negatively skewed: old people are much more likely to die of covid than young people.

A box-and-wisker plot is a compact way to jointly visualise the most important summary statistics in a dataset:

1.: a box is drawn from the first to the third quartile (i.e. from the 25 to the 75 percentile);
2.: the median is marked by a line in the middle of the box;
3.: two lines extend from the box towards the minimum and maximum value, ignoring outliers (as defined next);
4.: any points that fall more than 1.5 times the IQR below the first quartile, or more than 1.5 times the IQR above the third quartile are marked as outliers.

Figure 3.6: KDE (top) and box-and-whisker plot (bottom) of the clast size data. The width of the box marks the IQR. The whiskers extend to the minimum and maximum value, excluding a single outlier which, at ∼11cm, is more than 1.5 IQR larger than the third quartile (75 percentile). The median is offset towards the left hand side of the box, indicating the positive skewness of the dataset.

Probability is a numerical description of how likely it is for an event to occur or how likely it is for a proposition to be true. In the context of a random experiment, in which all outcomes are equally likely, the probability of an outcome A can be defined as:

the-number-of-ways-A-can-occur P (A) = the total number of outcomes

(4.1)

For example, the probability of tossing an unbiased coin and observing a head (H) is

{H } 1 --------= --= 0.5 {H} {T} 2

The probability of tossing the same coin three times and obtaining two heads and one tail (T) is:

P(H, H,T ) = -----------------{T-HH--}{HT-H-}{HHT---}------------------= 3- {HHH } {THH }{HT H }{HHT }{TT H }{THT }{HT T }{TT T} 8

(4.2)

Similarly, the probability of throwing two fair dice and obtaining a two (⚁ ) and a six (⚅ ) is:

( ) ---------------{⚁--⚅-}{⚅--⚁-}----------------- -2- -1- P ⚁, ⚅ = {⚀ ⚀ }{ ⚁ ⚀ }{⚂ ⚀ }{⚃ ⚀ }{⚄ ⚀ }{⚅ ⚀ } = 36 = 18 { }{ }{ }{ }{ }{ } {⚀ ⚁ }{ ⚁ ⚁ }{⚂ ⚁ }{⚃ ⚁ }{⚄ ⚁ }{⚅ ⚁ } ⚀ ⚂ ⚁ ⚂ ⚂ ⚂ ⚃ ⚂ ⚄ ⚂ ⚅ ⚂ {⚀ ⚃ }{ ⚁ ⚃ }{⚂ ⚃ }{⚃ ⚃ }{⚄ ⚃ }{⚅ ⚃ } {⚀ ⚄ }{ ⚁ ⚄ }{⚂ ⚄ }{⚃ ⚄ }{⚄ ⚄ }{⚅ ⚄ } {⚀ ⚅ }{ ⚁ ⚅ }{⚂ ⚅ }{⚃ ⚅ }{⚄ ⚅ }{⚅ ⚅ }

(4.3)

The multiplicative rule of probability states that the probability of two combined independent¹ experiments is given by the product of their respective probabilities. Thus, if one were to carry out a coin tossing and a dice throwing experiment, then the probability of obtaining two heads and one tail for the first experiment and throwing a two and a six in the second experiment is:

( ) 3-1- -3-- P H,H, T ∩ ⚁, ⚅ = 818 = 144 = 0.021

where ∩ stands for ‘and’ in the sense of “the intersection of two sets of outcomes”. Similarly, the likelihood of throwing an unbiased coin twice and obtaining two heads can be calculated as:

P (H ∩ H ) = 11-= 1- 22 4

The additive rule of probability states that the probability of observing either of two outcomes A and B is given by:

P(A ∪ B ) = P (A)+ P(B )− P (A ∩ B )

(4.4)

For example the probability of observing two heads for the first experiment or throwing a two and a six in the second experiment is:

( ) 3- -1- -3-- 59-- P H, H, T ∪ ⚁,⚅ = 8 + 18 − 144 = 144 = 0.410

For two mutually exclusive experiments, the third term in Equation 4.4 disappears. For example, the probability of obtaining two heads and one tail or throwing three heads is:

3- 1- 4- P (H, H,T ∪ H, H, H) = 8 + 8 = 8 = 0.5

A permutation is an ordered arrangement of objects. These objects can be selected in one of two ways:

1.

sampling with replacement. Consider an urn with n balls that are numbered 1 through n. Draw a ball from the urn and write down its number. There are n possible outcomes for this experiment. Then place the ball back in the urn, thoroughly mix the balls and draw a second one. Write down its number. There are n possible outcomes for the second experiment. Using the multiplicative rule of probability, there are (n × n) possible outcomes for the two numbers. Repeat until you have drawn k balls. Then the number of possible sequences of numbers is

k times ◜-----◞◟-----◝ k n× n× ...× n = n

(4.5)

2.

sampling without replacement. Consider the same urn as before and draw a first number. Like before, there are n possible outcomes. However this time we do not put the ball back into the urn. With the first ball removed, draw a second number. This time there are only (n − 1) possible outcomes. Thus, the total number of outcomes for the first two numbers is n × (n − 1). Continuing the experiment until you have drawn k balls (where k ≤ n) yields

---n!--- n × (n − 1)× (n − 2)× ...× (n − k + 1) = (n − k)!

(4.6)

possible sequences, where ‘!’ is the factorial operator.

Let us apply these two formulas to a classic statistical problem: “what is the probability that at least two students in a classroom of k celebrate their birthdays on the same day?”. The solution is as follows.

1.: There are (n =) 365 possible days on which the first person might celebrate their birthday.
2.: There are 365 possible days on which the second person might celebrate their birthday, but only 364 of these do not overlap with the first person’s birthday.
3.: There are another 365 possible days on which the third person might celebrate their birthday, but only 363 of these do not overlap with the birthdays of the first two people.
4.: For k people, there are 365^k possible sets of birthdays (sampling with replacement), but only 365 × 364 ×… × (n − k + 1) = 365!∕(365 − k)! of these sets do not overlap (sampling without replacement).
5.: Therefore, the probability that two people’s birthdays do not overlap is given by $-----365!----- P(no overlapping birthdays) = (365− k)!365k$
6.: And the probability that at least two people’s birthdays overlap is $-----365!----- P(> 1 overlapping birthdays ) = 1 − (365 − k)!365k$

If k = 23, then P(> 1 overlapping birthdays) = 0.507. In other words, there is a greater than 50% chance that at least two students will share the same birthday in a classroom of 23.

Having considered coins, dice and lottery balls, we will now (literally!) deal with a fourth archetypal source of statistical experiments, namely playing cards. Equation 4.6 showed that there are 52!∕49! = 132600 unique possible ways to select three cards from a deck. Suppose that we have drawn the following cards:

, ,

then there are 3! = 6 ways to order these cards:

{ } { } { } { } { } { }

Suppose that we don’t care in which order the objects (cards) appear. How many different unordered samples (hands) are possible?

(# ordered samples) = (# unordered samples) × (# ways to order the samples)

There are n!∕(n−k)! ways to select k objects from a collection of n, and there are k! ways to order these k objects. Therefore

(# unordered samples ) =-----(#-ordered-samples)---- = ---n!---- (# ways to order the samples) (n − k)!k!

The formula on the right hand side of this equation gives the number of combinations of k elements among a collection of n. This formula is also known as the binomial coefficient and is often written as (n)
k (pronounce “n choose k”):

(n ) n! = --------- k (n − k)!k!

(4.7)

Applying this formula to the card dealing example, the number of ways to draw three cards from a deck of 52 is

( ) 52 -52!- 52-×-51×--50 3 = 49!3! = 3× 2 = 22100

Revisiting the two examples at the start of this chapter, the number of ways to arrange two heads among three coins is

(3) 3! 6 = ----= --= 3 2 1!2! 2

which is the numerator of Equation 4.2; and the number of combinations of one ⚀ and one ⚅ is

( ) 2 = -2!-= 2 1 1!1!

which is the numerator of Equation 4.3.

So far we have assumed that all experiments (coin tosses, throws of a dice) were done independently, so that the outcome of one experiment did not affect that of the other. However this is not always the case in geology. Sometimes one event depends on another one. We can capture this phenomenon with the following definition:

P (A|B ) = “The conditional probability of A given B ”

(4.8)

Let P(A) be the probability that a sedimentary deposit contains ammonite fossils. And let P(B) be the proportion of our field area that is covered by sedimentary rocks of Bajocian age (170.3 – 168.3 Ma). Then P(A|B) is the probability that a given Bajocian deposit contains ammonite fossils. Conversely, P(B|A) is the probability that an ammonite fossil came from a Bajocian deposit.

The multiplication law states that:

P (A ∩ B) = P(A |B )P (B ) = P(B |A )P(A ) = P (B ∩ A)

(4.9)

Suppose that 70% of our field area is covered by Bajocian deposits (P(B) = 0.7), and that 20% of those Bajocian deposits contain ammonite fossils (P(A|B) = 0.2). Then there is a 14% (= 0.7 × 0.2) chance that the field area contains Bajocian ammonites.

The law of total probability states that, given n mutually exclusive scenarios B_i (for 1 ≤ i ≤ n):

∑n P (A) = P (A|Bi)P(Bi) i=1

(4.10)

Consider a river whose catchment contains 70% Bajocian deposits (P(B₁) = 0.7) and 30% Bathonian² deposits (P(B₂) = 0.3). Recall that the Bajocian is 20% likely to contain ammonite fossils (P(A|B₁) = 0.2), and suppose that the Bathonian is 50% likely to contain such fossils (P(A|B₂) = 0.5). How likely is it that the river catchment contains ammonites? Using Equation 4.10:

P(A ) = P (A|B )P (B )+ P (A|B )P (B ) = 0.2 × 0.7+ 0.5× 0.3 = 0.29 1 1 2 2

Equation 4.9 can be rearranged to form Bayes’ Rule:

P (A|B)P (B) P (B |A) = ------------- P(A )

(4.11)

or, combining Equation 4.11 with Equation 4.10:

P(B |A ) =--P(A-|Bj-)P(Bj)-- j n∑ i=1P (A|Bi)P (Bi )

(4.12)

Suppose that we have found an ammonite fossil in the river bed. What is its likely age? Using Equation 4.12, the probability that the unknown fossil is Bajocian (B₁) is given by:

P(A |B1 )P(B1) 0.2 × 0.7 P(B1 |A ) = ------------------------------= -------------------= 0.48 P(A |B1 )P (B1) + P(A |B2 )P(B2 ) 0.2 × 0.7 + 0.5× 0.3

Thus there is a 48% chance that the fossil is Bajocian, and a 52% chance that it is Bathonian.

A Bernoulli variable takes on only two values: 0 or 1. For example:

1.: a coin may land on its head (1) or tail (0);
2.: a die may land on ⚅ (1) or not (0);
3.: a ‘wildcat’ exploration well may produce petroleum (1) or be dry (0).

Consider five gold diggers during the 1849 California gold rush, who have each purchased a claim in the Sierra Nevada foothills. Geological evidence suggests that, on average, two thirds of the claims in the area contain gold (1), and the remaining third do not (0). The probability that none of the five prospectors find gold is

P(0 × gold ) = P (00000) = (1∕3)5 = 0.0041

The chance that exactly one of the prospectors strikes gold is

P (1× gold) = P(10000) + P(01000) + P(00100) + P(00010) + P(00001)

where

P(10000) =	(2∕3)(1∕3)⁴ = 0.0082
P(01000) =	(1∕3)(2∕3)(1∕3)³ = 0.0082
P(00100) =	(1∕3)²(2∕3)(1∕3)² = 0.0082
P(00010) =	(1∕3)³(2∕3)(1∕3) = 0.0082
P(00001) =	(1∕3)⁴(2∕3) = 0.0082

so that

( ) P (1× gold) = 5 (2∕3)(1 ∕3)4 = 5 × 0.0082 = 0.041 1

in which we recognise the binomial coefficient (Equation 4.7). Similarly:

P(2 × gold)	= (2∕3)²(1∕3)³ = 10 × 0.016 = 0.16
P(3 × gold)	= (2∕3)³(1∕3)² = 10 × 0.033 = 0.33
P(4 × gold)	= (2∕3)⁴(1∕3) = 5 × 0.066 = 0.33
P(5 × gold)	= (2∕3)⁵(1∕3)⁰ = 1 × 0.13 = 0.13

These probabilities form a probability mass function (PMF) and can be visualised as a bar chart:

Figure 5.1: The probability mass function (PMF) for a binomial experiment with a 2/3 chance of success (and a 1/3 chance of failure) for five gold prospecting claims. The horizontal axis is labelled with the number of claims that produce gold. The vertical axis shows the probability of these respective outcomes.

The generic equation for the PMF of the binomial distribution is

(n ) k n− k P(k|n,p) = k p (1− p)

(5.1)

where p is the probability of success and k is the number of successes out of n trials. Equivalently, the results can also be shown as a cumulative distribution function (CDF):

F (x ) = P (X ≤ x )

(5.2)

Figure 5.2: The cumulative distribution function (CDF) of the binomial distribution. This is the running sum of Figure 5.1. The horizontal axis is labelled with the number of claims that produce gold. The vertical axis shows the cumulative probability of these respective outcomes. For example, the probability that two or fewer prospectors find gold is 21%.

The previous section assumed that the probability of success (p in Equation 5.1) is known. In the real world, this is rarely the case. In fact, p is usually the parameter whose value we want to determine based on some data. Consider the general case of k successes among n trials. Then we can estimate p by reformulating Equation 5.1 in terms of p instead of k:

(n) ℒ(p|n,k) = pk(1− p)n− k k

(5.3)

This is called the likelihood function. The only difference between the probability mass function (Equation 5.1) and the likelihood function (Equation 5.3) is that former calculates the probability of an outcome k given the parameters n and p, whereas the latter is used to estimate the parameter p given the data n and k. The most likely value of p given n and k can be found by maximising ℒ(p|n,k). This can be achieved by taking the derivative of Equation 5.3 with respect to p and setting it to zero:

∂ℒ (p|n,k)|| ---------|| = 0 ∂p ˆp

which gives

( ) ( ) n k−1 n−k n k n−k−1 k kˆp (1− ˆp) − k ˆp (n − k)(1− ˆp) = 0

The symbol indicates that is an estimate for the true parameter value p, which is unknown. Dividing by ( )
nk and rearranging:

k− 1 n−k k n−k−1 k ˆp (1 − ˆp) = ˆp (n− k)(1 − ˆp)

Dividing both sides by ^k(1 − )^n−k:

k(1 − ˆp) = pˆ(n − k)

which can be solved for :

k pˆ= n-

(5.4)

Let us apply Equation 5.4 to our gold prospecting example. Suppose that only two of the five claims produce gold. Then our best estimate for p given this result is

ˆp = 2-= 0.4 5

So based on this very small dataset, our best estimate for the abundance of gold in the Sierra Nevada foothills is 40%. This may be a trivial result, but it is nevertheless a useful one. The derivation of Equation 5.4 from Equation 5.3 follows a recipe that underpins much of mathematical statistics. It is called the method of maximum likelihood. Of course, the derivation of parameter estimates is not always as easy as it is for the binomial case.

Let us continue with our gold prospecting example. Given that only two of the five prospectors found gold, our best estimate for the abundance of gold-bearing claims in the prospecting area is = 2∕5 (40%). However the introductory paragraph to this chapter mentioned that geological evidence suggests that 2/3 (67%) of the claims should contain gold. Can the discrepancy between the predicted and the observed number of successes by attributed to bad luck, or does it mean that the geological estimates were wrong? To answer this question, we follow a sequence of six steps:

1.

Formulate two hypotheses:

H₀ (null hypothesis)

H_A (alternative hypothesis):

p = 2∕3

p < 2∕3

2.

Calculate the test statistic T, which in this case is the number of successes (k).

3.

Determine the null distribution of T under H₀. In our example this is the binomial distribution, as shown in Figures 5.1 and 5.2. Tabulating the PMF and CDF of success:

k	0	1	2	3	4	5

P(T = k)	0.0041	0.0411	0.1646	0.3292	0.3292	0.1317
P(T ≤ k)	0.0041	0.0453	0.2099	0.5391	0.8683	1.0000

The p-value¹ is the probability of obtaining test results at least as extreme as the result actually observed, under the assumption that the null hypothesis is correct. For our one-sided test, the p-value is 0.2099, which corresponds to the probability of observing k ≤ 2 successful claims if p = 2∕3.

4.

Choose a significance level α. It is customary to choose α = 0.05.

5.

Mark all the outcomes that are incompatible with H₀. This rejection region is marked in bold in the following table:

k	0	1	2	3	4	5

P(T = k)	0.0041	0.0411	0.1646	0.3292	0.3292	0.1317
P(T ≤ k)	0.0041	0.0453	0.2099	0.5391	0.8683	1.0000

k = 0 and k = 1 are incompatible with H₀ because the probability of finding gold in k ≤ 1 claims is only 0.0453, which is less than α. Therefore our rejection region contains two values:

R = {0,1}

6.

Reach a decision. Evaluate whether the observed outcome for the test statistics falls inside the rejection region. If it does, then H₀ is rejected in favour of H_A. In our example,

k = 2 ∕∈ R

which means that we cannot reject H₀. Alternatively, and equivalently, we can reach the same conclusion by observing that the p-value is greater than α (i.e., 0.2099 > 0.05). Note that failure to reject the null hypothesis does not mean that said hypothesis has been accepted!

Displaying the rejection region graphically:

Figure 5.3: a) PMF and b) CDF of a binomial null distribution with p = 2∕3 and n = 5. The rejection region is marked in black on a). The horizontal dotted line in b) shows the α = 0.05 cutoff mark. The horizontal dashed line in b) marks the p-value for k = 2, which is greater than α. Therefore, the null hypothesis cannot be rejected.

The above hypothesis test is called a one-sided hypothesis test, which refers to the fact that H_A specifies a direction (‘<’ or ‘>’). Alternatively, we can also formulate a two-sided hypothesis test:

1.

In this case H₀ and H_A are symmetric:

H₀ (null hypothesis)

H_A (alternative hypothesis):

p = 2∕3

p≠2∕3

2.

The test statistic remains the same as before.

3.

We add one line to our table of cumulative outcomes, in order to evaluate the high end of the scale as well as its low end:

k	0	1	2	3	4	5

P(T = k)	0.0041	0.0411	0.1646	0.3292	0.3292	0.1317
P(T ≤ k)	0.0041	0.0453	0.2099	0.5391	0.8683	1.0000
P(T ≥ k)	1.000	0.9959	0.9547	0.7901	0.4609	0.1317

4.

The significance level is kept the same, but is now evaluated twice at α∕2 to accommodate both tails of the binomial distribution.

5.

Mark all the outcomes that are incompatible with H₀, i.e. all the values k with P(T ≤ k) or P(T ≥ k) < α∕2. This rejection region is marked in bold in the following table:

k	0	1	2	3	4	5

P(T = k)	0.0041	0.0411	0.1646	0.3292	0.3292	0.1317
P(T ≤ k)	0.0041	0.0453	0.2099	0.5391	0.8683	1.0000
P(T ≥ k)	1.000	0.9959	0.9547	0.7901	0.4609	0.1317

which yields a smaller rejection region than before, because P(T ≤ 1) = 0.0453, which is greater than α∕2 = 0.025. The same is true for P(T ≥ k) for any k. Therefore:

R = {0}

6.

Again, we fail to reject H₀, which means that the observed k∕n = 2∕5 success rate does not rule out the possibility that the true value of p = 2∕3, and that the geologists were correct.

Displaying the two-sided hypothesis test graphically:

Figure 5.4: a) the same PMF and b) CDF as Figure 5.3. Horizontal dotted lines mark α∕2 = 0.025 and (1 − α∕2) = 0.975. Their intersections with the CDF are shown as vertical dotted lines and mark the outer edges of the rejection region. The dashed lines mark the observed value and probability, which fall outside the rejection region.

Again, H₀ cannot be rejected at an α = 0.05 significance level. So in this case the one-sided and two-sided hypothesis tests produce exactly the same result, as the observed value k = 2 is not in the rejection region for either test. However this is not always the case, as will be shown next.

Suppose that not five but fifteen gold prospectors had purchased a claim in the same area as before. Further suppose that six of these prospectors had struck gold. Then the maximum likelihood estimate for p is:

-6- ˆp = 15 = 0.4

which is the same as before. The one-sided hypothesis test (H₀ : p = 2∕3 vs. H_A : p < 2∕3) proceeds as before, but leads to a different table of probabilities:

k	0	1	2	3	4	5	6	7

P(T = k)	7.0×10⁻⁸	2.1×10⁻⁶	2.9×10⁻⁵	2.5×10⁻⁴	0.0015	0.0067	0.0223	0.0574
P(T ≤ k)	7.0×10⁻⁸	2.2×10⁻⁶	3.1×10⁻⁵	2.8×10⁻⁴	0.0018	0.0085	0.0308	0.0882
k	8	9	10	11	12	13	14	15

P(T = k)	0.1148	0.1786	0.2143	0.1948	0.1299	0.0599	0.0171	0.0023
P(T ≤ k)	0.2030	0.3816	0.5959	0.7908	0.9206	0.9806	0.9977	1.0000

The rejection region is shown in bold and consists of

R = {0,1,2,3,4,5,6}

(5.5)

which includes k = 6. Therefore H₀ is rejected.

Figure 5.5: a) PMF and b) CDF of a binomial distribution with p = 2∕3 and n = 15. The vertical dashed lines mark the observation (k = 6). They fall in the black area of the bar chart, and to the left of the dotted line in the CDF, which marks the edge of the rejection region. Therefore, H₀ fails the one-sided test.

For the two-sided hypothesis test (H₀ : p = 2∕3 vs. H_A : p≠2∕3):

k	0	1	2	3	4	5	6	7

P(T = k)	7.0×10⁻⁸	2.1×10⁻⁶	2.9×10⁻⁵	2.5×10⁻⁴	0.0015	0.0067	0.0223	0.0574
P(T ≤ k)	7.0×10⁻⁸	2.2×10⁻⁶	3.1×10⁻⁵	2.8×10⁻⁴	0.0018	0.0085	0.0308	0.0882
P(T ≥ k)	1.0000	1-7.0×10⁻⁸	1-2.2×10⁻⁶	1-3.1×10⁻⁵	1-2.8×10⁻⁴	0.9982	0.9915	0.9692
k	8	9	10	11	12	13	14	15

P(T = k)	0.1148	0.1786	0.2143	0.1948	0.1299	0.0599	0.0171	0.0023
P(T ≤ k)	0.2030	0.3816	0.5959	0.7908	0.9206	0.9806	0.9977	1.0000
P(T ≥ k)	0.9118	0.7970	0.6184	0.4041	0.2092	0.0794	0.0194	0.0023

The rejection region (which includes both tails of the distribution) is:

R = {0,1,2,3,4,5,14,15}

(5.6)

This region does not include k = 6. Therefore we cannot reject the two-sided null hypothesis that p = 2∕3 at α = 0.05.

Figure 5.6: a) PMF and b) CDF of the binomial distribution with p = 2∕3 and n = 15. The dotted lines mark the α∕2 = 0.025 and (1 − α∕2) = 0.975 levels and quantiles. The dashed lines mark the observed value (k = 6), which falls outside the rejection region. Therefore H₀ does not fail the two-sided test.

Let us increase our ‘sample size’ (number of prospectors) even more, from 15 to 30, and suppose once again that only 40% of these found gold even though the geological evidence suggested that this should be 67%. The lookup table of probabilities would be quite large, so we will just show the distributions graphically:

Figure 5.7: The PMF (a & b) and CDF (c & d) for a binomial distribution with p = 2∕3 and n = 30, illustrating a one-sided (a & c) and two-sided (b & d) test. The vertical dashed lines mark the observation k = 12, which falls inside the rejection regions marked in black (a & b), and outside the cutoff limits defined by the vertical dotted lines (c & d).

In summary, we have compared the same outcome of 40% successes with the null hypothesis p = 2∕3, using three different sample sizes (n):

1.: Both the one-sided and the two-sided hypothesis tests failed to reject the null hypothesis for n = 5 prospectors.
2.: Increasing the number of prospectors to n = 15 leads to a rejection of H₀ using the one-sided test but failure to reject H₀ using the two-sided test.
3.: Further increasing our sample size to n = 30 whilst still keeping the percentage of success the same leads to a firm rejection of H₀ using both the one-sided and the two-sided test.

In statistical terms, the increase in sample size has increased the ‘power’ of the test to reject the hypothesis test. A formal mathemathical definition of this concept will be given in Section 5.4.

There are four possible outcomes for a hypothesis test, which can be organised in a 2 × 2 table:

H₀ is ...	false	true

rejected	correct decision	Type-I error
not rejected	Type-II error	correct decision

To appreciate the difference between the two types of errors in this table, it may be useful to compare statistical hypothesis testing with a legal analogue. The jury in a court of justice are in a situation that is similar to that of a statistical hypothesis test. They are faced with a suspect who has either committed a crime or not, and they must decide whether to sentence this person or acquit them. In this case our ‘null hypothesis’ is that the accused is innocent. The jury then needs decide whether there is enough evidence to reject this hypothesis in favour of the alternative hypothesis, which is that the accused is guilty. Casting this process in a second 2 × 2 table:

the accused is ...	guilty	innocent

convicted	correct decision	Type-I error
acquitted	Type-II error	correct decision

A type-I error is committed when a true null hypothesis test is erroneously rejected. This is akin to putting an innocent person in prison. For our gold prospecting example, this means that we reject the expert opinion of the geologist (whose assessment indicated a 2/3 chance of finding gold) when this geologist is in fact correct.

A type-II error is committed when we fail to reject a false null hypothesis. This is akin to letting a guilty person get away with a crime for lack of evidence. In the geological example, this means that we conclude that the geological assessment is right despite it being wrong.

The probability of committing a type-I error is controlled by one parameter:

1.: The confidence level α
Using the customary value of α = 0.05, there is a 5% chance of committing a type-I error. So even if the null hypothesis is correct, then we would still expect to reject it once every 20 times. This may be acceptable in geological studies, but probably not in the legal system! The principle that guilt must be proven “beyond any reasonable doubt” is akin to choosing a very small significance level (α ≪ 0.05). However it is never possible to enforce α = 0 unless all suspects are aquitted, so it is inevitable that some innocent people are convicted.

The probability of committing a type-II error (β) depends on two things:

1.

The degree to which H₀ is false

In our geological example, 40% of the prospectors found gold in their claim, so there clearly was some gold present in the area. Suppose that the actual abundance of gold in the prospecting area was indeed 40% (p = 2∕5) instead of p = 2∕3. Then the expected distribution of outcomes would follow a binomial distribution with p = 2∕5. As shown in Section 5.2, the rejection region for the one-sided hypothesis test of H₀ : p = 2∕3 vs. H_A : p < 2∕3 is R = {0,1} for n = 5. If the actual value for p is 2/5, then the probability of observing a value for k that falls in this rejection region is P(k < 2|n = 5,p = 2∕5) = 0.34. This is known as the power of the statistical test. The probability of committing a type-II error is given by:

β = 1 − power = 0.66

(5.7)

Next, suppose that the true probability of finding gold is even lower, at p = 1∕5. Under this alternative distribution, the probability of finding gold in k ≤ 1 out of n = 5 claims (i.e., the power) increases to 74%. Therefore, the probability of committing a type-II error has dropped to only 26%.

Finally, consider an end member situation in which the prospecting area does not contain any gold at all (p = 0). Then the probability of finding gold is obviously zero, because F(x = 2) = 0. Under this trivial scenario, the power of the test is 100%, and the probability of committing a type-II error is zero.

Plotting these results graphically:

Figure 5.8: a) – c) PMFs of the binomial null distribution with p = 2∕3 (H₀, grey and white, flipped, bottom) and alternative distributions (H_A, black and white, upright, top) with a) p = 2∕5, b) p = 1∕5 and c) p = 0. Sample size is n = 5 for all cases. d) – f) CDFs for the null distribution (grey) and the alternative distribution (black). The horizontal dotted lines mark α = 0.05. Their intersections with the CDF of the null distribution are marked by vertical dotted lines. The areas to the left of these lines define the rejection region under a one-sided test and are marked in black in the PMFs of the alternative distributions (a – c). The dashed horizontal lines mark the intersection of the rejection boundary with the CDF of the alternative distribution. They mark the power of the statistical test (1 − β). Power increases as the alternative distribution drifts away from the null distribution.

2.

Sample size

The effect of sample size was already discussed in Section 5.3. Comparing the predicted outcomes for the null hypothesis H₀ : p = 2∕3 to those of the alternative hypothesis H_A : p = 2∕5 for sample sizes of n = 5, 15 and 30:

Figure 5.9: a) – c) PMFs of binomial distributions with p = 2∕3 (H₀, grey and white, flipped, bottom) and p = 2∕5 (H_A, black and white, upright, top), for sample sizes of a) n = 5, b) n = 15 and c) n = 30. d) – f) CDFs for the null distribution (grey) and the alternative distribution (black). The horizontal dotted lines mark α = 0.05, corresponding to a one-sided test. Their intersections with the CDF of the null distribution are marked by vertical dotted lines. The area to the left of these lines define the rejection region and are marked in black in the bar chart. The larger the sample size, the easier it is to reject H₀. The dashed horizontal lines mark the intersection of the rejection region with the CDF of the alternative distribution. These mark the power of the statistical test (1 −β). Power increases with sample size.

All hypotheses are wrong ... in some decimal place – John Tukey (paraphrased)

All models are wrong, but some are useful – George Box

Statistical tests provide a rigorous mathematical framework to assess the validity of a hypothesis. It is not difficult to see the appeal of this approach to scientists, including geologists. The scientific method is based on three simple steps:

1.: Formulate a hypothesis.
2.: Design an experiment to test said hypothesis.
3.: Carry out the experiment and check to see if it matches the prediction.

It is rarely possible to prove scientific hypotheses. We can only disprove them. New knowledge is gained when the results of an experiment do not match the expectations. For example:

1.: Hypothesis: Earth’s lower mantle is made of olivine.
2.: Test: Study the stability of olivine at lower mantle pressures (24-136 GPa).
3.: Result: Olivine is not stable at lower mantle pressures.

From this experiment we still don’t know what the lower mantle is made of. But at least we know that it is not olivine. Let us contrast this outcome with a second type of hypothesis:

1.: Hypothesis: Earth’s lower mantle is made of perovskite.
2.: Test: Study the stability of perovskite at lower mantle pressures.
3.: Result: Perovskite is stable at lower mantle pressures.

What have we learned from this experiment? Not much. We certainly did not prove that Earth’s lower mantle consists of perovskite. There are lots of other minerals that are stable at lower mantle pressures. The only thing that we can say is that the null hypothesis has survived to live another day. The scientific method is strikingly similar to the way in which a statistical hypothesis test is carried out. A null hypothesis, like a scientific hypothesis, cannot be proven. It can only be disproved. Rejection of a null hypothesis is the best outcome, because it is the only outcome that teaches us something new. Our understanding of the natural world continues to improve by falsifying current hypotheses using scientific experiments, which leads to revised hypotheses that are closer to the truth.

It may seem natural to use the statistical approach to test scientific hypotheses. However doing so is not without dangers. To explain these dangers, let us go back to the power analysis of Section 5.3. The power of our hypothesis test to reject H₀ : p = 2∕3 increases with sample size. A small sample may be sufficient to detect large deviations from the null hypothesis. Smaller deviations require larger sample sizes. But no matter how small the violation of the null hypothesis is, there always exists a sample size that is large enough to detect it.

Statistical tests are an effective way to evaluate mathematical hypotheses. They are less useful for scientific hypotheses. There is a profound difference between mathematical and scientific hypotheses. Whereas a mathematical hypothesis is either ‘right’ or ‘wrong’, scientific hypotheses are always ‘somewhat wrong’. Considering our gold prospecting example, it would be unreasonable to expect that p is exactly equal to 2/3, down to the 100^th significant digit. Estimating the proportion of gold in the area to be within 10% of the truth would already be a remarkable achievement. Yet given enough data, there will always come a point where the geological prediction is disproved. Given a large enough dataset, even a 1% deviation from the predicted value would yield an unacceptably small p-value:

Figure 5.10: The p-values of a binomial test comparing the null hypothesis H₀ : p = 2∕3 with the alternative hypothesis H_a : p < 2∕3, for four parameter estimates

, over a range of sample sizes. For each

, let the outcome k be the closest integer to n

, so that k∕n ≈

. The horizontal dotted line marks the 5% significance level. No matter how little the observed k∕n-ratio differs from 2/3 (≈ 0.67), this difference always becomes ‘significant’ given a large enough sample.

As another example, suppose that we have analysed the mineralogical composition of two samples of sand that were collected 10 cm apart on the same beach. Our null hypothesis is that the composition of the two samples is the same. Plausible though this hypothesis may seem, it will always be possible to reject it, provided that enough grains are analysed. We may need to classify a hundred, a thousand or even a million grains from each sample, but at some point a ‘statistically significant’ difference will be found. Given a large enough sample, even the tiniest hydraulic sorting effect becomes detectable.

In conclusion, formalised hypothesis tests are of limited use in science. There are just two exceptions in which they do serve a useful purpose:

1.: If a statistical test fails to reject the null hypothesis, then this indicates that the sample size is too small to find a meaningful effect. In this context the statistical test protects us against over-interpreting our data.
2.: We can calculate the sample size that would be required to detect a pre-specified deviation from the null hypothesis, for the given type-1 and type-2 thresholds (i.e., α and β). This approach is used in the pharmaceutical industry to test the efficacy of drugs. However it is seldom or never available to geologists.

The previous section showed that simple binary hypothesis tests are of limited use in geology. The question that is relevant to scientists is not so much whether a hypothesis is wrong, but rather how wrong it is. In the context of our gold prospecting example, there is little use in testing whether p is exactly equal to 2∕3. It is far more useful to actually estimate p and to quantify the statistical uncertainty associated with it.

Equation 5.4 showed that, given k successful claims among n total claims, the most likely estimate for p is k∕n. For example, if we observe k = 2 successful claims among n = 5 trials, then our best estimate for the abundance of gold is = 2∕5. However this does not rule out other values. Let us now explore all possible values for p that are compatible with the observed k = 2 successful claims:

1.: p=0? If p = 0, then the outcome that k = 2 would be impossible. So p = 0 can be ruled out.
2.: p=0.1? The probability of observing k ≥ 2 successful claims if p = 0.1 is given by: $5 ( ) P(k ≥ 2|p = 0.1,n = 5) = ∑ 5 (0.1)i(0.9)n− i = 0.081 i i=2$
which is greater than α∕2. Consequently, the observation k = 2 falls outside the rejection region of the two-sided hypothesis test, and the proposed parameter value p = 0.1 is deemed to be compatible with the observation.
3.: p=2/5? This is our maximum likelihood estimate for p. It is definitely possible that this is the true value.
4.: p=2/3? Figures 5.8.a) and d) show that there is a greater than 2.5% chance of observing 2 or fewer successful claims if the true value of p is 2∕3. So p = 2∕3 remains possible.
5.: p=0.9? The probability of observing k ≤ 2 successful claims if p = 0.9 is given by: $2 ( ) P (k ≤ 2|p = 0.9,n = 5) = ∑ 5 (0.9)i(0.1)n−i = 0.0086 i i=0$
which is less than the α∕2 cutoff, and so p = 0.9 is not compatible with the observation.
6.: p=1? If p = 1, then 100% of the claims should contain gold. This is incompatible with the observation that k = 2. Therefore p = 1 can be ruled out.

The set of all values of p that are compatible with the observed outcome k = 2 forms a confidence interval. Using an iterative process, it can be shown that the lower and upper limits of this interval are given by:

C.I.(p|k = 2,n = 5 ) = [0.053,0.85]

Figure 5.11: a) the PMFs and b) CDFs of the binomial parameter p given k = 2 successful claims (dashed lines) out of n = 5 (i.e.,

= 0.4), for the lower (p = 0.053, grey) and upper (p = 0.85, black) limits of a 95% confidence interval. Dotted horizontal lines mark the 0.025 and 0.975 confidence levels.

Repeating this procedure for a different result, for example k = 4, yields a different confidence interval, namely:

C.I.(p|k = 4,n = 5) = [0.284,0.995 ]

Figure 5.12: The same as Figure 5.11, but for k = 4 successful claims (dashed lines) out of n = 5 trials (i.e.,

= 0.8).

What happens if we increase the sample size from n = 5 to n = 30, and the number of successful claims from k = 2 to k = 12? Then the maximum likelihood estimate remains = 2∕5 as in our first example, but the 95% confidence interval narrows down to

C.I.(p|k = 12,n = 30) = [0.23,0.59]

Figure 5.13: The same as Figure 5.11, but for k = 12 successful claims (dashed lines) out of n = 30 trials (i.e.,

= 0.4).

To further explore the trend of decreasing confidence interval width with increasing sample size, let us evaluate the 95% confidence intervals for = k∕n estimates of 2/3 and 1/5, respectively, over a range of sample sizes between n = 3 and n = 300:

Figure 5.14: 95% confidence intervals for a)

= 2∕3 and b)

= 1∕5 for different sample sizes n. The horizontal dotted lines mark the maximum likelihood estimates.

The confidence intervals become progressively narrower with increasing sample size. This reflects a steady improvement of the precision of our estimate for p with increasing sample size. In other words, large datasets are ‘rewarded’ with better precision.

The confidence intervals of Figure 5.14 are asymmetric but become more symmetric around the estimate with increasing sample size. In fact, for large n (> 20, say) and p not near to 0 or 1, the 95% confidence interval can be approximated as

C.I.(p|k = nˆp,n) ≈ ˆp± 1.96s[ˆp]

(5.8)

where

∘ -------- ˆp(1-−-ˆp)- s[ˆp] = n

(5.9)

A derivation of and justification for this approximation will be provided in Section 8.4.

Example 1

A declustered earthquake catalog¹ of the western United States contains 543 events of magnitude 5.0 and greater that occurred between 1917 and 2016:

Figure 6.1: The number of US earthquakes of magnitude 5.0 or greater per year between 1917 and 2016, with aftershocks removed.

The number of earthquakes in each bin forms a new dataset of 100 numbers, which has the following summary statistics:

Figure 6.2: Histogram of the earthquake counts shown in Figure 6.1.
mean: 5.43
standard deviation: 2.50
variance: 6.24

Note how the mean and the variance of this dataset are similar.

Example 2

5000 grains of sand have been mounted in an uncovered thin section and imaged with a scanning electron microscope (SEM). The SEM has identified the locations of zircon (ZrSiO₄) crystals that are suitable for geochronological dating:

Figure 6.3: Point-counting results for zircon in sand. Black squares mark zircons and grey circles other minerals.

Counting the number of zircons in the graticule:

Figure 6.4: The number of zircons counted in each square of Figure 6.3.

and tallying the number of zircons per square in a histogram:

Figure 6.5: Histogram of the zircon counts shown in Figure 6.4.
mean: 3.50
standard deviation: 1.85
variance: 3.40

Like the earthquake example, also this zircon example is characterised by similar values for the mean and the variance. This turns out to be a characteristic property of the Poisson distribution.

The Poisson distribution expresses the probability of a given number of events occurring in a fixed interval of time or space if these events occur with a constant mean rate and are independent of the time since the last event. Examples of Poisson variables include the number of

1.: people killed by lightning per year;
2.: mutations in DNA per generation;
3.: radioactive disintegrations per unit time;
4.: mass extinctions per 100 million years.

The number of earthquakes including aftershocks and the number of floods per year are not Poisson variables, because they are clustered in time. The Poisson distribution predicts the probability of observing the number of ‘successes’ k given the long term average of successes λ:

k −λ P(k|λ) = λ-e--- k!

(6.1)

Thus the Poisson distribution is characterised by a single parameter, λ. Exploring the distribution for different values of this parameter:

Figure 6.6: PMF (top) and CDF (bottom) of the Poisson distribution for different values of the parameter λ, as marked by the dashed lines.

The Poisson distribution is positively skewed but becomes more symmetric with increasing λ. In this respect it is similar to the binomial distribution (Figure 5.8). In fact the Poisson and binomial distributions are closely related to each other. Recall that the binomial distribution depends on two parameters: n and p. It can be shown that the binomial distribution converges to the Poisson distribution with increasing n and decreasing p. In the limit of n →∞ and p → 0, the binomial distribution simplifies to a Poisson distribution with λ = np. The next table illustrates this empirically by evaluating the probability of observing k = 2 successes under a binomial distribution with np = 5 for different values of n and p:

n	10	20	50	100	200	500	1,000	10,000
p	0.5	0.25	0.1	0.05	0.025	0.01	0.005	0.0005
P(k = 2)	0.0439	0.0669	0.0779	0.0812	0.0827	0.0836	0.0839	0.0842

Table 6.1: Binomial probability of 2 successes after n trials for different values of p, where np = 5. In the limit of n →∞ and p → 0, the cumulative probability P(k = 2) converges to a value of 0.0842. This equals the probability of 2 successes under a Poisson distribution with λ = 5.

The Poisson distribution has one unknown parameter, λ. This parameter can be estimated using the method of maximum likelihood, just like the parameter p of the binomial distribution (Section 5.1). As before, the likelihood function is obtained by swapping the parameter (λ) and the data (k) in the PMF function (Equation 6.1):

k −λ ℒ(λ|k) = λ-e--- k!

(6.2)

And as before, we can estimate λ by maximising the likelihood function. Thus, we take the derivative of ℒ with respect to λ and set it to zero:

∂ℒ-= 0 ∂λ

(6.3)

Alternatively, we can also maximise the log-likelihood:

∑k ℒ ℒ(λ|k) = k ln [λ]− λ − i i=1

(6.4)

and set its derivative w.r.t. λ to zero:

∂ℒ-ℒ-= 0 ∂λ

(6.5)

Both approaches give exactly the same result because any parameter value ˆ
λ that maximises ℒ also maximises ℒℒ. Thus:

| ∂-ℒℒ-|| = k− 1 = 0 ∂ λ |λˆ ˆλ

(6.6)

which leads to

k- ˆλ = 1

(6.7)

and, hence

ˆ λ = k

(6.8)

In other words, the measurement itself equals the ‘most likely’ estimate for the parameter. However, this does not mean that all other values of λ are unlikely. In fact, other values of λ may also be compatible with k, and vice versa. The next section explores which values of λ are reconcilable with a given value of k.

Hypothesis testing for Poisson variables proceeds in exactly the same way as for binomial variables (Section 5.2). For example:

1.

Consider the following one-sided pair of hypotheses:

H₀ (null hypothesis)

H_A (alternative hypothesis):

λ = 3.5

λ > 3.5

2.

Like for the binomial case, the test statistic is the number of ‘successes’. Suppose that we have observed k = 9 successes.

3.

The null distribution of the test statistic is a Poisson distribution with λ = 3.5:

k	0	1	2	3	4	5	6	7	8	9	10

P(T = k)	0.030	0.106	0.185	0.216	0.189	0.132	0.077	0.038	0.017	0.007	0.002
P(T ≥ k)	1.000	0.970	0.864	0.679	0.463	0.275	0.142	0.065	0.027	0.010	0.003

4.

We will use the same significance level as always, i.e. α = 0.05.

5.

Marking the rejection region in bold:

k	0	1	2	3	4	5	6	7	8	9	10

P(T = k)	0.030	0.106	0.185	0.216	0.189	0.132	0.077	0.038	0.017	0.007	0.002
P(T ≥ k)	1.000	0.970	0.864	0.679	0.463	0.275	0.142	0.065	0.027	0.010	0.003

6.

The rejection region is R = {8,9,10,…,∞}, which includes our observation k = 9. Therefore, our null hypothesis is rejected.

Displaying the rejection region graphically:

Figure 6.7: a) PMF and b) CDF of a Poissonian null distribution with λ = 3.5. The rejection region is marked in black on a). The horizontal dotted line in b) shows the 1 − α = 0.95 mark. The horizontal dashed line in the CDF marks the cumulative probability for k = 9, which is greater than the 0.95 cutoff. Therefore the one-sided null hypothesis is rejected.

The observant reader may have noticed that the hypothesis test of Section 6.3 referred to the zircon counting example of Figures 6.3 – 6.4. The average number of observations per bin in this example was 3.5. Therefore, according to Section 6.2, the maximum likelihood estimate for λ is 3.5 as well. According to our hypothesis test, a value of k = 9 is incompatible with a parameter value of λ = 3.5. Yet the observant reader may also have noticed that a value of k = 9 appears in the dataset (Figure 6.4)!

Does this mean that our data do not follow a Poisson distribution?

The answer is no. The apparent contradiction between the point-counting data and the hypothesis test is a result of multiple hypothesis testing. To understand this problem, we need to go back to the multiplicative rule of page 32. The probability of incurring a type-I error is α. Therefore, the probability of not making a type-I error 1 − α = 0.95. But this is only true for one test. If we perform two tests, then the probability of twice avoiding a type-I error is (1 − α)² = 0.9025. If we do N tests, then the probability of not making a type-I error reduces to (1 − α)^N. Hence, the probability of making a type-I error increases to 1 − (1 − α)^N. Figure 6.4 contains 4 × 12 = 48 squares. Therefore, the likelihood of a type-I error is not α but 1 − (1 − α)⁴⁸ = 0.915.

In other words, there is a 91.5% chance of committing a type-I error when performing 48 simultaneous tests. One way to address this issue is to reduce the confidence level of the hypothesis test from α to α∕N, where N equals the number of tests. This is called a Bonferroni correction. In the case of our zircon example, the confidence level would be reduced from α = 0.05 to α = 0.05∕48 = 0.00104 (1 − α = 0.99896). It turns out that the 99.896 percentile of a Poisson distribution with parameter λ = 3.5 is 10. So the observed outcome of k = 9 zircons in the 48 square graticule is in fact not in contradiction with the null hypothesis, but falls within the expected range of values.

Multiple testing is a common problem in science, and a frequent source of spurious scientific ‘discoveries’. For example, consider a dataset of 50 chemical elements measured in 100 samples. Suppose that you test the degree of correlation between each of these elements and the gold content of the samples. Then it is inevitable that one of the elements will yield a ‘statistically significant’ result. Without a multi-comparison correction, this result will likely be spurious. In that case, repetition of the same experiment on 100 new samples would not show the same correlation. Poorly conducted experiments of this kind are called statistical fishing expeditions, data dredging or p-hacking. Sadly they are quite common in the geological literature, and it is good to keep a sceptical eye out for them.

The construction of confidence intervals for the Poisson parameter λ proceeds in pretty much the same way as it did for the binomial parameter p. Let us construct a 95% confidence interval for λ given the observation that k = 5 magnitude 5.0 or greater earthquakes occurred in the US in 2016.

The lower limit of a 95% confidence interval for the number of earthquakes per year is marked by the value of λ that is more than 2.5% likely to produce an observation of k = 5 or greater. This turns out to be λ = 1.62. The upper limit of the confidence interval is marked by the value of λ that is more than 97.5% likely to produce an observation of k = 5 or smaller. This value is λ = 11.7. Hence, the 95% confidence interval is [1.62,11.7]. Note that this interval includes the average of all the 100 preceding years.

Figure 6.8: 95% confidence interval for the Poisson parameter λ given a single observation of k = 5 events. The lower (p = 1.62, grey) and upper (p = 11.67) limits of the confidence interval are shown as a) PMFs and b) CDFs. Dotted horizontal lines mark the 0.025 and 0.975 confidence levels.

Repeating the exercise for all observations in Figure 6.1 yields the following set of 100 confidence intervals for λ:

Figure 6.9: 95% Poisson confidence intervals for all the years in the declustered earthquake database. The horizontal dotted line marks the average of all the years (

= 5.43). ‘outliers’ are marked in black.

Three years (1941, 1954 and 1984) stand out because their 95% confidence intervals do not overlap with the long term average value of 5.43. Does this mean that the earthquake statistics did not fit the Poisson distribution during those years? The answer to this question is no, for the same reasons as given in Section 6.4. When a large number of confidence intervals are drawn, it is inevitable that some of these do not include the true parameter value. In fact it would be suspicious if all the error bars overlapped with the long term average.

With a confidence level of α = 0.05, there should be a 5% chance of committing a type-I error. Therefore, we would expect 5% of the samples to be rejected, and 5% of the error bars to exclude the true parameter value. The observed number of rejected samples (3/100) is in line with those expectations.

The binomial (Chapter 5) and Poisson (Chapter 6) distributions are just two of countless possible distributions. Here are a few examples of other distributions that are relevant to Earth scientists:

The negative binomial distribution models the number of successes (or failures) in a sequence of Bernoulli trials before a specified number of failures (or successes) occurs. For example, it describes the number of dry holes x that are drilled before r petroleum discoveries are made given a probability of discovery p:

$(r + x − 1) x r P(x|r,p) = x (1 − p) p$ (7.1)
The multinomial distribution is an extension of the binomial distribution where more than two outcomes are possible. For example, it describes the point counts of multiple minerals in a thin section. Let p₁,p₂,…,p_m be the relative proportions of m minerals (where ∑ _i=1^mp_i = 1), and let k₁,k₂,…,k_m be their respective counts in the thin section (where ∑ _i=1^mk_i = n). Then:

$m P (k ,k ,...,k |p ,p ,...,p ) = --n!--∏ pki 1 2 m 1 2 m ∏m i i=1ki!i=1$ (7.2)

The binomial and Poisson distributions are univariate distributions that aim to describe one-dimensional datasets. However the multinomial distribution is an example of a multivariate probability distribution, which describes multi-dimensional datasets.
The uniform distribution is the simplest example of a continuous distribution. For any number x between the minimum a and maximum b:

${ b1−a if a ≤ x ≤ b f(x|a,b) = 0 otherwise$ (7.3)

x does not have to be an integer but is free to take any decimal value. Therefore, f(x|a,b) is not referred to as a probability mass function (PMF) but as a probability density function (PDF). Whereas PMFs are represented by the letter P, we use the letter f to represent PDFs. This is because the probability of observing any particular value x is actually zero. For continuous variables, calculating probabilities requires integration between two values. For example:

$∫ d P(c ≤ x ≤ d) = f(x|a,b)dx c$ (7.4)

The cumulative density function (CDF) of a continuous variable is also obtained by integration rather than summation. For the uniform distribution:

$( |{ 0∫ if x < a P(X ≤ x) = xa b−1adX = xb−−aa- if a ≤ x ≤ b |( 1 if x > b$ (7.5)

Earthquakes follow a uniform distribution across the day, because they are equally likely to occur at 3:27:05 in the morning as they are at 17:02:58 in the afternoon, say.

Figure 7.1: The time of day for all 543 magnitude 5.0 or greater earthquakes of Figure 6.1.

We will not discuss these, or most other distributions, in any detail. Instead, we will focus our attention on one distribution, the Gaussian distribution, which is so common that it is also known as the normal distribution, implying that all other distributions are ‘abnormal’.

Let us revisit the Old Faithful dataset of Figure 2.15.

Figure 7.2: The KDE and rug plot of 272 Old Faithful eruption durations is the marginal distribution of Figure 2.15. This distribution has two modes at 2 and 4.5 minutes.

The next three figures derive three new distributions by taking the sum of n randomly selected values from the geyser eruption durations:

Figure 7.3: Collect n = 2 randomly chosen events from the original dasaset of geyser eruptions with replacement and add their durations together. Repeat to create a new dataset of N = 500 values. The KDE of this distribution has not two but three modes at 4 (= 2 × 2), 6.5 (= 2 + 4.5), and 9 (= 2 × 4.5) minutes, respectively.

Figure 7.4: Collect n = 3 randomly chosen events from the geyser eruption dataset and add their durations together. Repeat N = 500 times to create a third dataset. The KDE of this distribution has four visible modes, including peaks at 6 (= 3 × 2), 8.5 (= 2 × 2 + 4.5), 11 (= 2 + 2 × 4.5) and 13.5 (= 3 × 4.5) minutes.

Figure 7.5: Taking N = 500 samples of n = 10 randomly selected eruptions and summing their durations produces a fourth dataset whose KDE has a single mode with symmetric tails towards lower and higher values.

Figure 7.5 has the characteristic bell shape of a Gaussian distribution, which is described by the following PDF:

1 [ (x − μ)2] f(x|μ,σ) = --√---exp − --2-σ2-- σ 2π

(7.6)

where μ is the mean and σ is the standard deviation. It can be mathematically proven that the sum of n randomly selected values converges to a Gaussian distribution, provided that n is large enough. This convergence is guaranteed regardless of the distribution of the original data. This mathematical law is called the Central Limit Theorem.

The Gaussian distribution is known as the normal distribution because it naturally arises from additive processes, which are very common in nature. It is easy to create normally distributed values in a laboratory environment. There even exists a machine that generates normally distributed numbers:

Figure 7.6: Galton’s bean machine is a mechanical device that simulates additive physical processes. It consists of a triangular arrangement of pegs above a linear array of containers. When a bead enters the machine from the top, it bounces off the pegs on its way down to the containers. The probability of bouncing to the left is the same as the probability of bouncing to the right. After n bounces, the bead lands in one of the containers, forming a bell shaped (binomial) distribution. With increasing n, this distribution converges to a Gaussian form.

Additive processes are very common in physics. For example, when a drop of ink disperses in a volume of water, the ink molecules spread by colliding with the water molecules. This Brownian motion creates a Gaussian distribution, in which most ink molecules remain near the original location (μ), with wide tails in other directions.

The binomial, Poisson, negative binomial, multinomial, uniform and univariate normal distributions are but a small selection from an infinite space of probability distributions. These particular distributions were given a specific name because they commonly occur in nature. However the majority of probability distributions do not fall into a specific parametric category. For example, the bivariate distribution of Old Faithful eruption gaps and durations (Figure 2.15) is not really captured by any of the aforementioned distributions. In fact, it is quite easy to invent one’s own distributions. Here are four examples of such creations in two-dimensional data space:

Figure 7.7: Four synthetic bivariate continuous distributions, defined by black areas and lines. White areas are excluded from the distributions.

Let us collect n = 100 random (X,Y ) samples from these four distributions and plot them as four scatter plots:

Figure 7.8: It is easy to recognise the probability distributions of Figure 7.7 in the scatter plots of n = 100 random points selected from them.

Next, we can calculate the sum of all the sampled points in each of the four panels in Figure 7.8. This gives rise to four new pairs of coordinates. Repeating this experiment N = 200 times and plotting the four resulting datasets as scatter plots:

Figure 7.9: Four scatter plots with N = 200 points, each of which represents the sum of a random sample of n = 100 points drawn from Figure 7.7.

Despite the completely different appearance of the four parent distributions (Figure 7.7) and samples (Figure 7.8), the distributions of their sums (Figure 7.9) all look very similar. They consist of an elliptical point cloud that is dense in the middle and thins out towards the edges. The density of the points per unit area is accurately described by a bivariate Gaussian distribution:

( ) [ ][ σ2 σx,y]− 1[x − μx] exp − (x− μx ) (y − μy) σ x σ2 y − μ /2 f(x,y|μ ,μ ,σ ,σ ,σ ) = ---------------------∘------x,y--y----------y------ x y x y x,y ||σ2 σ || 2π || x x,2y|| σx,y σy

(7.7)

This matrix expression is completely described by five parameters: the means μ_x and μ_y, the standard deviations σ_x and σ_y, and the covariance σ_x,y. One-dimensional projections of the data on the X- and Y-axis yield two univariate Gaussian distributions.

Figure 7.10: The main panel shows population 4 of Figure 7.9 as grey circles, and the best fitting bivariate Gaussian distribution as contours. The side panels show the marginal distributions of the X- and Y -variable, which are both univariate Gaussian. The means and standard deviations of the marginal distributions equal the means and the standard deviations of the bivariate distribution. The bivariate distribution has a fifth parameter, the covariance σ_x,y, which controls the angle at which the elliptical contours are rotated relative to the axes of the diagram. The significance of these parameters is further explored in Section 7.3.

The univariate normal distribution is completely controlled by two parameters:

1.: The mean μ controls the location of the distribution. Because the normal distribution is unimodal and symmetric, the mean also equals the median and the mode.
2.: The standard deviation σ quantifies the dispersion of the distribution.

Figure 7.11: PDFs of the univariate normal distribution for different values of μ and σ. μ controls the position and σ the width of the distribution. By definition, the area under the PDF always remains the same (i.e. ∫ _−∞^+∞f(x) dx = 1).

The interval from μ−σ to μ + σ covers 68.27% of the area under the PDF, and the interval from μ− 2σ to μ + 2σ covers 95.45%. Conversely 95% of the area under the normal PDF is contained within an interval of μ ± 1.96σ.

Figure 7.12: PDF (a) and CDF (b) of the normal distribution. The μ ± σ and μ ± 2σ intervals cover ∼68% and ∼95% of the distribution, respectively.

3.: The covariance σ_x,y controls the degree of correlation between two variables (x,y) in a bivariate normal distribution.

Figure 7.13: The standard deviations σ_x and σ_y, and covariance σ_x,y control the shape and dispersion of the bivariate normal distribution.

The 1σ contour around the mean of a bivariate normal distribution only covers 39% of the probability instead of 68% for the univariate normal distribution, and the 2σ interval covers 86% of the probability instead of the 95% of the univariate distribution.

μ and σ are unknown but can be estimated from the data. Just like the binomial parameter p (Section 5.1) and the Poisson parameter λ (Section 6.2), this can be done using the method of maximum likelihood. Given n data points {x₁,x₂,…,x_n}, and using the multiplication rule, we can formulate the normal likelihood function as

∏n ℒ (μ,σ|x1,x2,...,xn ) = f (xi|μ,σ) i=1

(7.8)

μ and σ can be estimated by maximising the likelihood or, equivalently, the log-likelihood:

∑n ℒ ℒ(μ,σ |x ,x ,...,x ) = ln[f(x |μ,σ )] 1 2 n i=1 i n 2 = ∑ − ln[σ ]− 1-ln[2π]− (xi-−-μ)-- i=1 2 2σ2

(7.9)

Taking the derivative of ℒℒ with respect to μ and setting it to zero:

∂ ℒℒ || ∑n xi − ˆμ -∂μ--|| = − --σ2--= 0 μˆ i=1 ∑n ⇒ nˆμ − xi = 0 i=1 1 ∑n ⇒ ˆμ = -- xi n i=1

(7.10)

which is the same as Equation 3.1. Using the same strategy to estimate σ:

|| ∑n 2 ∂ℒ-ℒ| = − 1-+ (xi-−-μ)-= 0 ∂σ |ˆσ i=1 ˆσ σˆ3 ∑n 2 ⇒ (xi −-μ-)-= n- i=1 ˆσ3 σˆ ┌│ -------------- │∘ 1-∑n 2 ⇒ σˆ= n (xi − μ) i=1

(7.11)

which is almost the same as the formula for the standard deviation that we saw in Section 3.2 (Equation 3.3):

┌│ -------n--------- s[x] = │∘ --1-- ∑ (x − ¯x)2 n − 1 i i=1

(7.12)

There are just two differences between Equations 3.3/7.12 and Equation 7.11:

1.: Equation 7.11 uses the population mean μ, whereas Equation 3.3 uses the sample mean x.
2.: Equation 7.11 divides the sum of the squared differences between the measurements and the mean by n, whereas Equation 3.3 divides it by (n − 1).

The two differences are related to each other. The subtraction of 1 from n is called the Bessel correction and accounts for the fact that by using an estimate of the mean (x), rather than the true value of the mean (μ), we introduce an additional source of uncertainty in the estimate of the standard deviation. This additional uncertainty is accounted for by subtracting one degree of freedom from the model fit.

Finally, for multivariate normal datasets, we can show that (proof omitted):

n ∑ 1- ˆσx,y = n (xi − μx)(yi − μy) i=1

(7.13)

or, if μ_x and μ_y are unknown and must be estimated from the data as well:

∑n s[x,y] = --1-- (xi − ¯x)(yi − ¯y) i=1n − 1

(7.14)

Suppose that the extinction of the dinosaurs has been dated at 65 Ma in one field location, and a meteorite impact has been dated at 64 Ma elsewhere. These two numbers are effectively meaningless in the absence of an estimate of precision. Taken at face value, the dates imply that the meteorite impact took place 1 million years after the mass extinction, which rules out a causal relationship between the two events. However, if the analytical uncertainty of the age estimates is significantly greater than 1 Myr, then such a causal relationship remains plausible. There are two aspects of analytical uncertainty that need to be considered:

accuracy is the closeness of a statistical estimate to its true (but unknown) value.
precision is the closeness of multiple measurements to each other.

Figure 8.1: Darts board illustration of accuracy (closeness to the centre) and precision (closeness of the measurements). The best case scenario combines high precision with high accuracy. The worst case scenario combines high precision with low accuracy. This is the worst possible situation because the high precision gives false confidence in the data.

Accuracy can be assessed by analysing reference materials (or ‘secondary standards’) whose true parameter values are known through independent means. This procedure involves little statistics and won’t be discussed further in this chapter. Quantifying precision is a more involved process that is also known as error propagation. This procedure will be discussed in some detail in the following sections.

Suppose that a physical quantity (z) is calculated as a function (g) of some measurements (x):

z = g(x)

(8.1)

and suppose that replicate measurements of x (x_i, for i = 1...n) follow a normal distribution with mean x and standard deviation s[x]. Then these values can be used to estimate s[z], the standard deviation of the calculated value z. If the function g is (approximately) linear in the vicinity of x, then small deviations (x − x_i) of the measured parameter x_i from the mean value x are proportional to small deviations (z − z_i) of the estimated quantity z from the mean value z = g(x).

Recall the definition of the sample standard deviation and variance (Equation 7.12):

2 --1--∑n 2 s[z] ≡ n− 1 (zi − ¯z) i=1

(8.2)

Let ∂z∕∂x be the slope of the function g with respect to the measurements x (evaluated at x)¹ then:

∂z (zi − ¯z) ≈---(xi − ¯x) ∂x

(8.3)

Figure 8.2: Error propagation of two linear functions, g₁ (black line) and g₂ (grey line). The input data {x_i} are normally distributed around some average value x. Error propagation estimates the dispersion of the inferred quantity z_i from the dispersion of the measurements x_i. Deviations (x_i−x) from the average x-value are reduced (for function g₁) or magnified (for function g₂) depending on the slope of the functions, resulting in deviations of the dependent variable that are smaller (z_i1 −z₁) or greater (z_i2 −z_i) than (x_i −x).

Plugging Equation 8.3 into 8.2, we obtain:

n [ ] 2 --1-- ∑ -- ∂z-2 s[z] ≈ n − 1 (xi − x )∂x [ ] i=[1 ] [ ] ∂z 2 1 ∑n --2 ∂z 2 2 = ∂x- n-−-1 (xi − x ) = ∂x- s[x] | | i=1 ||∂z-|| ⇒ s[z] = |∂x |s[x]

(8.4)

s[z] is the standard error of z, i.e. the estimated standard deviation of the inferred quantity z.

Equation 8.4 is the general equation for the propagation of uncertainty with one variable. Next, let us move on to multivariate problems. Suppose that the our estimated quantity (z) is calculated as a function (g) of two measurements (x and y):

z = g(x,y)

(8.5)

and further suppose that x and y follow a bivariate normal distribution (Equation 7.7), then μ_x, μ_y, σ_x, σ_y and σ_x,y can all be estimated (as x,ȳ,s[x],s[y] and s[x,y]) from the input data ({x_i,y_i}, for i = 1...n). These values can then be used to infer s[z], the standard error of z, in exactly the same way as for the one dimensional case.

Differentiating g with respect to x and y:

z − z-≈ (x − x)-∂z + (y − y)∂z- i i ∂x i ∂y

(8.6)

Plugging Equation 8.6 into Equation 8.2:

1 ∑n [ ∂z ∂z ]2 s[z]2 ≈ ----- (xi − x)---+ (yi − y)-- n− 1 i=1 ∂x ∂y

(8.7)

After some rearranging (similar the derivation of Equation 8.4), this leads to:

( ) ( ) 2 2 ∂z- 2 2 ∂z- 2 -∂z∂z- s[z] ≈ s[x] ∂x + s[y] ∂y + 2 s[x, y]∂x ∂y

(8.8)

This is the general equation for the propagation of uncertainty with two variables, which can also be written in a matrix form:

[ ][ 2 ][ ∂z] s[z]2 ≈ ∂z ∂z s[x ] s[x,y] ∂x ∂x ∂y s[x,y] s[y]2 ∂∂zy

(8.9)

where the innermost matrix is known as the variance-covariance matrix and the outermost matrix (and its transpose) as the Jacobian matrix. The advantage of the matrix formulation is that it can easily be scaled up to three or more dimensions.

Figure 8.3: Error propagation of a bivariate function z = g(x,y). The uncertainty in z is approximately proportional to the slope of the surface g w.r.t. the measurements x and y. If the curvature of the 3D surface is minor relative to the uncertainties, then said surface can be locally approximated by a plane. Using such a linear approximation, the size of the error ellipses (s[x], s[y]) around the mean measurements (black dots) linearly scales with corresponding deviations in the inferred quantity (s[z]).

Equation 8.8/8.9 is the generic error propagation formula. This section will apply this formula to some common mathematical functions. It will use the following notation:

a,b,c,… are constants, i.e. values that are known without uncertainty (s[a] = s[b] = s[c] = … = 0);
x and y are measurements whose uncertainties (s[x] and s[y]) were estimated from replicates;
z = g(x,y) is the estimated quantity.

For example, z = x∕2 + y³ can be written as z = ax + y^b, where a = 1∕2 and b = 3.

1.

addition:

z = a + bx+ cy

Taking the derivatives of z with respect to x and y:

∂z ∂z ---= b and ---= c ∂x ∂y

Plugging these into Equation 8.8:

s[z]2 = b2s[x]2 + c2s[y]2 + 2bc s[x,y]

(8.10)

If x and y are uncorrelated (i.e., s[x,y] = 0), then the variance of the sum equals the sum of the variances.

2.

subtraction:

z = ax− by

The partial derivatives of z with respect to x and y are

∂z- ∂z- ∂x = a and ∂y = − b

Plugging these into Equation 8.8:

s[z]2 = a2s[x]2 + b2s[y]2 − 2ab s[x,y]

(8.11)

Note that, if x and y are uncorrelated, then Equations 8.10 and 8.11 are identical.

3.

multiplication:

z = axy

The partial derivatives are

∂z-= ay and ∂z-= ax ∂x ∂y

Plugging these into Equation 8.8:

s[z]2 = (ay)2s[x]2 + (ax)2s[y]2 + 2(ay)(ax ) s[x,y]

Dividing both sides of this equation by z² = (axy)²:

(s[z])2 ( ay )2 ( ax )2 ( ay ) ( ax ) ---- = ---s[x] + ---s[y] + 2 ---- ---- s[x,y] z axy axy axy axy

which simplifies to:

( ) ( ) ( ) s[z] 2 s[x] 2 s[y]- 2 s[x,y]- z = x + y + 2 xy

(8.12)

(s[x]∕x) and (s[y]∕y) represent the relative standard deviations of x and y. These are also known as the coefficients of variation (CoV). If x and y are uncorrelated, then the squared CoVs of a product equals the sum of the squared CoVs.

4.

division:

x z = ay-

The partial derivatives are

∂z a ∂z ax ∂x- = y-and ∂y- = − y2-

Plugging these into Equation 8.8:

2 ( a)2 2 ( ax)2 2 ( a) ( ax) s[z] = y- s[x] + − y2- s[y] + 2 y- − y2- s[x,y]

Dividing both sides of this equation by z² = (ax∕y) ²:

(s[z])2 ( a y )2 ( ax y )2 ( a y ) ( ax y) ---- = ---- s[x]2 + − -2---- s[y]2 + 2 ---- − -2---- s[x,y] z yax y ax yax y ax

which simplifies to:

( ) ( ) ( ) s[z] 2 s[x] 2 s[y]- 2 s[x,y]- z = x + y − 2 xy

(8.13)

If x and y are uncorrelated, then the uncertainty of the quotient (Equation 8.13) equals the uncertainty of the product (Equation 8.12).

5.

exponentiation:

bx z = ae

The partial derivative of z w.r.t. x is

∂z- bx ∂x = abe

Plugging this into Equation 8.8:

( ) s[z]2 = abebx 2 s[x]2

Dividing both sides by z² = ( bx)
ae ²:

( s[z])2 ( abebx )2 ---- = --bx-s[x ] z ae

which simplifies to

( ) s[z]- 2 2 2 z = b s[x]

(8.14)

6.

logarithms:

z = aln[bx]

The partial derivative of z w.r.t. x is

∂z- a- ∂x = x

Plugging this into Equation 8.8:

( s[x])2 s[z]2 = a2 ---- x

(8.15)

7.

power:

z = axb

The partial derivative of z w.r.t. x is

∂z-= abxb−1 ∂x

Plugging this into Equation 8.8:

2 ( b−1)2 2 s[z] = abx s[x]

Dividing both sides by z² = ( )
axb ²:

( s[z])2 ( abxb−1 )2 -z-- = --axb--s[x]

which simplifies to

( )2 ( )2 s[z]- = b2 s[x] z x

(8.16)

8.

other:

Error propagation for more complicated functions can either be derived from Equation 8.8 directly, or can be done with Equations 8.10–8.16 using the chain rule. For example, consider the following equation:

d = d + v t + gt2 0 0

(8.17)

which describes the distance d travelled by an object as a function of time t, where d₀ is the position at t = 0, v₀ is the velocity at t = 0, and g is the acceleration. Although Equation 8.17 does not directly fit into any of the formulations that we have derived thus far, it is easy to define two new functions that do. Let

x ≡ d0 + v0t

(8.18)

and

2 y ≡ gt

(8.19)

then Equation 8.18 matches with the formula for addition (Equation 8.10):

z = a + bx+ cy

where a ≡ d₀, b ≡ v₀, x ≡ t, c ≡ 0 and y is undefined. Then uncertainty propagation of Equation 8.18 using Equation 8.10 gives:

s[x]2 = (v0s[t])2

(8.20)

Similarly, Equation 8.19 matches with the formula for powering (Equation 8.16):

b z = ax

where a ≡ g, b ≡ 2 and x ≡ t. Applying Equation 8.16 to Equation 8.19 yields:

( ) ( ) ( ) s[y]- 2 2 s[t] 2 2 2s[t] 2 2 y = 2 t ⇒ s[y] = 2gt t = (2gts[t])

(8.21)

Combining Equations 8.18 and 8.19 turns Equation 8.17 into a simple sum:

d = x+ y

whose uncertainty can be propagated with Equation 8.10:

s[d]2 = s[x]2 + s[y]2

Substituting Equation 8.20 for s[x]² and Equation 8.21 for s[y]²:

s[d]2 = (v0s[t])2 + (2gts[t])2

which leads to the following expression for the uncertainty of d:

∘ ------------------- ∘ ---------- 2 2 2 2 s[d] = (v0s[t]) + (2gts[t]) = s[t] v0 + (2gt)

(8.22)

To illustrate the use of this formula, suppose that d₀ = 0 m, v₀ = 10 m/s and g = 9.81 m/s². Further suppose that we measure the time t with a watch that has a 1/10^th of a second precision² (s[t] = 0.1 s). Then we can predict how far the object will have travelled after 5 seconds:

m m 2 d = 0 m + 10-s × 5 s + 9.81 s2 × (5 s) = 295.25 m

(8.23)

Using Equation 8.22, the uncertainty³ of d is given by:

∘ ----------------------------- ( m-)2 ( m- )2 s[d] = 0.1 s× 10 s + 2× 9.81 s2 × 5 s = 9.861 m

(8.24)

Thus the predicted displacement after 5 seconds can be reported as 295.250 ± 9.861 m, or as 295.2 ± 9.9 m if we round the estimate to two significant digits. Note how Equations 8.23 and 8.24 specifies the units of all the variables. Checking that these units are balanced is good practice that avoids arithmetic errors.

As defined in Section 8.1, the standard error is the estimated standard deviation of some derived quantity obtained by error propagation. The mean of set of numbers is an example of such a derived quantity, and its estimated uncertainty is called the standard error of the mean. Let {x₁,x₂,…x_n} be n measurements of some quantity x, and let x be its mean (Equation 3.1):

∑n xi 1 ∑n ¯x = n- = n- xi i=1 i=1

Applying the error propagation formula for a sum (Equation 8.10):

n ( )2 ( )2 n 2 ∑ s[xi] 1- ∑ 2 s[¯x] = n = n s[xi] i=1 i=1

If all the x_is were drawn from the same normal distribution with standard deviation s[x], then

( ) ( ) 2 1- 2 ∑n 2 1- 2 2 s[¯x] = n s[x] = n n s[x] i=1

which simplifies to

s[¯x] = s√[x]- n

(8.25)

The standard error of the mean monotonically decreases with the square root of sample size. In other words, we can arbitrarily increase the precision of our analytical data by acquiring more data. However, it is important to note that the same is generally not the case for the accuracy of those data (Figure 8.1). To illustrate the effect of the square root rule, consider the statistics of human height as an example. The distribution of the heights of adult people is approximately normal with a mean of 165 cm and a standard deviation of 10 cm. There about 5 billion adult humans on the planet. Averaging their heights should produce a value of 165 cm with a standard error of 10∕ √5-×-109 = 1.4 × 10⁻⁴ cm. So even though there is a lot of dispersion among the heights of humans, the standard error of the mean is only 1.5 microns.

We used the method of maximum likelihood to estimate the parameters of the binomial (Section 5.1), Poisson (Section 6.2) and normal (Section 7.4) distributions. An extension of the same method can be used to estimate the standard errors of the parameters without any other information. Let ẑ be the maximum likelihood estimate of some parameter z. We can approximate the log-likelihood with a second order Taylor series in the vicinity of ẑ:

| | ∂ℒ-ℒ-|| 1-∂2ℒ-ℒ|| 2 ℒ ℒ(z) ≈ ℒℒ (ˆz) + ∂z |ˆz(z − ˆz)+ 2 ∂z2 |ˆz(z − ˆz)

By definition, ∂ℒℒ∕∂z = 0 at ẑ. Therefore, the likelihood (ℒ(z) = exp[ℒℒ(z)]) is proportional to:

[1 ∂2ℒℒ || ] ℒ(z) ∝ exp -- ---2-||(z − ˆz)2 2 ∂z zˆ

which can also be written as:

⌊ ⌋ 1 (z − ˆz)2 ℒ (z) ∝ exp|⌈ −------1---|⌉ 2 − ∂2ℒℒ||| ∂z2 ˆz

This equation fits the functional form of the normal distribution (Equation 7.6):

[ 2 ] ℒ(z) ∝ exp − 1(z-−-ˆz)- 2 σ[z ]2

which leads to

1 ˆσ [z]2 = ---2--|-- − ∂∂ℒzℒ2|| ˆz

(8.26)

− ∂2ℒℒ||
-∂z2-| _ẑ is known as the Fisher Information. Equation 8.26 can be generalised to multiple dimensions:

ˆΣ = − ℋ −1

(8.27)

where is the estimated covariance matrix and (ℋ)⁻¹ is the inverse of the (‘Hessian’) matrix of second derivatives of the log-likelihood function with respect to the parameters.

To illustrate the usefulness of Equation 8.26, let us apply it to the Poisson distribution. Recalling the log-likelihood function (Equation 6.4) and denoting the maximum likelihood estimate of the parameter by :

ˆ ˆ ˆ ∑k ℒℒ(λ|k) = klog[λ]− λ − i i=1

Taking the second derivative of ℒℒ with respect to λ:

2 || ∂-ℒℒ-| = − k-- ∂ λ2 |ˆλ ˆλ2

(8.28)

Plugging Equation 8.28 into 8.26:

2 ˆλ2- ˆσ[λ] = k

Recalling that = k (Equation 6.8), we get:

2 ˆ ˆσ[λ] = λ

(8.29)

Thus we have proved that the variance of a Poisson variable equals its mean, which was already shown empirically in Chapter 6.

Using the same rationale to estimate the standard error of a binomial variable, we take the logarithm of Equation 5.3:

( ) ℒ ℒ(p|n, k) = ln n + kln(p)+ (n − k)ln(1− p) k

the second derivative of which is:

∂2ℒ-ℒ- k- -n-−-k-- ∂p2 = − p2 − (1− p)2

At the maximum likelihood estimate = n∕k (Equation 5.4), this becomes:

∂2ℒ ℒ|| nˆp n− n ˆp n ---2-|| = − -2-− ------2-= − -------- ∂p ˆp ˆp (1− pˆ) ˆp(1− ˆp)

so that the variance of p is:

ˆσ[p]2 = ˆp(1−--ˆp) n

which proves Equation 5.9.

The previous chapters have introduced a plethora of parametric distributions that allow us to test hypotheses and assess the precision of experimental results. However, these inferences are only as strong as the assumptions on which they are based. For example, Chapter 5 used a binomial distribution to assess the occurrence of gold in a prospecting area, assuming that the gold was randomly distributed across all the claims; and Chapter 6 used a Poisson distribution to model earthquakes, assuming that the earthquake catalog was free of clusters and that all aftershocks had been removed from it. This chapter will introduce some strategies to test these assumptions, both graphically and numerically.

As the name suggests, a quantile-quantile or Q-Q plot is a graphical method for comparing two probability distributions by plotting their quantiles against each other. Q-Q plots set out the quantiles of a sample against those of a theoretical distribution, or against the quantiles of another sample. For example, comparing the Old Faithful eruption duration dataset to a normal distribution:

Figure 9.1: Construction of a Q-Q plot. a) is the ECDF of the Old Faithful eruption durations, rotated 90 degrees counterclockwise; b) is the actual Q-Q plot comparing the eruption durations with a normal distribution; c) is the CDF of a standard normal distribution. If the two distributions being compared are similar, the points in the Q-Q plot will approximately lie on the line y = x. Otherwise they will not. In this example the distribution of eruption durations clearly does not follow a normal distribution.

We can use the Q-Q plot to illustrate the effect of the Central Limit Theorem (Figure 7.2):

Figure 9.2: Q-Q plots of a) the original Old Faithful eruption durations; and 500 sums of b) n = 2; c) n = 10; and d) n = 10 randomly selected eruption durations. Increasing n brings the Q-Q plot ever closer to the 1:1 line, visually confirming the emergence of a normal distribution. The sample distribution for n = 10 only deviates from the theoretical distribution at the most extreme quantiles. This indicates that the sample distribution has heavier tails than the normal distribution. This phenomenon will be discussed further in the next section. Increasing n further would remove this effect near the tails and bring the sample distribution even closer to the normal distribution.

Q-Q plots cannot only be used to compare sample distributions with theoretically predicted parametric distributions, but also to compare one sample with another. For example:

Figure 9.3: Q-Q plot comparing datasets x₁ = sum(X₁) and x₂ = sum(X₂) of Figure 7.9. Even though the two datasets have markedly different means (x₁ = 50.0 and x₂ = 59.9) and slightly different standard deviations (s[x₁] = 3.3 and s[x₂] = 3.1), the quantiles of the two datasets plot along a straight line. This means that their distributions are identical in shape.

The Q-Q plot in Figure 9.3 compared two samples that were normally distributed with different means. However, it may not be clear if the difference between the means is statistically significant or not. Before we address this problem, let us first look at a related but slightly simpler problem. Consider the density of 5 gold coins as an example:

coin #	1	2	3	4	5

density (g/cm³)	19.09	19.17	19.31	19.07	19.18

The density of pure gold is 19.30 g/cm³. We might ask ourselves the question if the five coins are made of pure gold, or if they consist of an alloy that mixes gold with a less dense metal? To answer this question, we assume that the sample mean x follows a normal distribution with mean μ and standard deviation σ∕ √n- (following the same derivation as Equation 8.25). Thus, the parameter

z = -¯x−√μ-- σ ∕ n

(9.1)

follows a standard normal distribution whose mean is zero and whose standard deviation is one. As we have seen in Section 7.4, σ is unknown but can be estimated by the sample standard deviation s[x]. We can then replace Equation 9.1 with a new parameter

x¯− μ t =-----√-- s[x]∕ n

(9.2)

However, t does not follow a normal distribution but a Student t-distribution with (n − 1) degrees of freedom where the (n− 1) plays a similar role as the Bessel correction of Section 7.4. It accounts for the ‘double use’ of the data to estimate both the mean and the standard deviation of the data. The t-distribution forms the basis of a statistical test that follows the same sequence of steps as used in Sections 5.2 and 6.3.

1.

Formulate two hypotheses:

H₀ (null hypothesis)

H_A (alternative hypothesis):

μ = 19.30

μ < 19.30

2.

Calculate the following test statistic:

-¯x-−-μ∘- t = s[x]∕√n

(9.3)

where x = 19.164, μ_∘ = 19.30, s[x] = 0.0948, and n = 5 so that t = −3.2091.

3.

Under H₀, Equation 9.3 follows a Student t-distribution with 4 degrees of freedom. Tabulating some key quantiles of this distribution:

t	-3.75	-3.2091	-2.78	-2.13	-1.53	-0.74	0	0.74	1.53	2.13	2.78	3.75

P(T ≤ t)	0.01	0.0163	0.025	0.05	0.1	0.25	0.5	0.75	0.9	0.95	0.975	0.99

where the observed value is marked in italics.

4.

We will use the usual confidence level α = 0.05.

5.

Marking the rejection region (P(T < t) < α) in bold:

t	-3.75	-3.2091	-2.78	-2.13	-1.53	-0.74	0	0.74	1.53	2.13	2.78	3.75

P(T ≤ t)	0.01	0.0163	0.025	0.05	0.1	0.25	0.5	0.75	0.9	0.95	0.975	0.99

The one-sided rejection region consists of all t < −2.10.

6.

Because the observed value of t = −3.2091 falls inside the rejection region, the null hypothesis is rejected.

We therefore conclude that the coins are not made of pure gold. Here is a graphical representation of this test:

Figure 9.4: a) PDF and b) CDF of a t-distribution with 4 degrees of freedom. The one-sided rejection region (for α = 0.05) is marked in black. The vertical dashed line marks the observed value of t = −3.2091. This line plots in the rejection region, leading to the conclusion that μ < 19.30 g/cm³. The horizontal dashed line marks the p-value (0.0163 < α).

The comparison between the mean of a sample (of coin densities) and a particular value (19.30 g/cm³) is called a one sample t-test. If we want to compare the mean densities of two samples, then this requires a two sample t-test. For example, consider the following two collections of coins:

coin #	1	2	3	4	5

density (1^st collection)	19.09	19.17	19.31	19.07	19.18
density (2^nd collection)	19.30	19.33	19.15	19.32

The average densities of collection 1 and 2 are 19.164 g/cm³ and 19.275 g/cm³, respectively. If we assume that the two collections have the same variance, then we can test whether the difference between the two means is significant or not.

1.

Formulate two hypotheses:

H₀ (null hypothesis)

H_A (alternative hypothesis):

μ₁ = μ₂

μ₁≠μ₂

2.

Calculate the following test statistic:

¯x − ¯x t =--∘-1---2--- sp 1n1 + n12

(9.4)

where x₁ = 19.164, n₁ = 5, x₂ = 19.275, n₂ = 4, and s_p is the pooled standard deviation:

∘ ----------------------------- s = (n1 −-1)s[x1]2-+-(n2 −-1)s[x2]2 p n1 + n2 − 2

(9.5)

in which s[x₁] = 0.095 and s[x₂] = 0.084 are the standard deviations of the first and second coin collection, respectively. Plugging the data into equations 9.4 and 9.5 yields t = −1.857.

3.

Under H₀, Equation 9.4 follows a Student t-distribution with (n₁ + n₂ − 2) degrees of freedom¹ . Tabulating some key quantiles of the t-distribution with 7 degrees of freedom:

t	-3.00	-2.36	-1.89	-1.857	-1.41	-0.71	0.00	0.71	1.41	1.89	2.36	3.00

P(T ≤ t)	0.01	0.025	0.05	0.053	0.1	0.25	0.5	0.75	0.9	0.95	0.975	0.99
P(T ≥ t)	0.99	0.975	0.95	0.947	0.9	0.75	0.5	0.25	0.1	0.05	0.025	0.01

where the observed value is marked in italics.

4.

We will use the conventional confidence level α = 0.05. But because we are carrying out a two-sided test, we will have two cutoff regions marked by α∕2 and (1 − α∕2), respectively.

5.

Marking the rejection regions in bold:

t	-3.00	-2.36	-1.89	-1.857	-1.41	-0.71	0.00	0.71	1.41	1.89	2.36	3.00

P(T ≤ t)	0.01	0.025	0.05	0.053	0.1	0.25	0.5	0.75	0.9	0.95	0.975	0.99
P(T ≥ t)	0.99	0.975	0.95	0.947	0.9	0.75	0.5	0.25	0.1	0.05	0.025	0.01

The two-sided rejection region consists of all t < −2.40 and all t > 2.40.

6.

Because the observed value of t = −1.857 falls outside the rejection region, we cannot reject the null hypothesis.

Figure 9.5: a) PDF and b) CDF of a t-distribution with 7 degrees of freedom. The two-sided rejection region (for α = 0.05) is marked in black. The vertical dashed line marks the observed value of t = −1.857 and plots outside the rejection region. Therefore the test does not allow us to conclude that μ₁≠μ₂.

Section 9.2 introduced a powerful way to test whether the mean of a sample was equal to a particular value, or to the mean of another sample. However, Section 5.5 showed that formalised tests such as the t-test have limited practical value. Provided that the sample is large enough, its mean will nearly always be ‘significantly’ different than that of another sample. Suppose that we were to average the densities of not five but five millions coins, then it would be extremely unlikely for the mean to be exactly 19.30 g/cm³. With a sample size of five million, the power of the t-test would be such that even trace amounts of a lighter contaminant would have a detectable effect on the density.

Instead of asking ourselves whether the coins have exactly the same density as pure gold, it is more useful to know what the mean density actually is, and to construct a confidence interval for it. We could then use this information to learn something about the composition of the coins. To construct a confidence interval for the mean, we follow a similar procedure as laid out in Sections 5.6 and 6.5. Let us use the first set of coins as an example, and recall that the one sample t-statistic is defined as (Equation 9.3):

-x¯−-μ√-- t = s[x]∕ n

By definition, the 95% confidence interval is the collection of all those values of μ for which

tdf,α∕2 ≤ t ≤ tdf,1−α∕2

where t_df,α∕2 and t_{df,1−α∕2} are the α∕2 and (1 − α∕2) quantiles of a t-distribution with df degrees of freedom, respectively. Hence:

tdf,α∕2 ≤--¯x−-μ√---≤ tdf,1− α∕2 s[x]∕ n

Rearranging:

¯x− t s√[x]≥ μ ≥ ¯x− t s√[x] df,α∕2 n df,1−α∕2 n

Because the t-distribution is symmetric around zero, we can also write:

tdf,1−α∕2 = − tdf,α∕2

Hence

s[x] s[x ] ¯x + tdf,α∕2√n--≤ μ ≤ ¯x− tdf,α∕2√n--

{ s[x]} μ ∈ ¯x± tdf,α∕2√--- n

(9.6)

For the gold coin example of Section 9.2, x = 19.164, s[x] = 0.0948, df = 4 and t_4,0.025 = −2.776. Hence, the 95% confidence interval for μ is 19.16 ± 0.12 g/cm³. Note that this interval does not overlap with the density of pure gold (19.30 g/cm³), confirming again that the coins are not made of pure gold. However, the upper limit of the 95% confidence interval is 19.28 g/cm³, which is not far off the 19.30 g/cm³ value. Therefore, it is possible that the amount of light contaminant is minor.

With increasing sample size, the t-distribution converges towards the normal distribution:

Figure 9.6: a) PDFs and b) CDFs of the t-distribution for three different degrees of freedom (df). For small sample sizes (low df), the t-distribution has long tails towards low and high values. With increasing sample size, the tails become shorter and the t-distribution sharper. When df > 30, the t-distribution is almost indistinguishable from a standard normal distribution with μ = 0 and σ = 1.

Evaluating t_df,0.975 for different values of df:

df	1	2	3	4	5	6	7	8	9	10	30	100	1000
t_df,0.975	12.710	4.303	3.182	2.776	2.571	2.447	2.365	2.306	2.262	2.228	2.042	1.984	1.962

For large sample sizes, the 95% percentile of the t-distribution is the same as the 95% percentile of the normal distribution (= 1.962, see Section 7.3). In this case Equation 9.6 simplifies to approximately

μ ∈ {¯x ± 2s[x¯]}

where s[x] is the standard error of the mean (Equation 8.25).

Figure 9.7: The grey and black lines mark the PDFs (a) and CDFs (b) of two normal distributions whose means (vertical dotted lines) are offset by 2 standard errors from the sample average (vertical dashed line). They mark a ∼95% confidence interval for μ. However this simple procedure only works if sample size is large enough for the Central Limit Theorem to apply.

It is important to bear in mind that this procedure only works for sufficiently large samples sizes. For samples sizes of n < 30, the ‘2-sigma’ interval must be replaced with a ‘studentised’ confidence interval (i.e., Equation 9.6).

Comparing the means of two datasets is one way to assess their (dis)similarity. But as we have seen in Chapter 2 and Figure 2.1, summary statistics like the mean do not always capture the data adequately. An alternative approach is to compare the shape of a sample distribution with a theoretical distribution, or with another sample distribution, using a χ² (chi-square) test.

To illustrate this method, let us go back to example 1 of Chapter 6. Figure 6.2 tallied the number of magnitude ≥ 5.0 earthquakes per year from 1917 to 2016. This histogram represents 100 years, with values ranging from 1 to 12 events per year. Based on the similarity of the mean (5.43) and the variance (6.25), Chapter 6 proceeded under the assumption that the data followed a Poisson distribution. The χ²-test allows us to test this assumption more rigorously.

We begin by counting the number of events in each bin of Figure 6.2:

earthquakes/year	0	1	2	3	4	5	6	7	8	9	10	11	12
number of years	0	3	8	13	17	13	14	13	5	8	3	1	2

Next, we calculate the expected number of events per bin using the probability mass function of the Poisson distribution with λ = 5.43 (Equation 6.1):

earthquakes/year (k)	0	1	2	3	4	5	6	7	8	9	10	11	12
N × P(k\|λ = 5.43)	0.438	2.38	6.46	11.7	15.9	17.2	15.6	12.1	8.22	4.96	2.69	1.33	0.601

where N = 100 years. In order for the χ²-test to work, the bins should contain at least 5 items. We can fulfil this requirement by pooling the smallest bins.

earthquakes/year	≤ 2	3	4	5	6	7	8	≥ 9
observed counts	11	13	17	13	14	13	5	14
predicted counts	9.28	11.7	15.9	17.2	15.6	12.1	8.22	9.98

We can then compute the χ²-statistic:

n 2 χ2 = ∑ (Oi-−-Ei)-- Ei i=1

(9.7)

where O_i stands for the observed and E_i for the expected number of counts in the i^th out of n = 8 bins.

Finally we compare the value of χ² with a chi-square distribution with n − 2 degrees of freedom, where one degree of freedom has been removed because we forced ∑ _i=1ⁿE_i = ∑ _i=1ⁿO_i, and another degree of freedom was subtracted because we used the data to estimate λ when predicting the expected counts.

Applying the χ²-test to the earthquake data:

1.

Formulate two hypotheses:

H₀ (null hypothesis): the earthquake data follow a Poisson distribution

H_A (alternative hypothesis): the earthquake data do not follow a Poisson distribution

2.

Calculate the χ²-statistic using Equation 9.7:

2 (11 − 9.28 )2 (13 − 11.7)2 (17 − 15.9)2 (13− 17.2)2 χ =----9.28--- + ----11.7--- + ----15.9---- + ----17.2----+ 2 2 2 2 (14-−-15.6) + (13-−-12.1)- + (5−--8.22)- + (14−-9.98)- = 4.70 15.6 12.1 8.22 9.98

(9.8)

3.

Under H₀, Equation 9.7 follows a χ²-distribution 6 degrees of freedom. Tabulating some key quantiles for this distribution:

χ²	0.872	1.24	1.64	2.20	3.45	4.70	5.35	7.84	10.6	12.6	14.4	16.8

P(X ≤ χ²)	0.01	0.025	0.05	0.1	0.25	0.417	0.5	0.75	0.9	0.95	0.975	0.99
P(X ≥ χ²)	0.99	0.975	0.95	0.9	0.75	0.583	0.5	0.25	0.1	0.05	0.025	0.01

where the observed value is marked in italics.

4.

We will use an α = 0.05 confidence level.

5.

We are only interested in the upper tail of the null distribution, because it represents sample distributions that are very dissimilar from the theoretical distribution. The lower tail of the distribution groups samples that are very similar to the predicted distribution and therefore compatible with the null hypothesis (see Section 9.6 for further discussion). Marking the rejection region in bold:

χ²	0.872	1.24	1.64	2.20	3.45	4.70	5.35	7.84	10.6	12.6	14.4	16.8

P(X ≥ χ²)	0.99	0.975	0.95	0.9	0.75	0.743	0.5	0.25	0.1	0.05	0.025	0.01

The one-sided rejection region consists of all χ² > 12.6.

6.

Because the observed value of χ² = 4.70 falls outside the rejection region, we cannot reject the null hypothesis.

Figure 9.8: a) PDF and b) CDF of the χ²-distribution with 6 degrees of freedom. The rejection region is marked in black. The observed value for the earthquake data (χ² = 4.70) is shown as a vertical dashed line. It plots outside the rejection region, indicating that the histogram of the data falls within the expected range of the hypothesised (Poisson) distribution.

Recall that the t-test of Section 9.2 could either be used to compare the mean of a single sample with a particular value, or to compare the means of two samples. In a similar vein, the χ²-test can be used to compare either one sample to a theoretical distribution, or to compare two samples with each other. For example, let us compare two sets of petrographic observations:

	quartz	plagioclase	alkali feldspar	lithics

sample A	10	5	6	20
sample B	25	12	10	35

Table 9.1: Manual point counting data for two samples of sand.

This is a 2 × 4 contingency table. Note that the two samples contain a different number of clasts.

lithology	quartz	plagioclase	alkali feldspar	lithics	row sum

sample A	10	5	6	20	41
sample B	25	12	10	35	82

column sum	35	17	16	55	123

If the two samples have the same underlying composition, then the expected counts of each cell in the contingency table should be:

(sum of row i) × (sum of column j) expected counts of bin (i,j) =-------(sum-of-all the-cells)------

For example, the expected number of quartz grains in sample A would be

41×-35-= 11.7 123

Applying this formula to the whole table, the expected counts are

lithology	quartz	plagioclase	alkali feldspar	lithics

sample A	11.7	5.67	5.33	18.3
sample B	23.3	11.30	10.70	36.7

Table 9.2: Expected grain counts for two sedimentary samples.

All the cells in this table exceed 5. Therefore, the observed (table 9.1) and expected (table 9.2) counts can be plugged into Equation 9.7 to calculate a χ²-value. The null distribution of this statistic is χ² with (n_r − 1) × (n_c − 1) degrees of freedom, where n_r is the number of rows and n_c is the number of columns. The χ²-test then proceeds in the same way as the one-sample case of Section 9.4:

1.

Formulate two hypotheses:

H₀ (null hypothesis): samples A and B have the same composition

H_A (alternative hypothesis): samples A and B do not have the same composition

2.

Calculate the χ²-statistic with Equation 9.7, using table 9.1 for the observed and table 9.2 for the predicted values:

2 2 2 2 χ2 = (10−-11.7)-+ (5−-5.67)-+ (6-−-5.33)-+ (20-−-18.3)+ 11.7 5.67 5.33 18.3 (25−-23.3)2 (12−-11.30)2 (10-−-10.70)2 (35-−-36.7)2 23.3 + 11.30 + 10.70 + 36.7 = 0.86

(9.9)

3.

Under H₀, Equation 9.7 follows a χ²-distribution with (2 − 1) × (4 − 1) = 3 degrees of freedom. Tabulating some key quantiles for this distribution:

χ²	0.115	0.216	0.352	0.584	0.86	1.21	2.37	4.11	6.25	7.81	9.35	11.3

P(X ≤ χ²)	0.01	0.025	0.05	0.1	0.157	0.25	0.5	0.75	0.9	0.95	0.975	0.99
P(X ≥ χ²)	0.99	0.975	0.95	0.9	0.843	0.75	0.5	0.25	0.1	0.05	0.025	0.01

where the observed value is marked in italics.

4.

The confidence level α = 0.05.

5.

Marking the rejection region in bold:

χ²	0.115	0.216	0.352	0.584	0.86	1.21	2.37	4.11	6.25	7.81	9.35	11.3

P(X ≥ χ²)	0.99	0.975	0.95	0.9	0.843	0.75	0.5	0.25	0.1	0.05	0.025	0.01

The one-sided rejection region consists of all χ² > 7.81.

6.

Because the observed value of χ² = 0.86 falls outside this region, we cannot reject the null hypothesis.

Figure 9.9: a) PDF and b) CDF of the χ²-distribution with 3 degrees of freedom. The rejection region is marked in black. The observed value (χ² = 0.86) is shown as a vertical dashed line. It plots outside the rejection region, indicating that the histogram of the data falls within the expected range of the hypothesised (Poisson) distribution.

The χ²-distribution only covers positive numbers, from 0 to ∞. Low χ²-values indicate a good fit of the data to the proposed distribution, and high χ²-values indicate a bad fit. This is why the χ²-tests of Section 9.4 were one-tailed tests: we want to identify the bad fits in order to reject the null hypothesis. However it would be wrong to completely ignore the good fits.

Section 5.5 made the case that, in general, the desired outcome of a statistical test is the rejection of the null hypothesis. However, in the context of distributional tests, our life is often easier if the null hypothesis is not rejected. For example, if the data pass a χ²-test for a Poisson distribution, then this allows us to model the data with a single number (λ). If the data fail the χ²-test, then we may have to abandon the simplicity of the Poisson distribution and use a more realistic but complex alternative.

The desire to see the data pass a hypothesis test leads some scientists to cherry pick data. This means that they selectively remove perceived ‘outliers’ from the data until the remaining values pass the null hypothesis. It is important to remember that, even if the null hypothesis is true, we should still expect 5% (if α = 0.05) of all samples fail the null hypothesis. That is, there is always a 5% chance of committing a Type-I error (Section 5.4).

If all samples in a study have p-values exceeding 0.05, then this should raise suspicion. For example, comparing a Poisson null distribution (a) with three samples (b–d):

Figure 9.10: a) predicted frequency distribution for the zircon count data of example 2 in Chapter 6, following a Poisson distribution with λ = 3.50 and n = 48; b) – d) three sample distributions with the χ² statistic and p-values for comparison with distribution a; e) χ²-distribution with 5 degrees of freedom. The χ²-test would flag sample d as being ‘significantly different’ from the predicted histogram a. Sample c also looks somewhat different from the predicted distribution a, but this difference falls within the expected range of random sampling variability of the Poisson distribution. Sample b is identical to the prediction a. This should raise suspicion. It is extremely unlikely for a sample to fit the prediction so well.

The first sample is identical to the predicted distribution (χ²-statistic = 0.0, p-value = 1.0). Such an extraordinarily ‘good’ result would be extremely unlikely to happen by chance.

Let us carry out a similar experiment to Section 9.5, but instead of counting a few sedimentary clasts by hand, we task a machine to classify ∼10,000 grains of sand by image recognition:

lithology	quartz	plagioclase	alkali feldspar	lithics

sample A	29544	14424	13706	47864
sample B	29454	14788	13948	47311

Table 9.3: Automated point counting data for two samples of sand.

At first glance, the two samples look very similar in composition. But let us carry out a two-sample χ²-test to be sure.

1.

Formulate two hypotheses:

H₀ (null hypothesis): samples A and B have identical compositions

H_A (alternative hypothesis): samples A and B have different compositions

2.

The expected number of counts for each cell of the contingency table is obtained using the procedure outlined in Section 9.5. Calculate the row and column sums:

lithology	quartz	plagioclase	alkali feldspar	lithics	row sum

sample A	29544	14424	13706	47864	105538
sample B	29454	14788	13948	47311	105501

column sum	58998	29212	27654	95175	211039

and combine them to produce the following predicted counts:

lithology	quartz	plagioclase	alkali feldspar	lithics

sample A	29504	14609	13829	47596
sample B	29494	14603	13825	47579

Table 9.4: Predicted point counts.

Plugging tables 9.3 and 9.4 into Equation 9.7 yields

2 χ = 10.0

3.

Under H₀, the test statistic follows a χ²-distribution 3 degrees of freedom. Tabulating some key quantiles for this distribution:

χ²	0.115	0.216	0.352	0.584	1.21	2.37	4.11	6.25	7.81	9.35	10.0	11.3

P(X ≤ χ²)	0.01	0.025	0.05	0.1	0.25	0.5	0.75	0.9	0.95	0.975	0.9814	0.99
P(X ≥ χ²)	0.99	0.975	0.95	0.9	0.75	0.5	0.25	0.1	0.05	0.025	0.0186	0.010

where the observed value is marked in italics.

4.

The confidence level α = 0.05.

5.

Marking the rejection region in bold:

χ²	0.115	0.216	0.352	0.584	1.21	2.37	4.11	6.25	7.81	9.35	10.0	11.3

P(X ≥ χ²)	0.99	0.975	0.95	0.9	0.75	0.5	0.25	0.1	0.05	0.025	0.0186	0.010

The one-sided rejection region consists of all χ² > 7.81.

6.

The observed value of χ² = 10.0 falls inside this region, and so we reject H₀.

So despite the close similarity of the point counts for samples A and B in table 9.3, the χ²-test convincingly rejects the null hypothesis that they were drawn from the same population. To understand what is going on, we need to go back to Section 5.5. This section explained that the power of a statistical test to evaluate a null hypothesis monotonically increases with sample size.

With more than 100,000 counts per sample, it is not surprising that the χ²-test is able to detect even the tiniest difference between samples A and B. In comparison, the two samples of Section 9.5 only contain 41 and 82 samples, respectively. Consequently, it is more difficult to detect a small difference in composition between them.

The power of a statistical test depends on two things:

1.: Sample size: the larger the sample size, the easier it is to reject a false null hypothesis.
2.: The degree to which the null hypothesis is false: the greater the difference between the underlying populations, the easier it is to recognise this difference in the samples.

The second aspect can be quantified as the effect size. In the case of the χ²-distribution, the effect size is defined as:

┌ ------------- ││ ∑n (oi − ei)2 w = ∘ --------- i=1 ei

(9.10)

where o_i ≡ O_i∕N and e_i ≡ E_i∕N with N ≡∑ _iⁿO_i = ∑ _iⁿE_i. Effect sizes can be small, medium or large:

effect size	small	medium	large

w	0.1	0.3	0.5

For the framework mineral counts,

	lithology	quartz	plagioclase	alkali feldspar	lithics

o =	sample A	0.1400	0.06835	0.06495	0.2268
	sample B	0.1396	0.07007	0.06609	0.2242

and

	lithology	quartz	plagioclase	alkali feldspar	lithics

e =	sample A	0.1398	0.06922	0.06553	0.2255
	sample B	0.1398	0.06920	0.06551	0.2255

Plugging these values into Equation 9.10 yields an effect size of w = 0.00688, which is very small indeed. With a smaller sample size, the difference between A and B would have gone unnoticed. The only reason why the χ²-test failed is the huge size of the two samples. The tiny effect size indicates that, although the difference between samples A and B may be statistically significant, it is not geologically significant.

The t-test and χ²-test make specific parametric assumptions about the data:

The t-test assumes that the population mean follows a normal distribution. This assumption may not be correct for small samples from multimodal distributions. For example, when averaging n = 2 or n = 3 values from the bimodal geyser data (Figures 7.3 and 7.4), the assumption of normality is clearly incorrect.
The χ²-test requires binning the data into a histogram. This makes it well suited for discrete distributions such as the binomial and Poisson distribution. However it is less well adapted to continuous distributions such as the normal distribution. Furthermore, each bin in the histogram requires a sufficient number of counts for the χ²-assumption to be valid. This requirement may not be fulfilled for small samples.

These limitations can be avoided with non-parametric tests, which offer greater flexibility than parametric tests whilst increasing robustness to outliers.

The Wilcoxon test (which is also known as the Mann-Whitney test) is a non-parametric alternative to the t-test. Consider two sets of numbers, representing two different samples:

sample	1	2	3	4	5

A	A₁	A₂	A₃	A₄	A₅
B	B₁	B₂	B₃	B₄

To calculate the test statistic, merge the two samples and rank them. For example, suppose that A₄ < A₁ < B₃ < A₂ < A₅ < B₁ < A₃ < B₄ < B₂. Then

rank	1	2	3	4	5	6	7	8	9

value	A₄	A₁	B₃	A₂	A₅	B₁	A₃	B₄	B₂

If the two samples follow the same distribution, then we would expect the values to be randomly shuffled and evenly distributed on either side of the median. However if the two samples follow different distributions, then their values will be unevenly distributed. The test statistic is given by the sum of the ranks of the smallest sample. In our case, sample A contains 5 and sample B 4 items. We therefore calculate the sum of the ranks of sample B:

W = 3 + 6+ 8+ 9 = 26

For sample sizes of n_A = 5 and n_B = 4, W takes on values between ∑ _i=1⁴i = 10 and ∑ _i=6⁹i = 30. The closer the W-value is to these extremes, the less likely it is that samples A and B were drawn from the same distribution. The hypothesis test is carried out by comparing W with a lookup table. To understand how this lookup table is constructed, let us consider the smallest possible outcome for W, which is W = 10. This outcome corresponds to the following arrangements:

arrangement 1:	B₁	B₂	B₃	B₄	A₁	A₂	A₃	A₄	A₅
arrangement 2:	B₂	B₁	B₃	B₄	A₁	A₂	A₃	A₄	A₅
arrangement 2:	B₁	B₂	B₃	B₄	A₂	A₁	A₃	A₄	A₅
etc.

The total number of arrangements that result in W = 10 is 4!5! (2880 possibilities). To compute the probability of W = 10 under the null hypothesis, we must divide this value by the total number of permutations of the 9 values, which is 9! (see Section 4.1). Therefore:

4!5!- P(W = 10|nA = 5,nB = 4) = 9! = 0.00794

The probability of other outcomes is computed in a similar fashion.

Let us illustrate the Wilcoxon rank-sum test with the two-sample example of table 9.2:

coin #	1	2	3	4	5

density (1^st collection)	19.09	19.17	19.31	19.07	19.18
density (2^nd collection)	19.30	19.33	19.15	19.32

Table 9.5: The same data as table 9.2 but with the second sample marked in bold for future use.

1.

Formulate two hypotheses:

H₀ (null hypothesis)

H_A (alternative hypothesis):

median(sample 1) = median(sample 2)

median(sample 1) ≠ median(sample 2)

2.

To calculate the test statistic, merge the two samples and rank them:

rank	1	2	3	4	5	6	7	8	9

density	19.07	19.09	19.15	19.17	19.18	19.30	19.31	19.32	19.33

where tied values have been given an average rank. The test statistic is then simply the sum of the ranks for the smallest sample:

W =3 + 6+ 8 + 9 = 26

which (intentionally) is the same value as the earlier generic example.

3.

Under H₀, the test statistic follows a Wilcoxon rank-sum distribution for n₁ = 5 and n₂ = 4. Tabulating some key values for this distribution:

W	10	11	12	13	19	20	26	27	28	29	30

P(w ≤ W)	0.008	0.016	0.032	0.056	0.45	0.55	0.944	0.97	0.98	0.99	1.0
P(w ≥ W)	1.0	0.99	0.98	0.97	0.63	0.55	0.095	0.056	0.032	0.016	0.008

where the observed value is marked in italics.

4.

The confidence level α = 0.05.

5.

Marking the two-sided rejection region in bold:

W	10	11	12	13	19	20	26	27	28	29	30

P(w ≤ W)	0.008	0.016	0.032	0.056	0.45	0.55	0.944	0.97	0.98	0.99	1.0
P(w ≥ W)	1.0	0.99	0.98	0.97	0.63	0.55	0.095	0.056	0.032	0.016	0.008

The rejection region consists of W = {10,11,29,30}.

6.

The observed value of W(= 26) falls outside the rejection region, and so we cannot reject H₀.

Figure 9.11: a) PMF and b) CDF of the Wilcoxon test statistic for comparison of two samples containing 5 and 4 items, respectively. The two-sided rejection region is marked in black. The observed value (W = 26) is shown as a vertical dashed line. It plots outside the rejection region, leaving the possibility open that the two samples might have come from the same distribution.

The Kolmogorov-Smirnov test is a non-parametric alternative to the χ²-test for continuous variables that does not require binning. Given two sets of numbers:

x = {x1,x2,...,xn} and y = {y1,y2,...,ym }

the Kolmogorov-Smirnov statistic is defined as the maximum vertical distance between the ECDFs (Section 2.6) of the two samples:

D = mazx |Fx(z) − Fy(z)|

(9.11)

where F_x and F_y are the ECDFs of x and y, respectively. D takes on values from 0 (two identical distributions) and 1 (no overlap between the two distributions).

To illustrate the Kolmogorov-Smirnov method, consider two sand samples from China: one sample from the Yellow River, and one sample from a sand dune in the nearby Mu Us desert. We have separated the mineral zircon from the sand and analysed the crystallisation age of > 100 randomly selected zircon grains with the U–Pb method. The frequency distribution of the zircon dates serves as a characteristic ‘fingerprint’ that can be used to compare the different samples.

Figure 9.12: The two-sample Kolmogorov-Smirnov statistic is the maximum vertical distance between two ECDFs. This example compares the cumulative distributions of 121 detrital zircon U–Pb ages from the Yellow River with 116 detrital zircon U–Pb ages from a sand dune in the adjacent Mu Us desert. The KS-distance is 0.3006.

The hypothesis test then proceeds as always:

1.

Formulate two hypotheses:

H₀ (null hypothesis): samples 1 and 2 were drawn from the same distribution (D = 0).

H_A (alternative hypothesis): samples 1 and 2 were drawn from different distributions (D > 0).

2.

The test statistic is D = 0.30 (see Figure 9.12).

3.

Under H₀, the Kolmogorov-Smirnov statistic can be compared to a similar look-up table as the one that was used for the Wilcoxon test. Tabulating some key quantiles:

D	0.061	0.061	0.070	0.078	0.087	0.104	0.130	0.157	0.174	0.191	0.209	0.301

P(d ≤ D)	0.01	0.025	0.05	0.1	0.25	0.5	0.75	0.9	0.95	0.975	0.99	0.99996
P(d ≥ D)	0.99	0.975	0.95	0.9	0.75	0.5	0.25	0.1	0.05	0.025	0.010	0.000045

where the observed value is marked in italics.

4.

The confidence level α = 0.05.

5.

Just as for the χ²-test, we are only interested in the upper tail of the null distribution, because only high values of the test statistic are incompatible with H₀. Marking the one-sided rejection region in bold:

D	0.061	0.061	0.070	0.078	0.087	0.104	0.130	0.157	0.174	0.191	0.209	0.301

P(d ≥ D)	0.99	0.975	0.95	0.9	0.75	0.5	0.25	0.1	0.05	0.025	0.010	0.000045

The rejection region consists of all D > 0.174.

6.

The observed value of D = 0.301 falls inside this region, so H₀ is rejected.

Figure 9.13: a) PMF and b) CDF of the Kolmogorov-Smirnov statistic for comparison of two samples containing 116 and 121 items, respectively. The vertical dashed lines mark the test statistic for the two sand samples of Figure 9.12. The rejection region is marked in black on the PMF and groups all values of the test statistic that exceed the 95 percentile (D = 0.174). H₀ is rejected.

The Kolmogorov-Smirnov test can not only be used to compare two samples with each other, but also to compare one sample with a theoretical distribution. Here, for the sake of illustration, we compare the dune sample with a normal distribution that has the same mean and standard deviation:

Figure 9.14: Kolmogorov-Smirnov statistic for the comparison of a sample with a normal distribution that has the same mean and standard deviation. In this case D = 0.27, which can be compared with a lookup table for a one sample Kolmogorov-Smirnov test. The outcome of this test (which is not elaborated in these notes) is a rejection of the null hypothesis.

⁸⁷Rb is radioactive and decays to ⁸⁷Sr with a decay constant of λ = 1.42 × 10⁻⁵ Myr^-1. ⁸⁶Sr is a second isotope of Sr that does not have a radioactive parent. Together these three nuclides form the basis of a geochronometer:

[ 87Sr ] [87Sr ] [87Rb ]( λt ) 86Sr-- = 86Sr- + -86Sr- e − 1 ∘

(10.1)

where t is the age of the system, in millions of years, and [⁸⁷Sr∕⁸⁶Sr]_∘ is the non-radiogenic ⁸⁷Sr/⁸⁶Sr ratio, i.e. the ratio that was initially present in the sample at t = 0. When applied to multiple measurements, Equation 10.1 fits the generic formula for a straight line:

yi = β0 + β1 xi

(10.2)

where x_i and y_i are the ⁸⁷Rb/⁸⁶Sr- and ⁸⁷Sr/⁸⁶Sr-ratios of the i^th aliquot (out of n), respectively. x = {x₁,…,x_n} is also known as the independent variable and y = {y₁,…,y_n} as the dependent variable. The following table shows an example of eight Rb–Sr compositions from the same rock:

i	1	2	3	4	5	6	7	8

[⁸⁷Rb∕⁸⁶Sr] = x_i	2.90	7.14	9.10	3.41	1.91	7.15	5.92	8.28
[⁸⁷Sr∕⁸⁶Sr] = y_i	0.745	0.803	0.823	0.737	0.720	0.793	0.789	0.807

Table 10.1: Rb–Sr composition of eight aliquots (1 ≤ i ≤ 8) of the same sample.

Visualising the data on a scatter plot:

Figure 10.1: Isochron plot for the Rb–Sr data. The scatter plot of eight ⁸⁷Rb/⁸⁶Sr- and ⁸⁷Sr/⁸⁶Sr-ratios forms an array of points along a line whose intercept marks the initial ⁸⁷Sr/⁸⁶Sr-composition, and whose slope is a function of the age (t = ln[1 + β₁]∕λ). The linear trend is not perfect due to analytical uncertainty, which has dispersed the data.

The dispersion of the data around the straight line can be captured by a slightly modified version of Equation 10.2:

yi = β0 + β1 xi + 𝜖i

(10.3)

where 𝜖_i is the residual noise around the best fit line which, in this situation, is attributed to the dependent variable only.

The linear trend seems quite strong in the Rb–Sr example, but this may not always be the case. The correlation coefficient is a parameter that can be used to quantify the strength and test the significance of an apparent linear trend.

Recall the definition of standard deviation (σ_x,σ_y) and covariance (σ_x,y) from Section 7.3. Then Pearson’s correlation coefficient is defined as:

ρ = σx,y- σxσy

(10.4)

Section 7.4 showed that σ_x,σ_y and σ_x,y are unknown but can be estimated from the data using Equations 7.12 and 7.14:

┌ ----------------- │ ∑n s[x] = │∘ --1--(xi − ¯x)2 i=1 n− 1 n ∑ --1-- s[x,y] = n − 1 (xi − ¯x)(yi − ¯y) i=1

where x and ȳ are the arithmetic means of x and y, respectively. Then Pearson’s correlation coefficient for samples is defined as:

r = s[x,y]- s[x ]s[y]

(10.5)

Both ρ and r take on values between −1 and +1. It is also common for the degree of correlation to be quantified as r², which produces values between 0 and 1. r² is also known as the coefficient of determination.

Figure 10.2: Four synthetic bivariate normal datasets exhibiting different degrees of correlation between the x- and y-variable. Panel a) displays no correlation, b) a weak positive correlation, c) a strong positive correlation, and d) a strong negative correlation.

The ‘weak’ and ‘strong’ qualifiers in Figure 10.2 are subjective assessments of the linear trend. A more objective evaluation is possible by the fact that

----- r√ n − 2 t = √------2 1 − r

(10.6)

follows a t-distribution with n − 2 degrees of freedom. Thus we can formally test whether the apparent correlation between the ⁸⁷Sr/⁸⁶Sr- and ⁸⁷Rb/⁸⁶Sr-ratios in Figure 10.1 is statistically significant. Plugging the data of Table 10.1 into Equation 10.5 yields a correlation coefficient of r = 0.985. This seems high enough but let’s subject it to a hypothesis test anyway:

1.

Formulate two hypotheses:

H₀ (null hypothesis)

H_A (alternative hypothesis):

ρ = 0

ρ≠0

2.

Plugging r = 0.985 into Equation 10.6 yields a test statistic of

0.985√8--−-2 t = √-----------= 13.98 1 − 0.9852

3.

Tabulating some key quantiles for t under H₀:

t	-3.14	-2.45	-1.94	-1.44	-0.72	0	0.72	1.44	1.94	2.45	3.14	13.98
P(t < T)	0.01	0.025	0.05	0.1	0.25	0.5	0.75	0.9	0.95	0.975	0.99	0.9999958
P(t > T)	0.99	0.975	0.95	0.9	0.75	0.5	0.25	0.1	0.05	0.025	0.010	0.0000042

where the observed value is marked in italics.

4.

We will use a confidence level α = 0.05.

5.

Marking the two-sided rejection region in bold:

t	-3.14	-2.45	-1.94	-1.44	-0.72	0	0.72	1.44	1.94	2.44	3.14	13.98
P(t < T)	0.01	0.025	0.05	0.1	0.25	0.5	0.75	0.9	0.95	0.975	0.99	0.9999958
P(t > T)	0.99	0.975	0.95	0.9	0.75	0.5	0.25	0.1	0.05	0.025	0.010	0.0000042

The rejection region consists of all |t| > 2.40.

6.

The observed value of t = 13.98 clearly falls inside the rejection region, so H₀ is rejected.

Figure 10.3: a) PDF and b) CDF of a t-distribution with 6 degrees of freedom. The two-sided rejection region (for α = 0.05) is marked in black. The vertical dashed line marks the observed value of t = 13.98 and plots inside the rejection region. Therefore the test rejects the null hypothesis that ρ = 0, leading to the conclusion that the data are significantly correlated.

Now that we have convinced ourselves that the correlation is significant, we can try to fit a line through the data. In order to find the best possible fit, we first need to define what we need with ‘best’. There are many ways to do this, but the most common of these is the method of least squares.

As the name suggests, the least squares criterion quantifies the misfit of the line through the data as the sum of the squared residuals:

∑n ∑n ss ≡ 𝜖2i = (yi − β0 − β1xi)2 i=1 i=1

(10.7)

Let us take a wild guess and assume that β₀ = 0.6 and β₁ = 0.03:

i	1	2	3	4	5	6	7	8

x_i	2.90	7.14	9.10	3.41	1.91	7.15	5.92	8.28
y_i	0.745	0.803	0.823	0.737	0.720	0.793	0.789	0.807

β₀ + β₁x_i	0.687	0.8142	0.873	0.7023	0.6573	0.8145	0.7776	0.8484
𝜖_i	0.058	-0.0112	-0.05	0.0347	0.0627	-0.0215	0.0114	-0.0414

Table 10.2: The same data as Table 10.1. y_i are the observed and β₀ + β₁x_i the fitted values of the dependent variable. The residuals 𝜖_i are the differences between these two sets of numbers.

Plugging the 𝜖_i-values into Equation 10.7 yields a sum of squares ss = 0.013. Evaluating some other values:

Figure 10.4: Three guesses for the intercept (β₀) and slope (β₁) of the Rb–Sr isochron data of Figure 10.1. Dashed lines mark the residuals. The lines in panels a) and b) are too steep and too shallow, respectively. Consequently, the sums of their squared residuals (ss) are relatively large. Panel c) shows a better fit and a smaller sum of squares.

Instead of minimising ss by iterating over all possible values of β₀ and β₁, we can also find the optimal solution directly by taking the partial derivatives of Equation 10.7 w.r.t. β₀ and β₁ and setting them to zero:

{ ∑ ∂∂sβs0 = 2 (yi − β0 − β1xi) = 0 ∂ss = 2∑ (yi − β0 − β1xi)xi = 0 ∂β1

(10.8)

By solving this system of equations, it can be shown that:

( (∑ ∑ ∑ ∑ ) ( ∑ ∑ 2) { βˆ0 = x2i yi − xi xiyi( ∕ n x2i − ( x)i) ( βˆ = (n ∑ x y − ∑ x ∑ y )∕ n ∑ x2 − (∑ x )2 1 i i i i i i

(10.9)

For the Rb–Sr data:

i	1	2	3	4	5	6	7	8	∑ _i=1⁸

x_i	2.90	7.14	9.10	3.41	1.91	7.15	5.92	8.28	45.81
y_i	0.745	0.803	0.823	0.737	0.720	0.793	0.789	0.807	6.217

x_i²	8.41	50.98	82.81	11.63	3.648	51.12	35.05	68.56	312.2
x_iy_i	2.160	5.733	7.489	2.513	1.375	5.670	4.671	6.682	36.29

so that

{ ( ) βˆ0 = (312.2× 6.217 − 45.81 × 36.29)∕ 8 × 312.2− 45.812 = 0.698 βˆ = (8 × 36.29 − 45.81× 6.217)∕ (8× 312.2− 45.812) = 0.0138 1

Least squares regression is one way to determine the fit parameters of a straight line. The method of maximum likelihood (Sections 5.1, 6.2, 7.4 and 8.4) is another. This section will show that these two approaches produce exactly the same results. Recall the general formulation of the linear regression problem (Equation 10.3):

yi = β0 + β1 xi + 𝜖i

Let us assume that the residuals (𝜖_i) are normally distributed with zero mean:

[ ] --1--- -𝜖2i- f(𝜖i|β0,β1,σ) = σ√2-π-exp − 2σ2 , where 𝜖i = β0 + β1xi − yi

(10.10)

In that case, we can estimate β₀ and β₁ by maximising the (log-)likelihood, in exactly the same fashion as the maximum likelihood estimation algorithm of Section 7.4 and Equation 7.8

n [ ] ∏ -√1--- -𝜖2i- ℒ(β0,β1,σ|{x1,y1},...,{xn,yn}) = σ 2π exp − 2σ2 i=1

(10.11)

Minimising Equation 10.7 is equivalent to maximising Equation 10.11:

[∏n ] [∑n ] [ ( ) ∑n ( 2 ) ] max ℒ =max ln ℒ = max ln -√1--- − 𝜖i-- β0,β1 i=1 β0,β1 i=1 β0,β1 σ 2π i=1 2σ2 [ ∑n ] [ ∑n ] =max − 𝜖2 = min 𝜖2 = min (ss) β0,β1 i=1 i β0,β1 i=1 i β0,β1

(10.12)

This means that the least squares and maximum likelihood methods produce exactly the same result, and that linear regression works best when the residuals are normally distributed. Error propagation of the estimated regression coefficients ( ˆ
β ₀, ˆ
β ₁) proceeds as in Section 8.4:

⌊ ⌋ s[ˆβ0]2 s[ˆβ0, ˆβ1] Σ ˆβ = ⌈s[βˆ0,βˆ1 ] s[ˆβ1]2 ⌉ = − ℋ −1 [ 2 2 ]− 1 ∂∂ℒβℒ2- ∂∂β0ℒ∂ℒβ1- = − -∂2ℒ0ℒ-- ∂2ℒℒ ∂β0∂β1 ∂β21 ( 1 [ n ∑n x] )−1 = − --2 ∑n ∑ni=1 i2 σ i=1xi[∑ i=1x i ] -----σ2------- ni=1 x2i∕n − ¯x = ∑ni=1(x2i − ¯x2) − ¯x 1

(10.13)

Equation 10.13 uses the standard deviation of the data around the linear fit (σ). This parameter is generally unknown and must be estimated from the data:

-------------------- ┌│ n 2 ˆσ = │∘ ∑ (ˆβ0-+-ˆβ1xi −-yi) i=1 n− 2

(10.14)

Equation 10.14 looks similar to the usual definition of the standard deviation (Equation 7.12) except that 2 degrees of freedom have been subtracted instead of 1, because two parameters were estimated ( ₀ and ₁) instead of one (x).

Equation 10.13 can be used to construct a confidence interval for the regression coefficients. The procedure for doing this is analogous to the construction of confidence intervals around the mean (Equation 9.6):

{ } { } β0 ∈ βˆ0 ± tdf,α∕2 s[ˆβ0] and β1 ∈ ˆβ1 ± tdf,α∕2 s[ˆβ1]

(10.15)

where df = n− 2. We can then construct a confidence envelope around the best fit line by observing that y = β₀ + β₁x matches the equation for a sum, and applying the corresponding error propagation formula (Equation 8.10):

{ } ∘ --------------------------- y ∈ ˆβ0 + βˆ1x ± tdf,α∕2 s[ˆβ0]2 + s[ˆβ1]2x2 + 2s[ˆβ0, ˆβ1]x

(10.16)

Applying Equation 10.16 to the Rb–Sr data and evaluating y for the full range of x-values:

Figure 10.5: The grey area represents a 95% confidence envelope for the least squares regression of the Rb–Sr data. The curvature of this envelope reflects the increased uncertainty that is caused by extrapolating the data. The width of the confidence envelope is the smallest near the average of the measurements (x, ȳ) and increases indefinitely beyond that.

Recall that the standard error of the mean decreases with increasing sample size (Section 8.3). Exactly the same phenomenon applies to the standard errors of the regression parameters (Equation 10.13) and, hence, to the width of the confidence envelope. In order to assess the likely range of future outcomes, the confidence envelopes produced by Equation 10.16 need to be enlarged to produce a prediction interval:

{ ∘ --------------------------------} ˆ ˆ 2 ˆ 2 ˆ 2 2 ˆ ˆ y ∈ β0 + β1x ± tdf,α∕2 ˆσ + s[β0 ]+ s[β1] x + 2s[β0,β1]x

(10.17)

where is given by Equation 10.14. Comparing confidence intervals and prediction intervals for different sample sizes of three synthetic datasets:

Figure 10.6: Confidence envelopes (grey) and prediction intervals (dashed lines) for three synthetic datasets of increasing size. Whereas the width of the confidence interval approaches zero for large datasets, the width of the prediction interval only decreases slightly.

Three mistakes are commonly made in regression analysis:

1.

p-hacking: This problem was already discussed in section 6.4 but bears repeating in the context of linear regression. When presented with a multivariate dataset (e.g. the concentration of n chemical species in m samples), it is common practice in exploratory data analysis to plot all pairs of variables against each other in an n × m grid of scatter plots. When doing this it is inevitable that some of these pairs exhibit a statistically significant correlation.

Figure 10.7: 18 scatter plots of bivariate random uniform data, with indication of the coefficient of determination (r²) and the p-value for correlation. The one ‘significant’ result (p-value = 0.00063) is a Type-I error.

2.

outliers: High values for the correlation coefficient (r) and coefficient of determination (r²) are often taken as evidence for correlation. However, these statistics are sensitive to extreme values. A single outlier can have a disproportionally large effect, suggesting a significant correlation where there is none.

Figure 10.8: A synthetic dataset of ten clustered points around {x = 1,y = 1} plus one outlier at {x = 5,y = 5}. The cluster consists of the same values as the first panel of Figure 10.7. But whereas the latter dataset was characterised by a coefficient of determination of almost zero (r² = 0.0014), the new dataset has a coefficient of determination that is close to one (r² = 0.9). Such is the disproportionate effect of the additional data point.

3.

Spurious correlation: Let x, y and z be three independent random datasets of 50 numbers. Then the bivariate scatter plots of y vs. x, z vs. x and z vs. y do not exhibit any discernable correlation. However, plotting y∕z vs. x∕z, or z vs. x∕z produces strong correlations:

Figure 10.9: a–c) Three random datasets of 50 points each, drawn from normal distributions with means μ_x = μ_y = μ_z = 100 and standard deviations σ_x = σ_y = 1 and σ_z = 10, respectively. The three datasets are independent, so ρ_x,y = ρ_x,z = ρ_y,z = 0. However, the ratio of x∕z is strongly correlated with d) y∕z, and with e) z. This correlation is entirely spurious and has no scientific value.

The data clouds in panels d) and e) of Figure 10.9 are strongly correlated even though x, y, z are uncorrelated. This correlation arises because z appears in both axes. This phenomenon was first described by Karl Pearson¹ of the eponymous correlation coefficient (r, Equation 10.5), who also showed that it is possible to predict the expected correlation coefficient (or ‘null correlation’) associated with the spurious effect. In the general case of four samples w, x, y and z, the ratio correlation of w∕y and x∕z is given by:

[-σw-][σx] [σw-][σz] [ σx][σy] [σy][σz] --ρw,x-μw---μx--−-ρw,z-μw---μz--−-ρx,y--μx--μy--+-ρy,z--μy--μz---- ρwy,xz ≈ ∘ [---]2--[---]2-------[---][--]∘ [---]2--[--]2-------[---][--]- μσww- + μσyy − 2ρw,y σμww- σμyy σxμx + σμzz − 2ρx,z σμxx σμzz

(10.18)

Substituting z for y and y for w in Equation 10.18, and considering that ρ_z,z = 1, we can show that:

[ ] [ ] [ ][ ] [ ][ ] [ ] ρy,x σy σx − ρy,z σy σz − ρx,z σx- σz + σz 2 ρ y,x ≈ ∘---------μy---μx--------μy---μz∘---------μx--μz-----μz--------- zz [σy]2 [ σz]2 [σy][ σz] [σx]2 [ σz]2 [σx][ σz] μy + μz − 2ρy,z μy μz μx + μz − 2ρx,z μx μz

(10.19)

Similarly, substituting w for z, and setting y = 1 and σ_y = 0 in Equation 10.18:

[ ] [ ] [ ] ρ σz σx − σz 2 ρ x ≈ ----∘---z,x-μz---μx-----μz---------- z,z [σ ] [σ ]2 [σ ]2 [ σ ][σ ] μzz μxx + μzz − 2ρx,z μxx μzz

(10.20)

For the example of Figure 10.9, where ρ_x,y = ρ_x,z = ρ_y,z = 0, the expected null correlation for x∕z and y∕z is obtained from Equation 10.19:

[σz]2 [ ] --------------μz-------------- -----11000-2----- ρ yz,xz ≈ ∘ [--]2--[---]2∘ [--]2--[--]2-= [-1]2 [ 10]2 = 0.990 σμy + σzμ- σμx + σμz 100 + 100 y z x z

(10.21)

which is in excellent agreement with the observed correlation coefficient of Figure 10.9d. Similarly, the null correlation for x∕z and z is obtained from Equation 10.20:

[ ]2 ------−--11000--------- ρz,xz ≈ [ 10]∘ [ 1-]2--[ 10]2-= − 0.995 100 100 + 100

which explains the strong negative correlation in Figure 10.9e. It is important to be aware of the ratio correlation phenomenon because scatter plots of ratios are commonplace in geology. Two examples are the ⁸⁷Rb/⁸⁶Sr – ⁸⁷Sr/⁸⁶Sr isochron plot from the beginning of this chapter, and the Zr–Zr/Y tectonic discrimination diagram of igneous geochemistry.

The least squares regression algorithm of Sections 10.2 and 10.3 assumed that the residuals (𝜖_i in Equation 10.3) followed a normal distribution with zero mean and equal standard deviation. Using statistical jargon, we assumed that the data are homoscedastic. However, real datasets are often heteroscedastic, i.e. their standard deviation varies between aliquots.

Consider, for example, the case of (Rb–Sr) isochron regression. It is based on ⁸⁷Rb/⁸⁶Sr and ⁸⁷Sr/⁸⁶Sr isotope measurements that are obtained by mass spectrometry. The uncertainties of these measurements are known and may vary significantly between aliquots. Due to the spurious correlation issue of Section 10.4.3, these uncertainties may also be correlated. To illustrate the effect of these complications, consider a simple three-point example:

i	X	Y	x	s[x]	y	s[y]	s[x,y]

1	10	20	10.5	1	20.5	1	0.9
2	20	30	19.5	1	29.9	1	0.9
3	30	40	25.1	3	45.2	5	-13.5

Table 10.3: Synthetic three-aliquot dataset. X and Y the true values; x and y are three measurements, s[x] and s[y] their respective standard errors, and s[x,y] their covariance. The uncertainties differ between the three samples, which are therefore heteroscedastic.

The uncertainties can be visualised as error ellipses (see Figures 7.10 and 7.13):

Figure 10.10: Synthetic data of Table 10.3. The white squares are the true population values (X and Y ) of three aliquots. The black squares are three measurements (x and y). The ellipses represent 95% confidence regions for bivariate normal distributions with means x and y, and (co)variances s[x]², s[y]² and s[x,y]. The true values fall on a line with intercept β₀ = 10 and slope β₁ = 1 (dotted line). The unweighted least squares fit (dashed line) has an intercept of β₀ = 1.9 and slope β₁ = 1.63. This poor result is entirely due to the third data point, whose disproportionally large uncertainties are not properly accounted for by the ordinary least squares regression algorithm.

To account for the unequal uncertainties of the three aliquots, we need to replace the likelihood function of the unweighted regression algorithm (Equation 10.11):

n∏ 1 [ (y − β − β x)2] ℒ(β0,β1,σ|x1,y1,...,xn,yn) = -√----exp − --i---0-2--1-i-- i=1 σ 2π 2σ

with a different function that is based on the bivariate normal distribution (Equation 7.7):

ℒw(β0,β1,x1,...,xn|x1,y1,...,xn,yn) = n ( ⌊ x − x ⌋ T ⌊ x − x ⌋) ∏ ---1---- | 1-⌈ i i ⌉ −1⌈ i i ⌉| 2π∘ |Σ-| exp (− 2 yi − β0 − β1xi Σ i yi − β0 − β1xi ) i=1 i

(10.22)

where x₁,…,x_n are the fitted values for the independent variable x. This is the most likely value for X given the measurement x and the analytical uncertainties. The x-values must be estimated from the data along with the intercept (β₀) and slope (β₁) of the regression model. Equation 10.22 can be maximised numerically.

Figure 10.11: a) the same as Figure 10.10, but with the weighted regression result added as a solid line. Its intercept is β₀ = 9.4 and its slope is β₁ = 1.05. These values are much closer to the true values (dotted line) than the ordinary least squares solution (dashed line) is. b) Zooming into aliquot 3 shows the true value of the independent variable (X₃), its measured value (x₃), and the fitted value x₃.

The United States Geological Survey (USGS) hosts a catalog¹ of global earthquakes. Here is a histogram of the 20,000 most recent earthquakes with magnitude 4.5 or greater, which was downloaded when this text was written.

Figure 11.1: Histogram for 20,000 recent earthquakes of magnitude ≥ 4.5 from the USGS earthquake catalog.

This distribution is positively skewed (section 3.3), and therefore has a similar appearance as the clast size distribution of Figure 2.11. Recall that the skewness of the Ca/Mg distribution caused the mean and the median to be very different (Figure 3.2). This problem was solved by a simple logarithmic transformation (Figure 2.12). After taking logs, the skewed distribution of clast sizes became symmetric (section 2.4). In fact, we could apply a χ²- or Kolmogorov-Smirnov test to show that the log of the clast sizes follows a normal distribution. This type of skewed distribution, which becomes normal after taking logarithms, is called a lognormal distribution. Let’s see if this procedure also works for the earthquake data:

Figure 11.2: The histogram of the logarithm of the 20,000 earthquakes in Figure 11.1 is still positively skewed.

The logarithmic transformation has not reduced the skewness of the data. This means that the earthquake magnitudes do not follow a lognormal distribution. And it also means that logarithms cannot fix the discrepancy between the mean and the median. It turns out that the distribution of earthquake magnitudes does not have a well-defined mean or median. This phenomenon is very common in geology.

In our attempt to remove the skewness of the earthquake magnitude data, we ignored the fact that earthquake magnitudes are already a logarithmic quantity, which can take negative values. In fact, we will see that most seismic events have negative magnitudes! So instead of taking the logarithm of the earthquake magnitudes, let us take the logarithm of the frequencies:

Figure 11.3: Bivariate scatter plot of y = log ₁₀[N∕N_∘] against earthquake magnitude, where N is the number of earthquakes exceeding a given magnitude and N_∘ is the total number of registered earthquakes, which is 20,000 for the dataset of Figure 11.1.

The transformed data plot on a straight line of the form

log10[N ∕N∘] = a+ b magnitude

(11.1)

where a = 5.09 and b = −1.15. Equation 11.1 is called the Gutenberg-Richter Law and is a mainstay of seismology. It is tremendously useful for predictive purposes. Given a record of small seismic events, the Gutenberg-Richter Law allows accurate predictions of seismic hazards posed by larger events. For example, given that the 20,000 earthquakes of Figure 11.1 span a continuous period of 1031 days, we can use Equation 11.1 to predict the likelihood that an earthquake of magnitude 9.0 or greater will happen within the next year:

1.: The expected number of earthquakes per year is $20000-earthquakes × 365 days = 7080 earthquakes 1031 days$
2.: Plugging magnitude 9.0 into Equation 11.1: $log [N ∕N ] = 5.09− 1.15 × 9.0 = − 5.26 10 ∘$
3.: Rearranging for N: $N = 7080 × 10−5.26 = 0.039$

In other words, there is a 3.9% chance that at least one magnitude earthquake 9.0 or greater earthquake happens per year. This is equivalent to 1 such event occurring per 25 years, or 4 events occurring per century. If we look at the last magnitude ≥ 9.0 earthquakes of the past century:

location	Japan	Sumatra	Alaska	Chile	Kamchatka
year	2011	2004	1964	1960	1952
magnitude	9.1	9.2	9.2	9.5	9.0

then that amounts to 5 events. This seems to confirm that the short term earthquake record can indeed be used to make long term predictions.

Power-law relationships are found in many other geoscience fields. For example, in hydrology:

Figure 11.4: A map of central Finland (axis labels mark latitude and longitude), with water marked in black and land marked in white. Finland is also known as “the land of a thousand lakes”. But in reality there are far more than 1000 lakes in Finland. The small area shown in this figure already contains 2327 of them. Most of these lakes are small, but there are also a few big ones that cover areas of more than 1000 km².

The size-frequency relationship of Finnish lakes looks very similar to the Gutenberg-Richter Law of earthquakes:

Figure 11.5: Plotting the number of lakes exceeding a certain size against that size on a log-log scale yields a linear array of points similar to the Gutenberg-Richter Law of Figure 11.3. Extrapolating this trend towards the left would reveal that there are millions of puddles in Finland.

Other examples of similar power law relationships in the Earth Sciences include the size-frequency distributions of faults and joints, clasts in glacial till, oil fields and ore deposits, rivers and their tributaries, mountains and floods, to name just a few.

In a famous paper² , Benoit Mandelbrot showed that it is impossible to unequivocally pin down the circumference of Britain. The answer depends on the length of the measuring rod:

Figure 11.6: Five attempts to measure the length of Britain’s coastline. Different results are obtained depending on the length of the measuring rod used for the measurements. The shorter the yardstick (shown as error bars), the longer the estimate.

However this seemingly complex phenomenon can be fully captured by a simple power law equation. Plotting the length of the coastline against the size of the measuring rod on a log-log scale:

Figure 11.7: Setting out the length of the British coastline against the length of the measuring rod on a log-log scale produces a linear trend with a slope of −0.253. This line can be extrapolated to lower values to estimate the length that would be measured with even smaller measuring rods. For example, if we were to measure the British coast with a 30 cm long ruler, then this would produce a result of exp(9.18 − 0.253ln[3 × 10⁻⁴]) = 42,060 km!

An alternative (and equivalent) way to plot the data is to divide the measured length of the coastline by the length of the measuring rod to obtain the number of linear segments that approximate the British coast. Plotting this number against the length of the measuring rod on a log-log plot also produces a straight line:

Figure 11.8: The same data as Figure 11.7 but plotting the number of polygonal segments on the y-axis instead of the length of those segments. Note how the slope of the best fit line equals the slope of Figure 11.7 minus one.

Box counting is another way to obtain this result. Instead of approximating the British coast with a set of line segments, this method covers the coast with a set of boxes. Varying the box size and counting the number of boxes needed to cover the entire coast:

Figure 11.9: Box counting of the British coast using (from left to right) a 16 × 16, 32 × 32, 64×64, and 128×128 grid. Black squares overlap with the coastline, white squares do not. The legends in the upper right corner of each subpanel specify the number of black squares relative to the total number of squares.

Instead of plotting the number of line segments against their size, we plot the number of boxes against their size (width). This produces a linear trend that has a different intercept than Figure 11.8, but a similar slope:

Figure 11.10: Plotting the number of boxes against their size (i.e. width in pixels) on a log-log diagram yields a linear array with a slope of −1.26. This is similar to the value obtained by the polygonal line segment method of Figure 11.8.

The box counting method is more flexible, and therefore more widely used, than the line segment method. For example, applying the same technique to a river network:

Figure 11.11: Box counting of the river network on the island of Corsica (France). Black boxes overlap with rivers, white boxes do not. Legends are as in Figure 11.9.

and visualising the results on a log-log plot:

Figure 11.12: Log-log plot of the box counting results of Figure 11.11. The power law appears to be a law of nature.

This section will introduce some simple geometric patterns that match the statistical properties of geological patterns such as the coastlines, lakes and river networks of section 11.2. The first of these synthetic patterns is the Koch curve, which is created by a simple recursive algorithm.

Figure 11.13: A 1^st order Koch curve is constructed by (1) dividing a straight line segment into three segments of equal length; (2) drawing an equilateral triangle that has the middle segment from step 1 as its base and points outward; and (3) removing the line segment that is the base of the triangle from step 2.

Figure 11.14: A 2^nd order Koch curve is derived from a 1^st order Koch curve by replacing each straight line segment in the 1^st order curve with a scaled down version of that 1^st order curve.

Figure 11.15: A 3^rd order Koch curve is derived from a 2^nd order Koch curve by replacing each straight line segment in the 2^nd order curve with a scaled down version of the 1^st order Koch curve.

Figure 11.16: This is a 6^th order Koch curve, which is generated by replacing each straight line segment in a 5^th order curve with a scaled down version of the 1^st order curve.

This procedure can be repeated ad infinitum, producing an intricate curve that is endlessly detailed. It is similar in many ways to the British coast line, which also produces ever more detail as we zoom into the map. Measuring the length of a Koch curve presents the same difficulty as measuring the length of the British coastline: the answer depends on the length of the measuring rod. Applying the box counting method to the Koch curve:

Figure 11.17: Box counting of the 6^th order Koch curve. The larger the boxes, the fewer of them are needed to cover the entire curve. The recursive order of the Koch curve can be increased indefinitely, and so does the number of small boxes needed to cover them.

Plotting the results on a frequency-magnitude log-log plot:

Figure 11.18: Log-log plot setting out the number of boxes needed to cover the Koch curve of Figure 11.16 against the size of those boxes. The best fitting line has a slope of −1.26, which is similar to the slope of the box-counting results for the British coastline (Figure 11.10). Box sizes are expressed in arbitrary units (pixel widths).

The similarity of Figures 11.10 and 11.18 suggest that the Koch curve serves as an artificial coastline. The Koch curve is just one of many artificial fractals. Here is another one:

Figure 11.19: The Sierpinski carpet is generated using a recursive algorithm that is built on a grid of eight black squares surrounding a white square. Each level of recursion replaces each black square with the same pattern. From left to right, this figure shows the first four levels of recursion for this algorithm. The end result is an arrangement of small and large holes that shares many characteristics with the size distribution of Finnish lakes shown in Figure 11.4.

One thing that the Koch curve and Sierpinski carpet have in common is their self-similarity. Their level of complexity remains the same regardless of scale. Whether one zooms in or out of the picture, the complexity remains the same.

Because it consists of boxes, the Sierpinski carpet is ideally suited for box counting. We can either cover the black areas with boxes, or do the same with the white areas. Here is the resulting log-log plot:

Figure 11.20: Log-log plot of the box counting results for the Sierpinski carpet. Note how the slope of the best fitting line (−1.89) is steeper than that of the Koch curve (Figure 11.18). It is similar to the slope that would be obtained by box-counting the Finnish lakes of Figure 11.4, which is similar to x − 1 where x is the slope of the size-frequency plot of the Finnish lakes (Figure 11.5).

Mathematically, the linear fit of the log-log plots can written as

log[number of boxes] = C − D × log [size of the boxes]

(11.2)

The slope magnitude of the line (D, which is independent of the logarithmic base) is also known as the fractal dimension of the pattern. It is called a dimension because it has all the characteristics of the spatial dimensions, which can be used to chacterise a point (D = 0), a line (D = 1), a plane (D = 2) or a cube (D = 3). But whereas these traditional notions of dimension are tied to integer numbers, the dimensionality of fractals is quantified by non-integer fractions. Formalising the box-counting procedure leads to an alternative definition of the fractal dimension:

log-N(𝜖) D = − li𝜖→m0 log𝜖

(11.3)

where 𝜖 is the normalised³ size of the box and N(𝜖) is the number of boxes needed to cover the feature of interest. Equation 11.3 can be used to obtain exact analytical expressions for deterministic fractals. Thus, it can be shown that the fractal dimension of a Koch curve is D = log ₃(4) = 1.262. This is a number between D = 1 (a line) and D = 2 (a plane). It reflects the intricate curvature of the Koch curve, which partly ‘fills’ the 2-dimensional plane. The Sierpinski carpet has a fractal dimension of D = log ₃(8) = 1.893. This number, too, falls between the dimensionalities of a line and a plane. But it is more similar to a plane than it is to a line.

Other shapes exist that have fractal dimensions between 0 < D < 1 or between 2 < D < 3. Consider, for example, the Cantor set:

Figure 11.21: The Cantor set is generated using a recursive algorithm that is built on a line segment whose middle third is removed. Each level of recursion replaces each black line by the same pattern. From top to bottom, this figure shows the first five levels of recursion for this algorithm.

Plotting the size distribution of the Cantor set on a log-log scale:

Figure 11.22: The y-axis shows the number of linear segments in the Cantor set that exceed the length shown on the x-axis. The values form a power law with a fractal dimension of D = ln[2]∕ln[3] = 0.631. With a fractal dimension between zero and one, the Cantor set falls somewhere between a point and a line.

Consider the following experimental setup:

Figure 11.23: A pendulum is swinging above three magnets. The force (F_m) exerted on the pendulum scales with the square of its bob’s distance (|d|) to the magnets (F_m(i) ∝ 1∕|d(i)|², where 1 ≤ i ≤ 3 marks each of the magnets). The pendulum slows down due to friction (F_f(i) ∝ v where v is the velocity of the bob) and eventually comes to a standstill above one of the magnets. On this figure it has done so above the first magnet.

Despite the simplicity of this setup, it can lead to some complex behaviour.

Figure 11.24: This figure shows the three magnet configuration of Figure 11.23 in map view. The black line marks the trajectory of the pendulum after it was pushed southward from a position towards the northwest of the three magnets. After describing a circular motion, the bob of the pendulum accelerates towards the second magnet, gets deflected by it, and slows down. It then heads towards the third magnet and the first magnet before returning to the second magnet and coming to a standstill there.

Moving the initial position of the bob slightly to the south of the previous position results in a different outcome:

Figure 11.25: The experimental setup shown in this figure is nearly identical to that of Figure 11.24. The only difference is a slight offset of the initial position. The first stage of the resulting trajectory is nearly identical to that of Figure 11.24. The bob makes a circular motion towards the second magnet and decelerates. But after passing the second magnet, its course diverges from the first experiment. It moves towards the first magnet, to the third magnet and then back to the first magnet before coming to a standstill. Thus, the slight difference in initial position has produced a completely different end result.

We can repeat the experiment for any other initial position. Evaluating the outcomes along a 512 × 512 grid of initial values:

Figure 11.26: This intricate picture paints the initial positions of the magnetic pendulum experiment according to its outcomes. White, grey and black pixels in this 512 × 512 image mark initial positions that resulted in a final position at the first, second and third magnet, respectively. The resulting pattern is simple in the immediate vicinity of the magnets, but complex further away. It has all the characteristics of a fractal, exhibiting the same level of complexity regardless of scale. The pattern is determinisitic in the sense that the same grid of initial conditions produces exactly the same pattern. But it is chaotic because even tiny changes in the initial position or velocity may produce completely different patterns.

The strong sensitivity of the outcome on the initial conditions of the magnetic pendulum experiment is the hallmark of deterministic chaos. It is the phenomenon whereby “the present determines the future, but the approximate present does not approximately determine the future”.

The magnetic pendulum is just one example of a simple system of coupled equations that produces complex outcomes. It is a geologically relevant example, because the gravitational interaction between the planets and moons in our solar system produces similar chaotic behaviour. The interplay between the multitude of gravitational fields in our solar system is responsible for the ejection of meteors from the asteroid belt, which have been linked to some of the largest mass extinctions in the history of life. Gravitational interactions also destabilise the orbital parameters of planets such as Mars. The obliquity of Mars’ spin axis is chaotic. Its evolution can be predicted thousands of years into the future, but becomes unpredictable over million year timescales. Rapid changes in the obliquity of Mars have caused its polar ice caps to shift over time.

Chaos theory originates from the work of atmospheric scientist Edward Lorenz. Lorenz formulated a simplified mathematical model for atmospheric convection, based on three nonlinear equations. Like the three magnets of the pendulum example, the interactions between the three Lorenz equations produce outcomes that were extremely sensitive to the initial conditions. Lorenz called this the butterfly effect.

In the magnetic pendulum example there were just three outcomes, but in the real world there are countless numbers of them. The butterfly effect raises the theoretical possibility that the flap of a butterfly’s wings in Brazil may change the initial conditions of the global atmosphere and thereby cause a tornado in Texas. Of course the vast majority of butterfly wingflaps won’t have this outcome, but some of them may. The outcome is impossible to predict far in advance. This phenomenon limits the ability of meteorologists to forecast the weather more than 10 or so days in advance.

The simple plots of chapter 2 are useful for visualising simple datasets of one or two dimensions. However many Earth Science datasets span multiple dimensions. For example, a geochemist may have measured the concentration of 20 elements in 50 samples; or a palaeoecologist may have counted the relative abundances of 15 species at 30 sites. This chapter will introduce some tools that can help us see some structure in such ‘big’ datasets without any prior knowledge of clusters, groupings or trends. This is called unsupervised learning, as opposed to the supervised learning algorithms that will be introduced in Chapter 13.

Principal Component Analysis (PCA) is an exploratory data analysis method that takes a high dimensional dataset as input and produces a lower (typically two-) dimensional ‘projection’ as output. PCA is closely related to Multidimensional Scaling (MDS), which is introduced in Section 12.2. To explain the mathematical mechanism behind PCA, let us begin with a simple toy example. Consider the following bivariate (a and b) dataset of three (1, 2 and 3) samples:

a b ⌊ ⌋ 1 0 9 X = 2⌈ 17 0⌉ 3 19 6

(12.1)

Displaying these three points on a scatter diagram reveals that two of the three samples plot close together while the third one plots further away:

Figure 12.1: Three bivariate data points that can be visualised on a two-dimensional scatter plot. There is no need to use unsupervised learning in this case. But we will use this simple dataset as a toy example to understand how Principal Component Analysis works.

Imagine that you live in a one-dimensional world and cannot see the spatial distribution of the three points represented by X. Then PCA allows us to visualise the two-dimensional data as a one-dimensional array of numbers. PCA involves the following steps

1.

Decompose X into a mean vector M and a centred data matrix Y :

⌊ ⌋ ⌊ ⌋ 1 [ ] − 12 4 X = 13,1M + Y = ⌈ 1 ⌉ 12 5 + ⌈ 5 − 5⌉ 1 7 1

(12.2)

2.

Compute the covariance matrix Σ of X:

[ ] 109 − 33 Σ = − 33 21

(12.3)

3.

Subject Σ to an eigendecomposition¹ , i.e. rewrite it as the product of three matrices:

[ ] [ ][ ] −1 0.95 0.32 120 0 0.95 − 0.32 Σ = Q Λ Q = − 0.32 0.95 0 10 0.32 0.95

(12.4)

where the diagonal elements of Λ are called the eigenvalues, and the columns of Q are the eigenvectors, which are also referred to as the loadings in the context of PCA.

4.

Define the principal components (P) or scores as:

⌊ ⌋ − 12.6 0 P = Y Q = ⌈ 6.3 − 3.2 ⌉ 6.3 3.2

(12.5)

The first column of Q defines the direction of maximum variance of the bivariate dataset X (direction of the line labelled PC1 in Figure 12.2). It separates sample 1 from samples 2 and 3. Projecting the data onto this direction yields a new set of numbers, corresponding to the first column of P (PC1 in Figure 12.3). The variance of these new coordinates is 120, i.e. the first eigenvalue of Λ. The second column of Q defines a direction that is orthogonal (perpendicular) to that defined by the first column (direction of the line labelled PC2 in Figure 12.2). This direction accounts for the remaining variance of the dataset, separating samples 2 and 3. Projecting the three samples on the second principal direction yields a second set of principal components, corresponding to the second column of P (PC2 in Figure 12.2). The variance of these numbers is 10, i.e. the second eigenvalue of Λ.

Figure 12.2: PCA decomposition of Figure 12.1. The data X are shown as numbers, M as a square, and 1_2,1M ± Q √ --
Λ

as a cross. The first principal direction (running from the upper left to the lower right) has been stretched by a factor of ------
∘ Λ2 ∕Λ1

= 3.5 w.r.t the second principal direction, which runs perpendicular to it.

The first column of P yields the desired one-dimensional simplification of the data:

Figure 12.3: Projection of the three data points on the two principal directions yields two principal components (P in Equation 12.5), representing one-dimensional representations of the two-dimensional data.

Using the eigenvalues Λ in Equation 12.4, it can be shown that the first principal component (PC1) captures 120∕(120 + 10) = 92% of the variance in the dataset, and the second principal component (PC2) accounts for the remaining 10∕(120 + 10) = 8% of the variance. So relatively little information is lost by discarding PC2. Then PC1 tells us that, to a first approximation, sample 1 is very different from samples 2 and 3, which are identical to each other. PC2 captures the remaining information, which shows that samples 2 and 3 are, in fact, not exactly identical. In the second principal component, sample 1 falls in between samples 2 and 3.

The scores P and loadings Q contain information about the samples (1, 2 and 3) and the variables (a and b), respectively. We can graphically combine all this information in a biplot:

Figure 12.4: A biplot of both principal components along with the loadings of the two variables shown as arrows. The first principal component shows that the most important difference is that between samples 2 and 3 (which are rich in a) and sample 1 (which is poor in a). The second principal component captures the remaining variance, with samples 1 and 3 being richer in b than sample 2. The bottom-left and top-right axes represent the scores and loadings, respectively, after shrinking the former and expanding the latter by a factor of √ ---
nΛ

, where n = 3 is the number of samples.

Although the two-dimensional example is useful for illustrative purposes, the true value of PCA obviously lies in higher dimensional situations. As a second example, let us consider one of R’s built-in datasets:

	Murder	Assault	Rape	UrbanPop

Alabama	13.2	236	21.2	58
Alaska	10.0	263	44.5	48
Arizona	8.1	294	31.0	80
Arkansas	8.8	190	19.5	50
California	9.0	276	40.6	91
Colorado	7.9	204	38.7	78

Wisconsin	2.6	53	10.8	66
Wyoming	6.8	161	15.6	60

Table 12.1: USArrests is a dataset that is built into the R programming environment. It contains crime statistics (in arrests per 100,000 residents) for murder, assault and rape in each of the 50 US states in 1973. Also given is the percentage of the population living in urban areas. Thus, USArrests is a four-column table that cannot readily be visualised on a two-dimensional surface.

Applying PCA to Table 12.1 yields four principal components, the first two of which represent 62% and 25% of the total variance, respectively. Visualising the PCA results as a biplot:

Figure 12.5: PCA biplot of American crime statistics. The grey labels mark the different states, whereas the black vectors mark the crimes and the percentage of the population that lives in urban areas. States that have a lot of crime plot on the left hand side of the diagram, states with little crime plot towards the right. Heavily urbanised states plot at the top of the diagram, rural states plot near the bottom.

States that plot close together (such as West Virginia & Vermont, or Arizona & New York, etc.) have similar crime and urbanisation statistics. States that plot at opposite ends of the diagram (such as North Dakota & Florida, or Mississipi & California) have contrasting crime and/or urbanisation statistics. The vectors for murder, assault and rape all point in approximately the same direction, towards the left hand side of the diagram. This tells us that the crimes are all correlated with each other. So states that have a lot of assaults (such as Florida), also have a lot of rape and murder. States that plot on the right hand side of the diagram (such as North Dakota), have low crime statistics in all categories. The vector with the urban population (UrbanPop) is perpendicular to the crime vectors. This tells us that crime and degree of urbanisation are not correlated in the United States.

Multidimensional Scaling (MDS) is a multivariate ordination technique that is similar in many ways to PCA. MDS aims to extract two (or higher) dimensional ‘maps’ from tables of pairwise distances between objects. Let us illustrate the method with the synthetic dataset of Equation 12.1. Using the Euclidean distance (d[i,j], where 1 ≤{i,j}≤ 3):

∘ ------------------------- d[i,j] = (a[i]− a[j])2 + (b[i]− b[j])2

(12.6)

we can populate a 3 × 3 table of distances:

⌊ 1 2 3 ⌋ 1 0 19.2 19.2 d = 2⌈19.2 0 6.3 ⌉ 3 19.2 6.3 0

(12.7)

Table 12.7 has the following properties:

1.: symmetry: d[i,j] = d[j,i]; for example, the distance from sample 1 to 3 equals the distance from sample 3 to 1.
2.: non-negativity: d[i,j] ≥ 0 and d[i,j] = 0 if i = j; for example, the distance between sample 2 and itself is zero.
3.: triangle inequality: d[i,j] + d[j,k] ≥ d[i,k]; for example, the sum of the distance from sample 2 to 1 and the distance from sample 2 to 3 is 19.2 + 6.3 = 25.5 km, which is greater than the distance from sample 1 to 3 (19.2).

Given a table of this form, MDS reconstructs the original set of plot coordinates:

x y ⌊ ⌋ 1 12.6 0 m = 2⌈ − 6.3 − 3.2⌉ 3 − 6.3 3.2

(12.8)

Note that Equation 12.8 is identical to Equation 12.5 apart from the sign of the x-column. So in this example MDS is essentially identical to PCA. The only difference is that MDS takes a table of distances as input, whereas PCA uses the raw data. This makes MDS more flexible than PCA. As a second (non-trivial) example, consider the following table of pairwise distances between European cities:

	Athens	Barcelona	Brussels	…	Rome	Stockholm	Vienna

Athens	0	3313	2963	…	817	3927	1991
Barcelona	3313	0	1326	…	1460	2868	1802
Brussels	2963	1318	0	…	1511	1616	1175

Rome	817	1460	1511	…	0	2707	1209
Stockholm	3927	2868	1616	…	2707	0	2105
Vienna	1991	1802	1175	…	1209	2105	0

Table 12.2: Table of road distances (in km) between European cities. The full dataset comprises 21 cities.

Plugging this table into an MDS algorithm produces the following output:

Figure 12.6: MDS configuration of the European city distance data (Table 12.2). Cities (such as Lyon and Geneva) that are close together in the real world plot close together on the MDS configuration. And cities (such as Stockholm and Athens) that are far apart in the real world plot on opposite ends of the MDS configuration. Whilst the MDS configuration preserves the distances, it does not preserve the orientation of the cities. In this figure, the y-axis has been flipped, and the city locations are rotated ∼15^∘ in a clockwise sense compared to the real map of Europe.

We can measure the distances between the cities on the MDS map (in cm, inches or any other unit) and plot them against the input distances (in km) from Table 12.2. This produces a so-called Shepard plot:

Figure 12.7: The Shepard plot of the European city distances shows a good agreement between the true input distances from Table 12.2 (x-axis) and the fitted distances measured on the MDS configuration of Figure 12.6 (y-axis). There are 21 cities in the dataset, resulting in 21 × 20∕2 = 210 pairwise distances. Hence there are 210 data points on this scatter plot. Most of the scatter of the data around the best fit line is caused by the fact that the input data are road distances, which do not perfectly agree with the straight line map distances.

The scatter of the fitted data relative to the true distances can be quantified as the Stress:

∘ ∑-----∑------------------------ ---ni=1---nj=i+1-(f-(d-[i,j])-−-δ[i,j])2 S = ∑n ∑n δ[i,j]2 i=1 j=i+1

(12.9)

where d[i,j] is the input distance between objects i and j (e.g., the Euclidean distance of Equation 12.7), δ[i,j] is the fitted distance measured on the MDS configuration, and f is a monotonic transformation that essentially maps d[i,j] to the same scale as δ[i,j]. The European city distance dataset is characterised by a Stress values of 7.5%, which corresponds to a ‘good’ fit:

fit	poor	fair	good	excellent	perfect
S	0.2	0.1	0.05	0.025	0

Table 12.3: Rule of thumb for interpreting the goodness of fit of an MDS configuration.

So far we have only discussed the graphical output of MDS, but we have not yet explained how this output is produced. It turns out that there are several ways to do so. In its simplest form (classical MDS), the method consists of a simple sequence of matrix operations that are similar to the PCA algorithm outlined in Equations 12.2 – 12.5. An alternative and more widely used approach (nonmetric MDS) uses an iterative gradient search algorithm to minimise Equation 12.9.

Nonmetric MDS is more flexible than classical MDS because it accommodates unconventional ‘dissimilarity’ measures that do not necessarily have to behave like conventional distances. So instead of physical distances expressed in kilometres or miles, we can also use MDS to interpret differences in chemical concentration, density, degree of correlation, and many other numerical quantities. Consider, for example, the detrital zircon U–Pb geochronology data of Section 9.8. Figure 9.12 showed that we can express the ‘dissimilarity’ between two U–Pb age spectra using the Kolmogorov-Smirnov (K-S) statistic. Repeating this exercise for a collection of 13 samples yields a 13 × 13 matrix of K-S values:

|-| |--| 1 2 3 4 -5- 6 7 8 9 10 L T -Y-| ⌊ ⌋ 1 | 0 14 33 27 18 14 15 22 48 32 42 37 40 | 2 || 14 0 36 33 16 14 15 24 46 32 47 42 43 || 3 || 33 36 0 19 24 44 47 55 17 10 13 12 8 || 4-|| 27 33 19 0 20 38 41 48 28 14 21 17 |16|| -5-|| 18 16 24 20 0 22 24 33 31 20 33 28 -30-|| 6 || 14 14 44 38 22 0 14 24 52 41 52 48 49 || d = 7 || 15 15 47 41 24 14 0 16 51 43 54 49 52 || 8 || 22 24 55 48 33 24 16 0 61 53 63 59 62 || || || 9 | 48 46 17 28 31 52 51 61 0 20 22 18 16 | 10 || 32 32 10 14 20 41 43 53 20 0 17 15 13 || L || 42 47 13 21 33 52 54 63 22 17 0 10 11 || T--⌈ 37 42 12 17 |28-| 48 49 59 18 15 10 0 7 ⌉ -Y-| 40 43 8 16 -30-| 49 52 62 16 13 11 7 0

(12.10)

where the K-S values have been multiplied with 100 to remove the decimal points. Square boxes mark the two samples shown in Figure 9.12. Equation 12.10 is a symmetric matrix containing positive values and a zero diagonal. It fulfils the requirements for MDS analysis:

Figure 12.8: MDS configuration of the detrital zircon U–Pb data. Samples that have similar age distributions (such as ‘Y’ and ‘T’) are characterised by low K-S statistics (e.g., d[Y,T] = 0.07) and plot close together. Samples that have greatly differing age distributions (such as ‘L’ and ‘8’) are characterised by high K-S statistics (e.g., d[L,8] = 0.63) and plot far apart on the MDS map.

K-means clustering is an unsupervised learning algorithm that tries to group data based on their similarity. As the name suggests, the method requires that we pre-specify the number of clusters (k) to be found in the dataset. We will introduce this algorithm using a simple 2-dimensional dataset:

Figure 12.9: A two-dimensional dataset to illustrate the k-means clustering algorithm. There are 150 data points. In this first exercise we will try to classify them into three groups.

The k-means algorithm then proceeds as follows:

1.: Randomly select three data points from the dataset and designate them as the centroids of three clusters:

Figure 12.10: Data points 48, 100 and 130 were randomly selected from the dataset and assigned as the centroids of clusters 1 (circle), 2 (triangle) and 3 (cross).
2.: Reassign each data point to the cluster whose centroid is closest to it:

Figure 12.11: Replace each of the question marks in Figure 12.10 with the symbol (circle, triangle or cross) that is closest to it, using the Euclidean distance of Equation 12.6.
3.: Calculate a new centroid for each cluster:

Figure 12.12: The grey symbols are the same as the black symbols in Figure 12.11. The black symbols are the average {x,y}-positions of all the samples within each cluster. These are different than the previous values shown in Figure 12.10. The new values form the centroids of the clusters that will be used in the next iteration of the k-means algorithm.
4.: Repeat steps 2 and 3 until convergence is achieved:

Figure 12.13: Final classification of the data. Each of the 150 data points has been assigned to one of the three clusters whose centroids are marked as large and bold symbols.

The k-means algorithm can easily be generalised from two to more dimensions because the Euclidean distance (Equation 12.6) can easily be generalised to any number of dimensions.

The k-means algorithm is an unsupervised learning algorithm. This means that it is meant to be applied to data for which we do not know the correct classification. However, to get an idea of the success rate of the algorithm, it is useful to apply it to a dataset for which we do know the correct answer. One dataset that is particularly useful for this purpose was first introduced to statistics by R.A. Fisher. The dataset contains the measurements in centimetres of the sepal length and width, plus the petal length and width of 50 flowers from 3 species of iris:

Figure 12.14: Two-dimensional marginal distributions of R.A. Fisher’s iris dataset, which comprises four variables measured in 150 different flowers belonging to three different species: setosa (circles), versicolor (triangles) and virginica (crosses). The two-dimensional dataset of Figures 12.9–12.13 was derived from panel (4,2), which sets out petal width against sepal width. The classification shown in Figure 12.13 does a decent job at classifying the 150 flowers into three groups but the classification is not perfect. For example, the flower in the lower left corner of panel (4,2) belongs to setosa but was incorrectly classified as versicolor in Figure 12.13.

We already know which iris species each of the 150 flowers belongs to. For the sake of illustration, we can ignore this information and apply the k-means algorithm to the four dimensional measurements. Afterwards, we can compare the resulting classification to the true species and visualise them on a 3 × 3 contingency table:

cluster	setosa	versicolor	virginica

1	50	0	0
2	0	48	14
3	0	2	36

Table 12.4: Classification results of the k-means algorithm applied to Fisher’s iris data.

Table 12.4 shows that all flowers of the setosa species were collected in the same group (cluster 1). This is not suprising when one considers that, in Figure 12.14, the circle symbols form a distinct group in all the panels. 48 out of 50 versicolor flowers were classified into the second cluster, with the remaining two flowers being misclassified into the third cluster. Finally, 36 out of 50 virginica flowers were classified into the third cluster, whilst 14 ended up in cluster 2. The difficulty in separating the versicolor and virginica flowers is caused by the overlap between their four dimensional data clouds of measurements.

The k-means clustering algorithm of Section 12.3 requires that we pre-specify the number of groups. It is not always obvious how to choose this number, although exercise 3 of Chapter 18.12 explores a method to help pick an optimal number of clusters. Hierarchical clustering is an alternative approach that builds a hierarchy from the bottom-up, and does not require the number of groups to be specified beforehand. Let us again use a simple 2-dimensional example to introduce the method:

x y ⌊ ⌋ 1 6 90 2|| 91 43|| X = 3|| 17 95|| 4⌈ 20 70⌉ 5 80 6

Figure 12.15: A simple bivariate dataset that will be used to illustrate the hierarchical clustering algorithm:

The algorithm works as follows:

1.

Put each data point in its own cluster and calculate the distances between them with Equation 12.6.

1 2 |-3--| 4 5 1⌊ 0 97.1 -12.1- 24.4 112 ⌋ 2| 97.1 0 90.4 76.0 38.6| d = 3|||12.1| 90.4 0 25.2 109 || |⌈------ |⌉ 4 24.4 76.0 25.2 0 87.7 5 112 38.6 109 87.7 0

2.

Identify the closest two clusters (corresponding to the boxed numbers in step 1) and join them together.

Figure 12.16: First step of the hierarchical clustering process. The closest two samples (1 and 3) have been grouped together into a new cluster (grey line, left). The results can also be visualised as a tree or dendrogram (right).

3.

Calculate the distances between the remaining four clusters:

⌊ 13 2 |-4--| 5 ⌋ 13 0 97.1 25.2-- 112 2 || 97.1-| 0 76.0 38.6 || d = 4 ⌈ 25.2-- 76.0 0 87.7 ⌉ 5 112 38.6 87.7 0

where the distance between cluster 13 (which groups samples 1 and 3) and the other points is calculated as

d[13,i] = max (d[1,i],d[3,i]) for i ∈ {2,4,5}

4.

Identify the closest two clusters (corresponding to the boxed numbers in step 3) and join them together.

Figure 12.17: The second step of the hierarchical clustering process shown as a scatter plot (left) and a dendrogram (right). The first order cluster is nested inside the second order one.

5.

Calculate the distance between the remaining three clusters:

⌊ 134 2 5 ⌋ 134 0 97.1 |112-| d = 2 ⌈ 97.1 |-0--| -38.6-⌉ 5 112 |38.6 | 0 ------

6.

Identify the two closest clusters (corresponding to the boxed numbers in step 5) and join them together:

Figure 12.18: The third step of the hierarchical clustering process shown as a scatter plot (left) and a dendrogram (right). The third cluster does not share any elements with the first two clusters.

7.

The final iteration yields the following tree:

Figure 12.19: Final results of the hierarchical cluster analysis. The tree consists of four nested clusters. The y-axis has units of distance: the longer the branch, the greater the difference between the corresponding clusters.

Applying the same algorithm to Fisher’s iris datasets produces a tree with 150 ‘leaves’, each corresponding to a single flower:

Figure 12.20: Hierarchical clustering tree of R.A. Fisher’s iris data. Labels have been omitted to reduce clutter.

Recall that the ‘height’ of the tree corresponds to the maximum possible distance between points belonging to two different clusters. The height changes rapidly between one and three clusters, indicating that these correspond to the most significant bifurcations. So let us ‘cut down’ the tree at this level:

Figure 12.21: Cutting the tree at height=3.6 (dashed line) produces a simple tree with three branches, which can be used to classify the iris flowers into three groups.

Given that we know the species of all 150 iris flowers in the dataset, we can assess the performance using a contingency table, just like Table 12.4:

cluster	setosa	versicolor	virginica

1	50	0	0
2	0	23	49
3	0	27	1

Table 12.5: Classification results of the hierarchical clustering algorithm applied to Fisher’s iris data.

The algorithm has done a good job at classifying setosa and virginica but struggles with versicolor.

Consider a collection of fresh basalt samples originating from different tectonic settings (e.g., mid ocean ridges, ocean islands and island arcs). Suppose that we have analysed the chemical composition of these samples (i.e., the concentration of SiO₂, Al₂O₃, etc). Further suppose that we have also analysed a basaltic rock from an ophiolite of unknown origin. Can we recover the original tectonic setting of the ophiolite?

This is one example of a supervised learning problem. Whereas the unsupervised learning algorithms of chapter 12 do not require prior knowledge about the data, the supervised learning algorithms of this chapter use training data to classify samples into pre-defined categories. Once the classification is complete, the same decision rules can then be used to assign a new sample of unknown affinity to one of these categories.

This chapter will introduce two supervised learning techniques. Discriminant analysis (Section 13.1) is a method that is similar in some ways to PCA (Section 12.1), whereas decision trees (Section 13.2) are similar in some ways to the hierarchical clustering algorithm of Section 12.4.

Consider a dataset X containing a large number of N-dimensional data, which belong to one of K classes. We are trying to decide which of these classes an unknown sample x belongs to. This question is answered by Bayes’ Rule (Section 4.3): the decision d is the class G (1 ≤ G ≤ K) that has the highest posterior probability given the data x:

d = k=m1a,x...,KP (G = k|X = x)

(13.1)

This posterior probability can be calculated using Bayes’ Theorem (Equation 4.11):

P (G |X ) ∝ P (X|G )P (G )

(13.2)

where P(X|G) is the likelihood of the data in a given class and P(G) the prior probability of the class, which we will assume to be uniform; i.e., P(G = 1) = P(G = 2) = … = P(G = K) = 1∕K. Therefore, plugging Equation 13.2 into Equation 13.1 reduces Bayes’ Rule to a comparison of likelihoods. We now make the simplifying assumption of multivariate normality:

[ 1 T −1 ] P(X = x|G = k) = exp-−-2(x∘−-μk)-Σ-k-(x−-μk-)- (2π)N|Σk|

(13.3)

where μ_k and Σ_k are the mean and covariance of the k^th class and (x − μ_k)^T indicates the transpose of the matrix (x − μ_k). Using Equation 13.3 and taking logarithms, Equation 13.1 becomes:

[ 1 1 ] d = max − -ln|Σk |− -(x − μk)TΣ −k1(x− μk ) k=1,...,K 2 2

(13.4)

Equation 13.4 forms the basis for quadratic discriminant analysis (QDA). Let us illustrate this procedure with a simple bivariate dataset comprising three classes:

Figure 13.1: Quadratic discriminant analysis of synthetic data: a) the bivariate training data comprise three classes that are marked by crosses, circles and triangles, respectively; b) fitting three bivariate normal distributions to the data produces c) curved decision boundaries between the three distributions.

If we have a new sample of unknown affinity, then we can simply plot its (x,y)-composition on Figure 13.1.c) and classify it in one of the three groups depending on the decision boundaries.

Usually, μ_k and Σ_k are not known, and must be estimated from the training data. If we make the additional assumption that all the classes share the same covariance structure (i.e., Σ_k = Σ for all k), then Equation 13.1 simplifies to:

[ ] T −1 1- T −1 d = km=a1x,...,K x Σ μk − 2 μkΣ μk

(13.5)

This is the basis of linear discriminant analysis (LDA), which has some desirable properties. Because Equation 13.5 is linear in x, the decision boundaries between the different classes are straight lines. Figure 13.2 illustrates this phenomenon with a second bivariate dataset:

Figure 13.2: Linear discriminant analysis of a second synthetic dataset: a) the bivariate training data comprise the same three classes as Figure 13.1; b) fitting three bivariate normal distributions to the data with the same covariance matrix c) produces linear decision boundaries between the three distributions.

LDA can also be used to reduce the dimensionality of datasets in a similar way to PCA. Recall that, in Section 12.1, we used a simple toy example to show how PCA can be used to project a two-dimensional dataset onto a one dimensional line. We can use the same approach to illustrate how LDA can achieve the same effect with a different aim. Consider a mixed dataset of two bivariate samples, X and Y :

a b a b 1⌊3 0 ⌋ 1⌊60 19⌋ ⌈ ⌉ ⌈ ⌉ X = 2 0 2 and Y = 2 69 36 3 9 19 3 66 38

(13.6)

Calculate the means of X and Y

[a b] [ a b ] ¯X = 4 7 and ¯Y = 65 31

(13.7)

and the within-group covariance matrix:

Σ = ΣX1--+-ΣX2-+-...+-ΣY-3 w 6− 2

(13.8)

where

Σ_X1	= ,	(13.9)
Σ_X2	= ,…	(13.10)
Σ_{Y 3}	= .	(13.11)

so that:

[ 84 174] Σw = 174 436

(13.12)

The linear discriminant function is obtained by minimising the within-group (co)variance relative to the difference between the group means:

a b LD = Σ− 1(¯Y − X¯) = [14.1 − 5.42] w

(13.13)

R’s lda() function (Section 17.13.1) normalises the linear discriminant function as follows:

a b ′ -----LD1------- [ ] LD = ∘ ----T--------= 0.52 − 0.20 LD1 Σw LD1

(13.14)

The projected coordinates for the two samples then become:

⌊ ⌋ ( ⌊ ⌋ ) (⌊ ⌋ ) ⌊ ⌋ X1 3 0 − 12.65 || X2 || || || 0 || || |||| 2 || || || − 14.61 || || X3 || → 0.52|| || 9 || − ¯a|| − 0.2 |||| 19 || − ¯b|| = || − 13.32 || || Y1 || || || 60 || || |||| 19 || || || 13.32 || ⌈ Y2 ⌉ ( ⌈ 69 ⌉ ) (⌈ 36 ⌉ ) ⌈ 14.61 ⌉ Y3 66 38 12.65

(13.15)

where ā(= 34.5) and b(= 19.0) are the column means of the pooled data. The outcome of these calculations are shown graphically in Figure 13.3, alongside the results of a PCA analysis for the same dataset.

Figure 13.3: (i) The bivariate toy dataset of Equation 13.6 with indication of the linear dicriminant function (LD) and the first (PC1) and second (PC2) principal components. LDA uses the groupings (X and Y ), PCA ignores them. (ii) Projecting the data onto the directions defined by LD, PC1 and PC2 produces three one-dimensional views of the toy data. LD is the direction that maximises the between-group variance (i.e. the variance of the group means) relative to the within-group variance (i.e., the pooled variance of the groups about their means). In contrast, PC1 maximises the overal variance.

Applying LDA to the four-dimensional iris dataset produces two linear discriminant functions (K − 1, where K = 3 is the number of classes/species). Like the first two principal components of Figure 12.5 and the first two MDS dimensions of Figure 12.6, also the two linear discriminant functions of Figure 13.4 achieve a dimension reduction. But whereas PCA and MDS aim to visualise the total variability of the dataset, LDA aims to highlight the pre-defined clusters within the dataset:

Figure 13.4: The two linear discriminants of the four-dimensional iris data represent a two-dimensional projection of these data that maximises the differences between the three different species of flowers. They are defined as
LD1 = 0.83 × (sepal length - 5.84) + 1.53 × (sepal width - 3.06) - 2.20 × (petal length - 3.76) - 2.81 × (petal width - 1.20); and
LD2 = 0.024 × (sepal length - 5.84) + 2.16 × (sepal width - 3.06) - 0.93 × (petal length - 3.76) + 2.84 × (petal width - 1.20).

Suppose that we have a new flower with a sepal length of 6.0 cm, a sepal width of 3.0 cm, a petal length of 5.0 cm and a petal width of 1.5 cm. What species does the flower belong to?

Figure 13.5: Using the two discriminant functions of Figure 13.4:
LD1 = 0.83 × (6.0 − 5.84) + 1.53 × (3.0 − 3.06) − 2.20 × (5.0 − 3.76) − 2.81 × (1.5 − 1.20) = −3.53; and
LD2 = 0.024×(6.0−5.85)+2.16×(3.0−3.06)−0.93×(5.0− 3.76) + 2.84 × (1.5 − 1.20) = −0.42.
Plotting these two coordinates on the discrimination diagram of Figure 13.4 suggests that the new flower belongs to the versicolor species.

A more precise classification can be obtained by plugging the measurements of the new flower into Equation 13.2 and calculating the posterior probabilities for the three species. This produces the following result:

	virginica	versicolor	setosa

P(G\|X)	0.19	0.81	3.2 × 10⁻²⁷

In conclusion, there is an 81% chance that the new flower is versicolor and a 19% chance that it is virginica.

Discriminant analysis is a parametric learning algorithm, which assumes that the different classes in a dataset are grouped in multivariate normal clusters (Figures 13.1.b and 13.2.b). Here is an example of a dataset for which this assumption is invalid:

Figure 13.6: Synthetic dataset of bivariate measurements that belong to two classes. The black circles are split into two data clouds. The white circles form a ‘c’-shape surrounding one of the modes of the black population. Discriminant analysis is unable to deal with this situation.

Decision trees are a nonparametric learning algorithm that is better able to handle complex multimodal datasets like this. It uses recursive binary partitions to describe the data, with a piecewise constant function. The first step of the algorithm exhaustively searches all the possible split points s (−∞ < s < ∞) and variables (x and y in our example) to minimise the impurity Q:

Q = p∘(1− p∘) + p∙(1 − p∙)

(13.16)

where p_∘ and p_∙ are the proportions of class 1 and class 2 objects in one half of the partition. Equation 13.16 is also known as the Gini index of diversity. Figure 13.7 evaluates a few possible split points using this criterion. The optimal solution is found by exhaustive searching:

Figure 13.7: Evaluation of three candidate split points using the Gini index Q. The best first order split corresponds to x = −4.902 (panel c).

The left hand side of the first partition (x < −4.902, Figure 13.7.c) is pure (100% ×∙). So the second step only needs to evaluate the right hand side. One major advantage of using recursive partitions is that they can be visualised as dendrograms, just like the hierarchical clusters of Section 12.4:

Figure 13.8: Second order partition of the right hand side of Figure 13.7.c). The results can be visualised as a dendrogram or tree (right). The ‘leaves’ of this tree are annotated as n_∘∕n_∙ where n_∘ is the number of training data of class ∘ and n_∙ is the number of training data of class ∙.

The recursive partitioning process can be repeated until all the nodes of the decision tree are ‘pure’, i.e. they contain only ∘ or only ∙:

Figure 13.9: A recursive binary partition tree that perfectly classifies all the samples in Figure 13.6. The annotations of the dendrogram have been removed to reduce clutter.

The maximum sized tree thus obtained contains 35 terminal nodes and perfectly describes the training data. In other words, it has zero bias. However, for the purpose of prediction, this tree is not optimal, because it overfits the training data, causing high variance. One way to estimate the misclassification rate is to use a separate set of test data whose affinities (∘ or ∙) are known, but which were not used to train the decision tree. Another method is cross validation. This method works as follows:

1.: divide the training data into ten equal parts;
2.: remove one of the parts, and use the remaining nine to create a decision tree;
3.: put the removed part in the tree and count the number of misclassified samples;
4.: repeat steps 2 and 3, but this time removing the second fraction;
5.: repeat until all ten parts have been assessed.

The tree with optimal predictive power is smaller than the largest possible tree, and can be found by ‘pruning’ it. Define the “cost-complexity criterion” of a tree T as

|T | ∑ cpα(T ) = NmQm + α|T| m=1

(13.17)

where |T| is the number of terminal nodes in the tree T, N_m the number of observations in the m^th terminal node; Q_m the impurity of the m^th node, defined by Equation 13.16; and α a tuning parameter. For a given α ≥ 0, it is possible to find the subtree T_α ⊂ T₀ that minimises cp_α(T) over all possible subtrees of the largest possible tree T₀:

T α = mTi⊂nT cpα(T) 0

(13.18)

Repeating this procedure for a range of values 0 ≤ α < ∞ produces a finite nested sequence of trees {T₀,T_α₁,...,T_{α_max}}. Except for T₀, these trees no longer have only pure end-nodes. Impure end-nodes are assigned to the class that dominates in them. We then choose the value α^∗ that minimises the cross validation error. For the bivariate example dataset:

Figure 13.10: A plot of cross validated (CV) prediction error versus the number of nodes in the collection of nested subtrees shows a minimum at five splits. The CV error of small trees is caused by bias, whereas the CV error of large tree is a result of variance. There typically exist several trees with CV errors close to the minimum. Therefore, a ‘1-SE rule’ is used, i.e., choosing the smallest tree whose misclassification rate does not exceed the minimum CV error plus one standard error of the smallest CV error (dotted line).

For the bivariate example, this procedure generates a tree with four splits and five terminal nodes:

Figure 13.11: The optimal tree for the bivariate test data. This tree misclassifies 19 of the 300 samples in the training data (6.3%). The 10-fold cross validation error is 18%.

Like the discriminant analysis of Section 13.1, also decision trees can be applied to, and are most useful for, datasets that comprise more than two dimensions. Applying the method to Fisher’s iris data:

Figure 13.12: The optimal tree for Fisher’s iris data. There are four variables (instead of two for the previous example) and three classes (instead of two for the previous example). The optimal tree misclassifies 6 of the 150 flowers in the training data (4%). The 10-fold cross validation error is 6%.

Suppose that we have a new flower with a sepal length of 6.0 cm, a sepal width of 3.0 cm, a petal length of 5.0 cm and a petal width of 1.5 cm. To which species does the flower belong? The petal length of the new flower is greater than the 2.45 cm cutoff of the first split and its petal width is less than the 1.75 cm cutoff of the second split. Therefore the flower ends up in second terminal node of the tree, suggesting that it belongs to species versicolor. This is the same conclusion as the LDA classification shown in Figure 13.5.

The normal distribution (chapter 7) plays a key role in linear regression (chapter 10), PCA (Section 12.1), discriminant analysis (Section 13.1) and many other standard statistical techniques. Although Gaussian distributions are common in nature, it is dangerous to assume normality for all datasets. In fact, more often than not, the normality assumption is invalid for geological data. Ignoring this non-normality can lead to counter-intuitive and plainly wrong results.

To illustrate the dangers of blindly assuming normality, let us consider the simple case of positive ratio data, which are quite common in the Earth Sciences. Take, for example, the ratio of compressional and shear-wave velocity (V _p∕V _s) in seismology, which can be used to infer lithology. Or consider the ratio of ⁸⁷Sr to ⁸⁷Rb in geochronology (chapter 10). We have already seen that ratios can lead to spurious correlations in regression analysis (section 10.4). This section will show that even basic summary statistics such as the mean and standard deviation can produce counter-intuitive results when applied to ratios of Jeffreys quantities. Consider the following two datasets of ten random numbers between 0 and 1:

	1	2	3	4	5	6	7	8	9	10

A	0.27	0.37	0.57	0.91	0.20	0.90	0.94	0.66	0.63	0.062
B	0.21	0.18	0.69	0.38	0.77	0.50	0.72	0.99	0.38	0.780

Forming two sets of ratios by dividing A by B and B by A:

	1	2	3	4	5	6	7	8	9	10

A∕B	1.30	2.10	0.83	2.40	0.26	1.80	1.30	0.67	1.70	0.079
B∕A	0.78	0.47	1.20	0.42	3.80	0.55	0.76	1.50	0.60	13.0

Simple arithmetic dictates that the reciprocal of A∕B equals B∕A:

	1	2	3	4	5	6	7	8	9	10

A∕B	1.30	2.10	0.83	2.40	0.26	1.80	1.30	0.67	1.70	0.079
B∕A	0.78	0.47	1.20	0.42	3.80	0.55	0.76	1.50	0.60	13.0
1∕(A∕B)	0.78	0.47	1.20	0.42	3.80	0.55	0.76	1.50	0.60	13.0

However when we calculate the mean of the ratios:

	1	2	3	4	5	6	7	8	9	10	mean

A∕B	1.30	2.10	0.83	2.40	0.26	1.80	1.30	0.67	1.70	0.079	1.20
B∕A	0.78	0.47	1.20	0.42	3.80	0.55	0.76	1.50	0.60	13.0	2.30

then we find that the mean of A∕B does not equal the reciprocal of the means of B∕A!

1 1 ----- -----= ---- = 0.81 ⁄= 2.30 = B∕A A∕B 1.20 -1--- -1-- ----- and B∕A = 2.30 = 0.44 ⁄= 1.20 = A∕B

The solution to the ratio averaging conundrum is to take logarithms:

	1	2	3	4	5	6	7	8	9	10

ln[A∕B]	0.25	0.75	-0.18	0.86	-1.30	0.59	0.27	-0.41	0.50	-2.50
ln[B∕A]	-0.25	-0.75	0.18	-0.86	1.30	-0.59	-0.27	0.41	-0.50	2.50

Taking the average of the logarithms and exponentiating produces a geometric mean:

	mean	exp[mean]

ln[A∕B]	-0.12	0.88
ln[B∕A]	0.12	1.13

We then find that:

1 1 ------- = ---- = 1.13 = g (B ∕A ) g(A ∕B) 0.88 and ---1--- = --1- = 0.88 = g (A ∕B ) g(B ∕A) 1.13

where g(∗) stands for the “geometric mean of ∗”. This is an altogether more satisfying result than the arithmetic mean.

Like the ratios of the previous section, the chemical compositions of rocks and minerals are also expressed as strictly positive numbers. They, however, do not span the entire range of positive values, but are restricted to a narrow subset of that space, ranging from 0 to 1 (if fractions are used) or from 0 to 100 (using percentage notation). Compositions are further restricted by a constant sum constraint:

n ∑ Ci = 1 i=1

for an n-component system. Consider, for example, a three-component system {x,y,z}, where x + y + z = 1. Such compositions can be plotted on ternary diagrams, which are very popular in geology. Well known examples are the Q–F–L diagram of sedimentary petrography, the A–CN–K diagram in weathering studies, and the A–F–M, Q–A–P and Q–P–F diagrams of igneous petrology. Treating the ternary data space as a regular Euclidean space with Gaussian statistics leads to wrong results, as illustrated by the following example:

Figure 14.1: The A–CN–K diagram is a widely used graphical device in chemical weathering studies. It is a ternary diagram of Al₂O₃, CaO + Na₂O and K₂O, where the CaO refers to the silicate component of the sediment only (ignoring carbonates). The composition on the diagram results from the competing effects of initial starting composition and chemical weathering. With increasing weathering intensity, A–CN–K compositions get pulled towards the Al₂O₃ apex of the ternary diagram. This figure shows a synthetic dataset of 20 A–CN–K measurements that have been affected by variable weathering intensities.

To obtain an average composition for the 20 samples, our natural instinct is to calculate the arithmetic mean of their A–CN–K values:

------ ∑20 Al2O3 = (Al2O3)i∕20 = 0.763 i=1 ------------ ∑20 CaO + Na2O = (CaO2 + Na2O )i∕20 = 0.141 i=1 ----- ∑20 K2O = (K2O )i∕20 = 0.096 i=1

Plotting this result on the ternary diagram reveals that it is physically meaningless:

Figure 14.2: The black square represents the average A–CN–K composition of the 20 samples of Figure 14.1. It is obtained by taking the arithmetic mean of all the Al₂O₃, (CaO + Na₂O) and K₂O concentrations using Equation 3.1, and plotting the resulting 3-element vector as a new sample. This average composition plots outside the sample cloud, which is a meaningless result not unlike the vegetation example of Figure 3.3.a.

To quantify the dispersion of the data, we might be tempted to calculate the standard deviation of the A–CN–K values:

┌ ---------------------- ││ 2∑0 [(Al O ) − 0.763]2 s[Al2O3 ] = ∘ ----2-3-i---------= 0.0975 i=1 19 ┌│ ----------------------------- │∘ 2∑0 [(CaO2-+-Na2O--)i −-0.141]2 s[CaO + Na2O ] = 19 = 0.142 ┌ --------i=1--------- ││ ∑20 2 s[K2O ] = ∘ [(K2O-)i −-0.096]-= 0.0926 i=1 19

If the assumption of normality holds, then we would expect ∼95% of the data to fall into a ‘2-sigma’ interval around the mean (Section 7.3):

Al2O3 : 0.763 ± 0.195 CaO + Na2O : 0.141± 0.284 K2O : 0.096± 0.185

Note how the lower limits of the confidence regions for (CaO + Na₂O) and K₂O have physically impossible negative values. Visualising these limits as a 2-sigma ‘confidence polygon’:

Figure 14.3: The polygon represents a ‘2-sigma confidence region’ for the arithmetic mean of Figure 14.2. It is obtained by 1) calculating the standard deviations of the Al₂O₃, (CaO + Na₂O) and K₂O concentrations using Equation 3.3; 2) multiplying these values by two; and 3) subtracting or adding these values to the arithmetic mean of Figure 14.2. The resulting ‘2-sigma’ bounds plot outside the ternary diagram, in physically impossible negative data space.

The synthetic example shows that even the simplest summary statistics do not work as expected when applied to compositional data. The same is true for more advanced statistical operations that are based on the normal distribution. This includes linear regression (Section 10.3), principal component analysis (Section 12.1) and discriminant analysis (Section 13.1). These problems have long been known to geologists, but a comprehensive solution was not found until the 1980s, by Scottish statistician John Aitchison.

The solution to the compositional data conundrum is closely related to the solution of the ratio averaging problem discussed in Section 14.1. The trick is to map the n-dimensional composition to an (n − 1)-dimensional Euclidean space by means of an additive logratio transformation (alr). For example, in the ternary case, we can map the compositional variables x, y and z to two transformed variables u and v:

[ ] [ ] x- y- u = ln z , v = ln z

(14.1)

After performing the statistical analysis of interest (e.g., calculating the mean or constructing a confidence region) on the transformed data, the results can be mapped back to compositional space with the inverse logratio transformation. For the ternary case:

eu ev 1 x = -u---v----, y =-u----v----and z = -u---v----- e + e + 1 e + e + 1 e + e + 1

(14.2)

Applying this method to the A–CN–K dataset:

Figure 14.4: The additive logratio transformation (Equation 14.1) maps data from an n-dimensional compositional space to an (n−1)-dimensional Euclidean space. For the A–CN–K data, it maps the data from a ternary diagram (n = 3) to a bivariate (n−1 = 2) dataspace using Equation 14.1. In this transformed space, it is safe to calculate arithmetic means (square) and confidence regions (ellipse). After completion of these calculations, the result can be mapped back to the ternary diagram using the inverse logratio transformation (Equation 14.2).

For the A–CN–K example, the transformed logratio mean plots inside the heart of the data cloud. The 95% confidence ellipse of the logratios maps to a curved confidence region that entirely stays within the ternary diagram and closely hugs the data. These results are a lot more meaningful than the arithmetic mean and 2-sigma confidence polygons of Figures 14.2 and 14.3.

The logratio transformation makes intuitive sense. The very fact that it is possible to plot a ternary diagram on a two-dimensional sheet of paper already tells us that it really displays only two and not three dimensions worth of information. The bivariate logratio variables more faithfully represent the true information content of the compositional data. The logratio transformation can easily be generalised from three to more variables. For example, four-component compositional datasets are constrained within a three dimensional triangle or tetrahaedron. The general mathematical term for a constrained dataspace like a ternary diagram or tetrahaedron is the simplex. The additive logratio transformation maps data from the simplex to an ordinary Euclidean data space.

Table 14.1 shows the concentration of 10 major oxides in 16 samples of dune sand from the Namib sand sea:

	SiO₂	Al₂O₃	Fe₂O₃	MgO	CaO	Na₂O	K₂O	TiO₂	P₂O₅	MnO

N1	82.54	6.14	3.18	1.65	2.24	1.16	1.35	0.44	0.11	0.06
N2	83.60	6.42	2.55	1.25	1.83	1.21	1.48	0.36	0.08	0.04

N13	79.96	6.41	3.19	1.31	2.10	1.09	1.62	0.51	0.06	0.05
N14	73.62	9.96	4.07	2.01	3.45	1.86	2.29	0.44	0.10	0.07
T8	85.70	5.89	1.82	0.81	1.44	1.12	1.67	0.23	0.07	0.03
T13	82.54	6.02	2.30	1.42	3.07	1.19	1.46	0.34	0.11	0.04

Table 14.1: The major element composition (in weight percent) of 16 samples of Namib dune sand.

It is not possible to plot this multivariate dataset on a single two-dimensional diagram. Section 12.1 introduced principal component analysis as an effective way to address this issue by projecting the data onto a two-dimensional subspace. Unfortunately we cannot directly apply PCA to compositional data. For example, the first step in the analysis involves the calculation of an arithmetic mean (Equation 12.2). However, Section 14.2 showed that the arithmetic mean of compositional data can produce nonsensical results. This Section will show that this problem can be solved using logratio transformations.

Before we apply PCA to the data of Table 14.1, let us first illustrate the logratio solution using a simple toy example, which is similar to the toy example that was used in Section 12.1. Consider the following trivariate (a, b and c) dataset of three (1, 2 and 3) compositions that are constrained to a constant sum (a_i + b_i + c_i = 100% for 1 ≤ i ≤ 3):

a b c 1⌊ 0.03 99.88 0.09 ⌋ ⌈ ⌉ X = 2 69.45 25.55 5.00 3 72.44 26.64 0.92

(14.3)

Plot the data on a ternary diagram:

Figure 14.5: Ternary diagram of the synthetic toy example of Equation 14.3. Component c has been multiplied by a factor of three to avoid an unsightly overlap between the plot symbols of samples 2 and 3, whose compositions are very similar.

Although the alr-transformation solves the closure problem, it is not suitable for PCA because the axes of alr-coordinate spaces are not orthogonal to each other:

Figure 14.6: There are six different ways to form logratios in the ternary data space. These six logratios define three coordinate axes that intersect each other at 60^∘ angles. As a consequence, distances in alr-space depend on the choice of logratios.

For example, calculate the distance between the alr-coordinates of samples 2 and 3 of Equation 14.3, first using b as a common denominator (d[2,3]_b), and then using c as a common denominator (d[2,3]_c):

∘ (--[-----]-----[-----])----(---[-----]----[------])-- 69.45 72.44 2 5.00- 0.92- 2 d[2,3]b = ln 25.55 − ln 26.64 + ln 25.55 − ln 26.64 = 1.74 ∘ ----------------------------------------------------- ( [69.45] [72.44])2 ( [25.55] [ 26.64 ])2 ⁄= ln -5.00 − ln -0.92- + ln 5.00- − ln 0.92- = 2.46 = d[2,3]c

The fact that distances are not uniquely defined in alr-space spells trouble for PCA. Recall the equivalence of PCA and classical MDS, which was discussed in Section 12.2. MDS is based on distance matrices, and so if dissimilarities are not well defined, then neither are the MDS configuration and, hence, the principal components. This issue can be solved by using a different type of logratio transformation, the centred logratio transformation (clr):

[ x ] [ y ] [ z ] u = ln -------- , v = ln-------- and w = ln -------- g(x,y,z) g(x,y,z) g(x,y,z)

(14.4)

where g(x,y,z) is the geometric mean of the composition:

[ ] g(x,y,z) = exp ln[x]+-ln[y]+-ln[z]- 3

The inverse clr-transformation is defined as follows:

eu ev ew x = eu +-ev-+-ew, y = eu +-ev-+-ew-and z = eu-+-ev +-ew

(14.5)

Unlike alr-coordinate axes, clr-coordinate axes are orthogonal and unequivocally define a set of well behaved distances. Applying the clr-transformation to the data of Equation 14.3 yields a new trivariate dataset:

ln(a∕g) ln (b∕g) ln(c∕g) ⌊1 − 3 5 − 2 ⌋ X = ⌈2 1.21 0.21 − 1.42 ⌉ c 3 1.79 0.79 − 2.58

(14.6)

where g stands for the geometric mean of each row. Note that each of the rows of X_c adds up to zero. Thanks to the symmetry of the clr-coordinates, whose three-dimensional axes are perpendicular to each other, the distances between the rows (which are also known as Aitchison distances) are well defined. Subjecting Equation 14.6 to the same matrix decompositions as Equations 12.2–12.5 yields:

X_c
= 1_3,1 M + P Q⁻¹

= ⌊ 1 ⌋
⌈ ⌉
1
1

⌊ − 4.24 0 0 ⌋
⌈ ⌉
2.12 − 7.1 0
2.12 7.1 0

⌊ 0.71 − 0.71 0 ⌋
⌈ ⌉
0.41 0.41 − 0.82
0.58 0.58 0.58

(14.7)

Note that, even though this yields three principal components instead two, the scores of the third component in matrix P are all zero. Therefore, all the information is contained in the first two components¹ .

Figure 14.7: Compositional biplot of the toy data using the clr transformation. The biplot tells us that sample 1 is rich in component b, whereas samples 2 and 3 are rich in component a. The difference between samples 2 and 3 is due to a small difference in component c. Compare with the ternary diagram (Figure 14.5) to verify these conclusions.

Applying the same procedure to the Namib dataset of Table 14.1:

Figure 14.8: Compositional biplot of the major element concentration data of Table 14.1, with the samples shown in grey and the components in black. Samples that plot close together (such as N1 and N14) are compositionally similar, whereas samples that plot far apart (such as N8 and T8) are compositionally different. One should be careful not to interpret the individual rays of compositional biplots. For example, one cannot safely conclude that N8 is enriched in MnO relative to T8. However, it is safe to say that N8 has a higher MnO/Na₂O-ratio than T8.

With regard to the rays, the key information is contained in the links between them. For example, the link between MgO and CaO on Figure 14.8 is short, which suggests that these two oxides are proportional and their (log)ratios exhibit little scatter. In contrast, the link between K₂O and MgO is long, so their logratios are more dispersed. The link between TiO₂ and CaO is collinear with that between MnO and MgO. This indicates that the TiO₂ – CaO – MnO – MgO subcomposition forms a one dimensional pattern. The link between K₂O and CaO is perpendicular to that between TiO₂ and P₂O₅, suggesting that the corresponding subcompositions are independent:

Figure 14.9: Subcompositional analysis of Figure 14.8. a) Box plots of the CaO/MgO-logratio (short link) and the K₂O/MgO-logratio (long link); b) The collinearity of the links between TiO₂, CaO, MnO and MgO indicates strong correlation of the samples in ln[MnO/MgO] – ln[TiO₂/CaO] space; c) The link between P₂O₅ and TiO₂ is orthogonal to that between between K₂O and CaO, reflecting independence of the corresponding logratios.

Although the alr and clr transformations act in slightly different ways, they both achieve the same effect, namely to ‘liberate’ the compositional data from the confines of the simplex to a Euclidean space in which values are free to take any value from −∞ to +∞. Logratio transformations allow compositional data to be analysed by a host of standard statistical techniques, including not only PCA, but also clustering and discriminant analysis.

Section 14.3 showed how unsupervised learning techniques such as PCA can be safely applied to compositional data after performing a logratio transformation. Exactly the same procedure can be used for supervised learning techniques such as LDA. We will illustrate this with an example from igneous petrology:

Figure 14.10: A dataset of igneous rock compositions from Iceland and the Cascade Mountains on a ternary A–F–M diagram (where A = Na₂O+K₂O, F = Fe₂O₃+FeO and M = MgO). The white dots define a ‘Fenner’ trend marking the tholeiitic suite of igneous rocks from Iceland. The black dots define a ‘Bowen’ trend, marking the calc-alkaline suite of rocks from the Cascades.

Plotting the same data in logratio space:

Figure 14.11: The additive logratio transformation liberates the A–F–M data from the confines of the ternary diagram and maps them to a Euclidean dataspace in which logratios are free to take any value from −∞ to +∞. In this space, the ln(M/F) vs. ln(A/F) values of the tholeiitic and calc-alkali rock suites are clustered into two distinct clouds of roughly equal size that have all the hallmarks of bivariate normal distributions with a shared covariance matrix. Thus the transformed data seems well suited for linear discriminant analysis.

Applying LDA to the A–F–M data and visualising the results in both logratio space and the ternary diagram:

Figure 14.12: LDA of the A–F–M data shown in a) logratio space and b) a ternary diagram.

The linear boundary between the tholeiitic and calc-alkaline fields of the A–F–M logratio diagram is not only an effective discriminator between these two fields, but it also encodes some real geochemical information about the petrogenesis of igneous rocks. Section 14.5 will show how fundamental compositional processes can lead to simple laws in logratio space.

Consider a magma containing A mass units of Na₂O+K₂O, F mass units of FeO+Fe₂O₃, and M mass units of MgO. Suppose that, as the magma cools, it loses components A, F and M at rates that are proportional to the amounts of A, F and M present in the magma:

∂A- ∂F- ∂M-- ∂t = − λAA, ∂t = − λF F, and ∂t = − λM M

(14.8)

where t is time and λ_x is a decay constant (for x ∈{A,F,M}). The same mathematical formulation can be used to describe the settling of sediment from a suspension, or the decay of radioactive isotopes. The solution to Equation 14.8 is a set of exponential functions:

A = A ∘exp(− λAt), F = F ∘exp(− λFt), and M = M ∘exp (− λM t)

(14.9)

where A_∘, F_∘ and M_∘ are the initial values of A, F and M in the primitive magma. Different values of λ_A, λ_F and λ_M give rise to different trajectories on the AFM diagram. Combining the three compositional variables A, F and M into two logratio variables ln(A∕F) and ln(M∕F) recasts the exponential functions of Equation 14.9 into two linear functions:

ln(A∕F) =	ln(A_∘∕F_∘) + (λ_F − λ_A)t	(14.10)
ln(M∕F) =	ln(M_∘∕F_∘) + (λ_F − λ_M)t	(14.11)

which can be combined as follows:

ln(A ∕F) =C ln(M ∕F) + C 1 2 where C1 = λF-−-λA-- λF − λM and C2 = ln(A ∘∕F∘)− C1 ln(M ∘∕F∘)

(14.12)

Thus, the curved trajectories on the AFM diagram become straight lines in logratio space. This is exactly the behaviour that was observed in Figure 14.12. Evaluating the compositional evolution of three different A–F–M systems:

produces the following output:

Figure 14.13: Exponential decay processes of compositional data become linear trends in logratio space. Initial composition i forms the starting point of two trajectories, marked by dashed and solid lines, respectively. These trends are characterised by different decay parameters.

The dashed and solid lines in Figure 14.13 mimic the Fenner and Bowen trends of the tholeiitic and calc-alkaline magma series, respectively. This makes geological sense because it is not hard to imagine how λ_F could depend on the oxygen fugacity in the magma, which controls the valence state of the Fe-ions and, hence, the minerals that they form.

Strike and dip, azimuth and elevation, latitude and longitude, … directional data are ubiquitous in the Earth Sciences. Just like the compositional data discussed in chapter 14, the statistical analysis of these directional data is fraught with dangers.

Consider the following dataset of directions (in degrees) of thirty glacial striations from Madagascar:

44	51	79	65	27	31	4	355	22	352	287	7	287	339	0
276	342	355	334	296	7	17	351	349	37	339	40	324	325	334

These values span a range from 0^∘ and 360^∘, where 0^∘ = 360^∘. Plotting the data on a circle:

Figure 15.1: The directions of 30 glacial striations. The values are roughly centred around zero but exhibit significant scatter, from the northwest (276^∘) to the northeast (79^∘).

Computing the arithmetic mean of these angles yields a value of 189.2^∘. This is a nonsensical result:

Figure 15.2: The same glacial striation data of Figure 15.1, with their arithmetic mean shown as an arrow. Even though the individual striations are all pointing north, their arithmetic mean is pointing in exactly the opposite direction.

The problem is that directional data are ‘wrapped around’ a circle. The difference between 1^∘ and 359^∘ is not 358^∘ but 2^∘. So the usual rules of arithmetic do not apply to angular measurements. Better results for the difference between two angles 𝜃 and ϕ are obtained using the following trigonometric relationship:

d[𝜃,ϕ ] = arcsin (sin[𝜃]cos[ϕ]− cos[𝜃]sin[ϕ])

(15.1)

Similarly, the mean of 1^∘ and 359^∘ should not be 180^∘ but 0^∘. A more sensible definition of the ‘mean direction’ is again obtained using trigonometry, by taking the vector sum of all the component directions. For example, if 𝜃 = {𝜃₁,…,𝜃_i,…𝜃_n} are n angles, then the resultant direction is obtained by summing the horizontal and vertical components of unit vectors pointing in these directions:

(∑n ∑n ) 𝜃 = arctan2 sin[𝜃i], cos[𝜃i] i=1 i=1

(15.2)

where

{arctan (y∕x) if x > 0 and arctan2 (y, x) = arctan(y∕x)± π if x < 0

(15.3)

Applying this formula to the glacial striation data:

Figure 15.3: The vector sum of the 30 glacial striation measurements points in a direction of 357.6^∘, which is marked by the arrow and does a much better job at capturing the ‘central’ direction than the arithmetic mean of Figure 15.2 does.

Dividing the length of the vector sum by the number of measurements yields a new summary statistic, R, which is shown as a thick black arrow in Figure 15.4.

┌│ (------------)2---(------------)2- │∘ ∑n ∑n R¯= sin[𝜃i]∕n + cos[𝜃i]∕n i=1 i=1

(15.4)

Figure 15.4: An average direction (dashed line) is obtained by taking the vector sum of four angular measurements (thin black arrows). Scaling by the number of measurements (thick black arrow) yields a measure of concentration (0 ≤ R ≤ 1), which increases in length with decreasing angular dispersion and vice versa.

This arrow shortens with increasing scatter and so R is not a dispersion parameter like the standard deviation, but a concentration parameter. To create a dispersion parameter, we can use R to define the circular standard deviation:

∘ -------- sc = ln(1∕R¯2 )

(15.5)

In the extreme case where all the measurements point in the same direction, R = 1 and s_c = 0. If the data are evenly distributed around the circle, then R = 0 and s_c = ∞.

Figure 15.5: When the directional measurements are greatly dispersed, the normalised vector sum R is short (R = 0.125 in this example), and the circular standard deviation is large (s_c = 4.16).

The normal distribution is inappropriate for directional data, for the same reason why it was inappropriate for compositional data: its tails go from −∞ to +∞ and do not fit within the constrained data space of the angles. The von Mises distribution does not suffer from this problem:

f (𝜃|μ,κ) ∝ exp[κcos(𝜃 − μ )]

(15.6)

where μ is the location parameter (the mean angle) and κ is the concentration parameter. As the name suggests, large κ-values correspond to narrow distributions, and small κ-values to wide ones.

Figure 15.6: Four examples of the von Mises distribution with different values for the mean μ and the concentration parameter κ, wrapped around a (grey) circle. The probability density is proportional to the distance from the circle to the black line.

The parameter κ is not easy to determine directly, but can be estimated indirectly via the concentration parameter R and the following approximation:

(p + 1− R¯2 ) ˆκ = R¯-------¯2--- 1 − R

(15.7)

where p marks the number of parameters, namely p = 1 for circular data and p = 2 for spherical data (Section 15.4).

Figure 15.7: R and κ are two concentration parameters for circular data. R is easier to calculate than κ (using Equation 15.4), and can be converted to κ in order to fit a von Mises distribution to the data.

Applying this routine to the glacial striation measurements of Section 15.1 yields a value of R of 0.78, which corresponds to a κ-value of 2.8.

Figure 15.8: Density plot and rug plot for the glacial striation data. The bell shaped density curve represents a von Mises distribution with concentration parameter κ = 2.8 and mean μ = 357.6^∘. In contrast with Figure 15.6, in which the von Mises distribution was wrapped around a circle, here it is stretched out along a linear interval from −π to +π.

The principles of directional statistics can be generalised from the 1-dimensional circular line in 2-dimensional space to a 2-dimensional spherical surface in 3-dimensional space. The coordinates of data in this space can be expressed as latitude (l) and longitude (L):

( |{ x = cos[l]sin[L] y = sin[l] |( z = − cos[l]cos[L ]

(15.8)

as strike (S) and dip (D):

( |{x = − cos[D ]sin[S] y = cos[D ]cos[S] |( z = sin[D]

(15.9)

or as azimuth (A) and dip (D):

( |{ x = cos[D]cos[A] y = cos[D]sin[A ] |( z = sin[D ]

(15.10)

where the x-axis points towards the north, the y-axis towards the east, and the z-axis points downwards. Spherical data can be visualised on a 2-dimensional surface by stereographic map projection, using either a Wulff equal angle or a Schmidt equal area stereonet:

Figure 15.9: a) Wulff equal angle and b) Schmidt equal area stereonets, with circles of 10^∘ radius drawn at azimuths of 0^∘,90^∘,180^∘ and 270^∘, and dips of 10^∘ and 90^∘. As the names suggest, the Wulff equal angle net preserves the shape of the circles, and the Schmidt equal area net their size.

The Wulff net is used in structural geology and the Schmidt net in crystallography. Given the Cartesian coordinates {x,y,z}, the stereographic projection proceeds as follows:

{ X = tan(ϕ) sin(𝜃) Y = tan(ϕ) cos(𝜃)

(15.11)

for the Wulff net, and

{ √ -- X = √ 2sin(ϕ)sin (𝜃) Y = 2sin(ϕ)cos(𝜃)

(15.12)

for the Schmidt net, where

({ ( ∘ -2----2--------2- ) ϕ = arccos x + y + (z − 1) ∕2 ( 𝜃 = arctan2 (x,y)

(15.13)

Stereonets can be used to visualise

1.

geographical data:

Figure 15.10: a) Wulff and b) Schmidt projection of the African continent. The Wulff net shows Africa in its right shape, and the Schmidt net shows it at the right size. No two dimensional projection can achieve both goals at once.

2.

palaeomagnetic data, such as the following dataset of 10 declination (= azimuth) and inclination (= dip) measurements:

declination	47.9	46.3	44.7	50.9	56.4	42.6	44.9	41.5	47.9	39.6
inclination	28.6	20.1	15.6	18.1	17.5	28.7	12.2	24.5	20.6	15.0

Figure 15.11: The palaeomagnetic declination (=azimuth) and inclination (=dip) of 10 samples shown in a Schmidt equal area diagram.

3.

structural measurements, such as this set of 10 strikes and dips on a fault plane:

strike	226	220	223	222	233	227	234	229	227	224
dip	28.4	35.3	41.0	39.6	48.3	34.7	34.5	36.0	34.2	28.7

Figure 15.12: 10 fault plane measurements shown on a Wulff stereonet. The white circles mark the ‘pole’ of the planar measurements, i.e. a perpendicular line to the plane. The circular segments mark the intersection of the planes with bottom half of a sphere, shown in an equal angle projection.

The statistical analysis of spherical data is similar to that of the circular data in Sections 15.1 and 15.2. The von Mises – Fisher (vMF) distribution generalises the von-Mises distribution of Equation 15.6. In three dimensions, the density function for the vMF distribution is given by:

κ-exp[κ(xμx-+-yμy-+-zμz)] f({x,y,z}|{μx,μy,μy},κ ) = 2π(exp[κ]− exp[− κ ])

(15.14)

where μ = {μ_x,μ_y,μ_z} is a unit vector representing the mean of the distribution in Cartesian coordinates, and κ is the concentration parameter. μ and κ are unknown but can be estimated from the data using the vector sum. To average a collection of n spherical measurements:

{ n n n } {¯x,¯y,¯z} = ∑ xi-,∑ -yi ,∑ -zi nR¯ n ¯R nR¯ i=1 i=1 i=1

(15.15)

where {x_i,y_i,z_i} are the Cartesian coordinates computed using Equation 15.8, 15.9 or 15.10, and R is the concentration parameter:

┌│ (------)2---(------)2---(-----)2- 1 │∘ ∑n ∑n ∑n ¯R = n- xi + yi + zi i=1 i=1 i=1

(15.16)

which is related to the (approximate) value for κ by Equation 15.7 (with p = 2).

Then the average latitude and longitude are given by:

{ ¯L = arctan2(− ¯x,¯z) ¯l = arcsin (¯y)

(15.17)

the average strike and dip:

{ ¯ S = arctan2 (− ¯x,¯y) ¯D = arcsin(¯z)

(15.18)

and the average azimuth and dip:

{ ¯D = arctan2(¯y,¯x) ¯A = arcsin (¯z)

(15.19)

Applying this procedure to the palaeomagnetic and structural datasets:

Figure 15.13: a) the palaeomagnetic data of Figure 15.11 shown in grey, with their vector mean marked as a black square. b) the fault data of Figure 15.12 in grey with the vector mean as a black square and great circle, respectively.

It is important to make a distinction between directions such as ‘north’ or ‘east’ and orientations such as ‘north–south’ and ‘east-west’. The glacial striations of Section 15.1 are a clear example of directional data: on average, the striations point towards the north, suggesting a southerly ice source. In contrast, consider the following dataset of 20 pebble orientations on a beach:

32	33	34	37	40	41	42	43	45	53
210	212	214	217	220	221	222	223	226	230

Unlike the striations, there are not one but two ways to measure the pebble orientations. For example, an angle of 45^∘ degrees is equivalent to an angle of 225^∘. However, the vector mean (Equation 15.2) assumes that they are distinct angles. This can produce nonsensical results:

Figure 15.14: 20 pebbles in a sedimentary bed are aligned in a NE–SW direction. Half of the measurements were recorded as northeasterly and the other half as southwesterly. Both of angles are equivalent. The white circle marks the vector mean, which points in a southeasterly direction.

A better way to average orientational data is to minimise the moment of inertia. Suppose that the circle of Figure 15.14 is a ring, and that the tick marks have a physical weight attached to them. If we were to spin this ring around the y-axis, the it would wobble significantly. The wobble can be eliminated by changing the spin axis to the direction that minimises the moment of inertia of the spinning ring.

Figure 15.15: The diagonal line tilts at an angle of 40^∘. It marks the direction that minimises the moment of inertia of a circle with point masses attached to the angular measurements marked by the radial ticks. If the circle and the point masses were spun around this line, then they would do so without wobbling. The same is true for the perpendicular direction, which maximises the moment of inertia.

The moment of inertia can be minimised by calculating the Cartesian coordinates of the angular data, and calculating their covariance matrix:

⌊ n n ⌋ [ 2 ] 1 ∑ (xi − 0)2 1 ∑ (xi − 0)(yi − 0) Σ = sx sx,y = || ni=1 n i=1 || sy,x s2y ⌈ 1 n∑ (y − 0)(x − 0) 1 ∑n(y − 0)2 ⌉ ni=1 i i n i=1 i ⌊ n∑ ∑n ⌋ 1| x2i xiyi| ∑n [ ][ ] = --|⌈ in∑=1 i=∑1n |⌉ = xi xi yi n xiyi y2i i=1 yi i=1 i=1

(15.20)

For the pebble data:

[ ] Σ = 0.62 0.51 0.51 0.44

Subjecting their covariance matrix to an eigendecomposition (see Equation 12.4) yields two eigenvalues and eigenvectors:

[ ][ ][ ] Σ = QΛQ − 1 = − 0.77 0.64 0.989 0 − 0.77 − 0.64 − 0.64 − 0.77 0 0.011 0.64 − 0.77

The first eigenvector defines a direction of 𝜃 = atan(−0.64∕ − 0.77) = 0.70 = 40^∘ (see Figure 15.15). It accounts for 98.9% of the data variance, as indicated by the first eigenvalue.

The same procedure can be used to average orientational data on a sphere. For example, consider a structural dataset for a subhorizontal bed. Suppose that some measured surfaces dip gently to the north and others gently to the south:

strike	77.1	94.9	73.2	122.8	97.3	254.7	280.4	285.6	282.5	264.0
dip	5.6	1.9	10.1	7.8	3.2	0.0	8.5	2.3	3.4	6.7

Calculating the vector mean with Equation 15.15 yields a nonsensical (subvertical!) result:

Figure 15.16: Open circles and grey lines shows 20 strike and dip measurements of a subhorizontal bed. The vector mean of these measurements is shown as a filled square and black line. It is perpendicular to the bed, which is clearly wrong. The large open square marks the direction that minimises the moment of inertia of the poles. This is a much more sensible orientational average.

The moment of inertia can be minimised for spherical data by generalising Equation 15.20 from two to three dimensions and subjecting the resulting covariance matrix to an eigendecomposition.

“everything is related to everything else, but near things are more related than distant things.” – Waldo R. Tobler (1969)
Geostatistics is a branch of statistics that deals explicitly with data distributed in space and aims at explicitly modelling that spatial relationship between the data. Originally developed in the context of mining geology, the field has applications in petroleum geology, environmental science, oceanography, hydrology, pedology, ecology, epidemiology, and other fields. In this chapter we will use an environmental dataset to introduce some fundamental geostatistical concepts:

# x y Zn z = ln[Zn]
1 181072 333611 1022 6.93
2 181025 333558 1141 7.04
3 181165 333537 640 6.46

50 180199 331591 375 5.93

155 180627 330190 375 5.93

Table 16.1: Locations and zinc concentrations of 155 samples of topsoil, collected in a flood plain of the river Meuse, near the village of Stein (Netherlands). x and y are the Easting and Northing (as metres), respectively, in Rijksdriehoek (RDH, Netherlands topographical) map coordinates. The column labelled ‘Zn’ contains the zinc concentration (in ppm) from composite samples of approximately 15 × 15m areas.

Now consider a position {x = 179850,y = 331650} where no measurements was taken. The question that this chapter addresses is: can we use the surrounding data to estimate the Zn-concentration at this new location? Or, bearing in mind the idiosyncrasies of compositional data that were discussed in chapter 14: can we estimate the logarithm¹ of the Zr-concentration (z in Table 16.1) at this new location?

Figure 16.1: Location map of the Meuse data. The question mark is located at a position {x = 179850,y = 331650} where no measurement was taken. This chapter will use a geostatistical technique called kriging to estimate the (logarithm of) the zinc concentration at this location from the surrounding measurements.

16.1 The semivariogram

To assess the spatial correlation of the data, we begin by measuring the inter-sample distances between the 155 samples in the Meuse dataset:

1 2 3 4 5 6 7 8 9 10 … 155
1 0 71 120 260 370 470 260 250 380 470 … 3400
2 0 140 280 360 470 230 200 330 440 … 3400
3 0 140 250 360 170 220 320 380 … 3400
4 0 150 240 180 300 350 320 … 3400
5 0 110 150 280 270 180 … 3200
6 0 250 380 330 180 … 3200
7 0 140 170 210 … 3200
8 0 140 280 … 3200
9 0 180 … 3100
10 0 … 3000

155 0

where only the upper triangle is shown because the lower triangle contains exactly the same values. Next, we compute the differences between the (logged) Zn-concentrations of the samples, thus producing a second 155 × 155 matrix:

1 2 3 4 5 6 7 8 9 10 … 155
1 0
2 0.11 0
3 0.47 0.58 0
4 1.4 1.5 0.91 0
5 1.3 1.4 0.87 0.046 0
6 1.3 1.4 0.82 0.089 0.044 0
7 1.1 1.2 0.62 0.30 0.25 0.21 0
8 0.92 1.0 0.46 0.46 0.41 0.37 0.16 0
9 1.1 1.2 0.61 0.30 0.25 0.21 0.0029 0.16 0
10 1.7 1.8 1.3 0.34 0.39 0.43 0.64 0.80 0.64 0

155 1.0 1.1 0.53 0.38 0.33 0.29 0.080 0.079 0.078 0.72 … 0

where we show only the lower triangle because the upper triangle contains exactly the same values. Merging the two tables and selecting all the pairs of samples that are within a distance of 150 m from each other:

     1    2    3    4    5    6    7    8    9    10    …    155
1      0    71    120    260    370    470   260    250    380    470   …    3400
2      0.11   0    140   280    360    470   230    200    330    440   …    3400
3      0.47    0.58    0    140    250    360   170    220    320    380   …    3400
4      1.4    1.5    0.91   0    150    240   180    300    350    320   …    3400
5      1.3    1.4    0.87   0.046    0    110    150    280    270    180   …    3200
6      1.3    1.4    0.82   0.89    0.044   0    250    380    330    180   …    3200
7      1.1    1.2    0.62   0.30    0.25    0.21   0    140    170    210   …    3200
8      0.92   1.0    0.46   0.46    0.41    0.37   0.16    0    140    280   …    3200
9      1.1    1.2    0.61   0.30    0.25    0.21   0.0029   0.16    0    180   …    3100
10      1.7    1.8    1.3    0.34    0.39    0.43   0.64    0.80    0.64    0    …    3000

155     1.0    1.1    0.53   0.38    0.33    0.29   0.080    0.079   0.078   0.72   …    0


Table 16.2: Composite data table with the inter-sample distances shown above the diagonal and the log-contrasts of the corresponding zinc-concentrations shown below the diagonal. Boxed values mark sample pairs that are located within 150 metres of each other.

We can then use the boxed values below the diagonal to define the sample semivariance as:

$n∑(h) 2 γ[h] = (zi[k] −-zj[k])- k 2 n(h)$ (16.1)

where n(h) is the number of sample pairs that fall within a distance (or ‘lag’) h of each other, and {z_i[k],z_j[k]} are the ln(Zn) values for the k^th sample pair. For the data shown in Table 16.2, there are 166 sample pairs that fall within a distance of 150 m of each other (of which only 9 are shown due to space constraints), and the sample semivariance equals:

$0.112 +-0.472 +-0.582 +-... γ[0 < h ≤ 150] = 2× 166 = 0.10$

We can repeat this calculation for all sample pairs that are within a distance of 150 to 300 metres of each other:

     1    2    3    4    5    6    7    8    9    10    …    155
1      0    71    120   260    370    470   260    250    380    470   …    3400
2      0.11   0    140   280    360    470   230    200    330    440   …    3400
3      0.47   0.58   0    140    250    360   170    220    320    380   …    3400
4      1.4    1.5    0.91   0    150    240   180    300    350    320   …    3400
5      1.3    1.4    0.87    0.046   0    110   150    280    270    180    …    3200
6      1.3    1.4    0.82   0.089   0.044   0    250    380    330    180   …    3200
7      1.1    1.2    0.62   0.30    0.25    0.21   0    140    170    210   …    3200
8      0.92   1.0    0.46   0.46    0.41    0.37   0.16    0    140    280   …    3200
9      1.1    1.2    0.61   0.30    0.25    0.21   0.0029   0.16    0    180   …    3100
10      1.7    1.8    1.3    0.34    0.39    0.43   0.64    0.80    0.64    0    …    3000

155     1.0    1.1    0.53   0.38    0.33    0.29   0.080    0.079   0.078   0.72   …    0


Table 16.3: The same composite matrix of Table 16.2, with boxed values marking inter-sample distances of 150 to 300 metres (above the diagonal) and their corresponding log-contrast (below the diagonal).

There are 530 sample pairs that fall within a 150–300 m distance of each other, resulting in a semivariance of:

$1.42 + 1.12 + 0.922 + 1.52 + ... γ[150 < h ≤ 300] = ---------------------------- = 0.28 2× 530$

which is greater than γ[0 < h ≤ 150]. Continuing these calculations for progressively increasing values of h produces an empirical semivariogram:

Figure 16.2: Empirical semivariogram of the Meuse data. The semivariance is low at first and then gradually rises with increasing lag before levelling off to a plateau. Because the semivariance is inversely proportional to the degree of spatial correlation, the semivariogram tells us that there is relatively little variability among samples that are located close together, and more variability among samples that are far apart. In other words, the semivariogram formalises Tobler’s First Law of Geography, which was quoted at the beginning of this chapter.

16.2 Semivariogram models

Now that we have quantified the degree of spatial variability or, equivalently, the degree of spatial correlation, the next step in the kriging method is to parameterise the semivariogram. There are several equations to do this, such as the spherical model:

$( ( [ ]3) |{ γ[h] =cn + (cs − cn) 3h- − 1- h- if h ≤ cr 2cr 2 cr |( γ[h] =cs if h > cr$ (16.2)

the exponential model:

$[ ] γ[h] = cs + (cn − cs)exp −-h cr$ (16.3)

and the Gaussian model:

$[ 2] γ[h] = cs + (cn − cs)exp − h c2r$ (16.4)

Figure 16.3: Parametric semivariogram fits for the Meuse data, using the a) spherical, b) exponential, and c) Gaussian models.
Equations 16.3–16.2 contain three parameters:

1.
The range (c_r) is the distance within which samples are spatially correlated, and beyond which they are independent from each other.
2.
The sill (c_s) quantifies the long range background variability of the data.
3.
The nugget (c_n) is the excess variance at zero distance. In principle samples that are co-located should have the same value, but in practice this is not always the case and the semivariogram may have a non-zero intercept. This nugget effect may be caused by measurement errors, or by processes occurring at scales smaller than the sample.

Figure 16.4: Illustration of the range, sill and nugget of a spherical semivariogram model. The same three parameters also control the shape of the exponential and Gaussian models. In the case of the Meuse dataset, the nugget effect is likely caused by the fact that the zinc concentrations were averaged over a 15×15 metre area, and that the variability within this sampling area is not captured by the semivariogram.

16.3 Kriging interpolation

The kriging method is named after the South Afrian mining geologist Danie Gerhardus Krige, who laid the foundations for the method in his Master’s thesis. Given the best analytical fit to the semivariogram, the kriging algorithm interpolates the data between the measurements as follows:

$n z (x ,y ) = ∑ w z(x,y ) ∘ ∘ i i i i=1$ (16.5)

where, for the Meuse dataset, z(x,y) is the (logged) zinc concentration at some location {x,y}; {x_i,y_i} are the coordinates of the i^th site (out of n); and {x_∘,y_∘} is the location of the new site for which we want to estimate the zinc concentration. w_i is a weighting parameter that depends on the distance between locations {x_i,y_i} and {x_∘,y_∘} and on the spatial correlation at that distance, which is captured by the analytical fit to the semivariogram. The weights can be estimated with the following matrix expression:

$⌊ ⌋ ⌊ 0 γ[h1,2] ... γ[h1,n ] 1⌋− 1⌊ ⌋ w1 |γ [h2,1] 0 ... γ[h2,n ] 1| γ [h1,∘] ||w2 || || . . . . .|| ||γ [h2,∘]|| || .. || = || .. .. .. .. ..|| || .. || |⌈ . |⌉ ||γ[hn,1] γ[hn,2] ... 0 1|| |⌈ . |⌉ wn ⌈ 1 1 ... 1 0⌉ γ [hn,∘] λ 1$ (16.6)

where λ is a Lagrange multiplier that is used to ensure that the weights add up to one; and h_i,j is the distance between points i and j:

$∘ --------------------- hi,j = (xi − xj)2 + (yi − yj)2$ (16.7)

Fitting a spherical variogram to the Meuse dataset yields a nugget c_n = 0.035, a sill c_s = 0.64 and a range c_r = 892 metres. Fitting a kriging model to location {x_∘ = 179850,y_∘ = 331650} of Figure 16.1 then yields the following results:

$⌊ ⌋ ⌊ ⌋ ⌊ 0 0.11 0.15 ...0.64 ...0.64 1⌋ −1⌊ ⌋ w1 0.0022 | | 0.64 || w2 || || − 0.00030|| || 0.11 0 0.18 ...0.64 ...0.64 1|| || 0.64|| | w3 | | 0.00022 | | 0.150.18 0 ...0.64 ...0.64 1| | 0.64| || . || || . || || ... ... ... ... ... ... ... ...|| || . || || .. || = || .. || = || 0.640.64 0.64 ... 0 ...0.64 1|| || .. || || w50|| || − 0.028 || || . . . . . . . .|| || 0.37|| || .. || || .. || || .. .. .. .. .. .. .. ..|| || .. || | . | | . | || 0.640.64 0.64 ...0.64 ... 0 1|| | . | ⌈w155⌉ ⌈ 0.0019 ⌉ ⌈ 1 1 1 ... 1 ... 1 0⌉ ⌈ 0.64⌉ λ 0.0015 1$ (16.8)

in which we can recognise the sill (c_s = 0.64) for all the sample pairs that are separated by a distance of more than c_r = 892 metres. Given the 155 (logged) zinc concentration measurements of Table 16.1, the logged zinc concentration of location {x_∘ = 179850,y_∘ = 331650} is given by

$zx∘,y∘ = 0.0022×6.93 − 0.00030×7.04+0.00022 ×6.46+... − 0.028 ×5.93+...+0.0019 ×5.93 = 5.01$

which corresponds to a zinc concentration of exp[5.01] = 150 ppm. Repeating the same exercise for a grid of points spanning the entire sampling area:

Figure 16.5: The kriging interpolation results for the Meuse dataset shown as a filled contour plot, with the measurements of Figure 16.1 added as points of comparison. The extent of the contour plot is limited to the immediate vicinity of the measurements because kriging is an interpolation method and not an extrapolation method.

The uncertainty (variance) of the kriging estimate is given by

$2 ∑n s (x∘,y∘) = λ + γ [hi,∘]wi i=1$ (16.9)

where λ is the Lagrange multiplier of Equation 16.6. For location {x_∘ = 179850,y_∘ = 331650}:

$2 s (x∘,y∘) = 0.0015+ 0.64 × 0.0022 − 0.64× 0.00030 + ...= 0.22$

Using the error propagation formula for the natural logarithm (Equation 8.15), this tells us that the relative uncertainty (i.e. the coefficient of variation) of the new measurement is s[z]∕z = = 47%. Repeating this exercise for the entire grid of {x,y}-values:

Figure 16.6: Contour plot of the relative uncertainty (coefficient of variation) of the kriging interpolation estimates of Figure 16.5. Note how the uncertainties are the greatest near the edges of the dataset, confirming again that kriging is not an extrapolation but an interpolation method.

The kriging algorithm described in this chapter is known as ordinary kriging. Other, more sophisticated variants of the kriging method exist that relax some of the simplifying assumptions that are made by ordinary kriging. For example:

1.
Ordinary kriging assumes that the data are stationary, and have a constant mean and variance. Minor violations of this assumption can be corrected by detrending the data.
2.
The semivariogram models of Section 16.2 assumed that the spatial correlation was isotropic. This means that the semivariance was only a function of the lag and not of the direction. This assumption may not always be appropriate. For example, valleys in a fold-and-thrust belt may be preferentially oriented perpendicular to the principal stress direction. To accommodate such situations, anisotropic semivariogram models can be created that require additional parameters.
3.
Using Equation 16.5, the kriging interpolation for a new location involved the values of all n measurements. This is fine for the relatively small Meuse dataset (n = 155), but for much larger datasets, the matrix inversion of Equation 16.6 becomes computationally too demanding. This problem can be solved by interpolationg using only measurements that are reasonably close to the new sample.

Chapter 17
An introduction to R

R is an increasingly popular programming language that is available free of charge on any operating system at https://r-project.org. A number of different graphical user interfaces (GUIs) are available for R, the most popular of which are RGui, RStudio, RCommander and Tinn-R. For this tutorial, however, the simple command line console suffices. This chapter introduces all the functions that are needed to solve the exercises in Chapter 18. A more complete listing of base R functions is provided at https://cran.r-project.org/doc/contrib/Short-refcard.pdf.

17.1 The basics
1.
We will start this tutorial from the R command prompt, which is the window that begins with a > symbol. First, do some arithmetic:
> 1 + 1 
[1] 2
R prints the result to the command prompt (2 in this case). Here are some other arithmetic operations:
> sqrt(2) 
[1] 1.414214 
> exp(log(10)) 
[1] 10 
> 13%%5 
[1] 3
2.
An arrow operator is used to assign a value to a variable. Note that the arrow can point both ways:
> foo <- 2 
> 4 -> bar 
> foo <- foo*bar 
> foo 
[1] 8
3.
Create a vector of numbers:
> myvec <- c(2,4,6,8) 
> myvec*2 
[1]  4  8 12 16
Query the third value of the vector:
> myvec[3] 
[1] 6
Change the third value of the vector:
> myvec[3] <- 100
Change the second and the third value of the vector:
> myvec[c(2,3)] <- c(100,101)
Create a sequence of numbers:
> seq(from=1,to=10,by=1) 
 [1]  1  2  3  4  5  6  7  8  9 10
Equivalently (output omitted for brevity):
> seq(1,10,1) 
> seq(1,10) 
> seq(to=10,by=1,from=1) 
> seq(to=10) 
> 1:10
Create a 10-element vector of twos:
> rep(2,10) 
 [1] 2 2 2 2 2 2 2 2 2 2
4.
Create a 2 × 4 matrix of ones:
> mymat <- matrix(1,nrow=2,ncol=4)
Change the third value in the first row of mymat to 3:
> mymat[1,3] <- 3
Change the entire second column of mymat to 2:
> mymat[,2] <- 2
Remove the first column from mymat:
> mymat[,-1] 
     [,1] [,2] [,3] 
[1,]    2    3    1 
[2,]    2    1    1
Give names to the rows:
> rownames(mymat) <- c('first','second')
Use the names:
> mymat['first',] 
[1] 1 2 3 1
The transpose of mymat:
> t(mymat) 
     first second 
[1,]    1    1 
[2,]    2    2 
[3,]    3    1 
[4,]    1    1
Element-wise multiplication (*) vs. matrix multiplication (%*%):
> mymat*mymat 
       [,1] [,2] [,3] [,4] 
first     1    4    9    1 
second    1    4    1    1 
> p <- mymat %*% t(mymat) 
> p 
       first second 
first     15      9 
second     9      7
The inverse and determinant of a square matrix:
> invp <- solve(p) 
> det(invp %*% p) 
[1] 1
5.
Lists are used to store more complex data objects:
> mylist <- list(v=myvec,m=mymat,nine=9) 
> mylist$v 
[1] 2 4 6 8
or, equivalently:
> mylist[[1]] 
> mylist[['v']]
Data frames are list-like tables:
> myframe <- data.frame(period=c('Cz','Mz','Pz','PC'), 
+                       SrSr=c(0.708,0.707,0.709,0.708), 
+                       fossils=c(TRUE,TRUE,TRUE,FALSE)) 
> myframe 
  period  SrSr fossils 
1     Cz 0.708    TRUE 
2     Mz 0.707    TRUE 
3     Pz 0.709    TRUE 
4     PC 0.708   FALSE
You can access the items in myframe either like a list or like a matrix:
> myframe$period == myframe[,'period'] 
[1] TRUE TRUE TRUE TRUE
6.
Save data to a text (.csv) file:
> write.csv(myframe,file='timescale.csv',row.names=FALSE)
Read data from a .csv file:
> myframe2 <- read.csv(file='timescale.csv',header=TRUE)
Type myframe2 at the command prompt to verify that the contents of this new variable match those of myframe.
7.
Plot the first against the second row of mymat:
> plot(x=mymat[1,],y=mymat[2,])
Draw lines between the points shown on the existing plot:
> lines(mymat[1,],mymat[2,])
Create a new plot with red lines but no points and a 1:1 aspect ratio for the X- and Y-axis:
> plot(mymat[1,],mymat[2,],type='l',col='red',asp=1)
Save the currently active plot as a vector-editable .pdf file:
> dev.copy2pdf(file="trigonometry.pdf")
8.
If you want to learn more about a function, type ‘help’ or ‘?’:
> help(c) 
> ?plot
9.
You can also define your own functions:
> cube <- function(n){ 
+     return(n^3) 
+ }
Using the newly created function:
> cube(2) 
[1] 8 
> result <- cube(3)
10.
Collect the following commands in a file called ‘myscript.R’. Note that the following text box does not contain any ‘>’-symbols because it is not entered at the command prompt but in a separate text editor:
1# the 'print' function is needed to show intermediate 
2# results when running commands from a .R file 
3print(pi)
You can run this code by going back to the command prompt (hence the ‘>’ in the next box) and typing:
> source("myscript.R") 
[1] 3.141593
Note that everything that follows the ‘#’-symbol was ignored by R.
11.
Conditional statements. Replace the contents of myscript.R with:
1toss <- function(){ 
2    r <- runif(1) # create a random number between 0 and 1 
3    if (r<0.5){ 
4        print("head") 
5    } else { 
6        print("tail") 
7    } 
8}
Save and run at the command prompt:
> source('myscript.R') 
> toss() 
[1] "head"
(you might, of course, get "tail" when you run this)
12.
Loops. Add the following function to myscript.R:
1fibonnaci <- function(n=5){ # 5 is the default value 
2    if (n < 3) { stop('n must be at least 3') } 
3    # seed the output vector with 0 and 1: 
4    s <- c(0,1) 
5    # loop through all numbers from 3 to n: 
6    for (i in 3:n){ 
7        s[i] <- s[i-1] + s[i-2] 
8    } 
9    return(s) 
10}
Save and run at the command prompt to calculate the first n numbers in the Fibonnaci series:
> source('myscript.R') 
> fibonnaci() 
 [1] 0 1 1 2 3 
> fibonnaci(10) 
 [1]  0  1  1  2  3  5  8 13 21 34
13.
Arguably the most powerful feature of R is the availability of thousands of community generated packages that provide additional functionality. One of these packages is called geostats and was specifically created to accompany these notes. To install this package:
> install.packages('geostats')
Once installed, the package can be loaded into memory by entering:
1library(geostats)
Let’s use geostats to produce a Sierpinski carpet, as will be discussed in Chapter 11:
2s <- sierpinski() 
3image(s,axes=FALSE)
Type ?sierpinski for further details. To view the source code of this function, just type sierpinski at the command prompt:
> sierpinski 
function (n = 5){ 
    X <- matrix(1, 1, 1) 
    for (i in 1:n) { 
        Y <- cbind(X, X, X) 
        Z <- cbind(X, 0 * X, X) 
        X <- rbind(Y, Z, Y) 
    } 
    return(X) 
}
17.2 Plotting data

In the remainder of this text, we will assume that the geostats package has been loaded into memory:
library(geostats)
1.
The Anscombe quartet of Table 2.1 and Figure 2.1 is built into R. You can have a look at it by typing anscombe at the command prompt. We can then create Figure 2.1:
1op <- par(mfrow=c(1,4)) # op = old parameter values 
2plot(anscombe$x1,anscombe$y1) 
3plot(anscombe$x2,anscombe$y2) 
4plot(anscombe$x3,anscombe$y3) 
5plot(anscombe$x4,anscombe$y4) 
6par(op)                 # restore old parameter values
where par(mfrow=c(1,4)) creates a 1 × 4 grid of plot panels. Note that we can also write this more generically:
1np <- 4 # np = 'number of panels' 
2op <- par(mfrow=c(1,np)) 
3for (i in 1:np){ 
4  plot(anscombe[,i],anscombe[,i+np]) 
5} 
6par(op)
Or, adding a few options to make the plot look exactly like Figure 2.1:
3titles <- c('I','II','III','IV') 
4for (i in 1:np){ 
5  plot(anscombe[,i],anscombe[,i+np], 
6       xlab='x',ylab='y',pch=19,main=titles[i]) 
7} 
8par(op)
2.
The geostats package includes a number of datasets, which are automatically attached to the workspace when the geostats package is loaded into memory. These datasets can also be accessed outside the package using the data() function. For example:
data(catchments,package='geostats')
Entering ‘catchments’ at the console reveals the contents of a data frame containing the river catchment data of Table 2.2.
3.
Using the catchments data to create a bar chart of lithologies (Figure 2.2.a):
1counts <- table(catchments$lithology) 
2barplot(counts)
4.
Plotting the pH data of Figure 2.3.b on a histogram and rug plot:
1pH <- catchments$pH 
2hist(pH) 
3rug(pH)
Changing the number of bins:
2op <- par(mfrow=c(1,2)) 
3hist(pH,breaks=5) 
4hist(pH,breaks=10) 
5par(op)
Specifying the position of the bins:
3hist(pH,breaks=seq(from=3,to=7,by=0.5)) 
4hist(pH,breaks=seq(from=3.25,to=6.75,by=0.5))
5.
A kernel density estimate (KDE) and rug plot of the pH data (Section 2.1 and Figure 2.9):
2dens <- density(pH) 
3plot(dens) 
4rug(pH)
A KDE of the log-transformed catchment Ca/Mg data (Figure 2.12):
1lc <- log(catchments$CaMg) 
2dens <- density(lc) 
3plot(dens)
The log-transformed KDE is not as easy to relate to as the raw Ca/Mg values. geostats implements a function called exp() to map the log-transformed KDE back to a linear scale:
3plot(exp(dens)) 
4rug(catchments$CaMg)
Subjecting the vegetation data to a logistic transformation before plotting as a KDE:
1veg <- catchments$vegetation/100 
2lv <- logit(veg) 
3ld <- density(lv) 
4plot(ld)
where the logit() function is provided by the geostats package. To map the density estimate from the logistic scale (−∞,+∞) back to the normal vegetation scale (0,1):
3d <- logit(ld,inverse=TRUE) 
4plot(d)
Note that we are using the logit() twice using different inputs. In programming jargon, the function has been overloaded. To inspect the R-code of the two implementations, just type geostats:::logit.default and geostats:::logit.density at the command prompt.
6.
The Old Faithful geyser data of section 2.5 are included with R. Plotting the eruption durations and waiting times proceeds in exactly the same way as point 5 above:
1x <- faithful[,'waiting'] 
2y <- faithful[,'eruptions'] 
3op <- par(mfrow=c(2,1)) 
4plot(density(x),xlab='minutes',main='waiting time') 
5rug(x) 
6plot(density(y),xlab='minutes',main='eruption duration') 
7rug(y) 
8par(op)
where we have nested the density and plot functions for the sake of brevity. Two-dimensional KDEs are not part of base R. To access this functionality, we must first load the important MASS (‘Mathematical and Applied Statistics with S¹ ’) package.
3library(MASS) 
4kde2 <- kde2d(x,y) 
5contour(kde2) 
6points(x,y)
7.
Calculate the empirical cumulative distribution function of the pH data:
1cdf <- ecdf(pH) 
2plot(cdf)
Adding some optional arguments to produce an output that is more similar to Figure 2.16a:
2plot(cdf,verticals=TRUE,pch=NA)
where pch=NA removes the plot characters, and verticals=TRUE is self explanatory.
The ecdf function produces another function that can be evaluated at any value. For example, if we want to evaluate the fraction of pH values that are less than 4.5:
> cdf(4.5) 
[1] 0.25
which means that there are 25% such values.
17.3 Summary Statistics
1.
Calculating summary statistics in R is straightforward:
> pH <- catchments$pH # to save some typing 
> mean(pH) 
[1] 5 
> sd(pH)              # standard deviation 
[1] 0.6657011 
> median(pH) 
[1] 5.1 
> mad(pH,constant=1)  # median absolute deviation 
[1] 0.5 
> IQR(pH,type=1)      # interquartile range 
[1] 1
2.
R does not come with a function to calculate the mode. However, the geostats package comes to the rescue by providing the Mode function² . For categorical data, the mode is the most frequently occurring value. Using the stratigraphic age data of Figure 2.2.b as an example:
> Mode(catchments$age,categorical=TRUE) 
[1] "Mesozoic"
For continuous variables, in which there are no duplicate values, we use a KDE to determine the mode:
> Mode(catchments$pH,categorical=FALSE) 
[1] 5.372393
Skewness is not implemented in base R either. But it is easy to write a function for it, based on Equation 3.5:
1skew <- function(x){ 
2  mean((x-mean(x))^3)/sd(x)^3 
3}
which can be used as follows:
> skew(catchments$CaMg) 
[1] 1.212305
3.
A box plot for the Ca/Mg data:
1boxplot(catchments$CaMg)
17.4 Probability
1.
The factorial operator (!) in Chapter 4 is implemented as factorial(x). For example:
> factorial(1) 
[1] 1 
> factorial(10) 
[1] 3628800 
> factorial(100) 
[1] 9.332622e+157 
> factorial(1000) 
[1] Inf
factorial(x) fails to calculate !1000. For large numbers like this, it is better to use lfactorial(x), which returns the natural logarithm of the factorial:
> lfactorial(1000) 
[1] 5912.128
which means that 1000! = e^5912.128.
2.
Similarly, the combinations of small numbers can be calculated with choose(n,k):
> choose(n=10,k=2) 
[1] 45 
> choose(n=10000,k=2000) 
[1] Inf
and for large numbers with lchoose(n,k):
> lchoose(n=10000,k=2000) 
[1] 4999.416
which means that there are e^4999.416 ways to choose 2,000 items from a collection of 10,000.
17.5 The binomial distribution
1.
Flip 10 coins and count the number of heads:
> rbinom(n=1,size=10,prob=0.5)
Repeat 50 times and plot the outcomes as a histogram:
1d <- rbinom(n=50,size=10,prob=0.5) 
2hist(d)
2.
Calculate the probability of observing 4 heads out of 10 throws:
> dbinom(x=4,size=10,prob=0.5) 
[1] 0.2050781
Plot the probability mass function (PMF) of the binomial distribution (Equation 5.1) with n = 10 and p = 0.5:
1k <- 0:10 
2pmf <- dbinom(x=k,size=10,prob=0.5) 
3barplot(height=pmf,names.arg=k)
where names.arg specifies the labels of the bar plot.
3.
The probability of observing 4 or fewer heads out of 10 throws:
> pbinom(q=4,size=10,prob=0.5) 
[1] 0.3769531
Plot the cumulative distribution function (CDF) of the binomial distribution (Equation 5.2) with n = 10 and p = 0.5:
1cdf <- pbinom(q=0:10,size=10,prob=0.5) 
2plot(cdf,type='s')
4.
Calculate the quantiles of the binomial distribution. For example, assume that there is a p = 2∕3 chance of finding gold in claim, and suppose that there are n = 15 claims. Then there is a 95% chance that the number of successful claims is less than
> qbinom(p=0.95,size=15,prob=2/3) 
[1] 13
where the argument p must not be confused with the parameter p in Equation 5.1. The latter parameter is referred to as prob in R’s binomial functions.
Conversely, if p = 2∕3, then there is a 95% chance that the number of successful gold discoveries among 15 claims is greater than or equal to
> qbinom(p=0.95,size=15,prob=2/3,lower.tail=FALSE) 
[1] 7
or, equivalently:
> qbinom(p=0.05,size=15,prob=2/3) 
[1] 7
Thus the rejection region for the one-sided null hypothesis H₀ : p = 2∕3 vs. the alternative hypothesis H_A : p > 2∕3 is R = {0,…,6} (Equation 5.5).
5.
The boundaries of the rejection region for the two-sided null hypothesis H₀ : p = 2∕3 vs. the alternative hypothesis H_A : p≠2∕3 are given by
> qbinom(p=c(0.025,0.975),size=15,prob=2/3) 
[1] 6 13
Hence R = {0,…,5,14,15} (Equation 5.6).
6.
Based on the 1-sided rejection region calculated under point 4, a success rate of 6 gold discoveries out of 15 claims is incompatible with the null hypothesis H₀ : p = 2∕3 vs. the one-sided alternative hypothesis H_A : p < 2∕3. This is because 6 ∈ R = {0,…,6}. Here is another way to obtain the same result:
> binom.test(x=6,n=15,p=2/3,alternative='less',conf.level = 0.95) 
 
        Exact binomial test 
 
data:  6 and 15 
number of successes = 6, number of trials = 15, p-value = 0.03083 
alternative hypothesis: true probability of success is less than 0.666667 
95 percent confidence interval: 
 0.0000000 0.6404348 
sample estimates: 
probability of success 
                   0.4
This result shows a p-value of 0.03083 < 0.05, leading to the rejection of H₀. The 95% confidence interval for p spans the range from 0 to 0.6404348, which does not include the hypothesised value of p = 2∕3.
7.
Based on the 2-sided rejection region calculated under point 5, a success rate of 6 gold discoveries out of 15 claims is compatible with the null hypothesis H₀ : p = 2∕3 vs. the one-sided alternative hypothesis H_A : p≠2∕3. This is because 6R = {0,…,5,14,15}. Hence we cannot reject H₀ in this case. Here is another way to obtain the same result:
> h <- binom.test(x=6,n=15,p=2/3,alternative='two.sided',conf.level=0.95) 
> h$p.value 
[1] 0.05023902 
> h$conf.int 
[1] 0.1633643 0.6771302 
attr(,"conf.level") 
[1] 0.95
where we have stored the output of binom.test() in a variable h. H₀ cannot be rejected because the p-value is 0.05024 > 0.05, and the confidence interval ranges from 0.1633643 to 0.6771302, which includes p = 2∕3.
17.6 The Poisson distribution
1.
Create a histogram of 100 random values from a Poisson distribution with parameter λ = 3.5:
1d <- rpois(n=100,lambda=3.5) 
2hist(d)
2.
Calculate the probability of observing 0 successes if λ = 3.5:
> dpois(x=0,lambda=3.5) 
[1] 0.03019738
Plot the probability mass function (PMF, evaluated up to 15 successes) of the Poisson distribution (Equation 6.1) for λ = 3.5.
1k <- 0:15 
2pmf <- dpois(x=k,lambda=3.5) 
3barplot(height=pmf,names.arg=k)
3.
The probability of observing 9 or fewer successes if λ = 3.5:
> ppois(q=9, lambda=3.5) 
[1] 0.9966851
Plot the CDF of the Poisson distribution with λ = 3.5:
1cdf <- ppois(q=k,lambda=3.5) 
2plot(cdf,type='s')
4.
The rejection region (α = 0.05) for a one-sided hypothesis H₀ : λ = 3.5 vs. H_A : λ > 3.5 (Section 6.3):
> qpois(p=0.95,lambda=3.5) 
[1] 7
Hence R = {8,…,∞} (see Section 6.3.6).
5.
The two-sided rejection region:
> qpois(p=c(0.025,0.975),lambda=3.5) 
[1] 0 8
Hence R = {9,…,∞}.
6.
Based on the 1-sided rejection region calculated under point 4, a success rate of 9 zircons per grid is incompatible with the null hypothesis H₀ : λ = 3.5 vs. the one-sided alternative hypothesis H_A : λ > 3.5. This is because 9 ∈ R = {9,…,∞}. Here is another way to obtain the same result:
> h <- poisson.test(x=9,r=3.5,alternative='greater',conf.level=0.95) 
> h$p.value 
[1] 0.009873658
The p-value is 0.009873658 < 0.05, leading to a rejection of H₀. The 95% confidence interval is given by:
> h$conf.int 
[1] 4.695228      Inf
(λ = 3.5)[4.695228,∞). Hence, H₀ is rejected.
7.
Based on the 2-sided rejection region calculated under point 5, a success rate of 9 is incompatible with the null hypothesis H₀ : λ = 3.5 vs. the one-sided alternative hypothesis H_A : λ≠3.5. This is because 9 ∈ R = {9,…,∞}. This again leads to rejection of H₀ in favour of H_A. We can obtain the same result with poisson.test:
> poisson.test(x=9,r=3.5,alternative='two.sided',conf.level=0.95) 
 
        Exact Poisson test 
 
data:  9 time base: 1 
number of events = 9, time base = 1, p-value = 0.009874 
alternative hypothesis: true event rate is not equal to 3.5 
95 percent confidence interval: 
  4.115373 17.084803 
sample estimates: 
event rate 
         9
17.7 The normal distribution
1.
Generate 100 random numbers from a normal distribution with mean μ = 50 and standard deviation σ = 5, and plot as a histogram:
1d <- rnorm(n=100,mean=50,sd=5) 
2hist(d)
2.
Generate 200 random pairs of numbers ({x_i,y_i} for 1 ≤ i ≤ 200) from a bivariate normal distribution with mean {μ_x = 10,μ_y = 20} and covariance matrix
$[ σ2x = 2 σx,y = − 3 ] σx,y = σ = − 3 σ2 = 6 x,y y$
1library(MASS) 
2m <- c(10,20) 
3s <- matrix(data=c(2,-3,-3,6),nrow=2,ncol=2) 
4xy <- mvrnorm(n=200,mu=m,Sigma=s) 
5plot(xy)
3.
Plot the PDF and CDF of a normal distribution with mean μ = 50 and standard deviation σ = 5:
1op <- par(mfrow=c(1,2)) 
2m <- 50 
3s <- 5 
4x <- seq(from=25,to=75,length.out=100) 
5f <- dnorm(x=x,mean=m,sd=s) 
6plot(x=x,y=f,type='l',main='PDF') 
7P <- pnorm(q=x,mean=m,sd=s) 
8plot(x=x,y=P,type='l',ylab='P(X<x)',main='CDF') 
9par(op)
17.8 Error propagation

Propagating analytical uncertainties using the procedures of Section 8.2 is a manual process that does not require R. However, R does fulfil a useful purpose for the Fisher Information approach of Section 8.4. In Section 8.4, we manually showed that s[] = . This Section will show how R can do the same thing numerically.
1.
Recall the log-likelihood function for the Poisson distribution (Equation 6.4):
$∑k ℒ ℒ(λ|k) = k ln [λ]− λ − i i=1$

Implementing this in R:
1LL <- function(lambda,k){ 
2  k * log(lambda) - lambda - sum(1:k) 
3}
2.
Evaluating LL for different values of lambda assuming that k=4.
4N <- 100 
5lam <- seq(from=0,to=20,length.out=N) 
6loglik <- rep(0,N) 
7for (i in 1:N){ 
8  loglik[i] <- LL(lambda=lam[i],k=4) 
9} 
10plot(lam,loglik,type='l',xlab=expression(lambda),ylab='LL')
which produces the following output:

Figure 17.1: Log-likelihood function for the Poisson distribution, evaluated at different values for the parameter λ, given an observation of k = 4 successes. The function reaches a maximum value at = 4.
3.
We can find the maximum likelihood estimate for λ using R’s general purpose optim function³ , using par=1 as an initial guess for λ:
11o <-  optim(par=1,f=LL,k=4,control=list(fnscale=-1),method='BFGS')
The control parameter fnscale=-1 ensures that optim finds the maximum value of LL rather than its minimum (which is the default). And BFGS is one of several minimisation algorithms that is well suited for one-dimensional optimisation problems. See ?optim for further details. The maximum likelihood estimate is then obtained by:
> o$par 
[1] 4.000001
which yields a value of = 4, ignoring a small numerical error.
4.
To estimate the uncertainty of using the Fisher Information approach of Equation 8.26 requires just a small change to our code:
11o <-  optim(par=1,f=LL,k=4,control=list(fnscale=-1), 
12            method='BFGS',hessian=TRUE)
Equation 8.26 then becomes:
> -1/o$hessian 
         [,1] 
[1,] 4.000001
So s[]² = 4, which is the same result as we derived by hand in Equation 8.29.
17.9 Comparing distributions
1.
Create a Q-Q plot comparing the Old Faithful eruption durations with the eruption waiting times:
1qqplot(faithful[,'eruptions'],faithful[,'waiting'])
Create a Q-Q plot comparing the eruption durations with a normal distribution (Figure 9.2.a):
1dat <- faithful[,'eruptions'] 
2qqnorm(dat) 
3qqline(dat)
2.
Perform the one-sided, one-sample t-test of Section 9.2 (page 109):
1gold1 <- c(19.09,19.17,19.31,19.07,19.18) 
2h <- t.test(x=gold1,mu=19.30,alternative='less')
The p-value of this test is
> h$p.value 
[1] 0.01630814
which is less than 0.05, leading to a rejection of H₀ : μ = 19.30. Equivalently, the 95% confidence interval is
> h$conf.int 
[1]     -Inf 19.25435
μ = 19.30(−∞,19.25435], which again leads to a rejected null hypothesis.
3.
Comparing two sets of coins with a two-sided, two-sample test:
2gold2 <- c(19.30,19.33,19.15,19.32) 
3h <- t.test(x=gold1,y=gold2,var.equal=TRUE)
Inspecting the p-value and two-sided 95% confidence interval:
> h$p.value 
[1] 0.1099177 
> h$conf.int 
[1] -0.25441458  0.03241458
We cannot reject H₀ : μ₁ = μ₂ because 0.1099177 > 0.05, and because {−0.25441458 ≤ 0 ≤ 0.03241458}.
4.
To carry out the χ²-test of Section 9.5, we simply supply a contingency table of petrographic observations to the chisq.test() function:
1petrography <- rbind(sampleA=c(Q=10,P=5,A=6,L=20), 
2                     sampleB=c(Q=25,P=12,A=10,L=35)) 
3h <- chisq.test(x=petrography)
where rbind binds the two named rows together into one matrix. Querying the results of the test at the console:
> h 
        Pearson's Chi-squared test 
 
data:  petrography 
X-squared = 0.82706, df = 3, p-value = 0.843
0.843 > 0.05, hence H₀ is not rejected. Note that the χ²-statistic is not exactly the same as value that was calculated by hand in Section 9.5. This is due to rounding errors of the expected counts (E_i in Equation 9.7). Here are the exact values:
> h$expected 
               Q         P         A        L 
sampleA 11.66667  5.666667  5.333333 18.33333 
sampleB 23.33333 11.333333 10.666667 36.66667
5.
Comparing the two collections of gold coins from item 3 above but this time using a Wilcoxon test instead of a t-test:
1gold1 <- c(19.09,19.17,19.31,19.07,19.18) 
2gold2 <- c(19.30,19.33,19.15,19.32) 
3h <- wilcox.test(x=gold1,y=gold2)
Querying the p-value:
> h$p.value 
[1] 0.1904762
We cannot reject H₀ because 0.1904762 > 0.05.
6.
Extract two detrital zircon U–Pb age distributions for the Mu Us desert and Yellow River from geostats’ built-in DZ dataset, and compare them with the Kolmogorov-Smirnov test:
1river <- DZ[['Y']] 
2dune <- DZ[['5']] 
3h <- ks.test(x=river,y=dune)
Querying the result:
> h$p.value 
[1] 4.500822e-05
4.500822 × 10⁻⁵ ≪ 0.05 and hence H₀ is clearly rejected.
17.10 Regression
1.
geostats contains a Rb–Sr dataset called rbsr. Plot the columns RbSr and SrSr of this dataset as a scatter plot:
1plot(x=rbsr$RbSr,y=rbsr$SrSr)
Calculate the correlation coefficient of RbSr and SrSr:
2r <- cor(x=rbsr$RbSr,y=rbsr$SrSr)
which yields:
> r 
[1] 0.9847415 
> r^2 
[1] 0.9697158
2.
Compute the t-statistic for the correlation coefficient using Equation 10.6:
3n <- nrow(rbsr) 
4tstat <- r*sqrt(n-2)/sqrt(1-r^2)
Query the value at the console and compare it to the rejection region for a two-sided t-test:
> tstat 
[1] 13.86085 
> qt(c(0.025,0.975),df=n-2) 
[1] -2.446912  2.446912
The observed value for the t-statistic falls in the rejection region (13.86085 > 2.446912). Thus H₀ is rejected and the linear trend is real. Alternatively, we can also perform the correlation test using R’s cor.test() function:
> h <- cor.test(x=rbsr$RbSr,y=rbsr$SrSr) 
> h$p.value 
[1] 8.779983e-06
which is less than α = 0.05, again leading to the rejection of H₀.
3.
Having demonstrated the statistical significance of the correlation, we can now fit a line to the data using the method of least squares:
2fit <- lm(SrSr ~ RbSr, data=rbsr)
which uses R’s formula notation (Y ~ X where X is the independent variable and Y is the dependent variable)⁴ .
The output of lm() can be queried like a list to retrieve the slope and intercept:
> fit$coefficients 
(Intercept)        RbSr 
 0.69742660  0.01391808
The best fit line is given by [⁸⁷Sr/⁸⁶Sr] = 0.696 + 0.014 [⁸⁷Rb/⁸⁶Sr] (rounded to two significant digits).
4.
Predict the ⁸⁷Sr/⁸⁶Sr-value for ⁸⁷Rb/⁸⁶Sr = 10, say:
> fit$coefficients[1] + fit$coefficients[2]*10 
(Intercept) 
  0.8366074
Equivalently:
> predict(fit, newdata=data.frame(RbSr=10)) 
        1 
0.8366074
5.
Add the best fit line to the existing scatter plot:
3abline(fit)
Construct a 95% confidence envelope for the linear fit:
4x <- seq(from=min(rbsr$RbSr),to=max(rbsr$RbSr),length.out=20) 
5pred <- predict(fit,newdata=data.frame(RbSr=x), 
6                interval="confidence",level=0.95) 
7matlines(x,pred,lty=1,col='black')
where matlines simultaneously plots multiple lines. Adding a prediction interval to the existing plot:
8pred <- predict(fit,newdata=data.frame(RbSr=x), 
9                interval="prediction",level=0.95) 
10matlines(x,pred,lty=2,col='black')
6.
The rbsr file contains five columns, specifying the ratios as well as their uncertainties and error correlations:
> rbsr 
  RbSr errRbSr  SrSr errSrSr   rho 
1 2.90  0.0944 0.745 0.00702 0.586 
2 7.14  0.0970 0.803 0.00625 0.470 
3 9.10  0.1040 0.823 0.00740 0.476 
4 3.41  0.1040 0.737 0.00697 0.468 
5 1.91  0.0967 0.720 0.00676 0.486 
6 7.15  0.1110 0.793 0.00749 0.507 
7 5.92  0.0948 0.789 0.00632 0.561 
8 8.28  0.1070 0.807 0.00678 0.460
Plotting these data as error ellipses and implementing the weighted least squares regression algorithm of Section 10.5 is not trivial in base R. Fortunately, the geostats package comes to the rescue:
1yfit <- york(dat=rbsr)
The name of the weighted least squares regression refers to geophysicist Derek York, who developed an early version of the algorithm⁵ .
17.11 Fractals and chaos
1.
Calculate and plot the size-frequency distribution of Finnish lakes (Figure 11.5):
1sf <- sizefrequency(Finland$area) 
2plot(frequency~size,data=sf,log='xy') 
3fit <- lm(log(frequency)~log(size),data=sf) 
4lines(x=sf$size,y=exp(predict(fit)))
where the sizefrequency() function is provided by geostats.
2.
Create a Gutenberg-Richter plot for the recent earthquake data (Figure 11.3):
1gutenberg(earthquakes$mag)
where gutenberg() is a geostats function that is similar to the Finnish lake code of step 1. You can check out the implementation details by typing ‘gutenberg’ at the command prompt.
3.
Create a Koch snowflake, i.e. a triangle of three Koch curves (Figures 11.13-11.16):
1k <- koch(n=5)
Compute the fractal dimension of the curve:
2fit <- fractaldim(k)
which yield a slope of -1.2 and, hence, a fractal dimension of 1.2.
17.12 Unsupervised learning
1.
The geostats package includes a function called PCA2D() that can be used to reproduce Figures 12.2–12.4:
1X <- cbind(a=c(0,17,19),b=c(9,0,6)) 
2PCA2D(X)
2.
Verifying the equivalence of PCA and classical MDS:
3d <- dist(X)           # create a Euclidean distance matrix 
4conf <- cmdscale(d)    # classical MDS 
5plot(conf,type='n')    # create an empty plot 
6text(conf,labels=1:3)  # add text labels to the empty plot
This script produces the same output as the first panel of PCA2D.
3.
R contains not one but two built-in PCA functions: prcomp and princomp. Both produce essentially the same output but use different algebraic algorithms⁶ . Applying prcomp to the US arrests data to produce the biplot shown in Figure 12.5:
1pc <- prcomp(USArrests,scale=TRUE) 
2biplot(pc)
4.
Reproducing the MDS analysis of European road distances of Figure 12.6:
1conf <- cmdscale(eurodist) 
2plot(conf,type='n',asp=1) 
3text(conf,labels=labels(eurodist))
Repeating the same exercise using nonmetric MDS:
1library(MASS) 
2mds <- isoMDS(eurodist) 
3conf <- mds$points 
4plot(conf,type='n',asp=1) 
5text(conf,labels=labels(eurodist))
Flip the y-axis to make the MDS configuration look more like the map of Europe:
4ylim <- rev(range(conf[,2])) # reverse the minimum and maximum values 
5plot(conf,type='n',asp=1,ylim=ylim) # change the y-axis limits
Assess the goodness of fit on a Shepard plot:
4sh <- Shepard(d=eurodist,x=conf) 
5stress <- signif(mds$stress,2) 
6plot(sh,main=paste0('stress=',stress))
where signif(x,2) rounds x to 2 significant digits.
5.
Visualise the iris data in a 4 × 4 grid of scatter plots (Figure 12.14):
1measurements <- iris[,-5] 
2species <- iris[,5] 
3plot(measurements,pch=as.numeric(species))
where as.numeric converts the species to numbers, which are subsequently used as colours. Classify the data into three groups using the k-means algorithm:
4fit <- kmeans(measurements,centers=3)
Compare the classes to the known species of the flowers:
> table(fit$cluster,species) 
    setosa versicolor virginica 
  1      0          2        36 
  2      0         48        14 
  3     50          0         0
6.
Hierarchical clustering of the iris data:
1tree <- hclust(dist(measurements)) 
2plot(tree)
Cutting the tree down to three branches:
1treecut <- cutree(tree,k=3) 
2table(treecut,species)
17.13 Supervised learning
1.
Discriminant analysis is implemented in the MASS package:
1library(MASS) 
2ld <- lda(Species ~ ., data=iris)
Predict the species of a new flower with a sepal length of 6.0 cm, a sepal width of 3.0 cm, a petal length of 5.0 cm and a petal width of 1.5 cm:
3newflower <- data.frame(Sepal.Length=6.0,Sepal.Width=3.0, 
4                        Petal.Length=5.0,Petal.Width=1.5) 
5pred <- predict(ld,newdata=newflower)
Query the posterior likelihoods of the LDA classification:
> pred$posterior 
        setosa versicolor virginica 
1 3.207192e-27  0.8098087 0.1901913
which means that there is an 81% chance that the new flower is versicolor and a 19% chance that it is virginica.
Hence the predicted species of the new flower is:
> pred$class 
[1] versicolor
2.
To apply quadratic discriminant analysis, we simply replace lda by qda in the previous code:
2qd <- qda(Species ~ ., data=iris)
which produces the following outcome:
> predict(qd,newdata=newflower)$posterior 
         setosa versicolor virginica 
1 2.147589e-103  0.7583223 0.2416777
Thus, according to QDA, there is a 76% probability that the new flower is versicolor, and a 24% chance that it is virginica.
3.
Decision trees are implemented in the rpart package:
1library(rpart) 
2tree <- rpart(Species ~ ., data=iris, method="class") 
3plot(tree) 
4text(tree)
To add the misclassification rates to the tree:
4text(tree, use.n=TRUE)
Using the newflower data frame that we created in step 1, the class probabilities are:
> predict(object=tree,newdata=newflower) 
  setosa versicolor  virginica 
1      0  0.9074074 0.09259259
which leads to the following classification:
> predict(object=tree,newdata=newflower,type='class') 
         1 
versicolor
17.14 Compositional data
1.
Transform the A–CN–K data of Section 14.2 from the ternary simplex to bivariate logratio space using the alr transformation:
1uv <- alr(ACNK) 
2plot(uv)
where the alr() function is provided by the geostats package (type alr at the command prompt to see the code). Computing the mean and covariance matrix of the logratio data:
3mu <- colMeans(uv) 
4covmat <- cov(uv)
Add the mean to the logratio plot as a black square and the confidence ellipse as a polygon:
5points(x=mu[1],y=mu[2],pch=22,bg='black') 
6ell <- ellipse(mu,covmat) 
7polygon(ell)
2.
Plot the A–CN–K data on a ternary diagram:
1ternary(ACNK,labels=c(expression('Al'[2]*'O'[3]), 
2                      expression('CaO+Na'[2]*'O'), 
3                      expression('K'[2]*'O')))
where the expression function allows R to use subscripts and special characters in text labels. Mapping the logratio mean and error ellipse back to the ternary diagram:
4ternary(alr(mu,inverse=TRUE),add=TRUE,type='p',pch=22,bg='black') 
5ternary(alr(ell,inverse=TRUE),add=TRUE,type='l')
3.
Apply PCA to the major element compositions of Table 14.1:
1comp <- clr(major) 
2pc <- prcomp(comp) 
3biplot(pc)
4.
Apply LDA to the AFM data of Figure 14.10:
1library(MASS) 
2affinity <- FAM[,1] 
3comp <- alr(FAM[,-1]) 
4ld <- lda(x=comp,grouping=affinity)
where the lda function does not use formula notation (Section 17.13) but an alternative format (see ?lda for further details). Classify a new rock with 1 wt% FeO, 8 wt% Na₂O+K₂O, and 0.1 wt% MgO:
5newrock <- data.frame(F=1,A=8,M=0.1) 
6newcomp <- alr(newrock) 
7pr <- predict(object=ld,newdata=newcomp)
This produces:
> pr$posterior 
            ca          th 
[1,] 0.9931941 0.006805909
which suggests that the new sample is a calc-alkaline basalt.
17.15 Directional data

Base R includes little or no functionality for directional data analysis. So this tutorial will heavily depend on the geostats package.
1.
Plot the glacial striations data on a circle:
1circle.plot(striations,degrees=TRUE)
where the degrees option accounts for the fact that the striations data are expressed in degrees. Change to degrees=FALSE for radians. Add the arithmetic mean and vector mean to the plot as a white and black square, respectively:
2circle.points(mean(striations),pch=22,bg='white',degrees=TRUE) 
3md <- meanangle(striations,degrees=TRUE) 
4circle.points(md,pch=21,bg='black',degrees=TRUE)
2.
Compute the concentration parameter R:
1R <- Rbar(striations,degrees=TRUE)
Converting to the von Mises parameter κ using Equation 15.7:
2k <- Rbar2kappa(R)
Evaluating the von Mises distribution over the 0 – 360^∘ interval:
3a <- seq(from=0,to=360,length.out=100) 
4d <- vonMises(a,mu=md,kappa=k,degrees=TRUE) 
5plot(a,d,type='l') 
6rug(striations)
3.
To plot azimuths and dips on a Wulff stereonet:
1stereonet(trd=palaeomag$decl,plg=palaeomag$incl, 
2          option=1,degrees=TRUE,show.grid=FALSE)
For strikes and dips, use the option=2 argument:
1stereonet(trd=fault$strike,plg=fault$dip, 
2          option=2,degrees=TRUE,show.grid=FALSE)
To plot geographical coordinates (e.g, 30^∘ longitude and -20^∘ latitude) on a Schmidt stereonet:
1stereonet(trd=30,plg=-20,degrees=TRUE, 
2          option=3,show.grid=TRUE,wulff=FALSE)
To add five cones of 10 degree radius to the existing plot (as in Figure 15.9.b):
3stereonet(trd=c(0,0,90,180,270),plg=c(90,10,10,10,10), 
4          coneAngle=rep(10,5),option=4,degrees=TRUE,add=TRUE,wulff=FALSE)
4.
To compute the average azimuth and dip and add it to a Wulff net:
1A <- palaeomag$decl 
2D <- palaeomag$incl 
3meanpalaeomag <- meanangle(trd=A,plg=D,option=1,degrees=TRUE) 
4stereonet(trd=A,plg=D,degrees=TRUE,show.grid=FALSE,pch=16,col='grey') 
5stereonet(trd=meanpalaeomag[1],plg=meanpalaeomag[2], 
6          option=1,degrees=TRUE,add=TRUE,pch=16,col='black')
To average strike and dip measurements:
1S <- fault$strike 
2D <- fault$dip 
3meanfault <- meanangle(trd=S,plg=D,option=2,degrees=TRUE) 
4stereonet(trd=S,plg=D,option=2,degrees=TRUE, 
5          show.grid=FALSE,pch=16,col='grey') 
6stereonet(trd=meanfault[1],plg=meanfault[2], 
7          option=2,degrees=TRUE,add=TRUE,pch=16,col='black')
5.
To compute the average orientation of the pebble measurements:
1circle.plot(pebbles,degrees=TRUE) 
2m <- meanangle(pebbles,degrees=TRUE,orientation=TRUE) 
3circle.points(m,degrees=TRUE,pch=16,cex=2)
17.16 Spatial data

geostats implements a simple version of ordinary kriging. A far more complete set of geostatistical functions can be found in the gstat package.
1.
Load the Meuse dataset, define three shorthand variables, and plot them on a map:
1X <- meuse$x  # Easting 
2Y <- meuse$y  # Northing 
3Z <- log(meuse$zinc) 
4colourplot(X=X,Y=Y,Z=Z,key.title=title('ln[Zn]'))
2.
Plot the empirical semivariogram and fit a spherical semivariogram model to it:
4svm <- semivariogram(x=X,y=Y,z=Z)
Query the fit parameters:
> svm$snr 
        sill       nugget        range 
  0.63756843   0.03512741 892.01855323
To fit an exponential semivariogram model:
4svm_exp <- semivariogram(x=X,y=Y,z=Z,model='exponential')
The third option for the model argument is 'gaussian'.
3.
Estimating the ln[Zn]-value for the new location {x_∘ = 179850,y_∘ = 331650}:
> kriging(x=X,y=Y,z=Z,svm=svm,xi=179850,yi=331650) 
[1] 5.011715
and its variance:
> kriging(x=X,y=Y,z=Z,svm=svm,xi=179850,yi=331650,err=TRUE) 
[1] 0.2215261
4.
Evaluate the ln[Zn]-values for a grid of locations and plot as a contour map:
5xi <- seq(from=min(X),to=max(X),length.out=50) 
6yi <- seq(from=min(Y),to=max(Y),length.out=50) 
7zi <- kriging(x=X,y=Y,z=Z,svm=svm,xi=xi,yi=yi,grid=TRUE) 
8colourplot(x=xi,y=yi,z=zi,key.title=title('ln[Zn]'))
And to visualise the uncertainties of the kriging estimates, replace the last two lines with:
7zi <- kriging(x=X,y=Y,z=Z,svm=svm,xi=xi,yi=yi,grid=TRUE,err=TRUE) 
8colourplot(x=xi,y=yi,z=sqrt(zi),key.title=title('s[Zn]/Zn'))
Chapter 18
Exercises

18.1 The basics

1.
Plot the sine and cosine functions from 0 to 2π.
2.
Write a function to plot an ellipse: ${ x = a cos(α )cos(β)− bsin(α)sin(β) where 0 ≤ β ≤ 2π y = a sin(α) cos(β) + bcos(α)sin(β)$
and α is the rotation angle of the ellipse.
3.
Write a function that takes two numbers as input and tells the user whether these are multiples of each other or not.
4.
Write a function to print the following number triangle:
1
2 2
3 3 3
4 4 4 4
5 5 5 5 5

down to any value n

18.2 Plotting data

1.
Using geostats’ countQuakes function (section 17.1.13), plot the declustered earthquakes of magnitude 4.5 and greater from 2000 to 2016 as a bar plot, and as a histogram.
2.
Generate two samples (A and B, say) of 100 random numbers between 0 and 1; calculate the ratios A∕B and B∕A; and create a 2×2 figure with KDEs and rug plots of A∕B, B∕A, ln(A∕B) and ln(B∕A).
3.
Create a bivariate (x, y) dataset of 1000 random uniform numbers where −1 ≤ x ≤ +1 and 2 ≤ y ≤ 22. Construct a 2-dimensional KDE for these data.
4.
Plot the ECDFs of the x and y values of the previous exercise. What fraction of the x-values is less than 0? What fraction of the y-values is less than 7? And less than 17?

18.3 Summary statistics

1.
Calculate the means and variances of the Anscombe quartet, and store them in a 2 × 8 matrix.
2.
Generate n = 10 random numbers between 0 and 1. Calculate their mean. Repeat 100 times and store the mean values in a 100-element vector. Compute the mean and standard deviation of this vector. Repeat for n = 100,1000 and 10000.
3.
Generate 1000 random numbers between 0 and 200. Count the number of values that are less than 1. Repeat 500 times to fill a vector of counts. Compute the mean and variance of this vector.
4.
Generate two samples (A and B) of 100 random numbers between 0 and 1, and calculate their logratio ln(A∕B). Repeat 10 times and visualise the results as a box plot.

18.4 Probability

1.
The International Geo Sample Number (IGSN) is an alphanumeric code that is used to identify geological rock specimens in the scientific literature. It consists of one to five letters to identify the owner of the sample, followed by four characters (letters or numbers) to identify the sample itself. Examples are PVERM1234 and UCL001B. How many samples can each owner register? How many possible IGSNs are there in total?
2.
20 students are taking part in a mapping exercise. How many ways are there to divide them into 4 distinct groups of 5?
3.
A thoroughly mixed conglomerate contains 30% andesite clasts, 20% basalt clasts, and 50% carbonate clasts. How many randomly selected clasts do we need to pick to be 95% certain that we have collected at least one clast of each lithology?
4.
95% of iron ore deposits are characterised by magnetic anomalies, and so are 98% of chromium deposits and 1% of other rocks. 0.1% of all rocks contain iron ore, and 0.05% of rocks contain chromium ore. Suppose that we have found a magnetic anomaly. What is the probability that this is caused by an ore deposit?

18.5 The binomial distribution

1.
In palynology, the ratio of arboreal to non-arboreal pollen is widely used as an index of landscape openness. This ratio is estimated by counting a representative number of randomly selected pollen from a soil sample. Suppose that we have counted 20 pollen and counted the number of arboreal or non-arboreal species among them. Further suppose that the true arboreal/non-arboreal pollen ratio in the soil is 4. What is probability that the arboreal/non-arboreal ratio of the 20 counts is 9 or greater?
2.
We believe that 50% of all dinosaur fossils are female (and 50% are male). A bone bed contains 50 dinosaur fossils among which 32 are female (and 18 are male). Do we have enough evidence to prove that the proportion of males and females is different? Do our data support the notion that females are more common than males?
3.
Draw 50 random numbers from a binomial distribution with p = 0.2 and n = 100. For each of these values, perform a two-sided test against the null hypothesis that p = 0.2. Do all the values pass the test?
4.
Given A = 12 arboreal pollen and N = 8 non-arboreal pollen, compute a 95% confidence interval for the A∕N-ratio of the soil.

18.6 The Poisson distribution

1.
On average a magnitude ≥ 9 earthquake occurs once every 13.5 years. What is the probability that two such earthquakes will happen next year? What is the probability that none will happen in the next century?
2.
The fission track method is a geochronological technique that is based on the spontaneous fission of ²³⁸U in accessory minerals such as apatite. Spontaneous fission occurs with a probability of 8.46×10⁻¹⁷ per atom of ²³⁸U per year. This is known as the decay constant. Suppose that a 1 million year old apatite crystal contains 1 pmol of ²³⁸U (= 6.022×10¹¹ atoms). What is the probability that it contains 50 or fewer fission tracks? What is the probability that it contains 41 to 60 tracks?
3.
The Orange River in South Africa is famous for its alluvial diamonds. In order for mining operations to be profitable, placer deposits must contain at least 2 diamonds per tonne. A diamond mining company has sieved 10 tonnes of sand and found 12 diamonds. Assuming that the diamonds are randomly distributed in the sand, should the company cease operations? Or would it be premature to do so and should they acquire some more data first?
4.
At another site, a preliminary mining survey has yielded 30 diamonds in 10 tonnes. Construct a 95% confidence interval for the diamond yield in this deposit.

18.7 The normal distribution

1.
Generate 100 random numbers from a Poisson distribution with λ = 1.2 and take their sum. Repeat this 200 times and plot the results as a histogram. Calculate the mean and the standard deviation of this synthetic dataset. What percentage of the values falls within two standard deviations from the mean?
2.
The -symbol on packaging guarantees consumers that they receive at least the amount of product that they paid for. For example, 0.75 L means that at most 1.5% of the containers with this label are allowed to contain less that 0.75 L of fluid. Suppose that the amount of wine that is dispensed by a bottle filling machine follows a normal distribution with a standard deviation of 2 ml. On average, by how much do we have to overfill the bottles to ensure that our wine receives the -label? Work out the solution by hand using Table 20.1.
3.
The height of men follows a normal distribution with a mean of 175 cm and a standard deviation of 7.7 cm. The height of women follows a normal distribution with a mean of 160 cm and the same standard deviation. Using Table 20.1, determine the approximate percentage of men who are shorter than the tallest 10% of women.
4.
Using a normal approximation to the Poisson distribution, calculate the probability of k < 62 when λ = 64.
5.
Consider two bivariate normal distributions with mean vectors μ_X and μ_Y, and covariance matrices Σ_X and Σ_Y, respectively: $[ ] [ ] [ ] [ ] μ = 1 , Σ = 1 − 1 , μ = 3 and Σ = 2 2 X 2 X − 1 3 Y 4 Y 2 7$
Which outcome is most likely:
$[ ] [ ] X = 2 or Y = 5 ? 1 7$

18.8 Error propagation

In the following questions, the uncertainties are assumed to be independent (i.e. covariances are zero).

1.
The height difference between Ordnance Survey benchmark A and a second point B is 25 ± 0.25 m, and the height difference between B and a third point C is 50 ± 0.5 m. Calculate the height difference between A and C and propagate its uncertainty.
2.
Consider the following two rock specimens:

specimen mass (g) σ(mass) volume (cm³) σ(volume)
A 105 2 36 0.15
B 30 2.5 10 0.4

(a)
Compute the densities of rocks A and B.
(b)
Propagate their respective uncertainties.
(c)
Construct approximate 95% confidence intervals for the two densities.
(d)
Is there a significant difference in density between samples A and B?

3.
Sand may contain a variety of minerals with different densities. For example, zircon has a density of ρ_z = 4.85 ± 0.10 g/cm³ (1σ), whereas quartz has a density of ρ_q = 2.65 ± 0.05 g/cm³ (1σ). The settling velocity of sand grains in water depends on two factors: density and grain size. Dense grains settle more quickly than light ones, and large grains settle more quickly than small ones. When a balance is reached between these two factors, small zircons settle at the same rate as large quartz grains. This results in a size shift (SS)¹ between zircon and quartz in beach and river sands, which can be predicted from their respective densities: $( ) --1- ρz-−-1 SS = 0.69 ln ρq − 1$
Calculate SS and propagate its uncertainty.
4.
We measured the following ⁴⁰K and ⁴⁰Ar concentrations:

⁴⁰K (×10⁻¹⁰ mol/g) 2,093 2,105 2,055 2,099 2,030
⁴⁰Ar (×10⁻¹⁰ mol/g) 6.015 6.010 6.030 6.005 6.020 6.018

(a)
Calculate the mean ⁴⁰K and ⁴⁰Ar concentrations.
(b)
Calculate the standard error of these means.
(c)
Estimate the ⁴⁰Ar/⁴⁰K-ratio and its standard error.
(d)
The age equation for the ⁴⁰K-⁴⁰Ar method is as follows: $[ ( )] 1- λ-- 40Ar- t = λ ln 1+ λe 40K$
with λ = 5.543 × 10⁻⁴ Myr⁻¹ and λ∕λ_e = 9.3284. Calculate the age and its standard error.

18.9 Comparing distributions
1.
We have measured the contents of five randomly selected bottles of wine: $750.1,750.4,749.9,750.3,750.3$

(a)
Can we conclude, at a 95% confidence level, that the mean volume exceeds 750 ml?
(b)
Can we conclude that the mean volume is different than 750 ml?
(c)
Compute a 95% confidence interval for the mean volume of the wine bottles.

Work the solutions out by hand using Table 20.2 and check the results with R.
2.
geostats contains a dataset with foraminifera counts for two surface sediments (A and B) from the Atlantic Ocean:
> data(forams,package='geostats') 
> forams 
  uvula scitula quinqueloba pachyderma incompta glutinata bulloides 
A     9       1          13         15       16        10        10 
B    20      15          35         30       40        20        18
Is there sufficient evidence that the two samples represent different ecosystems, i.e. that the proportions of the different species in both samples is significantly different?
3.
Using Table 20.3, test if the following histogram is consistent with a uniform distribution:

count 20 28 16 12 24
bin 0-20 20-40 40-60 60-80 80-100

4.
A gemologist has ranked 15 diamonds in order of increasing quality. The diamonds belong to two different categories: synthetic (‘S’) and natural (‘N’):

rank 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
variety N S N N S S S N N N N S S S S

Using Table 20.4 or 20.5, is there a statistically significant difference in quality between the two diamond varieties at a 95% confidence level?
5.
Compare the following two samples, first with a t-test, then with a Wilcoxon test:

A 0 2 4 6 8
B 3 5 7 9 11 13 15

Which of the two tests is more powerful?
6.
The DZ dataset in the geostats package contains a list of 15 detrital zircon U–Pb age distributions. Calculate the Kolmogorov-Smirnov statistic for all possible pairs of samples. Which two samples are the most similar? Which are the least similar? Show the resulting two pairs as two Q-Q plots.
18.10 Regression

1.
trees is one of R’s built-in datasets. Have a look at it by typing plot(trees) at the command prompt. Now calculate the correlation coefficients of all possible combinations of variables. Which variables have the strongest correlation? Determine the corresponding slope and intercept.
2.
geostats contains a dataset called worldpop, which tallies the world’s population from 1750 until 2014. From 1960 onwards, the world population has been growing at a linear rate. Fit a straight line through these values and extrapolate it to today. Then compare your prediction with the value reported on https://www.worldometers.info.
3.
Let x, y and z be three independent normally distributed sets of n = 100 measurements with coefficients of variation σ_x∕μ_x = 0.05, σ_y∕μ_y = 0.01 and σ_z∕μ_z = 0.1, respectively. Form two new variables by taking the ratios X = x∕z and Y = y∕z. What is the expected null correlation for X and Y ? Compare with the actual value.
4.
Consider the following three samples:

i x s[x] y s[y] cov[x_i,y_i]
1 10 1 20 1 0.9
2 20 1 30 1 0.9
3 28 1 42 1 -0.9

Fit a straight line through the (x_i,y_i)-data, first ignoring the uncertainties (s[x_i], s[y_i], cov[x_i,y_i]), and then using error weighted regression.

18.11 Fractals and chaos

1.
Plot the size-frequency distribution of North American rivers using R’s built-in rivers dataset.
2.
Use the boxcount function in geostats to count how many 4 × 4 boxes are needed to cover all the fractures in geostats’ fractures dataset. How many 8×8 boxes are needed? Compute the fractal dimension of the fracture pattern.
3.
Using the last 50 years of the geostats’ declustered dataset, estimate the chance of observing at least one magnitude ≥6 earthquake in the western United States next year. Hint: use a combination of fractal and Poisson distributions.
4.
Use geostats’ pendulum function to repeat the pendulum experiment of Section 11.4. Try different initial positions, velocities, numbers of magnets etc. See ?pendulum for details.

18.12 Unsupervised learning

1.
Visualise R’s built-in iris dataset as a PCA biplot and interpret the results. Hint: remove the Species column from the iris data to get a matrix of four-dimensional data (Section 17.12.5).
2.
Analyse the same iris data by MDS. Check if the results are consistent with the output of the previous exercise.
3.
The kmeans function of Section 17.12.5 returns a 9-element list. This list includes an item called withinss, which contains the sum of squared distances of all the items in each cluster to their respective centres, and another item called tot.withinss, which contains the sum of the withinss values. Plot the value of tot.withinss against the number of clusters (k) for the iris dataset, for 1 ≤ k ≤ 10. The ‘elbow’ in this curve marks the optimal number of clusters.
4.
Use the ksdist function of the geostats package to create a distance matrix of Kolmogorov-Smirnov statistics for the DZ dataset. Then use the output of this function to build a hierarchical clustering tree for those detrital zircon U–Pb ages. Are the results consistent with the MDS configuration of Figure 12.8?

18.13 Supervised learning

1.
The geostats package includes a dataset called training, which contains the SiO₂, TiO₂, Al₂O₃, CaO, MgO, MnO, K₂O and Na₂O concentrations (in wt%) of 227 island arc basalts (IAB), 221 mid oceanic ridge basalts (MORB), and 198 ocean island basalts (OIB). Build an LDA model for the data. What is the misclassification rate of the data?
2.
The geostats package also contains a second dataset of IAB, MORB and OIB compositions called test. Classify these data using the LDA model obtained under question 1. What is the misclassification rate? Compare this to the misclassification rate of the training data.
3.
Build a decision tree for the training data. What is the misclassification rate of the optimal tree?
4.
Analyse the test data with the decision tree of question 3. How does its misclassification rate compare to that of the training data? And to the LDA analysis of questions 1 and 2?

18.14 Compositional data

1.
mtcars is one of R’s built-in datasets. It contains a table with the fuel consumption and 10 aspects of automobile design and performance for 32 automobiles features in a 1974 edition of ‘Motor Trend’ magazine.
(a)
Calculate the average fuel consumption of these 32 vehicles –which are listed in miles per gallon– using the arithmetic mean.
(b)
Convert the data to European units, namely litres per 100km. x miles/gallon = y litres/100km, where: $235.21 y = ------ x$
(c)
Calculate the arithmetic mean fuel consumption in litres/100km.
(d)
Convert the arithmetic mean number of mpg (from step 1a) to units of litres/100km. How does the resulting value compare with that obtained from step 1c?
(e)
Compute the geometric mean fuel consumption in mpg and litres/100km. Then convert the units of these mean values and compare with the result observed under step 1d.

2.
Load geostats’ test data into memory and plot the CaO–K₂O–Na₂O subcomposition on a ternary diagram. Add the arithmetic and logratio mean compositions to the plot. Experiment with the arguments to the ternary function to make the plot look as clear as possible.
3.
Perform compositional PCA on the test data and interpret the results on a biplot. Use the optional col argument of the biplot function to enhance the figure.
4.
Repeat the LDA exercise of Sections 18.13.1 and 2, but using a logratio transformation. How does the transformation affect the misclassification rates?

18.15 Directional data

1.
Consider the following two sets of compass bearings:

A 50 55 40 60 45
B 350 10 5 358

What is the difference of the vector means for A and B?
2.
The earthquakes dataset in the geostats package contains two columns with the longitude and latitude of 20,000 earthquakes. Extract the events that took place between longitudes −90^∘W and +90^∘E and plot them on a Schmidt stereonet.
3.
Consider the following two sets of fault measurements:

1^th fault
2^nd fault

strike 200 195 210 205 20 25 18 17 16
dip 70 68 71 67 10 12 9 14 11

Plot the two sets of faults on a Wulff net and calculate the difference of their mean strike and dip.
4.
Which of geostats’ built-in datasets is more tightly clustered: palaeomag or fault?

18.16 Spatial data

1.
The geostats package contains a dataset called hills, which tabulates 250 topographic measurements for a synthetic landscape comprised of three bivariate normal ‘hills’. Fit an exponential, Gaussian and spherical model to the semivariogram. Subjectively assess which model fits the empirical semivariogram best.
2.
Using the semivariogram model obtained under question 1, estimate the elevation at position {x = 0,y = 0}, and its standard error.
3.
Create a contour plot of the hills dataset.
4.
Using the Meuse data, create a contour map of the difference between kriging models using a spherical and exponential semivariogram model.

Chapter 19
Solutions

In programming there nearly always exist multiple ways to solve a problem. Your code does not have to be exactly the same as the solutions given in this chapter. Occasionally some of these solutions include R functions and optional arguments that have not yet been introduced before. These additional ‘bells and whistles’ are optional, and the code will still work without them. Several exercises depend on the geostats package, which we assume to have been loaded into memory:
library(geostats)
19.1 The basics
1.
Plotting the sine:
1theta <- seq(from=0,to=2*pi,length.out=50) 
2plot(theta,sin(theta),type='l')
Adding the cosine:
3lines(theta,cos(theta),lty=2)
where lty=2 creates a dashed line (see ?par).
2.
Define the function
1ellipse <- function(a=1,b=1,alpha=pi/4){ 
2  beta <- seq(from=0,to=2*pi,length.out=50) 
3  x <- a*cos(alpha)*cos(beta) - b*sin(alpha)*sin(beta) 
4  y <- a*sin(alpha)*cos(beta) + b*cos(alpha)*sin(beta) 
5  plot(x,y,type='l',asp=1) 
6}
Using the new function:
> ellipse() # produces a circle 
> ellipse(a=1,b=2)
3.
Here we need a conditional statement:
1multiples <- function(n,m){ 
2  remainder <- max(n,m) %% min(n,m) 
3  if (remainder == 0){ 
4    decision <- paste(n,'and',m,'are multiples') 
5  } else { 
6    decision <- paste(n,'and',m,'are not multiples') 
7  } 
8  print(decision) 
9}
where the paste function concatenates text strings (see ?paste for details).
4.
This exercise requires a for-loop:
1triangle <- function(n){ 
2  for (i in 1:n){ 
3    print(rep(i,i)) 
4  } 
5}
19.2 Plotting data
1.
The bars produced by barplot correspond to years, whereas those produced by hist correspond to the number of earthquakes per year. The following code shows both diagrams side by side:
1quakesperyear <- countQuakes(declustered,minmag=4.5,from=2000,to=2016) 
2op <- par(mfrow=c(1,2)) 
3barplot(quakesperyear) 
4hist(quakesperyear) 
5par(op)
2.
Write a plotting function to avoid duplicate code:
1rugdensity <- function(dat){ 
2  plot(density(dat)) 
3  rug(dat) 
4}
Then we can use this function to show that the (reciprocal) ratios of random numbers drawn from a symmetric (e.g. uniform) distribution follow a skewed distribution:
5A <- runif(100) 
6B <- runif(100) 
7op <- par(mfrow=c(2,2)) 
8rugdensity(A/B) 
9rugdensity(B/A) 
10rugdensity(log(A/B)) 
11rugdensity(log(B/A)) 
12par(op)
3.
Using the optional min and max arguments of the runif(…) function:
1library(MASS) 
2x <- runif(1000,min=-1,max=1) 
3y <- runif(1000,min=2,max=22) 
4contour(kde2d(x,y)) 
5points(x,y)
4.
Continuing the previous solution:
6op <- par(mfrow=c(1,2)) 
7cdfx <- ecdf(x) 
8plot(cdfx) 
9cdfy <- ecdf(y) 
10plot(cdfy) 
11par(op)
Querying the ECDFs:
> cdfx(0) 
[1] 0.492 
> cdfy(c(7,17)) 
[1] 0.248 0.748
You may get slightly different values due to the randomness of the runif(…) function. But they should hover around ∼ 0.5,∼ 0.25 and ∼ 0.75, respectively.
19.3 Summary statistics
1.
Carrying on from Section 17.2.1, and using a for-loop to avoid duplicate code:
1nc <- ncol(anscombe)                  # number of columns 
2mv <- matrix(0,nrow=2,ncol=nc)        # mv = 'mean-variance' 
3rownames(mv) <- c('mean','variance') 
4colnames(mv) <- colnames(anscombe) 
5for (i in 1:nc){                      # loop through the columns 
6  mv['mean',i] <- mean(anscombe[,i]) 
7  mv['variance',i] <- var(anscombe[,i]) 
8}
Querying the result and rounding to two significant digits:
> signif(mv,2) 
         x1 x2 x3 x4  y1  y2  y3  y4 
mean      9  9  9  9 7.5 7.5 7.5 7.5 
variance 11 11 11 11 4.1 4.1 4.1 4.1
2.
First define a function to generate n random numbers and average them:
1meanofmeans <- function(n=10,N=100){ 
2  m <- rep(0,N)            # initialise 
3  for (i in 1:N){ 
4    m[i] <- mean(runif(n)) # populate 
5  } 
6  c(mean(m),sd(m))         # return the results 
7}
An equivalent but faster solution would be to replace the for-loop with:
3  R <- runif(n*N) 
4  M <- matrix(R,nrow=n,ncol=N) 
5  m <- colMeans(M)
Now apply the meanofmeans function to n = 10,100,1000 and 10000, and store the results in a matrix:
8n <- c(10,100,1000,10000)       # Initialise 
9mom <- matrix(0,length(n),2)    # the 
10colnames(mom) <- c('mean','sd') # results 
11rownames(mom) <- n              # matrix. 
12for (i in 1:length(n)){         # Fill 
13  mom[i,] <- meanofmeans(n[i])  # the 
14}                               # matrix.
Check the result at the command prompt and verify that the mean converges to 0.5, and the standard deviation gets ever smaller with increasing n:
> signif(mom,4) 
        mean       sd 
10    0.4977 0.092320 
100   0.5033 0.025540 
1000  0.5003 0.009316 
10000 0.5002 0.002923
Again, results may vary slightly between runs. The sample size dependency of the mean of means is further investigated in Chapter 8.3.
3.
Filling the vector of 500 counts using a for-loop:
1N <- 500 
2counts <- rep(0,N) 
3for (i in 1:N){ 
4  r <- runif(1000,max=200) 
5  counts[i] <- sum(r < 1) 
6}
Alternatively, we can also solve the problem more quickly without a for-loop:
1n <- 1000 
2N <- 500 
3r <- runif(n*N,max=200) 
4m <- matrix(r,nrow=n,ncol=N) 
5counts <- colSums(m<1)
You should see that the mean approximately equals the variance:
> mean(counts) 
[1] 4.928 
> var(counts) 
[1] 5.01685
This phenomenon is further explored in Chapter 6.
4.
Taking the logratio of two matrices of random numbers:
1n <- 100 
2N <- 10 
3A <- matrix(runif(n*N),nrow=n,ncol=N) 
4B <- matrix(runif(n*N),nrow=n,ncol=N) 
5boxplot(log(A/B))
You should get ten broadly similar looking box plots.
19.4 Probability
1.
An IGSN consists of two parts. The first part consists of one to five letters. There 26 possible one-letter combinations, 26² possible two-letter combinations etc. Therefore, the total number of lab identifiers is ∑ _i=1⁵26ⁱ = 12,356,630. The second part consists of four letters (A–Z) or numbers (0–9). The total number of possibilities for these is 36⁴ = 1,679,616. Therefore, the total number of possible IGSNs is
> sum(26^{1:5}) * 36^4 
[1] 2.075439e+13
2.
There are choices for the first group, for the second, for the third and for the fourth. However this includes 4! duplicate groupings, because there are 4! ways to shuffle the four groups. Hence the total number of combinations is:
> i <- seq(from=5,to=20,by=5) 
> prod(choose(i,5))/factorial(4) 
[1] 488864376
3.
Let A, B and C stand for for andesite, basalt and carbonate, respectively, and let p_A, p_B and p_C be their respective probabilities in the population (i.e. p_A = 0.3, p_B = 0.2 and p_C = 0.5). The probability that none of n clasts are made of andesite is (1 − p_A)ⁿ. So we might think that the probability of missing at least one clast type is:
$p(no A,B or C )=?(1 − p )n + (1 − p )n + (1− p )n A B C$

However this would double-count the instances where two clast types are missing. For example, if a sample of n clasts does not contain any basalt or carbonate, then this means that it contains only andesite. The probability that this happens is p_Aⁿ. Similarly, the probabilility of missing both andesite and carbonate is p_Bⁿ, and the probability of missing both andesite and basalt is p_Cⁿ. Removing these double-counted events:

$n n n n n n p (no A, B or C) = (1− pA ) + (1− pB ) + (1 − pC) − pA − pB − pC$

Implementing this expression in R:
1ABC <- function(n,pA,pB,pC){ 
2  if (n<3){   # samples of fewer than 3 clasts 
3    p <- 0    # are definitely incomplete 
4  } else { 
5    p <- (1-pA)^n + (1-pB)^n + (1-pC)^n - pA^n - pB^n - pC^n 
6  } 
7  return(1-p) 
8}
Evaluating the ABC function at the command line for different values of n:
> ABC(n=10,pA=0.3,pB=0.2,pC=0.5) 
[1] 0.8634136 
> ABC(n=20,pA=0.3,pB=0.2,pC=0.5) 
[1] 0.9876719 
> ABC(n=14,pA=0.3,pB=0.2,pC=0.5) 
[1] 0.9491764
In conclusion, we need to collect at least 14 clasts to have 95% certainty that all lithologies are represented in our sample.
4.
Using Equation 4.12: $p(Fe|anomaly) = -----------------------p(anomaly-|Fe-)p(Fe)------------------------ p(anomaly |Fe)p(Fe)+ p(anomaly |Cr)p(Cr)+ p(anomaly |other)p(other) 0.95× 0.001 = (0.95-×-0.001)+-(0.98×-0.0005)+-(0.01×--0.9985)-= 0.083 = 8.3% p(Cr|anomaly) = -----------------------p(anomaly-|Cr-)p(Cr-)----------------------- p(anomaly |Fe)p(Fe)+ p(anomaly |Cr)p(Cr)+ p(anomaly |other)p(other) -----------------0.98-×-0.0005------------------ = (0.95 × 0.001)+ (0.98× 0.0005)+ (0.01× 0.9985) = 0.043 = 4.3%$
Hence the probability that the magnetic anomaly is caused by an ore deposit is 8.3 + 4.3 = 12.6%.
19.5 The binomial distribution
1.
Let p(A) and p(N) be the probability of counting an arboreal and non-arboreal pollen, respectively (where p(A) + p(N) = 1). Then:
$p(A-) -p-(A-)-- --4-- p(N ) = 4 = 1− p(A ) ⇒ p(A) = 1 + 4 = 0.8$

So if n = 20, then the most likely outcome is 20×0.8 = 16 arboreal and 20×(1 − 0.8) = 4 non-arboreal pollen. Using the same calculation, an arboreal/non-arboreal count ratio of 9 would correspond to a probability p(A) = 9∕10 = 0.9, which would lead to an outcome of 20 × 0.9 = 18 arboreal and 20 × (1 − 0.9) = 2 non-arboreal pollen. The probability of observing an outcome at least as extreme as this is given by:
> 1-pbinom(q=17,size=20,prob=0.8) 
[1] 0.2060847
or, equivalently:
> pbinom(q=17,size=20,prob=0.8,lower.tail=FALSE) 
> pbinom(q=2,size=20,prob=0.2)
In conclusion, there is a 20% chance of observing an A∕N count ratio of greater or equal than 9 when the actual population ratio is 4.
2.
To answer the first question, we formulate a two-sided hypothesis:
H₀ : p(female) = 0.5 = p(male)
H_A : p(female)≠0.5≠p(male)
Then the 2.5 and 97.5 percentiles of the binomial null distribution are given by:
> qbinom(p=c(0.025,0.975),size=50,prob=0.5) 
[1] 18 32
So the rejection region is R = {0,…,17,33,…,50}, which does not include the observed outcome of 32 female fossils. Equivalently, using the binom.test function:
> h <- binom.test(x=32,n=50,p=0.5,alternative="two.sided") 
> h$p.value 
[1] 0.06490865
0.06490865 > 0.05 and therefore the two-sided hypothesis is not rejected. To answer the second question, we carry out a one-sided hypothesis test:
H₀ : p(female) = 0.5 = p(male)
H_A : p(female) > 0.5 > p(male)
whose 95 percentile is given by:
> qbinom(p=0.95,size=50,prob=0.5) 
[1] 31
This leads to a rejection region of R = {32,…,50}, which includes the outcome of 32 female fossils. Hence the one sided hypothesis is rejected. Equivalently:
> h <- binom.test(x=32,n=50,p=0.5,alternative="greater") 
> h$p.value 
[1] 0.03245432
which is half of the one-sided p-value and less than 0.05. The one-sided hypothesis is more powerful than the two-sided version, leading to rejection of H₀. Thus the one-sided and two-sided hypothesis tests lead to different conclusions for the same dataset. It is important that the test is chosen before the experiment is completed. So you must decide on your scientific question before designing your study. In the dinosaur example, it would be bad practice to decide on a one-sided hypothesis test after seeing that female fossils are more abundant in the bone bed than male ones.
3.
This exercise is all about type-I errors. Using rbinom to generate the synthetic data:
1type1test <- function(p=0.2,n=50,N=100){ 
2  r <- rbinom(n=n,size=N,prob=p) 
3  pval <- rep(0,n) 
4  for (i in 1:n){ 
5    h <- binom.test(x=r[i],n=N,p=p) 
6    pval[i] <- h$p.value 
7  } 
8  sum(pval<0.05) 
9}
The function is to be used like this:
> type1test() 
[1] 2 
> type1test() 
[1] 3 
> type1test() 
[1] 0
You get a different result every time because rbinom generates a different set of numbers every time. The number of rejected null hypotheses should hover around 5%. Therefore we would expect that, on average, 2.5 out of every 50 tests should fail.
4.
The 95% confidence interval for p(A) is given by:
> binom.test(x=12,n=20)$conf.int 
[1] 0.3605426 0.8088099
Note that we are not really using binom.test to carry out a hypothesis test. We are only interested in the confidence interval. The lower value of the confidence interval for p(A) = 0.3605426 corresponds to an A∕N-ratio of

$p(A)- --p(A-)- -0.3605426--- p(N ) = 1 − p(A) = 1− 0.3605426 = 0.5638258$

A value of p(A) = 0.8088099 corresponds to an A∕N-ratio of

$p(A-) --p(A)-- --0.8088099--- p(N ) = 1− p(A ) = 1 − 0.8088099 = 4.230396$

Hence the 95% confidence interval for A∕N is [0.5638258,4.230396].
19.6 The Poisson distribution
1.
If, on average, one magnitude ≥ 9 event happens every 13.5 years, then this means that λ = 1∕13.5. The chance that exactly two such events happen per year is given by Equation 6.1:
$( ) [ ] -1-- λke−-λ exp-−-113.5-- P k = 2|λ = 13.5 = k! = 13.522! = 0.00255 (0.255% )$

In R:
> dpois(x=2,lambda=1/13.5) 
[1] 0.002547607
The average number of magnitude ≥ 9 earthquakes per century is λ = 100∕13.5. The chance that no such earthquake will happen during the next century is:

$( ) [ ] 100- -100 P k = 0|λ = 13.5 = exp −13.5 = 0.000607 (0.0607% )$
2.
The expected number of fission tracks is
$( 8.46 × 10−17) (1 × 106yr)× ------------ × (6.022× 1011atoms ) = 50.95 atoms ⋅yr$

The probability that the crystal contains 50 or fewer fission tracks is:
> L <- 1e6 * 8.46e-17 * 6.022e11 
> ppois(q=50,lambda=L) 
[1] 0.4843836
The probability that it contains 41 to 60 fission tracks is:
> ppois(q=60,lambda=L) - ppois(q=40,lambda=L) 
[1] 0.8393979
3.
This problem requires a one-sided hypothesis test:
H₀ : λ = 2
H_A : λ < 2
In R:
> h <- poisson.test(x=12,T=10,r=2,alternative='less') 
> h$p.value 
[1] 0.03901199
where T is the ‘time base’, corresponding to the number of years. H₀ is rejected on a 95% confidence level, and so it would be best to cease operations and abandon the mining site. Instead of using the number of diamonds per tonne to define λ, we can also use the number of diamonds per 10 tonnes:
H₀ : λ = 20
H_A : λ < 20
which allows us the calculate a rejection region:
> qpois(p=0.05,lambda=20) 
[1] 13
k = 12 ∈ R = {0,…,12}, leading to the rejection of H₀. Equivalently:
> poisson.test(x=12,T=1,r=20,alternative='less')$p.value 
[1] 0.03901199
which is the same result as before.
4.
To construct a 95% confidence interval, we again use poisson.test, but have no need to formulate a null hypothesis. This is similar to exercise 19.5.4:
> h <- poisson.test(x=30,T=10) 
> h$conf.int 
[1] 2.024087 4.282687
The profitability cutoff of 2 diamonds per tonne falls below this confidence interval. So we can conclude that the second site is profitable.
19.7 The normal distribution
1.
This is an illustration of the central limit theorem. Here is a solution that does not require a for loop:
1# np = the number of Poisson values per sum 
2# ns = the number of sums 
3CLT <- function(np=100,ns=200){ 
4  r <- rpois(n=np*ns,lambda=1.2)   # create all random numbers at once 
5  mat <- matrix(r,nrow=np,ncol=ns) # put the values in one big matrix 
6  sums <- colSums(mat)  # take the column sums (instead of a for loop) 
7  hist(sums)            # plot the histogram 
8  mu <- mean(sums)      # compute the mean 
9  sigma <- sd(sums)     # compute the standard deviation 
10  proximal <- sums>(mu-2*sigma) & sums<(mu+2*sigma) # logical vector 
11  return(sum(proximal)/ns)         # returns a value between 0 and 1 
12}
Running this script from the console should return a value of approximately 0.95:
> CLT() 
[1] 0.955
Results will vary slightly due to the randomness of rpois.
2.
Using Table 20.1, the 98.5-percentile of a standard normal distribution is 2.17. To reduce the tail that falls below 750 ml to less than 1.5%, we must raise the mean by 2.17 × 2 = 4.34ml, so that μ = 754.34 ml. Using R:
> qnorm(p=0.985,mu=750,sd=2) 
[1] 754.3402
3.
According to Table 20.1, the 90-percentile of the women’s distribution is 160+1.28×7.7 = 169.856 cm. This is (175 − 169.856)∕7.7 ≈ 0.67 standard deviations below the average height of the men. This corresponds to the 25-percentile of the men’s distribution. Using R:
> q90women <- qnorm(p=0.9,mean=160,sd=7.7) 
> pnorm(q=q90women,mean=175,sd=7.7) 
[1] 0.2525457
4.
The mean and variance of the Poisson distribution are both equal to λ. Provided that λ is reasonably big (and a value of 64 meets this requirement), we can approximate this distribution with a normal distribution with mean and variance λ. The expected value of this distribution is 64 and its standard deviation is 8. An outcome of k = 62 is 0.25 standard deviations below the mean. Using Table 20.1, the probability of observing a value of up to 0.25 standard deviations below the mean is 40%. The exact value (given by ppois(61,lambda=64)) is 38.5%.
5.
Here we need to evaluate the bivariate normal likelihood (equation 7.7). Unfortunately, this function is not part of base R so we need to implement it ourselves:
1dmvnorm <- function(vec,mu=rep(0,2),sigma=diag(2)){ 
2  num <- exp(0.5 * t(vec - mu) %*% solve(sigma) %*% (vec - mu)) 
3  den <- 2*pi*sqrt(det(sigma)) 
4  return(num/den) 
5}
Using these data to evaluate the bivariate normal densities of X and Y :
6muX <- c(1,2) 
7muY <- c(3,4) 
8SigmaX <- matrix(c(1,-1,-1,3),nrow=2,ncol=2) 
9SigmaY <- matrix(c(2,2,2,7),nrow=2,ncol=2) 
10X <- c(2,1) 
11Y <- c(5,7) 
12dX <- dmvnorm(X,mu=muX,sigma=SigmaX) 
13dY <- dmvnorm(Y,mu=muY,sigma=SigmaY)
which produces the following output:
> dX 
[1,] 0.1855463 
> dY 
[1,] 0.1511973
So outcome X is more likely than outcome Y .
19.8 Error propagation
1.
Let x be the height difference between points A and B (x = 25±0.25 m), and let y be the height difference between B and C (y = 50 ± 0.5 m). Then the height difference between A and C is simply z = x + y = 75 m. To propagate the uncertainty of this estimate, we use Equation 8.10 with a = 0 and b = c = 1: $√------ s[z]2 = s[x]2 + s[y]2 = 0.252 + 0.52 = 0.3125 ⇒ s[z] = 0.3125 = 0.559017$
Note that the uncertainty of both x and y is 1%, but the uncertainty of z is only 0.75%.
2.

(a)
Density is mass divided by volume, hence density(A) = 105∕36 = 2.92 g/cm³ and density(B) = 30∕10 = 3.00 g/cm³.
(b)
The uncertainty is calculated with the formula for a quotient (equation 8.13): $( )2 ( )2 ( )2 s[z]- = s[x] + s[y] z x y$
where z is density, x is mass and y is volume. Thus s[density(A)] = = 0.057 g/cm³ and s[density(B)] = = 0.277 g/cm³.
(c)
The 95% confidence intervals for the two densities are approximately 2× their standard deviations. So for sample A, the 95% confidence interval is 2.92±0.11 g/cm³, and for sample B it is 3.00±0.55 g/cm³.
(d)
The confidence interval for density(A) overlaps with density(B) and vice versa. Therefore, we cannot rule out the possibility that the true densities are equal.

3.
The formula for the size shift does not directly fit into any of the equations of Section 8.2. Therefore we have to apply the chain rule: $x = ρz − 1 = 3.85 and y = ρq − 1 = 1.65 x 3.85 w = --= ---- = 2.333 y 1.65 SS = aln(w ) =--1- ln(2.33) = 1.228 0.69$
Following the same steps for the error propagation:
$s[x] = s[ρz] = 0.10 and s[y] = s[ρq] = 0.05 (using Equation 8.11) ∘ (----)2---(----)2- ∘ (-----)2---(----)2- s[x]- s[y] 3.85 0.10- 0.05 s[w ] = w x + y = 1.65 3.85 + 1.65 = 0.093 (using Equation 8.13) s[SS ] = a s[w-]= -1--0.093-= 0.058 (using Equation 8.15) w 0.69 2.333$
Hence the size shift is SS = 1.228 ± 0.058, which can be used to select a set of sieves that separate zircon from quartz.
4.
(a)
Calculating the means:
1K40 <- c(2093,2105,2055,2099,2030) 
2Ar40 <- c(6.015,6.010,6.030,6.005,6.020,6.018) 
3meanK40 <- mean(K40) 
4meanAr40 <- mean(Ar40)
(b)
The standard errors of the means:
5nK40 <- length(K40) 
6nAr40 <- length(Ar40) 
7seK40 <- sd(K40)/sqrt(nK40) 
8seAr40 <- sd(Ar40)/sqrt(nAr40)
(c)
The ⁴⁰Ar/⁴⁰K-ratio and its standard error, propagated using Equation 8.13:
9Ar40K40 <- meanAr40/meanK40 
10seAr40K40 <- Ar40K40*sqrt((seAr40/meanAr40)^2+(seK40/meanK40)^2)
(d)
Calculating the K–Ar age:
11L <- 5.543e-4 # lambda 
12LLe <- 9.3284 # lambda/lambda_e 
13age <- log(1+LLe*Ar40K40)/L
Propagating its uncertainty requires breaking down the calculation into two steps:

$( ) -λ- 40Ar- z = 1+ λe 40K (which can be propagated using Equation 8.10) 1 and t = --ln(z) (which can be propagated using Equation 8.15) λ$

In R:
14z <- 1+LLe*Ar40K40 
15sez <- LLe*seAr40K40 # using equation 8.10 
16se_age <- sez/(L*z)  # using equation 8.15
Running this script in R yields
> age 
[1] 48.11483 
> se_age 
[1] 0.3331236
19.9 Comparing distributions
1.
(a)
Carrying out a one-sided t-test: $H0 : μ = 750.0$

$H : μ > 750.0 a$
Compute a one-sided t-statistic:
$¯x = 750.1-+-750.4-+-749.9+-750.3-+-750.3 = 750.2 5$

$∘ (−-0.1)2-+-0.22-+-(− 0.3)2 +-0.12 +-0.12 s[x ] = ----------------------------------- = 0.2 5 − 1$

$750.2−-750.0- t = 0.2∕√5- = 2.236$
t was drawn from a t-distribution with 4 degrees of freedom, with α = 0.05. Using Table 20.2, the rejection region: R = {t > 2.13}. 2.236 ∈ R, therefore, H₀ is rejected and we conclude that the mean volume exceeds 750 ml.
(b)
Switching to a two-sided t-test: $H0 : μ = 750.0$

$Ha : μ ⁄= 750.0$
The t-statistic is the same as before (t = 2.236). However, the two-sided rejection region is R = {t < −2.78,t > 2.78}. 2.236R, therefore, H₀ is not rejected and the mean volume may yet equal 750 ml.
(c)
Using Equation 9.6 and the t-statistic: $s[x] μ ∈ ¯x± tdf,α∕2√n--$

${ } μ ∈ 750.2± 2.780√.2- = {750.2 ± 0.25} 5$
This interval overlaps with μ = 750.0, reflecting the conclusion drawn from the two-sided hypothesis test of Exercise 18.9.1.1b.

Repeating the same exercise using R:
> wine <- c(750.1,750.4,749.9,750.3,750.3) 
> t.test(wine,mu=750,alternative='greater')$p.value 
[1] 0.04450467 
> t.test(wine,mu=750)$p.value 
[1] 0.08900934 
> t.test(wine)$conf.int 
[1] 749.9517 750.4483
Bonus note: it is also possible to construct a one-sided confidence interval:

$[ ) μ ∈ ¯x+ t s[√x],∞ df,α n$

$[ ) [ ) 0.2 μ ∈ 750.2− 2.13√5-,∞ = 750.01,∞$

which does not overlap with 750.0 and is, therefore, compatible with the outcome of the one-sided hypothesis test. In R:
> t.test(wine,alternative='greater')$conf.int 
[1] 750.0093      Inf
2.
This exercise is most concisely solved using R’s chisq.test function. First we amalgamate the data columns with the fewest counts. Then we apply a two-sample chi-square test to this new table:
1abundant <- forams[,c('quinqueloba','pachyderma','incompta', 
2                      'glutinata','bulloides')] 
3other <- rowSums(forams[,c('uvula','scitula')]) 
4dat <- cbind(abundant,other) 
5h <- chisq.test(dat)
where cbind binds two columns together into a matrix (you could also do this with the matrix function). Probing the outcome of these calculations:
> h$p.value 
[1] 0.7977184
Hence there is insufficient evidence to prove that the two samples reflect different ecosystems.
3.
First, we take the mean of the data:

$¯x = 20-+-28-+-16+-12-+-24 = 20 5$

This gives us the expected count for each bin. Next, we carry out a hypothesis test:
H₀: the bin counts follow a uniform distribution
H_a: the bin counts do not follow a uniform distribution
The test statistic is calculated using Equation 9.7 of the notes:

$2 (20-−-20)2- (28−-20)2- (16-−-20)2- (12-−-20)2 (24-−-20)2- χ = 20 + 20 + 20 + 20 + 20 = 8$

Under H₀, χ² follows a Chi-square distribution with 4 degrees of freedom. Using α = 0.05 and Table 20.3 the rejection region for χ² is

$R = {χ2 > 9.49}$

χ²R, therefore H₀ is not rejected and we cannot rule out the possibility that the data were drawn from a uniform distribution. Using R:
> chisq.test(x=c(20,28,16,12,24)) 
... 
X-squared = 8, df = 4, p-value = 0.09158
4.
We are comparing the following hypotheses:
H₀: the two distributions have the same median
H_a: the two distributions have different medians
The test statistic is the Wilcoxon rank-sum. The dataset contains n₁ = 7 natural and n₂ = 8 synthetic diamonds. The sum of the ranks for the natural diamonds is w = 1 + 3 + 4 + 8 + 9 + 10 + 11 = 46. This is a two-sided test, so we must evaluate the null distribution at α = 0.025. Using Table 20.4, the rejection region is R = {w < 38}. 46R, so we cannot reject the null hypothesis. There is no discernable quality difference between the two diamond varieties. Using R:
> diamonds <- c('N','S','N','N','S','S','S','N', 
                'N','N','N','S','S','S','S','S') 
> wilcox.test(x=which('N'==diamonds), 
              y=which('S'==diamonds)) 
... 
> W = 18, p-value = 0.1738
Note that the test statistic (W = 18) does not equal the manually calculated value (w = 46). This discrepancy is explained in the documentation:
> ?wilcox.test 
... 
The literature is not unanimous about the definitions of the 
Wilcoxon rank sum and Mann-Whitney tests.  The two most common 
definitions correspond to the sum of the ranks of the first sample 
with the minimum value (m(m+1)/2 for a first sample of size m) 
subtracted or not: R subtracts.  It seems Wilcoxon's original 
paper used the unadjusted sum of the ranks but subsequent tables 
subtracted the minimum.
For the diamond example, m = 7, so that 18 + m(m + 2)∕2 = 46, which equals the manually calculated rank-sum.
5.
Comparing the t-test and Wilcoxon test:
1A <- seq(from=0,to=8,by=2) 
2B <- seq(from=3,to=15,by=2) 
3ht <- t.test(A,B) 
4hw <- wilcox.test(A,B)
which produces the following outcomes:
> ht$p.value 
[1] 0.04325446 
> hw$p.value 
[1] 0.07323232
The null hypothesis is rejected by the t-test but not by the Wilcoxon test. This is representative of a general tendency for non-parametric hypothesis tests to be less powerful than their parametric counterparts.
6.
There are 13 samples in the DZ, whose Kolmogorov-Smirnov statistics can be evaluated in a double for loop.
1ns <- length(DZ)            # number of samples 
2maxKS <- 0                  # initialise the least similar K-S value 
3minKS <- 1                  # initialise the most similar K-S value 
4for (i in 1:(ns-1)){        # loop through the rows 
5  for (j in (i+1):ns){      # loop through the columns 
6    KS <- ks.test(DZ[[i]],DZ[[j]])$statistic 
7    if (KS < minKS){ 
8      minKS <- KS           # update the least similar K-S values 
9      mostsimilar <- c(i,j) 
10    } 
11    if (KS > maxKS){ 
12      maxKS <- KS           # update the most similar K-S values 
13      leastsimilar <- c(i,j) 
14    } 
15  } 
16}
Plotting the most similar and least similar pairs as Q-Q plots:
17op <- par(mfrow=c(1,2)          # set up a two panel plot 
18snames <- names(DZ)             # get the list of sample names 
19ms1 <- snames[mostsimilar[1]]   # first sample name in similar pair 
20ms2 <- snames[mostsimilar[2]]   # second sample name in similar pair 
21ls1 <- snames[leastsimilar[1]]  # first sample name in dissimilar pair 
22ls2 <- snames[leastsimilar[2]]  # second sample name in dissimilar pair 
23qqplot(DZ[[ms1]],DZ[[ms2]],xlab=ms1,ylab=ms2,main=minKS) 
24qqplot(DZ[[ls1]],DZ[[ls2]],xlab=ls1,ylab=ls2,main=maxKS) 
25par(op)
The most similar samples are T and Y, whose K-S distance is 0.07; the least similar samples are L and 8, whose K-S distance is 0.63.
19.10 Regression
1.
Inspecting the correlation matrix
1> cor(trees) 
2           Girth    Height    Volume 
3Girth  1.0000000 0.5192801 0.9671194 
4Height 0.5192801 1.0000000 0.5982497 
5Volume 0.9671194 0.5982497 1.0000000
shows that, among all three possible pairs of variables, Girth and Volume are most strongly correlated. Fitting a straight line through the data, and showing on a scatter plot:
1fit <- lm(Girth~Volume,data=trees) 
2plot(Girth~Volume,data=trees) 
3abline(fit)
Note how the plot function also accepts formula notation. This is equivalent to:
2plot(x=trees$Volume,y=trees$Girth,xlab='Volume',ylab='Girth')
The slope and intercept are:
> fit$coef 
(Intercept)      Volume 
  7.6778570   0.1846321
Hence Girth = 7.6778570 + 0.1846321 × Volume.
2.
Predicting the world population for the year 2020:
1recent <- worldpop[worldpop$year>=1960,] 
2fit <- lm(population~year,data=recent) 
3pred <- predict(fit,newdata=data.frame(year=2020),interval='prediction')
which gives the following result:
> pred 
       fit      lwr      upr 
1 7703.923 7579.205 7828.642
3.
Using the left hand side of Equation 10.21:
$[σz]2 --------------μz-------------- ----------0.12----------- ρX,Y ≈ ∘ [--]2--[--]2∘ [---]2--[--]2 = √0.012 +-0.12√0.052-+-0.12 = 0.89 σμyy + σμzz σμxx + σμzz$

Comparing with a numerical simulation:
1n <- 100 
2x <- rnorm(n,mean=100,sd=5) 
3y <- rnorm(n,mean=100,sd=1) 
4z <- rnorm(n,mean=100,sd=10) 
5X <- x/z 
6Y <- y/z
Then query the spurious correlation coefficient:
1> cor(X,Y) 
2[1] 0.891362
4.
Preparing the data:
1xy <- data.frame(x=c(10,20,28),sx=rep(1,3), 
2                 y=c(20,30,42),sy=rep(1,3), 
3                 cov=c(0.9,0.9,-0.9))
Fitting the two models:
4lmfit <- lm(y ~ x, data=xy) 
5yorkfit <- york(xy)
Querying the results:
> lmfit$coef 
(Intercept)           x 
   7.213115    1.213115 
> yorkfit$coef 
intercept     slope 
 9.300252  1.052391
19.11 Fractals and chaos
1.
The solution to this exercise is very similar to Section 17.11.1. Using the sizefrequency() function in the geostats package:
> sf <- sizefrequency(rivers) 
> plot(sf,log='xy')
Fitting a line through the data
> fit <- lm(log(frequency)~log(size),data=sf) 
> lines(sf$size,exp(predict(fit)))
The straight line fit is not great, indicating that short rivers have been under-counted in this dataset.
2.
Let us first have a look at the fractures dataset:
> image(fractures)
Counting the number of 4 × 4 and 8 × 8 boxes:
> boxcount(fractures,size=4) 
[1] 1456 
> boxcount(fractures,size=8) 
[1] 662
Using geostats’ fractaldim() function:
> fit <- fractaldim(fractures) 
> fit$coef[2] 
log(size) 
-1.541701
3.
Extract the last 50 years worth of data from the declustered database:
1ny <- 50                           # number of years 
2lastyear <- max(declustered$year)  # 2016 
3firstyear <- lastyear-ny           # 1967 
4recent <- (declustered$year>firstyear) & (declustered$year<=lastyear) 
5dat <- declustered$mag[recent]     # magnitudes of recent earthquakes
We use the Gutenberg-Richter law to estimate the number of magnitude ≥ 6 earthquakes:
6nq <- length(dat)           # the number of earthquakes 
7fit <- gutenberg(dat)       # size - (log)frequency distribution 
8lf6 <- predict(fit,newdata=data.frame(mag=6))
Giving a value for log ₁₀[N∕N_∘] of
> lf6 
-2.705345
Hence the expected number of magnitude ≥ 6 earthquakes per year is:
9f6 <- (10^lf6)*nq/ny
which is 0.92. The actual number of earthquakes will vary around this number, following a Poisson distribution (Chapter 6) with λ = 0.92. The probability of observing at least one such event next year is given by
> ppois(q=0,lambda=f6,lower.tail=FALSE) 
[1] 0.6014946
4.
Plotting 9 experiments in a 3 × 3 grid:
1op <- par(mfrow=c(3,3),mar=rep(0,4)) # initialise the plot 
2pendulum()                           # default settings 
3pendulum(startpos=c(-1.9,2))         # tiny change in starting position 
4pendulum(startpos=c(2,2))            # start in the upper right corner 
5pendulum(startvel=c(0.5,0))          # start with a horizontal push 
6pendulum(startvel=c(1,0))            # a stronger push 
7pendulum(startvel=c(0,-1))           # push down 
8pendulum(src=matrix(c(1,-1,0,0),ncol=2))             # two magnets 
9pendulum(src=matrix(c(1,-1,0,0,0,-1),ncol=2))        # three magnets 
10pendulum(src=matrix(c(1,1,-1,-1,1,-1,1,-1),ncol=2))  # four magnets 
11par(op)                              # restore the parameters
19.12 Unsupervised learning
1.
Removing the Species column from the iris dataset and feeding the unclassified data into R’s prcomp function:
1pc <- prcomp(iris[,-5]) 
2biplot(pc)
Which produces the following biplot:

Figure 19.1: PCA biplot for the iris data. This diagram indicates there are two groups of flowers that with distinct petal width and length measurements (PC1). Within these two groups, additional variability is caused by the sepal width and length (PC2). The vector loadings of the petal width and length point in the same direction, which means that these two variables are correlated with each other. The vector loading of the sepal width is perpendicular to that of the petal measurements, which means that the sepal and petal dimensions vary independently.
2.
Classical MDS analysis of the iris data:
1d <- dist(iris[,-5]) 
2mds <- cmdscale(d) 
3plot(mds,type='n') 
4text(mds)
which produces essentially the same output as Figure 19.1 but without the arrows.
3.
Repeating the code from Section 17.12.5 for different numbers of clusters:
1K <- 10            # maximum number of clusters to evaluate 
2ss <- rep(0,K)     # initialise the vector with the sums of squares 
3for (k in 1:K){    # loop through all the k values 
4  fit <- kmeans(iris[,-5],centers=k) # fit the k means 
5  ss[k] <- fit$tot.withinss          # extract the sum of squares 
6} 
7plot(x=1:K,y=ss,type='b')            # plot as both lines and points
Figure 19.2: Evaluating the within-cluster sum-of-squares (ss) of the k-means algorithm for different numbers of clusters. The ss-misfit drops off very quicky before making an ‘elbow’ at k = 3 clusters. Hence we cannot justify more than 3 clusters.
4.
Using the ksdist function as instructed:
1d <- ksdist(DZ) 
2tree <- hclust(d) 
3plot(tree)
This produces the following tree:

Figure 19.3: Hierarchical clustering tree for the detrital zircon data. The tree conveys the same information as the MDS configuration of Figure 12.8: there are two main clusters; T and Y are the two most similar samples; whilst L and 8 are the two most dissimilar samples.
19.13 Supervised learning
1.
Build an LDA model for the training data:
1library(MASS) 
2ld <- lda(affinity ~ ., data=training)
Predict the tectonic affinity of the training data:
3pr <- predict(ld)
which is equivalent to:
3pr <- predict(ld,newdata=training[,-1])
Count the number of misclassified training data:
> sum(pr$class != training$affinity) 
[1] 60
So 60 of the 646 training samples were misclassified (9%).
2.
Classifying geostats’ test data with the ld model developed in answer 1:
4pr.test <- predict(ld,newdata=test[,-1])
Count the number of misclassified test data:
> sum(pr.test$class != test$affinity) 
[1] 32
So 32 out of 147 of the test data were misclassified (21%). Unsurprisingly, this is higher than the misclassification rate of the training data. Inspecting the results in some more detail:
> table(pr.test$class,test$affinity) 
       IAB MORB OIB 
  IAB   48    2   1 
  MORB  12   16   8 
  OIB    4    5  51
This shows that the misclassification rates are similar for all three tectonic affinities.
3.
Repeating the code of exercise 1 but replacing MASS with rpart and lda with rpart:
1library(rpart) 
2tree <- rpart(affinity ~ ., data=training)
Predict the tectonic affinity of the training data:
3pr <- predict(tree,type='class')
Count the number of misclassified training samples:
> sum(pr != training$affinity) 
[1] 42
So 42 of the 646 training samples were misclassified.
4.
Repeating the code of exercise 2 but replacing MASS with rpart and lda with rpart.
4pr.test <- predict(tree,newdata=test[,-1],type='class')
Count the number of misclassified test data:
> sum(pr.test != test$affinity) 
[1] 28
So 28 out of 147 of the test data were misclassified (19%). Again, this is higher than the misclassification rate of the training data, but slightly lower than the misclassification rate of the test data by LDA.
19.14 Compositional data
1.
(a)
Calculate the arithmetic mean fuel consumption, in miles per gallon:
> avg.mpg <- mean(mtcars[,'mpg']) 
> avg.mpg 
[1] 20.09062
(b)
Convert the data to litres/100km:
> l100k <- 235/mtcars[,'mpg']
(c)
Average the litres/100km:
> avg.l100k <- mean(l100k) 
> avg.l100k 
[1] 12.7434
(d)
Convert the arithmetic mean number of miles per gallon (from step 1a) to units of litre/100km.
> avg.l100k.from.mpg <- 235/avg.mpg 
> avg.l100k.from.mpg 
[1] 11.697
which is not equal to the 12.7434 litres/100km obtained under step 1c. The difference is nearly 10%.
(e)
Compute the geometric mean fuel consumption in mpg and litres/100km, and convert the geometric mean mpg value to litres/100km:
> geomean.mpg <- exp(mean(log(mtcars[,'mpg']))) 
> geomean.l100k <- exp(mean(log(l100k))) 
> geomean.l100k.from.mpg <- 235/geomean.mpg 
> c(geomean.l100k,geomean.l100k.from.mpg) 
[1] 12.20775 12.20775
which is an altogether more sensible result than that obtained under step 1d.
2.
Using geostats’ test data, plot the CaO–K₂O–Na₂O subcomposition on a ternary diagram as filled grey circles:
1alkali <- test[,c('CaO','K2O','Na2O')] 
2ternary(alkali,pch=16,col='grey50')
Adding the arithmetic mean as a white filled circle:
3arithmetic.mean <- colMeans(alkali) 
4ternary(arithmetic.mean,add=TRUE,type='p',pch=21,bg='white')
Adding the logratio mean as a white filled square:
5logratio.mean <- colMeans(alr(alkali)) 
6compositional.mean <- alr(logratio.mean,inverse=TRUE) 
7ternary(compositional.mean,add=TRUE,type='p',pch=22,bg='white')
3.
PCA is an unsupervised learning technique. So we don’t need to know the tectonic affinities, which are stored in the first column of the training data. Hence we split the data into two parts: the affinities and the data itself. The data is used for the PCA, and the affinities are used as labels to help us interpret the biplot:
1aff <- test[,1]      # tectonic affinities 
2dat <- test[,-1]     # geochemical data 
3lrdat <- clr(dat) 
4pc <- prcomp(lrdat) 
5biplot(pc,col=c('grey50','black'),xlabs=aff)
Figure 19.4: PCA biplot of the oceanic basalt compositions. The data roughly fall into three clusters, corresponding to OIB, MORB and IAB. OIBs are relatively rich in K₂O and poor in CaO; MORBs are rich in MgO and TiO₂; and IABs are rich in MnO and poor in TiO₂. K₂O and CaO are anti-correlated, and so are TiO₂ and MnO. The variability in MnO/TiO₂ is independent of the variability in MgO/K₂O. The ‘outliers’ on the ternary diagram of exercise 2 are OIBs that are particularly poor in CaO.
4.
Compositional LDA of the training data:
1ld <- lda(x=alr(training[,-1]),grouping=training[,1]) 
2pr.training <- predict(ld)
Compositional LDA of the test data:
3pr.test <- predict(ld,newdata=alr(test[,-1]))
Compare the fitted results with the true affinities:
> sum(pr.training$class != training$affinity) 
[1] 48 
> sum(pr.test$class != test$affinity) 
[1] 19
In contrast with the previous LDA results of Section 19.13.1–2, the logratio transformation has reduced the number of misclassified samples from 60 to 48 for the training data, and from 32 to 19 for the test data.
19.15 Directional data
1.
Store the data into two variables:
1A <- c(50,55,40,60,45) 
2B <- c(350,10,5,358)
Then the difference of the vector means is given by
> meanangle(A,degrees=TRUE) - meanangle(B,degrees=TRUE) 
[1] 49.24496
(for comparison, the difference of the arithmetic means is −131^∘)
2.
Load the earthquake data and extract the latitude and longitude values:
1lon <- earthquakes$lon 
2lat <- earthquakes$lat
Select the plottable longitudes and show on a Schmidt net:
3good <- (lon>-90 & lon<90) 
4stereonet(trd=lon[good],plg=lat[good],option=3,degrees=TRUE,wulff=FALSE)
3.
Store the data into two data frames:
1F1 <- data.frame(strike=c(200,195,210,205),dip=c(70,68,71,67)) 
2F2 <- data.frame(strike=c(20,25,18,17,16),dip=c(10,12,9,14,11))
Plot the two sets of fault measurements on a single Wulff diagram (in different colours to allow distinction):
3stereonet(trd=F1$strike,plg=F1$dip,option=2,degrees=TRUE) 
4stereonet(trd=F2$strike,plg=F2$dip,option=2,degrees=TRUE, 
5          add=TRUE,col='grey')
Calculate the mean strike and dip:
6mF1 <- meanangle(trd=F1$strike,plg=F1$dip,option=2,degrees=TRUE) 
7mF2 <- meanangle(trd=F2$strike,plg=F2$dip,option=2,degrees=TRUE)
Add these values to the existing stereoplot:
8stereonet(trd=mF1[1],plg=mF1[2],option=2,degrees=TRUE,add=TRUE,pch=15) 
9stereonet(trd=mF2[1],plg=mF2[2],option=2,degrees=TRUE,add=TRUE,pch=15, 
10          col='grey')
Querying the mean vectors at the console:
> mF1 
[1] -157.67178   69.08943 
> mF2 
[1] 19.19848 11.21677
The difference between the strikes and dips is -176.9^∘ and 57.9^∘, respectively.
4.
The same Rbar function that was used for the circular data in Section 18.15.2 can also be used to calculating the concentration parameter (R) of spherical data. For the palaeomag data:
> Rbar(trd=palaeomag$decl,plg=palaeomag$incl,option=1,degrees=TRUE) 
[1] 0.9927666
and for the fault data:
> Rbar(trd=fault$strike,plg=fault$dip,option=2,degrees=TRUE) 
[1] 0.9970092
Because the latter value is higher than the former, we conclude that the fault orientation data are more strongly concentrated than the palaeomagnetic data.
19.16 Spatial data
1.
Load the hills dataset and storing the three columns in separate variables for brevity:
1X <- hills$X 
2Y <- hills$Y 
3Z <- hills$Z
Fit the four semivariogram models:
4op <- par(mfrow=c(1,3)) 
5expmodel <- semivariogram(x=X,y=Y,z=Z,plot=TRUE,model='exponential') 
6normodel <- semivariogram(x=X,y=Y,z=Z,plot=TRUE,model='gaussian') 
7sphmodel <- semivariogram(x=X,y=Y,z=Z,plot=TRUE,model='spherical') 
8par(op)
Visual inspection of the resulting plots indicates that the Gaussian model fits the empirical semivariogram best, which is not surprising given that the hills dataset consists of a bivariate Gaussian mixture.

Figure 19.5: a) exponential; b) Gaussian; and c) spherical semivariogram fits to the hills data.
2.
Kriging interpolation of {x = 0,y = 0} using the normodel semivariogram model from step 1:
> kriging(x=X,y=Y,z=Z,xi=0,yi=0,svm=normodel) 
[1] 538.1133
and the standard error:
> v <- kriging(x=X,y=Y,z=Z,xi=0,yi=0,svm=normodel,err=TRUE) 
> sqrt(v) 
[1] 6.180623
Hence the elevation estimate at {x = 0,y = 0} is 538 ± 6 m.
3.
Following the numerical recipe of Section 17.16.4:
5xi <- seq(from=min(X),to=max(X),length.out=30) 
6yi <- seq(from=min(Y),to=max(Y),length.out=30) 
7zi <- kriging(x=X,y=Y,z=Z,xi=xi,yi=yi,svm=normodel,grid=TRUE) 
8colourplot(x=xi,y=yi,z=zi,colspec=grey.colors)
4.
Fitting spherical and exponential semivariogram models to the Meuse data:
1X <- meuse$x 
2Y <- meuse$y 
3Z <- log(meuse$zinc) 
4sphmodel <- semivariogram(x=X,y=Y,z=Z,plot=FALSE,model='spherical') 
5expmodel <- semivariogram(x=X,y=Y,z=Z,plot=FALSE,model='exponential')
Estimating the log[Zn] values along an (x,y)-grid:
6xi <- seq(from=min(X),to=max(X),length.out=50) 
7yi <- seq(from=min(Y),to=max(Y),length.out=50) 
8zsph <- kriging(x=X,y=Y,z=Z,xi=xi,yi=yi,grid=TRUE,svm=sphmodel) 
9zexp <- kriging(x=X,y=Y,z=Z,xi=xi,yi=yi,grid=TRUE,svm=expmodel) 
10colourplot(x=xi,y=yi,z=zsph,plot.title=title('spherical')) 
11colourplot(x=xi,y=yi,z=zexp,plot.title=title('exponential'))
The largest disagreements are found in sparsely sampled parts of the map:
10colourplot(X=X,Y=Y,x=xi,y=yi,z=exp(zsph-zexp),colspec=grey.colors, 
11           key.title=title(expression('Zn'[sph]*'/Zn'[exp]),xpd=NA))
which exponentiates the log-contrasts to yield the ratio of the Zn-concentations estimated using the spherical and exponential semivariogram models.

Figure 19.6: Log-contrast between two kriging interpolations for the Meuse data, using the spherical and exponential semivariogram models. The 155 measurements locations are marked by white circles. The smallest discrepancies (< 10%) are found in densely sampled areas. In more sparsely sampled parts of the contour plot, and especially near the edges, the disagreement between the spherical and exponential models increases to up to 40%.
Chapter 20
Tables

p 0.55 0.6 0.65 0.7 0.75 0.8 0.841 0.9 0.95 0.975 0.977 0.985 0.99
q 0.13 0.25 0.39 0.52 0.67 0.84 1.00 1.28 1.64 1.96 2.00 2.17 2.33

Table 20.1: Upper quantiles of the standard normal distribution (lower quantiles are the negative equivalents of these values for reasons of symmetry)

p

df 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 0.975 0.99 0.999
1 0.16 0.32 0.51 0.73 1.00 1.38 1.96 3.08 6.31 12.71 31.82 318.31
2 0.14 0.29 0.44 0.62 0.82 1.06 1.39 1.89 2.92 4.30 6.96 22.33
3 0.14 0.28 0.42 0.58 0.76 0.98 1.25 1.64 2.35 3.18 4.54 10.21
4 0.13 0.27 0.41 0.57 0.74 0.94 1.19 1.53 2.13 2.78 3.75 7.17
5 0.13 0.27 0.41 0.56 0.73 0.92 1.16 1.48 2.02 2.57 3.36 5.89
6 0.13 0.26 0.40 0.55 0.72 0.91 1.13 1.44 1.94 2.45 3.14 5.21
7 0.13 0.26 0.40 0.55 0.71 0.90 1.12 1.41 1.89 2.36 3.00 4.79
8 0.13 0.26 0.40 0.55 0.71 0.89 1.11 1.40 1.86 2.31 2.90 4.50
9 0.13 0.26 0.40 0.54 0.70 0.88 1.10 1.38 1.83 2.26 2.82 4.30
10 0.13 0.26 0.40 0.54 0.70 0.88 1.09 1.37 1.81 2.23 2.76 4.14
11 0.13 0.26 0.40 0.54 0.70 0.88 1.09 1.36 1.80 2.20 2.72 4.02
12 0.13 0.26 0.39 0.54 0.70 0.87 1.08 1.36 1.78 2.18 2.68 3.93
13 0.13 0.26 0.39 0.54 0.69 0.87 1.08 1.35 1.77 2.16 2.65 3.85
14 0.13 0.26 0.39 0.54 0.69 0.87 1.08 1.35 1.76 2.14 2.62 3.79
15 0.13 0.26 0.39 0.54 0.69 0.87 1.07 1.34 1.75 2.13 2.60 3.73
16 0.13 0.26 0.39 0.54 0.69 0.86 1.07 1.34 1.75 2.12 2.58 3.69
17 0.13 0.26 0.39 0.53 0.69 0.86 1.07 1.33 1.74 2.11 2.57 3.65
18 0.13 0.26 0.39 0.53 0.69 0.86 1.07 1.33 1.73 2.10 2.55 3.61
19 0.13 0.26 0.39 0.53 0.69 0.86 1.07 1.33 1.73 2.09 2.54 3.58
20 0.13 0.26 0.39 0.53 0.69 0.86 1.06 1.33 1.72 2.09 2.53 3.55

Table 20.2: Upper quantiles of the t-distribution with df degrees of freedom (lower quantiles are the negative equivalents of these values for reasons of symmetry)

p

df   0.025   0.05   0.1   0.2   0.3   0.4   0.5   0.6   0.7   0.8   0.9   0.95   0.975
1   0.00   0.00   0.02   0.06   0.15   0.27   0.45   0.71   1.07   1.64   2.71   3.84   5.02
2   0.05   0.10   0.21   0.45   0.71   1.02   1.39   1.83   2.41   3.22   4.61   5.99   7.38
3   0.22   0.35   0.58   1.01   1.42   1.87   2.37   2.95   3.66   4.64   6.25   7.81   9.35
4   0.48   0.71   1.06   1.65   2.19   2.75   3.36   4.04   4.88   5.99   7.78   9.49   11.14
5   0.83   1.15   1.61   2.34   3.00   3.66   4.35   5.13   6.06   7.29   9.24   11.07   12.83
6   1.24   1.64   2.20   3.07   3.83   4.57   5.35   6.21   7.23   8.56   10.64   12.59   14.45
7   1.69   2.17   2.83   3.82   4.67   5.49   6.35   7.28   8.38   9.80   12.02   14.07   16.01
8   2.18   2.73   3.49   4.59   5.53   6.42   7.34   8.35   9.52   11.03   13.36   15.51   17.53
9   2.70   3.33   4.17   5.38   6.39   7.36   8.34   9.41   10.66   12.24   14.68   16.92   19.02
10   3.25   3.94   4.87   6.18   7.27   8.30   9.34   10.47   11.78   13.44   15.99   18.31   20.48
11   3.82   4.57   5.58   6.99   8.15   9.24   10.34   11.53   12.90   14.63   17.28   19.68   21.92
12   4.40   5.23   6.30   7.81   9.03   10.18   11.34   12.58   14.01   15.81   18.55   21.03   23.34
13   5.01   5.89   7.04   8.63   9.93   11.13   12.34   13.64   15.12   16.98   19.81   22.36   24.74
14   5.63   6.57   7.79   9.47   10.82   12.08   13.34   14.69   16.22   18.15   21.06   23.68   26.12
15   6.26   7.26   8.55   10.31   11.72   13.03   14.34   15.73   17.32   19.31   22.31   25.00   27.49
16   6.91   7.96   9.31   11.15   12.62   13.98   15.34   16.78   18.42   20.47   23.54   26.30   28.85
17   7.56   8.67   10.09   12.00   13.53   14.94   16.34   17.82   19.51   21.61   24.77   27.59   30.19
18   8.23   9.39   10.86   12.86   14.44   15.89   17.34   18.87   20.60   22.76   25.99   28.87   31.53
19   8.91   10.12   11.65   13.72   15.35   16.85   18.34   19.91   21.69   23.90   27.20   30.14   32.85
20   9.59   10.85   12.44   14.58   16.27   17.81   19.34   20.95   22.77   25.04   28.41   31.41   34.17


Table 20.3: Quantiles of the Chi-square distribution with df degrees of freedom

   1   2   3   4   5   6   7   8   9   10   11   12   13   14   15   16   17   18   19   20
4            10
5         6   11   17
6         7   12   18   26
7         7   13   20   27   36
8      3   8   14   21   29   38   49
9      3   8   14   22   31   40   51   62
10      3   9   15   23   32   42   53   65   78
11      3   9   16   24   34   44   55   68   81   96
12      4   10   17   26   35   46   58   71   84   99   115
13      4   10   18   27   37   48   60   73   88   103   119   136
14      4   11   19   28   38   50   62   76   91   106   123   141   160
15      4   11   20   29   40   52   65   79   94   110   127   145   164   184
16      4   12   21   30   42   54   67   82   97   113   131   150   169   190   211
17      5   12   21   32   43   56   70   84   100   117   135   154   174   195   217   240
18      5   13   22   33   45   58   72   87   103   121   139   158   179   200   222   246   270
19      5   13   23   34   46   60   74   90   107   124   143   163   183   205   228   252   277   303
20      5   14   24   35   48   62   77   93   110   128   147   167   188   210   234   258   283   309   337


Table 20.4: Lower 2.5-percentiles (upper bounds of the rejection region for α = 0.025) of Wilcoxon rank-sum distributions for different sample sizes n₁ (column labels) and n₂ (row labels), where n₁ ≤ n₂.

   1   2   3   4   5   6   7   8   9   10   11   12   13   14   15   16   17   18   19   20
4         6   11
5      3   7   12   19
6      3   8   13   20   28
7      3   8   14   21   29   39
8      4   9   15   23   31   41   51
9      4   9   16   24   33   43   54   66
10      4   10   17   26   35   45   56   69   82
11      4   11   18   27   37   47   59   72   86   100
12      5   11   19   28   38   49   62   75   89   104   120
13      5   12   20   30   40   52   64   78   92   108   125   142
14      5   13   21   31   42   54   67   81   96   112   129   147   166
15      6   13   22   33   44   56   69   84   99   116   133   152   171   192
16      6   14   24   34   46   58   72   87   103   120   138   156   176   197   219
17      6   15   25   35   47   61   75   90   106   123   142   161   182   203   225   249
18      7   15   26   37   49   63   77   93   110   127   146   166   187   208   231   255   280
19      7   16   27   38   51   65   80   96   113   131   150   171   192   214   237   262   287   313
20   1   7   17   28   40   53   67   83   99   117   135   155   175   197   220   243   268   294   320   348


Table 20.5: Lower 5-percentiles (upper bounds of the rejection region for α = 0.05) of Wilcoxon rank-sum distributions for sample sizes n₁ (column labels) and n₂ (row labels), where n₁ ≤ n₂.

44	51	79	65	27	31	4	355	22	352	287	7	287	339	0
276	342	355	334	296	7	17	351	349	37	339	40	324	325	334

#	x	y	Zn	z = ln[Zn]

1	181072	333611	1022	6.93
2	181025	333558	1141	7.04
3	181165	333537	640	6.46

50	180199	331591	375	5.93

155	180627	330190	375	5.93

	1	2	3	4	5	6	7	8	9	10	…	155

1	0	71	120	260	370	470	260	250	380	470	…	3400
2		0	140	280	360	470	230	200	330	440	…	3400
3			0	140	250	360	170	220	320	380	…	3400
4				0	150	240	180	300	350	320	…	3400
5					0	110	150	280	270	180	…	3200
6						0	250	380	330	180	…	3200
7							0	140	170	210	…	3200
8								0	140	280	…	3200
9									0	180	…	3100
10										0	…	3000

155												0

	1	2	3	4	5	6	7	8	9	10	…	155

1	0
2	0.11	0
3	0.47	0.58	0
4	1.4	1.5	0.91	0
5	1.3	1.4	0.87	0.046	0
6	1.3	1.4	0.82	0.089	0.044	0
7	1.1	1.2	0.62	0.30	0.25	0.21	0
8	0.92	1.0	0.46	0.46	0.41	0.37	0.16	0
9	1.1	1.2	0.61	0.30	0.25	0.21	0.0029	0.16	0
10	1.7	1.8	1.3	0.34	0.39	0.43	0.64	0.80	0.64	0

155	1.0	1.1	0.53	0.38	0.33	0.29	0.080	0.079	0.078	0.72	…	0

specimen	mass (g)	σ(mass)	volume (cm³)	σ(volume)

A	105	2	36	0.15
B	30	2.5	10	0.4

⁴⁰K (×10⁻¹⁰ mol/g)	2,093	2,105	2,055	2,099	2,030
⁴⁰Ar (×10⁻¹⁰ mol/g)	6.015	6.010	6.030	6.005	6.020	6.018


p	0.55	0.6	0.65	0.7	0.75	0.8	0.841	0.9	0.95	0.975	0.977	0.985	0.99

q	0.13	0.25	0.39	0.52	0.67	0.84	1.00	1.28	1.64	1.96	2.00	2.17	2.33


	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20

4				10
5			6	11	17
6			7	12	18	26
7			7	13	20	27	36
8		3	8	14	21	29	38	49
9		3	8	14	22	31	40	51	62
10		3	9	15	23	32	42	53	65	78
11		3	9	16	24	34	44	55	68	81	96
12		4	10	17	26	35	46	58	71	84	99	115
13		4	10	18	27	37	48	60	73	88	103	119	136
14		4	11	19	28	38	50	62	76	91	106	123	141	160
15		4	11	20	29	40	52	65	79	94	110	127	145	164	184
16		4	12	21	30	42	54	67	82	97	113	131	150	169	190	211
17		5	12	21	32	43	56	70	84	100	117	135	154	174	195	217	240
18		5	13	22	33	45	58	72	87	103	121	139	158	179	200	222	246	270
19		5	13	23	34	46	60	74	90	107	124	143	163	183	205	228	252	277	303
20		5	14	24	35	48	62	77	93	110	128	147	167	188	210	234	258	283	309	337


	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20

4			6	11
5		3	7	12	19
6		3	8	13	20	28
7		3	8	14	21	29	39
8		4	9	15	23	31	41	51
9		4	9	16	24	33	43	54	66
10		4	10	17	26	35	45	56	69	82
11		4	11	18	27	37	47	59	72	86	100
12		5	11	19	28	38	49	62	75	89	104	120
13		5	12	20	30	40	52	64	78	92	108	125	142
14		5	13	21	31	42	54	67	81	96	112	129	147	166
15		6	13	22	33	44	56	69	84	99	116	133	152	171	192
16		6	14	24	34	46	58	72	87	103	120	138	156	176	197	219
17		6	15	25	35	47	61	75	90	106	123	142	161	182	203	225	249
18		7	15	26	37	49	63	77	93	110	127	146	166	187	208	231	255	280
19		7	16	27	38	51	65	80	96	113	131	150	171	192	214	237	262	287	313
20	1	7	17	28	40	53	67	83	99	117	135	155	175	197	220	243	268	294	320	348

44	51	79	65	27	31	4	355	22	352	287	7	287	339	0
276	342	355	334	296	7	17	351	349	37	339	40	324	325	334


	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20

4				10
5			6	11	17
6			7	12	18	26
7			7	13	20	27	36
8		3	8	14	21	29	38	49
9		3	8	14	22	31	40	51	62
10		3	9	15	23	32	42	53	65	78
11		3	9	16	24	34	44	55	68	81	96
12		4	10	17	26	35	46	58	71	84	99	115
13		4	10	18	27	37	48	60	73	88	103	119	136
14		4	11	19	28	38	50	62	76	91	106	123	141	160
15		4	11	20	29	40	52	65	79	94	110	127	145	164	184
16		4	12	21	30	42	54	67	82	97	113	131	150	169	190	211
17		5	12	21	32	43	56	70	84	100	117	135	154	174	195	217	240
18		5	13	22	33	45	58	72	87	103	121	139	158	179	200	222	246	270
19		5	13	23	34	46	60	74	90	107	124	143	163	183	205	228	252	277	303
20		5	14	24	35	48	62	77	93	110	128	147	167	188	210	234	258	283	309	337


	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20

4			6	11
5		3	7	12	19
6		3	8	13	20	28
7		3	8	14	21	29	39
8		4	9	15	23	31	41	51
9		4	9	16	24	33	43	54	66
10		4	10	17	26	35	45	56	69	82
11		4	11	18	27	37	47	59	72	86	100
12		5	11	19	28	38	49	62	75	89	104	120
13		5	12	20	30	40	52	64	78	92	108	125	142
14		5	13	21	31	42	54	67	81	96	112	129	147	166
15		6	13	22	33	44	56	69	84	99	116	133	152	171	192
16		6	14	24	34	46	58	72	87	103	120	138	156	176	197	219
17		6	15	25	35	47	61	75	90	106	123	142	161	182	203	225	249
18		7	15	26	37	49	63	77	93	110	127	146	166	187	208	231	255	280
19		7	16	27	38	51	65	80	96	113	131	150	171	192	214	237	262	287	313
20	1	7	17	28	40	53	67	83	99	117	135	155	175	197	220	243	268	294	320	348

starting position

ln[A_∘∕F_∘]

ln[M_∘∕F_∘]

44	51	79	65	27	31	4	355	22	352	287	7	287	339	0
276	342	355	334	296	7	17	351	349	37	339	40	324	325	334


	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20

4				10
5			6	11	17
6			7	12	18	26
7			7	13	20	27	36
8		3	8	14	21	29	38	49
9		3	8	14	22	31	40	51	62
10		3	9	15	23	32	42	53	65	78
11		3	9	16	24	34	44	55	68	81	96
12		4	10	17	26	35	46	58	71	84	99	115
13		4	10	18	27	37	48	60	73	88	103	119	136
14		4	11	19	28	38	50	62	76	91	106	123	141	160
15		4	11	20	29	40	52	65	79	94	110	127	145	164	184
16		4	12	21	30	42	54	67	82	97	113	131	150	169	190	211
17		5	12	21	32	43	56	70	84	100	117	135	154	174	195	217	240
18		5	13	22	33	45	58	72	87	103	121	139	158	179	200	222	246	270
19		5	13	23	34	46	60	74	90	107	124	143	163	183	205	228	252	277	303
20		5	14	24	35	48	62	77	93	110	128	147	167	188	210	234	258	283	309	337


	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20

4			6	11
5		3	7	12	19
6		3	8	13	20	28
7		3	8	14	21	29	39
8		4	9	15	23	31	41	51
9		4	9	16	24	33	43	54	66
10		4	10	17	26	35	45	56	69	82
11		4	11	18	27	37	47	59	72	86	100
12		5	11	19	28	38	49	62	75	89	104	120
13		5	12	20	30	40	52	64	78	92	108	125	142
14		5	13	21	31	42	54	67	81	96	112	129	147	166
15		6	13	22	33	44	56	69	84	99	116	133	152	171	192
16		6	14	24	34	46	58	72	87	103	120	138	156	176	197	219
17		6	15	25	35	47	61	75	90	106	123	142	161	182	203	225	249
18		7	15	26	37	49	63	77	93	110	127	146	166	187	208	231	255	280
19		7	16	27	38	51	65	80	96	113	131	150	171	192	214	237	262	287	313
20	1	7	17	28	40	53	67	83	99	117	135	155	175	197	220	243	268	294	320	348