Category Archives: Statistics

R

February 28, 2025 Adrian

I've recently discovered how useful R is. Compared to Excel or Sheets it's a joy to use, especially if you're on Linux. Installing it is the tricky bit, there are a lot of dependencies. I'm on Linux Mint (Ubuntu/Debian), so:

sudo apt-get install -y r-base-core libxml2-dev libcurl4-openssl-dev libssl-dev libfreetype6-dev libpng-dev libtiff5-dev libjpeg-dev libharfbuzz-dev libfribidi-dev

I know. There's a lot. But all of that will let you do the following and more...

First, get this public csv file:

worksheet1 Download

In the folder with your new csv file, enter R with:

Then from within this new R session, type the following to import your csv file:

work1 = read.csv("worksheet1.csv", header = TRUE)

Now we'd like to see a histogram of the "Wellbeing" column, so:

hist(work1$Wellbeing)

Statistics

Exams

June 11, 2020 Adrian

I had my statistics exam at the beginning of the week.

This exam was weirder than most. Because of the pandemic, the OU had decided to turn the usual three-hour sit down exam into an "end-of-module" assignment that could be done at home within a period of 24 hours.

As soon as I heard this news, I had mostly negative feelings. The exam at the end of a 9-month module is a chance to really show off what you've learned. In three hours you have to recall, at speed, a very large assortment of problem solving skills and take full advantage of nurtured intuition. Some people can just stroll into an exam and do well, but I need to work very hard to walk into that exam hall with any confidence. Preparing for these exams for me is like training for a marathon, or a mountain climb. It's exhausting.

I start attempting past papers under exam conditions on Saturdays and Sundays four to five weeks before exam day. Then the week before the exam I take a whole week off work to spend practically ten days straight doing past papers, marking them harshly, then reviewing them and revising further.

By the time I arrive at the exam in the exam hall in June, those three hours feel exactly like I'm running that marathon or climbing that mountain I've been training for.

Walking out of an exam realising like you were prepared and knowing it's all over is an enormous feeling. The final punctuation of nine months hard work.

Hearing that that wasn't happening this year was a let down. I'd be denied completing my marathon.

Though despite the fact that the "exam" was to be completed at home, I trained just the same. To the point where I felt I couldn't have been more prepared. I was comfortable and determined to complete the at-home exam in three hours regardless of how long I was given.

However.

On the day, I downloaded the exam pdf. I scrolled through it. And I realised that they had changed the distribution of the questions in the sections just enough that I was not prepared for it in the way that I was hoping. For the past seven years of past papers, you could guarantee certain topics would appear.

Not here.

For the past seven years, you should guarantee that within each topic, you'd be given a certain set of sub-topics.

Not here.

Immediately I was glad that I was not running my marathon. If I had been, I would have had to be stretchered away from the starting line by medics.

That day was a battle. Over the entire course of the exam (which took way way longer than three hours) I thought "how was I not prepared for this?". It shook me, and it was the only thing I could think about.

Here, in the third and final stage of this degree, I may have found that there is something fundamentally wrong with the way I learn.

Being kind to myself, this was generally a hard exam. It was statistics, which by its nature is non-intuitive (a lot of people find it so anyway). I did think that this module was aimed at students studying an actual degree in (just) statistics, and that I probably didn't have the background knowledge that other stats students did. And as my mathematician friend has pointed out, it's unlikely it was hard for just me. If an exam is hard, it's generally hard for everyone.

So where do I go from here? It's difficult isn't it. Amongst those 500 or so pages I learn from, should I pay attention and make notes on 'the fleeting comments on page 274 that I never got tested on once and seemed insignificant'? ......Regardless, it seems my revision technique as it stands isn't sufficient.

Assuming I will be in an actual physical exam hall for three hours in June 2021 for my complex analysis exam, I need a better revision strategy.

Critique, Statistics

Statistics Assignment 3

May 27, 2020 Adrian

So I've finally completed all my assignments! I've just had the very last one returned to me and again, although I did well, there are very obviously some areas for improvement:

Models for Populations

Very small thing here, but when asked to describe the shape of the age-specific death rate, it's important to describe it in terms of the rate (of change).

Genetics

Wow. My weakest area by a long way here... It'd be wise for me to avoid any exam questions on genetics... But for the moment, let's examine what I did wrong to try and understand the assignment questions better at least. (It's mainly an issue with conditional probabilities).

Proportions

When calculating the probabilities of genotype combinations of parents, if you're given the genotype of one parent you don't need to use it in the calculation! eg: Despite the proportion of a genotype in a population being 0.2, if you're given the genotype of one parent, then the chance of them being that genotype is 1.0, not 0.2! Making a mistake like this obviously has a knock-on effect on working out probabilities for the children's genetics, insofar as the probabilities of the children will be incorrect too.

But I compounded my issue with the children. It took me a while to review the next bit to work out where I went wrong, but here we go...

When working out the parent-child genetics, you start by working out two sets of probabilities:

The probability of the parents being certain combinations of genotype (easy in this specific case, as the probability of one parent is 1.0). The probability of the parent of unknown genotype follows from the Hardy-Weinberg law. We'll call the probability of all the mating types $P(E_{i})$ .
The offspring probabilities, which follow from Mendel's first law. Though we'll talk about them in terms of phenotype, so $P(\text{Hilary} M)$ means the probability of Hilary being of phenotype M.

So the next question asked just that: What is the probability of Hilary being phenotype M (which was just one genotype "MM").

This question I managed to get correct based on my initial incorrect probabilities of the parents, but it's important to explain it for the next question. So it turns out that:

$P(\text{Hilary M}) = \sum^{\text{3}}_{i=1} P(\text{Hilary M}| E_{i})P(E_{i})$

So you multiply each offspring probability, (the probability of Hilary being phenotype M given the mating type) with the associated mating type probability. And you sum them across all mating types. Easy.

But the next question was:

Calculate the probability that sisters Hilary and Jane both have phenotype M. This was the bit that I got completely wrong that took me a while to review. I ended up squaring the result I got from the last question. Very not correct. 🙁 From the above, we know we start with:

$P(\text{Hilary M and Jane M})$

$= \sum^{\text{3}}_{i=1} P(\text{Hilary M and Jane M}| E_{i})P(E_{i})$

and it turns out:

$= \sum^{\text{3}}_{i=1} P(\text{Hilary M}| E_{i})P(\text{Jane M}| E_{i})P(E_{i})$

Which suddenly makes it all very very clear. I suppose this goes to show that when you come across something convoluted, it's worth taking extra time out to run through it in depth and make detailed notes on it. Doing so here would've paid off. I think the problem I have with genetics questions is that there are quite a number of ways in which these questions can be phrased.

Writing Conditional Probabilities

Well this went really wrong. This is probably my weakest area, and is related to the above slip-ups in the questions with Hilary and Jane.

"Show the that proportion of male offspring for the second mating that you should expect to have plain wings (gene contains dominant allele A) is $\frac{3}{4}$ ."

Here, I wrote the definition incorrectly, but calculated the correct result. Kind of double-bad. 🙁 Here, I wrote:

P(male A)
(which is the joint probability of a male having the allele A)

When I should have written:
P(A | male)
(the conditional probability of offspring having the allele A given that they're male.)

The Hardy-Weinberg Law

A lengthier title to this subsection would be: "When to calculate the proportions of subsequent generations of a certain type using Hardy-Weinberg, and when to use your own table of probabilities".

As above, the table of probabilities includes the probabilities of the parents of certain types mating, and the probabilities of the associated offspring genotypes.

The question:

"One male and one female are chosen at random from all the offspring of the mating, and are themselves mated. What is the proportion of female offspring of the second mating to have a dominant allele?"

In this case, there were two genotypes which had a dominant allele, AA and Aa. But how do I parse this question? This question is asking about grandchildren of the initial parents! It's also asking about "proportion" which hints that I should be using Hardy-Weinberg proportions. Turns out not. It seems that you can only use the Hardy-Weinberg law when you're given the proportion of three genotypes of a starting generation.

So what are we left with?

P(AA | female) AND P(Aa | female)

Which in this case is equivalent to:

$\sum^{\text{4}}_{i=1} P(\text{female AA}| E_{i})P(E_{i}) + \sum^{\text{4}}_{i=1} P(\text{female Aa}| E_{i})P(E_{i})$

Notice how this differs from the sum in the last section (the Hilary and Jane example), because there's no assumption made about them both having the same father.

Last related one here that tripped me up was:

"What is the proportion of dominant-alleled females in this second mating would you expect to be AA?"

Again, I used the Hardy-Weinberg law to calculate this, when I should've been using conditional probability.

So it seems I needed to go through the process of parsing the question, and translating it into stats language: "What's the probability of offspring being genotype AA given that they're a female with a dominant allele?". The probability we require here is:

P(AA | dominant allele female)

Using the standard, straight-forward rule for conditional probability I learned in my first section back in September, this is equivalent to:

$\frac{P(AA \cap \text{dominant allele female})}{P(\text{dominant allele female})}$

What's the numerator here? The probability of being AA and a dominant-allele female? Well yeah, AA is dominant, we know that. So this is just the probability of being AA and female:

$\sum^{\text{4}}_{i=1} P(\text{female AA}| E_{i})P(E_{i})$

It's just one part of the previous question.

Then what's the denominator? The probability of being (proportion of) a dominant-allele female generally? So AA female and Aa female? Well that was the actual answer to the last question!

So that's it. There's a lot of parsing that needs to be done generally:

Have I been given proportions? Use Hardy-Weinberg.
No proportions? Use a table of parents and offspring probabilites.
What am I given, what don't I have to calculate?
What are they asking me, is the probability conditional?
If it's conditional, I can separate it out but then I need to parse what each of these new probabilities mean.

Armed with this little checklist, I may have done a bit better in my genetics questions!

General Stuff

Range

If your answer is an equation in terms of x, always state the range of possible values of x:

$Q(x) =1-\frac{x^{2}}{100},\:\:\:\: 0\leq x < 10$

Variance

Annoying oversight here. When stating the variance of the lifetime of something was 42.92 months, I should've said it was 42.92 $\text{months}^{2}$ . Not often you think of months-squared, but here, it's relevant. Variance!

Log and Ln

Concentrate when typing one or the other into your calculator. There's a big difference, people... Thankfully I only slipped up once here.

And that's it! Now it's just revision time until my exam on the 8th of June. Of course, due to our new friend covid-19, I'll be taking my exam at home which will be a bit weird. Plenty to revise though, so I'll get started...

Statistics

Statistics in the Media

April 1, 2020 Adrian

Great to see even the smallest amount of proper stats reporting.

This blog post from The Guardian actually mentions a reduction in the reproduction number (from 2.6 to 0.62), the 'R0' in my equations below. Meaningful and super-useful.

Conversely, this kind of reporting from the BBC on the symptom tracker app needs to be thrown in the sea. Completely meaningless summary of a potentially very important survey. Rubbish.

Statistics

COVID-19 Stats

March 22, 2020 Adrian

Update to this. UK government have finally put this out. Finally some decent stats!

Also, here's the mobile version, but the other one is better.

Statistics

Covid-19 Coronavirus

March 17, 2020 Adrian

So this virus outbreak thing is interesting isn't it?

As a mathematician-in-training, what I'm finding more interesting is the lack of good statistics on what's happening. The UK government were initially publishing new infections within England and in the whole of the UK. In addition, they were publishing the total UK infected, and total UK deaths. They stopped reporting anything on March 5th.

Though as well as keeping track of government announcements, I've also been keeping record of daily updates that the World Health Organisation (WHO) have been making.

Here's a link to their European map.

Here's a link to their global map.

On a daily basis, I've been recording UK totals from their site. You can see them graphed below:

You can't see it, but the values for the first 9 days are just two people.

Though see how there's a dip on March 16th? How could there possibly be a dip in cases for one day by over 100 people? It seems even the pros can get their numbers wrong. Bad work, WHO. 🙁

Also, you'll notice from around March 5th to March 8th, it goes flat. These are days on which I didn't check the figures.

From these total cases per day, I've calculated new cases per day.

See that negative number of new cases on March 16th? Insert eyeroll emoji.

There's also a spike on March 9th, this is just total new cases from March 5th to March 9th. -catching up from when I didn't check on the totals on these days.

But as you can see, as predicted, this initial growth is basically exponential. Fairly typical for an epidemic apparently.

So the big question is... what next?! -and here's where my mathematics and statistics study comes in handy... I've recently finished a section on epidemics! What better time to apply some of my learning!

First off, it's worth mentioning that all the epidemic modelling I've learned about assumes homogeneous mixing. This is the kind of mixing that occurs in a family home. ie: there's more or less an equal chance of me having contact with one person than there is anyone else. In real life this, of course, isn't true. Living in London, I'm less likely to be in contact with someone in Edinburgh than I am someone I travel with on the London Underground every day. Also important point: homogeneous mixing means no quarantining. So all of the below is essentially average worst-case.

So with the proviso that these results will probably (more than likely) be wildly inaccurate, let's get started! We need a few important numbers, some of which we've already got:

We need the starting number of infected (y0), in this case 2.
We need the starting number of non-infected (x0), that's the total population of the UK minus 2. I decided to estimate the current population at 67.44 million.
We need the epidemic number, ρ. This is calculated in the following way:

$\rho =\frac{n\gamma}{\beta}$

Where:

n is the total population (67.44 mil).
γ is the parameter in the exponential distribution that describes the mean recovery time of the virus. (It's random).
β is the parameter in the Poisson distribution that describes the mean contact rate (per day). (Also random).

Which is fine, but how do we know what γ and β are? Well there's a number called R0 ("R-naught") that I've NOT been taught about that represents the contagiousness of a virus. Interestingly, I've found two completely opposed descriptions of this number:

Though we all know how reliable wikipedia is, and I've just found this Stanford paper which supports the towardsdatascience description:

$R_{0}=\frac{\beta}{\gamma}$

Therefore:

$\rho =\frac{n}{R_{0}}$

Hooray. Imperial College London seems to estimate the R0 of COVID-19 to be 2.4, so let's run with that.

$\rho =\frac{67440000}{2.4}=28,100,000$

Max Number Of Infectives At One Time

Once we have all this, we can work out the maximum number of infectives at any one time. ie: the peak of the infection in the population, ymax:

$y_{\text{max}} = y_{0} + x_{0} -\rho -\rho\:\text{log}\left(\frac{x_{0}}{\rho}\right)$

Now we can plug all the numbers in to find ymax! So:

$y_{\text{max}} = 2 + 67439998 -28100000 -28100000\times\:\text{log}\left(\frac{67439998 }{28100000 }\right)$

Hence:

$y_{\text{max}} = 28,656,065$

Which is 42.5% of the population infected at one time! Ouch! At least you'll know that if we (in our theoretical UK) hit 28 million with no quarantining, we'd be at the peak of the outbreak.

Other sources state that COVID-19 actually has a range of R0 values, 1.4-3.8-ish, so the band of possible outcomes without quarantining is actually quite broad. But from this it's possible to work out a best-case/worse-case comparison:

An R0 of 1.4 would mean a max of 12.2 million (18% of the population), and an R0 of 3.8 would mean a max of 39.4 million (58% of the population) at one time.

Number Of People Not Affected

The following assumes that the whole debacle is over. Everyone that has caught the virus from it has now recovered. How many people were not affected?

This is found using the following iteration formula:

$x_{\infty,j+1}=x_{0}\:\text{exp}\:\left(\frac{x_{\infty,j}-\left(x_{0}+y_{0}\right)}{\rho}\right),\:\:\: j=0,1,2,\:\ldots$

Initially x_{inf,j} is zero, and you use your result x_{inf,j+1} to calculate x_{inf,j+2}, and so on. This eventually settles down to the number of people not affected!

So using the power of spreadsheets, and not taking up the space here with columns and columns of numbers:

R0 of 1.4 would mean 32.98 million people are not affected.
R0 of 2.4 would mean 8.29 million people are not affected.
R0 of 3.8 would mean 1.66 million people are not affected.

You can imagine that an increase in quarantining means a lower R0. Seems that could have a big effect.

#ImNotAStatisticianButItsStillFunLookingAtNumbers

Critique, Statistics

Statistics Assignment 2

March 17, 2020 Adrian

Two thirds of the way through my assignments!

Again, fairly happy with the mark I received for this, but there were some aspects of this assignment I found challenging, and some where I thought I might've done quite well on, but slipped up in some way.

Let's cover some areas here:

Finding A Real-World Process

For these questions I had to find a real-world process that could be modelled with the given mathematical objects/processes. Kind of the opposite of a mathematical modelling problem.

I found these tasks really difficult. What I found to be the worst aspect about getting this kind of question wrong is that it's not necessarily my understanding of the mathematical process that's flawed. I feel in each of these cases, I did my best to find a real world example, knowing that the example I gave, itself, was slightly flawed. So despite the fact that I can perfectly explain each mathematical process, I couldn't explain how each could be applied to a real world process so lost marks.

The two models were the Galton-Watson branching process, and the simple random walk (specifically, a particle executing a simple random walk on the line with two absorbing barriers).

The two typical examples that are referred to in my texts are genetics and mutations for the Galton-Watson branching process:

"A mutation is a spontaneous transformation of a particular gene into a different form, and this can occur by chance at any time... The mutant gene becomes the ancestor of a branching process, and geneticists are particularly interested in the probability that the mutation will eventually die out."

For the simple random walk, the example of the "gambler's ruin" was given. Imagine two people with £10 each, each of them betting on an event. If one of the two loses the bet, they give £1 to the other (the random walk on the line). If one of them runs out of money, then they lose (one of the "absorbing barriers" are hit).

In coming up with answers, I could've used Google, but that would've been cheating. However, now I've completed the assignment and received my grade, Google is my best friend in finding suitable answers here...

Seems you can use the Galton-Watson branching process to determine the extinction of a family name, and I found a good example of a random walk with absorbing barriers in this MIT paper, featuring a little flea called Stencil. It discusses the probability of him falling over the Cliff of Doom in front of him, or the Pit of Disaster behind him.

Concluding An Answer

A couple of my answers here and there were classed as being incomplete. Generalising each case:

1)
Upon finding that an answer resembles a certain construction (a probability distribution function, cumulative distribution function or generating function), as well as saying which distribution the function belongs to, you should also explicitly state the variables that appear in it. Even to anyone non-mathematical, it would be obvious to see that the variables in the general case are associated with the specific answer you arrived at. Though for assignments (and exams, presumably) this is not enough. If a general function has variables explicitly state what each one is in your answer.

eg: the p.g.f. of the modified geometric distribution is

$\frac{a\:-\:bs}{c\:-\:ds}$

If your answer resembles this, say what a, b, c and d are.

In addition, if your answer is a probability (or set thereof), include a statement describing them.

Note that one whole mark can be deducted for an insufficient conclusion (apparently).

2)
Don't forget your definitions.

Specifically:

To calculate the variance of the position of a particle (along the random walk line) after n steps, you can just sum the variances of each step. However this only works because each individual step is independent of the last (one of the properties of the random walk). Due to the fact that I didn't mention this definition of the variance of a particle in a random walk, I lost half a mark. Not massive, but where you can mention a definition, mention it.

Different Routes In A Markov Chain

I struggled with this, and although I arrived at the correct answer, the method I had used was entirely wrong (and also a little inelegant).

In this question, I covered all routes separately and so had a small handful of different probability calculations. Though when considering potential routes in a Markov chain, you can consider all routes simultaneously by taking advantage of something called an absolute probability (of the Markov chain being in a particular state at a particular time), given an initial distribution. (for my own reference this is covered in Book3, Subsection 11.2, p.87. And the handbook, p.23 item 17).

Arbitrary Constants

Does it matter if an arbitrary constant is positive or negative? (my ref: Q6a). I previously thought not. In this instance my constant in an integral calculation absorbed the negative sign that was in front if it. After all, a negative general constant is still a general constant, right? Well I lost half a mark here because of the absorption, and it's not currently clear why. I've asked my tutor, and I'll update it on here once I hear back from him.

Critique, Statistics

Statistics Assignment 1

December 2, 2019 Adrian

Very happy with the high mark I achieved for this first assignment. Though as usual, there's a decent about to be improving. Let's start by looking at some of the more major things my tutor pointed out.

Variance

First thing is variance. How do you calculate it? Well it turns out there are a couple of ways. I just decided to use the most cumbersome way...

Calculating the mean $\mu$ (expectation $E(X)$ ) is easy. Multiply each number with its probability and sum them all:

$\mu = E(x) = \sum x\: p(x)$

The variance can then be calculated in one of two different ways:

$\sigma^{2}=E[(X-\mu)^{2}]=\sum (x-\mu)^{2}\: p(x)$
or
$\sigma^{2}=E(X^{2})-(E(X))^{2}=\left(\sum x^{2}\: p(x)\right)-\mu^{2}$

When they're written out like this, it's fairly obvious to see which method is more like the method used to calculate the mean and as such would be far less hassle. ( $x$ is an integer and $p(x)$ and $\mu$ can be reals/rationals).

Multivariate Poisson Process

This was the question in which I lost the most marks:

Customers arrive at a shoe shop according to a Poisson process with a rate of 20 per hour.
15% of customers buy men's shoes.
60% buy women's shoes.
25% buy children's shoes.
Calculate the probability that exactly eight customers arrive in half an hour, exactly three of whom wish to purchase children's shoes.

This is such a typical mistake for me to make in statistics. I'm sure I've made this kind of mistake before...

What I ended up doing was working out the probability of the number of customers being 8 using the Poisson distribution's probability function. This was fine.

Then I used the same probability function to find the probability of 3 people wanting to buy children's shoes and multiplied them together. Wrong. At this point I needed to find the conditional probability that of the 8 customers, 3 bought children's shoes. Hence, here I shouldn't have used the Poisson distribution, I should've used the Binomial distribution instead. ie: from 8, choose 25%.

Reflecting back on the question, the correct answer seems slightly more obvious now. Especially given the "...exactly three of whom..." part of the question. I struggle to be mindful of stuff like this in the moment of answering a stats question. I suppose this part of the "translating English into maths" issue comes with more practise...

Index Of Dispersion

Again, my issue here was to not observe subtleties in the question. Given information about the associated distributions, I was initially meant to calculate the mean and variance of the total number of books bought in 9 hours. I managed to get this first part right, but the second part of the question asked me to calculate the index of dispersion for "this process". It turns out that "this process" refers to the process in the main question generally and not the process of books being bought within 9 hours. In this instance, ignoring the total number of books bought in 9 hours (kind of) simplifies the answer too.

Other Issues

In this first assignment, I lost a half a mark here and there for incorrect arithmetic. (GASP!). Upon completing my draft submission, instead of just reading through it, I should sit down and verify all my working. It will take more time, but if it scrapes 2 marks back, it could be worth it.

Other issue that occurred more than once was a lack of units when talking about rates of things happening. So there's a requirement to state " $\lambda=20$ per hour" instead of just " $\lambda=20$ ".

Statistics

End of Book 1 - Review of Understanding

October 21, 2019 Adrian

A quick look at the mind map I produced for the whole of Book 1, very much a foundation for the rest of the materials on the module.

The complexity has increased necessarily, but I'd like to think the clutter has been reduced to a minimum. I was rearranging nodes as I was adding them in the hope of reducing clutter. You'll notice there are two white floating boxes not connected to anything externally. This was just to avoid clutter.

Some things to note about it that help me refer back to it:

It's split roughly into thirds, vertically. Each third is more or less a sub topic. More like themes, perhaps. Far left is foundational principles. Middle relates to basics of distributions. Far right is concepts surrounding the C.D.F. and P.D.F. (the definitions of which are in the white box in the middle of the far-right section).

Colouring helps a lot when needing to refer back to it, I found. Axioms are pink, properties are green, and definitions are blue. I found that in referring back to it, I needed some kind of differentiation between the definition of a certain distribution, and a normal definition. So all the yellow nodes you see are definitions of distributions.

In summary, to assist my understanding, I now have a graph that leads me from the most basics concept (like what an "event" is), to the definition of the standard normal distribution. I'll be referring back to this as I go!

Maths, Statistics

The Royal Statistical Society

October 21, 2019 Adrian

I've just joined the Royal Statistical Society!

https://www.rss.org.uk

Models for Populations

Genetics

Proportions

Writing Conditional Probabilities

General Stuff

Range

Variance

Log and Ln

Max Number Of Infectives At One Time

Number Of People Not Affected

Finding A Real-World Process

Concluding An Answer

Different Routes In A Markov Chain

Arbitrary Constants

Variance

Multivariate Poisson Process

Index Of Dispersion

Other Issues

…is learning mathematics.