Tag Archives: biostatistics

COVID-19 variants. Statistical concerns

This piece draws heavily upon a piece published at NakedCapitalism. Pretty much all the references regarding epidemiological explanations and “on the ground” observations are there so in the interests of brevity (and my own schedule at the moment) I’ll simply give that as the main reference. I’ll put a few notes in regarding other issues though.


I’ve written before that stated preference (SP) data using logit/probit models – examples of limited dependent variable models, so-called because the outcome isn’t continuous like GDP or blood pressure – are very hard to interpret [a]. Technically they have an infinite number of solutions. It is incumbent upon the researcher either to collect a second dataset, totally independent in nature of the first (so we now have two equations to solve for the two unknowns – mean and variance) or use experience and common sense to give us the most likely explanation (or a small number of likely ones). This is technically true of revealed preference data (actual observed decisions) too [b] and Covid-19 might be an unfolding horrific example of where we are pursuing the “wrong” interpretation of the observed outcomes.

Background: What’s happened in various “high vaccination” countries so far?

In short, rates of Covid-19 initially dropped through the floor, typically in line with vaccination coverage, then started bouncing back.  However, the large correlation with hospitalisation and death did not re-appear. This is consistent with the fact the vaccines are not “sterilising vaccines” – you can still catch Covid-19, it’s just that the vaccine is (largely) stopping the infection from playing havoc with your body.

Sounds like a step forward? Actually, without widespread adjunct interventions (good mask usage etc) to stop the spread in the first place, this is potentially very very bad. We’ve already seen variants arise. The Delta variant is causing increasing havoc, whilst Lambda is becoming dominant in South America. The Pfizer vaccine – which thanks to media failures was often touted as “the bestest evah” – seems particularly ill-equipped to deal with Delta. NC is covering this very well.

The bio-chemists and colleagues can give good explanations of WHAT is happening pharmacologically and epidemiologically in producing these variants. Our archetypal drunk lost his keys on the way back from the pub. However, just like the story, he’s looking for them only under the lamp-post, whilst they’re actually on the dark part of the road; if you can’t or won’t look in the right place of course you won’t find the solution. This is what many experts are doing and why Delta etc could keep happening and at an increasing pace and perhaps is the real story: one with roots in statistics.

What’s the possible statistical issue here?

Consider how medical statisticians (amongst others) typically think about discrete (infected/non-infected, or live/die) outcomes. As in the SP case the [0,1] outcome is incapable of giving you a separate mean – “average number of times a human – or particular subgroup – would get bad Covid-19 for a given level of exposure” – and variance “consistency of getting it it for this given level of exposure”. If 80% of Covid sufferers at a given exposure level needed hospital care but only 20% do when vaccinated, then analysts tend to think that the average number of people has gone down.

Suppose the “extreme opposite interpretation” (equally consistent with the observed data) is true? Suppose it’s a variance effect? So, the vaccine is not really – on average – bringing the theoretical average hospitalisation rate down. Or not by much anyway. It is simply “pushing what was a high peaked thin mountain into a fat, low altitude hill” in the vaccination function relating underlying Covid-19 status with observable key outcomes. Far more people are in the tails, with an emphasis on the “hey, now Covid is no big deal for me” end [c]. The odds of hospitalisation following vaccination goes way down. However, if you look at subgroups, you’ll (if you’re experienced) be spotting a tell-tale giveaway: the pattern of odds ratios across subgroups by vaccination status is VERY SIMILAR TO BEFORE, they have all just (for instance) halved. This is a trick I’ve used in SP data for decades and more often shows that some intervention has a variance effect. Fewer people are going to hospital if vaccinated but their average tendency to get a bad bout is actually unchanged by vaccination (particularly if we add the confounding factor of TIME – Covid is changing FAST).

This provides an ideal opportunity for the virus to quietly mutate, spread and via natural selection, find a variant that is more virulent but which, when coupled with fewer people taking precautions, gives a greater tendency for a variant to emerge that is both “longer incubating” but then potentially “suddenly more lethal”.

So vaccines were a bad thing?

At this point in time I’ll say “NO” [d]. However, in conjunction with bad human behaviour and an inability to think through the statistics, they have led to a complacency that might lead to worse long-term outcomes. The moral of the story is one that sites like NC have been emphasising since the start and which certain official medical and statistical authorities really dropped the ball on right from the get-go.

The vaccines merely bought us time. Time we wasted. Now a long-ignored problem with the logit (or probit) function, being the key tool we use to plug “discrete cases of disease” into a “function relating underlying Covid-19 to observed disease status” might be our undoing. Far fewer people are going to hospital following vaccination (smaller mean effect in terms of lethality) but a MUCH larger number of people have become juicy petri dishes for the virus to play in (larger variance). We have concentrated way too much on the former. The statistics textbooks tend to stress that explanation.

Trouble is, too few people read the small print at the bottom warning them that their logit/probit estimates could just as easily arise from variances, not means. Assume you observe:

  • an answer of 8 in non-vaccinated group. You assume mean prevalence=8 and (inverse of) variance=1, as the stats program always does: 8*1=8.
  • In vaccinated group you see answer of 4. Wow, the mean (prevalence due to vaccination effect) has halved prevalence because you ALWAYS ASSUME THE VARIANCE IS 1. So you “must” have got 4 via 4*1 because that is what you must do to get 4!

Oops. How you SHOULD have “divied up the mean and inverse of variance” was 8*1 in non-vaccinated group and 8*(1/2) in vaccinated group. You have a treatment effect that is in fact unchanged. The inverse of the variance halved – in other words the variance doubled. People less consistently got ill [e]

For someone like me who used to deal primarily with stated preference data the worst thing that could happen was that I’d lose the client when model-based predictions went wrong (because I’d made the wrong “split” between means and variances).

The stakes here are much much bigger. This piece is the “statistical issue” – a potential big misinterpretation of Covid-19 data – which really worries people like me.


[a] See my blog and NC reprinted one of my posts.

[b] This is how and why Daniel McFadden won the so-called Economics Nobel – he predicted the demand for the BART in California extraordinarily accurately, before it was even built. He had both stated and revealed preference data on transport usage.

[c] You can’t keep symmetry if you keep squashing the mountain down. The right tail hits 100%. So you “see” a lot more people in the left tail (doing “well”) as a result of vaccination. This leads to the mean effect – so vaccination is unlikely to be 100% variance related. There must be a certain degree of mean effect here. My point is that the “real mean effect of vaccination” is theoretically a lot less than we observe from the data.

[d] With the “Keynes get out-clause”.

[e] The  actual logit and probit functions basically spit out a vector of “beta hats” but which are actually “true betas MULTIPLIED by AN INVERSE function of the variance on the latent scale”. So when variances go up – which in SP data happens when you get answers from people with lower literacy etc – then the “beta hats” (and hence odds ratios) all DECREASE in absolute magnitude. In other words, confusingly for non-stats people, we (to make the equation look less intimidating) tend to define a function of the variance (lambda – not to be confused with the Covid one) or mu that is MULTIPLICATIVE with the “true beta”. Believe me if you think this was a stupid simplification that will lead to confusion as people talk at cross-purposes you are not alone.