Category Archives: Economics


eu_support_graph copy


Well, I’ve finally got round to programming a model that:

  • Asks you just five best-worst scaling questions – you choose your “most agreed with principle” and “least agreed with principle” – people take 2-3 mins to answer this tops.
  • Runs a best-worst scaling (BWS) exercise on just YOUR five answers.
  • Spits out three things:
    • A pie chart showing how likely each of the six main options (continued EU membership/Norway option/ Switzerland option/ Canadian option/ Turkish option/ World Trade Organisation option) would best satisfy YOUR principles
    • A pie chart showing the predicted chances of you personally supporting each of the five principles
    • A pie chart showing the predicted chances of you personally rejecting each of the five principles





Thus, the first chart tells you, based on which of these five principles we could “get” under each of the six models, what are the chances of getting “as much as we want” from each model of a new British-European relationship – the six models (one REMAIN, five BREXIT) .

This, like all CORRECT best-worst scaling, is an individual model, giving you PERSONALISED results, not “you averaged with others”.

We can, of course, average across people, slice and dice the results across sex/gender/political affiliation etc, to find out what model is most popular in certain groups. But the point is, my model doesn’t NEED to do that. All because just five BWS questions tell me everything I need to know about what you value.

Gold dust for all the campaigns – and the government, as it struggles to negotiate what type of new relationship would command majority support in the country.

I have deliberately answered the survey as a “hypothetical REMAINer” to show what they should have done – namely made the single European market something people understood and fought for, above other factors.

There are lots of scenarios – including what probably actually happened in that people were in reality “sure” they disliked free movement of people and/or EU budget contributions but unsure about their SEM/FTA/CU support – which lead to a BREXIT outcome as the most likely to achieve their preferences….your relative preferences for these determines which BREXIT model (hard/soft) is most likely to suit you.

Campaign managers/constituency parties/national party executives as well as Jo(e) Public would be very interested in this.


Best-worst capabilities endorsed

Wow. In this article Will Hutton interviews Amartya Sen. A crucial quote:

“…you have to take in, somehow, the unattractiveness of the last as well as the attractiveness of the first candidate.”


Wow, quantifying the worst as well as the best?

Which group has been at the forefront world-wide of doing this?

Yep, we’ve been way ahead of our time.

EU inequality

OK I’m breaking my self-imposed law within a few hours.

Ben says EU good

Ben says EU good





I usually have utmost respect for Ben Goldacre and don’t want to get into trolling territory on twitter but this is a simplistic statement. The first statement is true. The second is highly debatable if you stratify by age.

It is well known (see Bill Mitchell amongst a wealth of others, many of whom could not be seen as “outsiders” but are well within the mainstream) that unemployment in southern EU countries is appalling amongst the young. 50% or so. People with PhDs living at home with parents and, if they’re lucky, doing some barista work. All courtesy of the banking rules that force them to “live within their means – like a household”. All a nonsense paradigm of course if you understand how money is created and destroyed. But the results are in and have been in for many years now. There is, of course, a strong affinity with the EU, given the benefits of the past. However, recent ECB policy means the young can’t afford a home, and get bare-bones healthcare.

effects or dummies redux

That old bugbear comes back….are effects codes really superior to dummy variables?


This note revisits the issue of the specification of categorical variables in choice models, in the context of ongoing discussions that one particular normalisation, namely effects coding, is superior to another, namely dummy coding. For an overview of the issue, the reader is referred to Hensher et al. (2015, see pp. 60–69) or Bech and Gyrd-Hansen (2005). We highlight the theoretical equivalence between the dummy and effects coding and show how parameter values from a model based on one normalisation can be transformed (after estimation) to those from a model with a different normalisation. We also highlight issues with the interpretation of effects coding, and put forward a more well-defined version of effects coding.

That’s one of the joys and frustrations of DCEs; why you can never rest on your laurels and should really be acknowledging that it is a field in its own right; why you should have a DCE expert on your team for all important projects. Just when you thought something was right, its merits are questioned. Fun fun fun.

model disclosure

This post regards a twitter post with an interesting poll and discussion initiated by Chris Carswell (editor of Pharmacoeconomics and The Patient) and twitter handle @PECjournal on whether a statement should be added to a paper to the effect that the authors’ model, when requested, was not submitted for peer review.

I abstained, saying I think a statement should be made if it’s a “traditional” decision analytic/similar CEA/CUA but I personally don’t favour it for DCEs.

The two counter-arguments made were that:

  1. Proprietary models go against the spirit of transparency that is increasingly demanded, &
  2. My point that model selection for DCEs being part art is similar to that used in qualitative research but qualitative researchers still have to submit discussion guides/full survey.

I do acknowledge both points, but my responses would be as follows:

(1) Proprietary software is routinely used to generate designs and (particularly) to analyse results of economic and other models: we’re getting into the nitty-gritty of the likelihood maximisation routine used (EM algorithm/other etc), starting value routines used internally by the stats program, etc. The ultimate black box is the stuff that does everything for the novice/inexperienced DCE researcher, mentioning no names 😉

Now, that doesn’t make things right, but it does mean that unless the researcher has the full code for everything from DCE design to model selection, or can reference it all for reviewers, I don’t think picking on just the DCE model selection issue is fair.

(2) I have no objections to submitting the design of the survey – when I was a reviewer, most fatal errors were made in the design and take the view that no DCE can be properly reviewed without access to the design by reviewers. (Another reason why authors might like to rethink if they are going to use “adaptive conjoint” – are they going to provide the design administered to every respondent? Haha, thought not, and if they do, will reviewers check through such a model, involving programming it in their software. Haha, thought not.) I myself also provide details of the main and secondary analyses I conducted. These can all be reproduced by reviewers, if they want to. The difficulty – and I believe, from my (far more limited, I acknowledge) experience/observation of analysis of qualitative data that it’s the same there – is that value judgments are made: e.g. “have we really reached saturation?” etc. For the reviewer it comes down to “in my experience, do I agree with this?”

And, unfortunately, in my experience in academia, too few peers had sufficient experience – and I mean designing, analysing and interpreting DCEs across multiple fields – to possibly feel comfortable endorsing me when I say “I didn’t use the model dictated by the BIC criterion – or whatever statistical rule you may like – because it routinely gives too many latent classes and I used my experience to choose the best model”. Sorry, yes I sound arrogant, but when any one DCE has literally an infinite number of solutions – a point still ignored or misunderstood by most practitioners – then inevitably experience and gut feelings based on intimate knowledge of your sample, data and survey become paramount.

In short, model selection skills can’t be taught, they must be gained with experience.

And, you are fully entitled to say “well you would say that, you work in industry now”. To which I’d respond, yes, I do have an interest in saying that, but why are academic groups that routinely delay competitor groups’ papers, mis-reference things in order to skew publication metrics and funding likelihood etc not pulled up on their shenanigans? I got a google citation report just today to something – and seeing the authors I would have bet (before reading) 100 GBP with anyone on the planet that the paper of mine that was absolutely crucial to this new publication would not be the citation I got the report for. I would have won the bet, the citation was to something else of mine entirely. I just laugh at these things now, they don’t affect me or my business, but it’s rather sad that they still go on. Particularly in this case when it can contribute to more QALY valuation studies that can’t possibly give the right answer – how is that defensible on equity or efficiency grounds?

So, until basic rules of research – and we’re talking the stuff I was taught in my first PhD supervision like “get the primary source”, not even the more recent transparency stuff – are followed consistently by academics I’m afraid industry is entitled to retort “people in glass houses shouldn’t throw stones”.

no capability not death

Just a quick note following a twitter exchange I had regarding whether capabilities as valued by the ICEPOP team (the ICECAP-O was referenced in the original paper) are “QALY-like”.

Key team members never intended the ICECAP-O scores to be multiplied by life expectancy (in the way, say, an EQ-5D score is). Whilst we have recognised that people would like to do this, technically this is a fudge and comes down to definitions and the maths:

Death necessarily implies no capabilities but no capabilities (the bottom ICECAP-O state) does not imply death. But more fundamentally, the estimated ICECAP scores are interval scaled, NOT ratio-scaled (for reference, read the BWS book): we used a linear transformation to preserve the relative differences between states but the anchoring at zero would not be accepted by a math psych person: they would say defining the bottom to be the zero doesn’t make it so.

Since different individuals technically had different zeros (the BWS or any discrete choice estimates have arbitrary zero) – death – multiplying a technically averaged interval scale score (our published tariff) by a ratio scaled one (life expectancy) to compare across groups/interventions is wrong: If there is heterogeneity in where “death” is on our latent capability scale (which we can’t/didn’t quantify – unlike the traditional QALY models estimated in the proper way) then comparisons across groups that don’t have the same “zero” gives incorrect answers. We can compare “mean losses of capability from full capability” which is why I personally (though I don’t speak for the wider team here) prefer the measure to be used as an alternative measure of deprivation, like the IMD in UK or SEIFA in Australia.

happiness redux

There is a piece on happiness at up today. It is a guest post from VoxEU and unfortunately, though trying to make valid points, falls into the usual holes: the key one is that the data all appear to be Likert-based self-reported happiness scales, which in two major countries (at the very least) have been shown to be deeply misleading (US and Australia). In short, even within these two countries, there are cohort and/or longitudinal effects: the number you state your happiness/life satisfaction to be is heavily dependent upon age (particularly if you are older), independent of (after adjusting for) a huge number of other factors (health, wealth, social empowerment, independence, etc). Moreover this is not “just” the infamous “mid-life dip”: the differences between such measures, and the more comprehensive well-being/quality-of-life ones, are particularly stark in extreme old age and have big implications for retirement age, what resources are needed by the very old etc.

To make comparisons across countries with different cultural backgrounds seems even more hazardous – Likert scales generally were pretty much discredited on such grounds by 2001:

Baumgartner H, Steenkamp J-BEM. Response styles in marketing research: a cross-national investigation. Journal of Marketing Research. 2001;38(2):143-56.

Steenkamp J-BEM, Baumgartner H. Assessing measurement invariance in cross-national consumer research. Journal of Consumer Research. 1998;25(1):78-90.

Five year age bands showing mean levels (after rescaling) of self-rated happiness versus scored quality of life in Bristol

Five year age bands showing mean levels (after rescaling) of self-rated happiness versus scored quality of life in Bristol, UK









The above shows that the ICECAP-O measure (based on discrete choice based outcomes of McFadden, coupled with Capabilities Approach of Sen, both winners of the Economics “Nobel”) tracks happiness (after both rescaled to be on 0-1 scale) reasonably well til middle age. In old age people report suspiciously high life satisfaction/happiness scores even when they have a whole host of problems in their lives. We captured these in the ICECAP-O (collected from the same people who gave us life satisfaction scores), as well as their individual answers to a huge number of questions about these other factors in life. This has been found in the USA too:

US life satisfaction

US life satisfaction








In short, we don’t have a bloody clue what older people are doing when they answer these scales but sure aren’t doing the same thing as younger people.

I discussed further the contribution of trust toward a broad measure of well-being in a talk I gave years ago when in Sydney: in Australia it is basically the case that a lack of trust of those in the local community has a pretty huge (11%) detrimental effect to your quality of life in Sydney but a much smaller, though still significant (5%) effect elsewhere in Australia.

I wish these Likert-based happiness surveys would cease. They really don’t help the field, when much better alternatives are already in routine use.

algorithms are bad mkayy

One of the blogs I follow is Math Babe and she has just published a book on (amongst other things) the problems with big data (which I intend to buy and read as soon as I get the time). The Guardian reprinted some of it, which is great for bringing this to a wider audience.

I left the following comment at her entry which mentions the Guardian article, but I think it might have disappeared into moderation purgatory as my first attempt to post “from an account” was the one which I don’t use (as opposed to this .org). Anyway the gist of what I said was that she is entirely right to lambast the use of automatic rules and algorithms to analyse (for instance) personality data used in recruitment. However, smart companies (1) don’t use psychometric data, they use Best-Worst Scaling which cannot be “gamed” and (2) use human input to interpret the results. Anyway here’s my comment to her blog post…EDIT – the comment appeared, hooray!


Hi. Nice article and I intend to get and read the book when things calm down a little in my work life. I just have two comments, one that is entirely in line with what you have said, and one which is a mild critique of your understanding of the personality questionnaires now being used by certain companies.

First, I agree entirely that the “decision rule” to cut down the number of “viable” candidates based on various metrics should not be automated. Awful practice.

Second, and where I would disagree with you, is in the merits of the “discrete choice” based personality statements (so you *have* to agree with one of several not very nice traits). This is not, in fact, psychometrics. It is an application of Thurstone’s *other* big contribution to applied statistics, random utility theory, which is most definitely a theory of the individual subject (unlike psychometrics which uses between-subject differences to make inferences).

I think you may be unaware that if an appropriate statistical design is used to present these (typically best-worst scaling) personality-trait data then the researcher obtains ratio scaled (probabilistic) inferences which must, by definition be comparable across people and allow you to separate people on *relative* degrees of (say) the big 5. Thus why they can’t be gamed, and why I know of a bank that sailed through the global financial crisis by using these techniques to ensure a robust spread of individuals with differing relative strengths.

If two people genuinely are the same on two less-attractive personality traits then the results will show their relative frequencies of choice to be equal, and those traits will have also competed against other traits elsewhere in the survey (and probably appear “low down” on the latent scale). So there’s nothing intrinsically “wrong” with a personality survey using these methods (see work by Lee Soutar and Louviere who operationalised it for Schwartz’s values survey) – indeed there is lots to commend it over the frankly awful psychometric paradigm of old.

I would simply refer back to my first point (where we agree) and say that the interpretation of the data is an art, not a science, and why people like me get work in interpreting these data. Incidentally and on that subject, I can relate to the own-textbook-buzz, mine came out last year. Smart companies already know how to collect the right data, they just realise they can’t put the results through an algorithm.

adaptive conjoint

I was interviewed for a podcast for MR Realities by Kevin Gray and Dave McCaughan a week or so ago. It went well (bar a technical glitch causing a brief outage in the VOIP call at one point) and apparently the podcast is doing very well compared to others.

One topic raised was adaptive conjoint analysis (ACA). This method seeks to “tweak” the choice sets presented to a respondent based on his/her initial few answers, and thus (the theory goes), “home in” on the trade-offs that matter most to him/her more quickly and efficiently. The trouble is, I don’t like it and don’t think it can work – and the last time I spoke to world design expert Professor John Rose about it, he felt similarly (though our solutions are not identical). There are three reasons I dislike it.

  1. Heckman shared the 2000 Nobel prize with McFadden: sampling on the basis of the dependent variable – the respondent’s observed choices – is perilous and often gives biased results – the long-recognised endogeneity issue.
  2. The second reason is probably more accessible to the average practitioner: suppose the respondent just hasn’t got the hang of the task in the first few questions and unintentionally misleads you about what matters – you may end up asking a load of questions about the “wrong” features.
    You may ask what evidence there is that this is happening. Well my last major paper as an academic showed that even doing the typically smallest “standard” design to give you individual-level estimates of all the main feature effects (the Orthogonal Main Effects Plan, or OMEP) can lead you up the garden path (if, as we found, people used heuristics because the task was difficult) so I simply, genuinely don’t understand how asking a smaller number of questions allows you me to make robust inferences.
  3. But it gets worse: the 3rd reason I don’t like adaptive designs is that if a friend and I seem to have different preferences from the model, I don’t know if we genuinely differ or whether it was that we answered different question designs that caused the result (estimates are confounded with design). And the other key finding of the paper I just mentioned confirmed a body of evidence showing that people do interact with the design – so you can get a different picture of what I value depending on what kind of design you gave me. Which is very worrying. So I just don’t understand the logic of adaptive conjoint and I follow Warren Buffett’s mantra – if I don’t understand the product I don’t sell it to my clients.

John Rose and Michiel Bilemer wrote a paper for a conference way back in 2009 debunking the “change the design to fit the individual” idea. Their solution was novel: the design doesn’t vary by individual (so no confounding issue) but it does change for everyone after a set number of questions. It’s a type of Bayesian efficient design, but requiring some heavy lifting to be done during the survey itself that most people would not be able to do.
Though I think it’s a novel solution, I personally would only do this to the extent that everyone has (for instance) done a design (e.g. at least the OMEP) that elicits individual level estimates, then after segmentation you could administer a second complete survey based on those results: indeed that would solve an issue that has long bugged me – how to you know what priors to use for an individual if you don’t already have DCE results for that individual (since heterogeneity nearly always exists)? But I also have a big dose of scepticism of very efficient designs anyway given the paper I referenced, and that is a different can of worms I opened 🙂




older aussie well-being

A tweet just alerted me to this report on the well-being of older Australians. I haven’t had time to read in detail but a quick skim seems to indicate it did all the “usual” things of checking correlations, doing PCA, etc and then “The final index was then calculated by averaging the five domain index_log”.


I cannot but help feel a little frustrated. I gave a talk on this subject 6 years ago in Sydney when working at UTS. Many of my top publications from my time in academia concerned the development and Australian usage of the ICECAP-O instrument (see chapter 12) as a measure of the well-being of (primarily but not just) older people. Advantages it has over the research in the report I’ve just read are the following:

  1. ICECAP-O doesn’t use variables that must be collected from official sources or be part of (say) the SEIFA. The five variables came from extensive qualitative work that established what it is about (say) housing uncertainty that really contributes to/takes away from well-being. We wanted the underlying key conceptual attributes of well-being. So whilst health (for instance) is valued, it is what it gives you in terms of independence, security, enjoyment of activities that really matters.
  2. ICECAP-O is an individual-level one A4 page questionnaire. Four response categories per question means 4^5=1024 distinct “states” you could be in, each with its own percentage score. So slice and dice the data in far more flexible, disaggregated ways than what’s out there so far.
  3. The five domains are NOT simply averaged, nor do the response categories across domains be equally valued – e.g. the 2nd to top level of “love and friendship” is more highly valued, on average, than the top level of ANY of the other four other domains. They don’t all matter equally to older people. There is even a fair degree of heterogeneity AMONG older people as to the relative importance of these, heavily driven by factors such as marital status and gender. We used choice models to find this out and the findings are based on robust well-tested theoretical models.
  4. You can compare with (say) the SEIFA – we did this with the UK Index of Multiple Deprivation in one of my papers looking a British city – and get far better insights. So for instance, measures like the IMD/SEIFA can be misleading when they fail to capture measures of social capital or connectedness.

It’s a shame when disciplines don’t talk to one another. Things could move forward a lot more quickly. And as long as we use SEIFA/IMD type measures in policy, we’re going to be directing resources to the wrong people.