Tag Archives: dce

semi-retiring from blogging

Unfortunately I shall be semi-retiring from blogging.

When I say “semi”, I mean that general discussions on my personal website and comment on my personal twitter account will become few and far between. I shall continue to make comments/blogs on my work account.

There are several, in some cases related, reasons:

(1) Standards of practice in DCEs are not improving in health. It’s profoundly depressing when you read a blog entry/article/op-ed that has you nodding fiercely – as just happened – and then you get to the central defence of the paper. And it involves a discrete choice experiment that has not followed proper practice and stands a non-trivial chance of being totally wrong.

(2) Standards of literature review are appalling and getting worse by the year. When I did my PhD you wouldn’t dream of submitting a paper that didn’t show awareness of the literature – particularly if key aspects of your design have been heavily criticised by others.

(3) I get the distinct impression “political arguments” are trumping “data”. This partly follows on from (2): it’s well-known and established why quota sampling is important in DCEs yet “population representative sampling” continues to be used as an “advantage” (ha!) of DCEs done in the field of QALY Decision-making.

If this makes no sense to you then can I respectfully suggest you need to go do some reading?

If you don’t know the finding (from the mid 1980s) that heteroscedasticity on the latent scale is a significant problem in terms of bias, and how it matters in QALY studies, then it makes me think you have a rather large hole in your statistical knowledge and worries me immensely.

I won’t name names, in the interests of discretion, but I’m tired of making this point year in year out, with no result (with the honorouble exception of the EuroQoL Foundation who funded a group I am part of to look at this)….and I showed it empirically in the BWS book. Please read the health chapters to understand this. I’m open to questions by email if you don’t understand the logic.

(4) I spent a lot of my own money showing how attitudes are related to preferences in terms of politics…..which got me zilch…..the media are lemmings….they’d rather all jump off the cliff together than report something different (and based on stronger assumptions) and risk being “the one who was wrong”. Again, lack of statistical training, noted already by people like Ben Goldacre.

So I’m afraid I’m a little tired of all this. I have a business to run. Parents to do a lot of stuff for.

I’m still here on email – ask me if you’re puzzled. I’m not trying to be obstructive here. But I need to concentrate on putting food on the table.

All the best,


BREXIT survey stuff on work account

Just a reminder that the results of my Best-Worst Scaling survey which showed what would happen if we could know the (LEAVE/REMAIN) view of every eligible voter in the UK is on my work account.

Most follow-up – regional variation, recommendations as to which type of BREXIT are preferred by whom, how 8% of that 28% who never turned out to vote could have held the key to everything – will be on that account too.

Some interesting observations from the raw data – and remember we can look at an individual’s responses here, because BWS gave us 10 data points to estimate 5 parameters:

  • The East Midlands, although heavily LEAVE, skews quite heavily toward a different type of BREXIT to other LEAVE regions.
  • The strong preference for free trade is simply not there….it has shifted – VERY heavily – toward the free movement of people throughout Europe. This “strong positive liking of immigration” is visible nowhere else. The non-English countries/principalities (Wales, Northern Ireland and Scotland) have a broadly neutral view on immigration. The non East-Midlands part of England strongly dislikes it
  • East Midlanders also have a strong antipathy toward several key aspects of the EU – in fact the pattern of their dislikes looks remarkably consistent with a “Swiss form of BREXIT” – one of the so-called “soft” BREXIT options.
  • They also are the region which loathes the EU budget contribution the most.
  • Their results form a remarkably realistic view, compared to some other segments of British society: they (we – am a Nottinghamian) seem quite happy to sacrifice elements of the single market and the customs union, plus we’ll adopt a constructive view on immigration with our European neighbours if it means we “get some money back”. We’ll also compromise on free trade quite happily.

So what gives? Has everyone round here had some secret training in Ricardo’s work, thus recognising when free trade is not welfare-enhancing?

algorithms are bad mkayy

One of the blogs I follow is Math Babe and she has just published a book on (amongst other things) the problems with big data (which I intend to buy and read as soon as I get the time). The Guardian reprinted some of it, which is great for bringing this to a wider audience.

I left the following comment at her entry which mentions the Guardian article, but I think it might have disappeared into moderation purgatory as my first attempt to post “from an account” was the WordPress.com one which I don’t use (as opposed to this .org). Anyway the gist of what I said was that she is entirely right to lambast the use of automatic rules and algorithms to analyse (for instance) personality data used in recruitment. However, smart companies (1) don’t use psychometric data, they use Best-Worst Scaling which cannot be “gamed” and (2) use human input to interpret the results. Anyway here’s my comment to her blog post…EDIT – the comment appeared, hooray!


Hi. Nice article and I intend to get and read the book when things calm down a little in my work life. I just have two comments, one that is entirely in line with what you have said, and one which is a mild critique of your understanding of the personality questionnaires now being used by certain companies.

First, I agree entirely that the “decision rule” to cut down the number of “viable” candidates based on various metrics should not be automated. Awful practice.

Second, and where I would disagree with you, is in the merits of the “discrete choice” based personality statements (so you *have* to agree with one of several not very nice traits). This is not, in fact, psychometrics. It is an application of Thurstone’s *other* big contribution to applied statistics, random utility theory, which is most definitely a theory of the individual subject (unlike psychometrics which uses between-subject differences to make inferences).

I think you may be unaware that if an appropriate statistical design is used to present these (typically best-worst scaling) personality-trait data then the researcher obtains ratio scaled (probabilistic) inferences which must, by definition be comparable across people and allow you to separate people on *relative* degrees of (say) the big 5. Thus why they can’t be gamed, and why I know of a bank that sailed through the global financial crisis by using these techniques to ensure a robust spread of individuals with differing relative strengths.

If two people genuinely are the same on two less-attractive personality traits then the results will show their relative frequencies of choice to be equal, and those traits will have also competed against other traits elsewhere in the survey (and probably appear “low down” on the latent scale). So there’s nothing intrinsically “wrong” with a personality survey using these methods (see work by Lee Soutar and Louviere who operationalised it for Schwartz’s values survey) – indeed there is lots to commend it over the frankly awful psychometric paradigm of old.

I would simply refer back to my first point (where we agree) and say that the interpretation of the data is an art, not a science, and why people like me get work in interpreting these data. Incidentally and on that subject, I can relate to the own-textbook-buzz, mine came out last year. Smart companies already know how to collect the right data, they just realise they can’t put the results through an algorithm.

adaptive conjoint

I was interviewed for a podcast for MR Realities by Kevin Gray and Dave McCaughan a week or so ago. It went well (bar a technical glitch causing a brief outage in the VOIP call at one point) and apparently the podcast is doing very well compared to others.

One topic raised was adaptive conjoint analysis (ACA). This method seeks to “tweak” the choice sets presented to a respondent based on his/her initial few answers, and thus (the theory goes), “home in” on the trade-offs that matter most to him/her more quickly and efficiently. The trouble is, I don’t like it and don’t think it can work – and the last time I spoke to world design expert Professor John Rose about it, he felt similarly (though our solutions are not identical). There are three reasons I dislike it.

  1. Heckman shared the 2000 Nobel prize with McFadden: sampling on the basis of the dependent variable – the respondent’s observed choices – is perilous and often gives biased results – the long-recognised endogeneity issue.
  2. The second reason is probably more accessible to the average practitioner: suppose the respondent just hasn’t got the hang of the task in the first few questions and unintentionally misleads you about what matters – you may end up asking a load of questions about the “wrong” features.
    You may ask what evidence there is that this is happening. Well my last major paper as an academic showed that even doing the typically smallest “standard” design to give you individual-level estimates of all the main feature effects (the Orthogonal Main Effects Plan, or OMEP) can lead you up the garden path (if, as we found, people used heuristics because the task was difficult) so I simply, genuinely don’t understand how asking a smaller number of questions allows you me to make robust inferences.
  3. But it gets worse: the 3rd reason I don’t like adaptive designs is that if a friend and I seem to have different preferences from the model, I don’t know if we genuinely differ or whether it was that we answered different question designs that caused the result (estimates are confounded with design). And the other key finding of the paper I just mentioned confirmed a body of evidence showing that people do interact with the design – so you can get a different picture of what I value depending on what kind of design you gave me. Which is very worrying. So I just don’t understand the logic of adaptive conjoint and I follow Warren Buffett’s mantra – if I don’t understand the product I don’t sell it to my clients.

John Rose and Michiel Bilemer wrote a paper for a conference way back in 2009 debunking the “change the design to fit the individual” idea. Their solution was novel: the design doesn’t vary by individual (so no confounding issue) but it does change for everyone after a set number of questions. It’s a type of Bayesian efficient design, but requiring some heavy lifting to be done during the survey itself that most people would not be able to do.
Though I think it’s a novel solution, I personally would only do this to the extent that everyone has (for instance) done a design (e.g. at least the OMEP) that elicits individual level estimates, then after segmentation you could administer a second complete survey based on those results: indeed that would solve an issue that has long bugged me – how to you know what priors to use for an individual if you don’t already have DCE results for that individual (since heterogeneity nearly always exists)? But I also have a big dose of scepticism of very efficient designs anyway given the paper I referenced, and that is a different can of worms I opened 🙂




what can’t be legitimately separated from how

I and co-authors have a chapter in the just published book “Care at the end of life” edited by Jeff Round. I haven’t had a chance to read most of it yet, but from I’ve seen so far it’s great.

Chris Sampson has a good chapter on the objects we value when examining the end-of-life trajectory. It’s nicely written and parts of it tie in with my series on “where next for discrete choice valuation”, parts one, (which he cites), two, three, but particularly (and too late for the book), four.

The issue concerns a better separation of what we are valuing from how we value it. I came at it from a slightly different angle from Chris, though I sense we’re trying to get people to address the same question. It’s of increasing importance now the ICECAP instruments are becoming more mainstream. I’m often thought of as “the valuation guy” – yet how we valued Capabilities is intimately tied up with how the measures might (or might not) be used, as well as the concepts behind them. When I became aware that the method we used – Case 2 BWS – would not necessarily have given us the same as those from discrete choice experiments, part of me worried…..briefly. But in truth, I honestly think our method is more in tune with the spirit of Sen’s ideas. (Not to mention the fact we seem to be getting similar estimates, though I explain why this is probably so in this instance previously).

I have said quite a bit already in the blogs, but it’s nice to see others also coming at this issue from other directions. Anybody working on developing the Capabilities Approach must remain in close contact with those who are working on valuation methods.

Where next for discrete choice health valuation – part four

Why Case 2 Best-Worst Scaling is NOT the same as a traditional choice experiment

This post is prompted by both a funding submission and a paper. My intention is to explain why Case 2 BWS (the profile case – asking respondents to choose the best and worst attribute levels in a SINGLE profile/state) does NOT necessarily give you the same answers as a discrete choice experiment (asking you to choose ONLY between WHOLE profiles/states). (This might not be clear from the book – CUP 2015).

So, an example. In a case 2 BWS question concerning health you might choose “extreme pain” as the worst attribute level and “no anxiety/depression” as the best attribute level. You considered a SINGLE health state.

In a DCE (or Case 3 BWS) you might have to choose between THAT state and another that had “no pain” and “extreme anxiety/depression” as two of the attribute levels.  All other attribute levels remain the same.

Now common sense suggests that if you preferred no depression in the Case 2 task that you would also choose the state with no depression in the DCE task. Unfortunately common sense might be wrong.


Well it comes down to trade-offs – as economics usually does. Case 2 does the following. It essentially puts ALL attribute levels on an interval or ratio scale – a rating scale. BUT it does it PROPERLY, unlike a traditional rating scale. The positions have known, mathematical properties (choose “this” over “that” x% of the time).

DCEs (or Case 3 BWS) don’t do exactly that. They estimate “with what probability of impairment x would you trade to improve impairment y”. Thus they naturally consider choices BETWEEN whole health states, forcing you to think “that’s bad, really bad, but how bad would this other aspect of life have to be to make me indifferent between the two”. And the answer might be different.

Now Tony Marley has shown that under mild conditions the estimates obtained from the two should be linearly related. But there is one stronger, underlying caveat – that CONTEXT has not changed.

What is context?

Well I shall use a variation on the example from our 2008 JMP example. Suppose I want to fly from London to Miami. I might be doing it for holiday or work. Now mathematical psychologists (choice modellers) would assume the utility associated with zero/one/two/three stops is fixed. Suppose the utilities are mean centred (so zero is probably positive, one might be, whilst two or three are probably negative – who likes American airports?). The attribute “importance weight” is a multiplicative weight applied to these four attribute LEVEL utilities, depending on the context. So, for business it is probably BIG: you don’t want to (can’t) hang around. For a holiday it may be smaller (you can accept less direct routes, particularly if they are cheaper). In any case the weight “stretches” the level utilities away from zero (business) or “squashes” them towards zero (holiday). It’s a flexible model and tells us a lot, potentially, about how contexts affects the benefits we accrue. However, there’s a problem. We can’t estimate the attribute importance weights from a single dataset – in the same way that we can’t estimate the variance scale parameter from a single dataset.

So what do we do?

40 years of research, culminating in Marley, Flynn & Louviere (JMP 2008) established that for discrete choice data we have no choice: we MUST estimate the same attribute levels TWICE or more, VARYING THE CONTEXT. Two datapoints for two unknowns – ring any bells? 😉 So do a DCE where you are always having to consider flying for business, then do a SECOND DCE where you are always having to consider flying for a holiday. You then “net out” the common level utilities and you have the (context dependent) attribute importance weights.

So how does this shed light on the Case 2/DCE issue?

Well we have two sets of estimates again: one from a Case 2 task asking us “how bad is each impairment” and another asking us “would you trade this for that”. We can, IF THE DESIGN IS THE SAME, do the same “netting out” to work out the person’s decision rule: “extreme pain is awful but I’d rather not be depressed instead, thank you very much”.

Now this is an example where the ordering of those two impairments changed as a result of context: depression became more important when the individual had to trade. That may or may not be possible or important – that’s an empirical matter. But if we are intending moving from a DCE (multi-state) framework to a Case 2 (single state) framework we should be checking this, where possible. Now for various ICECAP instruments we didn’t, mainly because:

(1) The sample groups were too vulnerable (unable) to do DCEs (older people, children, etc);

(2) We thought the “relative degrees of badness” idea behind Case 2 was more in tune with the ideas of Sen, who developed the Capability Approach (though he has never been keen on quantitative valuation at all, it does have to be said).

Checks, where possible – Flynn et al (JoCM), Potoglou et al (SSM), also seem to confirm that the more general the outcome (e.g. well-being/health) the less likely that the two sets of estimates will deviate in relative magnitude (at least), which is nice.

However, I hope I have convinced people that knowing “how bad this impairment is” and “how bad that impairment is” does not necessarily allow you to say how bad this impairment is relative to that impairment. It’s an empirical matter. Whether you care is a normative issue.

But one this is for CERTAIN: DON’T CONFOUND DESIGN CHANGES WITH CONTEXT CHANGES – in other words keep the same design if you are going to compare the two, if you want to be certain any differences are really due to context and not some unknown mix of design and context. After all we already know that design affects estimates from both Case 2 and DCE tasks.

EDIT to add copyright note 8 April 2016:

Copyright Terry N Flynn 2016.

This, together with the accompanying blogs, will form a working paper to be submitted to SSRN. Please cite appropriately if referring to these issues in academic papers.