Why Case 2 Best-Worst Scaling is NOT the same as a traditional choice experiment
This post is prompted by both a funding submission and a paper. My intention is to explain why Case 2 BWS (the profile case – asking respondents to choose the best and worst attribute levels in a SINGLE profile/state) does NOT necessarily give you the same answers as a discrete choice experiment (asking you to choose ONLY between WHOLE profiles/states). (This might not be clear from the book – CUP 2015).
So, an example. In a case 2 BWS question concerning health you might choose “extreme pain” as the worst attribute level and “no anxiety/depression” as the best attribute level. You considered a SINGLE health state.
In a DCE (or Case 3 BWS) you might have to choose between THAT state and another that had “no pain” and “extreme anxiety/depression” as two of the attribute levels. All other attribute levels remain the same.
Now common sense suggests that if you preferred no depression in the Case 2 task that you would also choose the state with no depression in the DCE task. Unfortunately common sense might be wrong.
Well it comes down to trade-offs – as economics usually does. Case 2 does the following. It essentially puts ALL attribute levels on an interval or ratio scale – a rating scale. BUT it does it PROPERLY, unlike a traditional rating scale. The positions have known, mathematical properties (choose “this” over “that” x% of the time).
DCEs (or Case 3 BWS) don’t do exactly that. They estimate “with what probability of impairment x would you trade to improve impairment y”. Thus they naturally consider choices BETWEEN whole health states, forcing you to think “that’s bad, really bad, but how bad would this other aspect of life have to be to make me indifferent between the two”. And the answer might be different.
Now Tony Marley has shown that under mild conditions the estimates obtained from the two should be linearly related. But there is one stronger, underlying caveat – that CONTEXT has not changed.
What is context?
Well I shall use a variation on the example from our 2008 JMP example. Suppose I want to fly from London to Miami. I might be doing it for holiday or work. Now mathematical psychologists (choice modellers) would assume the utility associated with zero/one/two/three stops is fixed. Suppose the utilities are mean centred (so zero is probably positive, one might be, whilst two or three are probably negative – who likes American airports?). The attribute “importance weight” is a multiplicative weight applied to these four attribute LEVEL utilities, depending on the context. So, for business it is probably BIG: you don’t want to (can’t) hang around. For a holiday it may be smaller (you can accept less direct routes, particularly if they are cheaper). In any case the weight “stretches” the level utilities away from zero (business) or “squashes” them towards zero (holiday). It’s a flexible model and tells us a lot, potentially, about how contexts affects the benefits we accrue. However, there’s a problem. We can’t estimate the attribute importance weights from a single dataset – in the same way that we can’t estimate the variance scale parameter from a single dataset.
So what do we do?
40 years of research, culminating in Marley, Flynn & Louviere (JMP 2008) established that for discrete choice data we have no choice: we MUST estimate the same attribute levels TWICE or more, VARYING THE CONTEXT. Two datapoints for two unknowns – ring any bells? 😉 So do a DCE where you are always having to consider flying for business, then do a SECOND DCE where you are always having to consider flying for a holiday. You then “net out” the common level utilities and you have the (context dependent) attribute importance weights.
So how does this shed light on the Case 2/DCE issue?
Well we have two sets of estimates again: one from a Case 2 task asking us “how bad is each impairment” and another asking us “would you trade this for that”. We can, IF THE DESIGN IS THE SAME, do the same “netting out” to work out the person’s decision rule: “extreme pain is awful but I’d rather not be depressed instead, thank you very much”.
Now this is an example where the ordering of those two impairments changed as a result of context: depression became more important when the individual had to trade. That may or may not be possible or important – that’s an empirical matter. But if we are intending moving from a DCE (multi-state) framework to a Case 2 (single state) framework we should be checking this, where possible. Now for various ICECAP instruments we didn’t, mainly because:
(1) The sample groups were too vulnerable (unable) to do DCEs (older people, children, etc);
(2) We thought the “relative degrees of badness” idea behind Case 2 was more in tune with the ideas of Sen, who developed the Capability Approach (though he has never been keen on quantitative valuation at all, it does have to be said).
Checks, where possible – Flynn et al (JoCM), Potoglou et al (SSM), also seem to confirm that the more general the outcome (e.g. well-being/health) the less likely that the two sets of estimates will deviate in relative magnitude (at least), which is nice.
However, I hope I have convinced people that knowing “how bad this impairment is” and “how bad that impairment is” does not necessarily allow you to say how bad this impairment is relative to that impairment. It’s an empirical matter. Whether you care is a normative issue.
But one this is for CERTAIN: DON’T CONFOUND DESIGN CHANGES WITH CONTEXT CHANGES – in other words keep the same design if you are going to compare the two, if you want to be certain any differences are really due to context and not some unknown mix of design and context. After all we already know that design affects estimates from both Case 2 and DCE tasks.
EDIT to add copyright note 8 April 2016:
Copyright Terry N Flynn 2016.
This, together with the accompanying blogs, will form a working paper to be submitted to SSRN. Please cite appropriately if referring to these issues in academic papers.