Tag Archives: valuation

BWS neither friend nor foe

This post replies to some requests I have had asking me to respond to a paper concluding that DCEs are better than BWS for health state valuation. To be honest I am loathe to respond, for reasons that will become apparent.

First of all, let me clarify one thing that people might not appreciate – I most definitely do not want to “evangelise” for BWS and it is not the solution in quite a few circumstances. (See the papers coming out from the CHU-9D child health valuation study I was involved with for starters – BWS was effectively a waste of resources in the end….”best” choices were all we could use for the tariff.)

I only really pushed BWS strongly in my early days as a postdoc when I wanted to make a name for myself. If you read my papers since 2007 (*all* of them) you’ll see the numerous caveats appear with increasing frequency. And that’s before we even get to the BWS book, where we devote an entire chapter discussing unresolved issues including the REAL weaknesses and research areas for BWS (as opposed to straw men I have been seeing in recent literature).

OK now that’s out of the way, I will lay some other cards on the table, many of which are well-known since I’ve not exactly been quiet about them. I had mental health issues associated with my exit from academia. I’m back on my feet now doing private sector work for very appreciative clients, but that doesn’t mean I want to go back and fight old battles….battles which I erroneously thought us three book authors had “won” by passing muster with the top mathematical psychologists, economists and others in the world during peer review. When you publish a paper in the Journal of Mathematical Psychology (the JHE of that field) illustrating a key feature/potential weakness of a DCE (or specifically Case 2 BWS) back in 2008 you tend to expect that papers published in 2016 would not ignore this and would not do research that showed zero awareness of this issue and as a result made fundamental errors – after all, whilst we know clinical trials take a while to go from proposal to main publication, preference studies do NOT take 8+ years to go through this process. I co-ran a BWS study from conceptualisation to results presentation in 6 days when in Sydney. Go figure.

So that’s an example of my biggest frustration – the standards of literature review have often been appalling. Two or three of my papers (ironically including the JHE one, which includes a whopping error which I myself have repeatedly flagged up and which I corrected in my 2008 BMC paper) seem to get inserted as “the obligatory BWS reference to satisfy referees/editors” and in many cases bear no relation to the point being made by authors. Alarm bells immediately flash when I read an abstract via a citation alert and see those were my references. But it keeps happening. Not good practice, folks.

In fact (and at a recent meeting someone with no connection to me said the same thing) in certain areas of patient outcomes research the industry reviews are considered far better than academic ones – they have to be or get laughed out of court.

Anyway, I have been told that good practice eventually drives out bad. Sorry, if that’s true, the timescale was simply too long for me, which didn’t help my career in academia and raised my blood pressure.

Returning to the issue at hand. I’m not going to go through the paper in question, nor the several others that have appeared in the last couple of years purporting to show limitations of BWS. I have a company to run, caring obligations and I’ve written more than enough for anyone to join the dots here if they do a proper literature review. My final attempt to help out was an SSRN paper. But that’s it – without some give and take from the wider community, my most imaginative BWS work will be for clients who put food on the table and who pay – sometimes quite handsomely – for a method that when properly applied shows amazing predictive ability together with insights into how humans make decisions.

Now, of course, health state valuation is another kettle of fish – no revealed preference data etc. However, Tony, Jordan and I discussed why “context” is key in 2008 (JMP); I expounded on this with reference to QALYs in my two 2010 single authored papers, and published a (underpowered) comparison in the 2013 JoCM paper (which I first presented at the 2011 ICMC conference in Leeds, getting constructive criticism from the top choice modellers on Earth). So this issue is not particularly new.

It’s rather poor that nobody has actually used the right design to compare Case 2 BWS with DCEs for health state valuation…I ended up deciding “if you want something done properly you have to do it yourself” and I am very grateful to the EuroQoL Foundation for funding such a study, which I am currently analysing with collaborators. I don’t really “have a dog in this fight” and if Case 2 proves useful then great, and if not then at least I will know exactly why not…and the reasons will have nothing to do with the “BWS is bad m’kayyyyy” papers published recently. (To be fair, I am sometimes limited in what I can access, with no longer having an academic affiliation so full texts are sometimes unavailable, but when there’s NO mention of attribute importance in the abstract, NOR why efficient designs for Case 2 are problematic my Bayesian estimate is 99.99% probability the paper is fundamentally flawed and couldn’t possibly rule BWS in or out as a viable competitor to a DCE.)

If you’d like to know more:

  • Read the book
  • Read all the articles – my google scholar profile is up to date
  • Get up to speed on the issues in discrete choice design theory – fast. Efficient designs are in many many instances extremely good (and I’ve used them) but you need to know exactly why in a Case 2 context they are inappropriate.

If you still don’t understand, get your institution to contract me to run an exec education course. When I’m not working, I’m not earning, full stop.

I’m now far more pragmatic about the pros and cons of academia and really didn’t want to be the archetypal “I’m leaving social media now” whinger. And I’m not leaving. But I am re-prioritising things. Sorry if this sounds harsh/unhelpful – I didn’t want to write this post and hoped to quietly slip beneath the radar, popping up when I thought something insightful based on one of BWS’s REAL disadvantages or Sen’s work etc was mentioned. But people I respect have asked for guidance. So I am giving what I can, given 10 minutes free time I have.

Just trying to end on a positive note – I gave a great exec education course recently. It was a pleasure to engage with people who asked questions that were pertinent to the limitations of BWS and who just wanted to use the right tool for the right job. That’s what I try to do and what we should all aim for. I take my hat off to them all.

Does age bring wisdom?

Today’s blog entry will discuss a philosophical issue that has implications for choice of ICECAP instrument and these views are purely my own, not those of any other individual involved in development of any of the ICECAP instruments. The issue concerns the difference in attributes (dimensions/domains) between ICECAP-O (technically designed only for older people) and ICECAP-A (designed for adults of any age).

Now, in many ways the issue is moot: there is considerable overlap in the underlying conceptual attributes and three of the five are pretty much the same. Whilst published work showed ICECAP-O working well amongst older people in a large survey conducted in Bristol, UK, unpublished follow-on work showed it working equally well among the younger adults. Furthermore a national online valuation exercise plus survey conducted in Australia (published in the BWS book) showed it working well there among adults of all ages too.

However, the other two attributes (doing things that make you feel valued and concerns about the future in ICECAP-O vs achievement & progress and feeling settled & secure in ICECAP-A) are arguably different. It might reflect the priorities of different generations: younger generations may feel a great need to achieve and progress – the idea of “moving forward” may be driving this (particularly since we have many more people working in that group). ICECAP-O on the other hand stresses the act of doing things that make you feel valued in life, which (to me) does not necessarily imply “moving forward” (though my personal career changes may have coloured my views!).

Likewise in ICECAP-A feeling settled and secure may reflect current younger generations’ feelings of instability in a world of zero-hour contracts etc. ICECAP-O asks instead about “concerns about the future”. Whilst it might be seen merely as the ICECAP-A question “flipped”, it is phrased with respect to the amount of concern overall, unlike ICECAP-A which is phrased with respect to how many areas of life – there’s a subtle difference there. To illustrate, I will simply pose a question. If you otherwise have a very good quality of life, can you still have a lot of concern about the future? I’d argue yes. Now let’s think about ICECAP-A. If you otherwise have a very good quality of life, can you feel settled and secure in only a few areas of life? Playing devil’s advocate, it could be argued that “this respondent has already said they’re doing well on the other four attributes of quality of life – they have a lot of capability to achieve the levels they want – so how can they feel unsettled in key attributes too?”

Ultimately these are empirical issues, requiring researchers to look at correlation matrices of actual answers. In the Australian survey 5002 people were randomised to either ICECAP-O or ICECAP-A. The Spearman rank correlation coefficients of the respondents’ five tickbox answers were uniformly higher for any given pair of attributes in ICECAP-A compared to their ICECAP-O equivalent. However, a big caveat here is that the ICECAP-O arm was a properly done valuation exercise in which quota sampling – on the basis of own ICECAP-O tickbox answers – was done. There were no previous ICECAP-A data on which to choose quotas. Thus this is not a like-for-like comparison and ICECAP-O therefore had an artificial advantage. Using the Bristol adult ICECAP-O data (to correct, somewhat, for this) caused four of the ten pairwise correlations to be smaller for ICECAP-A but two of these were for attributes common to both instruments. Comparisons among groups from the same country and using the same sampling is therefore required before firm conclusions can be made.

Finally, it is worth considering the philosophy here and I’ll raise a final point. OK it seems that adding younger adults to the valuation sample has changed at least one and arguably two attributes. It raises the normative question of whether we should use these attributes in valuing their quality of life when they haven’t, by definition, lived a long life: perhaps age brings wisdom and it is the older people who “know what’s best for you”. Most people experience regret at some point and our “values” (defined both conceptually – the attributes themselves – and numerically – the tariff) can change with experience.

Of course using a single ICECAP instrument – ICECAP-O if one were persuaded of the above philosophical argument – would make things nicer and easier when it comes to “a single common denominator for valuation” but if, like me, you are keen for greater investigation of (and possibly use of) individual valuation, could we justify using ICECAP-O scoring for a 30 year old which may downweight “doing things that make you feel valued” because that person actually is more interested in achieving things and progressing (forward?) in life?  On the other hand, knowing what are the key conceptual attributes of ICECAP-A, maybe stressing, in the intro for ICECAP-O,  that “doing things that make you feel valued” can easily encompass “achieving and progressing in life” is a practical solution?

Another empirical issue!

And so we come full circle to whether practical solutions, or stricter ones fitting some theory, are the way forward. As usual in health economics, normative issues galore.

Where next for discrete choice health valuation – part four

Why Case 2 Best-Worst Scaling is NOT the same as a traditional choice experiment

This post is prompted by both a funding submission and a paper. My intention is to explain why Case 2 BWS (the profile case – asking respondents to choose the best and worst attribute levels in a SINGLE profile/state) does NOT necessarily give you the same answers as a discrete choice experiment (asking you to choose ONLY between WHOLE profiles/states). (This might not be clear from the book – CUP 2015).

So, an example. In a case 2 BWS question concerning health you might choose “extreme pain” as the worst attribute level and “no anxiety/depression” as the best attribute level. You considered a SINGLE health state.

In a DCE (or Case 3 BWS) you might have to choose between THAT state and another that had “no pain” and “extreme anxiety/depression” as two of the attribute levels.  All other attribute levels remain the same.

Now common sense suggests that if you preferred no depression in the Case 2 task that you would also choose the state with no depression in the DCE task. Unfortunately common sense might be wrong.


Well it comes down to trade-offs – as economics usually does. Case 2 does the following. It essentially puts ALL attribute levels on an interval or ratio scale – a rating scale. BUT it does it PROPERLY, unlike a traditional rating scale. The positions have known, mathematical properties (choose “this” over “that” x% of the time).

DCEs (or Case 3 BWS) don’t do exactly that. They estimate “with what probability of impairment x would you trade to improve impairment y”. Thus they naturally consider choices BETWEEN whole health states, forcing you to think “that’s bad, really bad, but how bad would this other aspect of life have to be to make me indifferent between the two”. And the answer might be different.

Now Tony Marley has shown that under mild conditions the estimates obtained from the two should be linearly related. But there is one stronger, underlying caveat – that CONTEXT has not changed.

What is context?

Well I shall use a variation on the example from our 2008 JMP example. Suppose I want to fly from London to Miami. I might be doing it for holiday or work. Now mathematical psychologists (choice modellers) would assume the utility associated with zero/one/two/three stops is fixed. Suppose the utilities are mean centred (so zero is probably positive, one might be, whilst two or three are probably negative – who likes American airports?). The attribute “importance weight” is a multiplicative weight applied to these four attribute LEVEL utilities, depending on the context. So, for business it is probably BIG: you don’t want to (can’t) hang around. For a holiday it may be smaller (you can accept less direct routes, particularly if they are cheaper). In any case the weight “stretches” the level utilities away from zero (business) or “squashes” them towards zero (holiday). It’s a flexible model and tells us a lot, potentially, about how contexts affects the benefits we accrue. However, there’s a problem. We can’t estimate the attribute importance weights from a single dataset – in the same way that we can’t estimate the variance scale parameter from a single dataset.

So what do we do?

40 years of research, culminating in Marley, Flynn & Louviere (JMP 2008) established that for discrete choice data we have no choice: we MUST estimate the same attribute levels TWICE or more, VARYING THE CONTEXT. Two datapoints for two unknowns – ring any bells? 😉 So do a DCE where you are always having to consider flying for business, then do a SECOND DCE where you are always having to consider flying for a holiday. You then “net out” the common level utilities and you have the (context dependent) attribute importance weights.

So how does this shed light on the Case 2/DCE issue?

Well we have two sets of estimates again: one from a Case 2 task asking us “how bad is each impairment” and another asking us “would you trade this for that”. We can, IF THE DESIGN IS THE SAME, do the same “netting out” to work out the person’s decision rule: “extreme pain is awful but I’d rather not be depressed instead, thank you very much”.

Now this is an example where the ordering of those two impairments changed as a result of context: depression became more important when the individual had to trade. That may or may not be possible or important – that’s an empirical matter. But if we are intending moving from a DCE (multi-state) framework to a Case 2 (single state) framework we should be checking this, where possible. Now for various ICECAP instruments we didn’t, mainly because:

(1) The sample groups were too vulnerable (unable) to do DCEs (older people, children, etc);

(2) We thought the “relative degrees of badness” idea behind Case 2 was more in tune with the ideas of Sen, who developed the Capability Approach (though he has never been keen on quantitative valuation at all, it does have to be said).

Checks, where possible – Flynn et al (JoCM), Potoglou et al (SSM), also seem to confirm that the more general the outcome (e.g. well-being/health) the less likely that the two sets of estimates will deviate in relative magnitude (at least), which is nice.

However, I hope I have convinced people that knowing “how bad this impairment is” and “how bad that impairment is” does not necessarily allow you to say how bad this impairment is relative to that impairment. It’s an empirical matter. Whether you care is a normative issue.

But one this is for CERTAIN: DON’T CONFOUND DESIGN CHANGES WITH CONTEXT CHANGES – in other words keep the same design if you are going to compare the two, if you want to be certain any differences are really due to context and not some unknown mix of design and context. After all we already know that design affects estimates from both Case 2 and DCE tasks.

EDIT to add copyright note 8 April 2016:

Copyright Terry N Flynn 2016.

This, together with the accompanying blogs, will form a working paper to be submitted to SSRN. Please cite appropriately if referring to these issues in academic papers.


part 4 on where next for DCEs to come soon

Normal service is gradually being renewed after a bunch of stuff slowed me down.

Part four, to be written at the latest over Easter, concerns confusion over the valuation of health/well-being states using Case 2 Best-Worst Scaling compared with “more traditional DCEs”. Basically there are differences conceptually and potentially empirically that should be recognised. Depending on what you are valuing, and how much you want representativeness, the differences may not matter really, but if you find differences it does NOT necessarily mean at least one method is wrong. Which, unfortunately, was what one paper out there wrote. So I’ll try to put the record straight ASAP.

Where next for discrete choice health valuation – part one

Where next for discrete choice health valuation – part one

My final academic obligations concerned two projects involving valuation of quality of life/heath states. Interestingly, they involved people at opposite ends of the age spectrum – children and people at the end of life. (Incidentally I am glad that the projects happened to be ending at the time I exited academia, so I didn’t leave the project teams in the lurch!)

These projects have thrown up lots of interesting issues, as did my “first generation” of valuation studies (the ICECAP-O and –A quality of life/well-being instruments). This blog entry will be the first of three to summarise these issues and lay out some ideas for how to go forward with future valuation studies and, in particular, construction of new descriptive systems for health or quality of life. In time they will be edited and combined to form a working paper to be hosted on SSRN. The issue to be addressed in this first blog concerns the descriptive system – its size and how it can/cannot be valued.

The size of the descriptive system became pertinent when we valued the CHU-9D instrument for child health. More specifically, an issue that arose concerned the ability of some children to do Best-Worst Scaling tasks for the CHU-9D. The project found that we could only use the “best” (first best actually) data for reporting. This is not secret: I, and other members of the project team are reporting this at various conferences over the coming year. I may well be first, at the International Academy of Health Preference Research conference in St Louis, USA, in a few weeks. We knew from a pilot study that children exhibited much larger rates of inconsistency in their “worst” choices than their “best”: the plot of best vs worst frequencies had a bloody big part of the inverse relationship curve missing! (This was the first time I saw this.)



Best vs versus frequencies of the type in the child health study

Best vs versus frequencies of the type in the child health study

When you plot the best choice frequency against the worst choice frequency of each attribute level you should see an approximately inverse relationship. After all, an attractive attribute level should be chosen frequently as best and infrequently as worst; an unattractive attribute level should be chosen frequently as worst and infrequently as best. Yet in the child health study, the unattractive attribute levels (low levels of the 9 attributes), although showing the small “best” frequencies, did not show large “worst frequencies: they were all clustered together around a low worst frequency. This showed that the kids seemed to choose pretty randomly when it came to the “worst” choices – particularly bad attribute levels were NOT chosen more often as worst than moderately bad attribute levels. This made the part of the “inverse relationship” curve be missing! First time I’d seen that. It led us to made a big effort to get a lot of worst data (two rounds) and make it easy (by structuring the task with greying out of previously chosen options). However, it didn’t really work unfortunately.

I stress that despite my deliberately controversial title for the IAHPR conference, we CANNOT know if it was (1) the valuation method (BWS), (2) the descriptive system (CHU-9D) or (3) just plain respondent lack of knowledge that caused kids to be unable to decide what was worst about aspects of poor health.

(1) could be true if kids IN GENERAL don’t think about the bad things in life; (2) could be true if the number of attributes and levels was too large – the CHU-9D has 9 attributes, each with 5 levels, which is the largest instrument I have ever valued in a single exercise (I was involved in the ASCOT exercise which split the instrument in two); (3) could be true if kids can do “worst” tasks, but in general they just can’t comprehend poor health states (since kids from the general population are mostly highly unlikely to have experienced or even thought about them).

In the main study I hoped that “structured BWS” eliciting four of the nine ranks in a Case 2 BWS study would help the kids. More specifically:

(1) They answered best

(2) Their best option was then “greyed out” and they answered worst

(3) This was in turn greyed out and they answered next (second) best

(4) Which was in turn greyed out and they answered next (second) worst.

This in theory gave us four of the nine ranks (1,2,8,9). It was particularly useful because it enabled us to test the (often blindly made) assumption that the rank ordered logit model gives you utility function estimates that are “the same” no matter what ranking depth (top/bottom/etc) you use data from. Unfortunately our data failed this test quite spectacularly – only the first best data really gave sensible answers. So the pilot results were correct – for some reason in this study, kids’ worst choices were duff. (Even their second best data were not very good.)

Of course, as I mentioned, we don’t know the reason why this was the case, so we must proceed with caution before making controversial statements about how well BWS works among kids (ahem, cough, cough…)

But given the mooted idea to devise an “ICECAP for kids”, we should bear in mind the CHU-9D findings when constructing the descriptive system. I certainly don’t want to criticise the very comprehensive and well-conducted qualitative work done by Sheffield researchers to construct the CHU-9D. I merely pose some questions for future research to develop an “ICECAP for kids instrument” which may cause a tension between the needs of the descriptive system and the needs of the valuation exercise.

Would an ICECAP for kids really need 5^9=1953125 profiles (quality of life states) to describe child quality of life (as the CHU-9D did for health)?

My personal view is that too much of the health economics establishment may be thinking in terms of psychometrics, which (taking the SF-36 as the exemplar) typically concentrates on the number of items (questions/dimensions/attributes). A random utility theory based approach concentrates on the number of PROFILES (health/quality of life states). This causes the researcher to focus more on the combination of attributes and levels. When the system is multiplicative (as in a DCE), the number of “outcomes” becomes large VERY quickly.

Thus, some people are missing the point when they express concern at the small number of questions (five) in the ICECAP-O and –A. In fact there are 4^5 possible outcomes (states) – and moreover of the 1024 possible ICECAP states, over 200 ICECAP-O ones are observed in a typical British city. That makes the instrument potentially EXTREMELY sensitive. So I would end with a plea to think about the number of profiles (states) not the number of attributes. Can attributes be combined? That, and the statistics/intuition behind it, will be the subject of the second blog in this series.


Copyright Terry N Flynn 2015.

This, together with the accompanying blogs, will form a working paper to be submitted to SSRN. Please cite appropriately if referring to these issues in academic papers.