Where next for discrete choice health valuation – part two

Where next for discrete choice health valuation – part two

Part one of this series on the valuation of health (or quality of life) using discrete choice experiments (DCEs) and their variants concentrated on the tension between the size of the descriptive system and the needs of valuation. In particular, it summarised some disappointing findings in a study using best-worst scaling (BWS) to value the CHU-9D instrument (although I hasten to add we did successfully elicit a child population tariff!) Now I wish to re-emphasise that no definitive conclusion can be drawn from this, specifically whether the method, the instrument, or the kids themselves caused the problems. But it does raise issues that should be borne in mind if any ICECAP instrument for child quality of life is produced.

A very timely event this week has allowed me to discuss (in more detail than I would have otherwise done) a second issue that future valuation exercises using discrete choices (DCEs/BWS/ranking) should consider. The issue is the design of the valuation exercise.

The timely event was the acceptance (by Pharmacoeconomics) of a paper I and colleagues wrote on how varying levels of efficiency in a DCE might cause respondents to act differently. The paper is called “Are Efficient Designs Used In Discrete Choice Experiments Too Difficult For Some Respondents? A Case Study Eliciting Preferences for End-Of-Life Care” by T.N. Flynn, Marcel Bilger, Chetna Malhotra and Eric Finkelstein. The paper (two DCE experts thought) is revolutionary because it was a within-subject, not between-subject survey: all respondents answered TWO DCEs, differing in their level of statistical efficiency. Now, why did we do this, and what was the issue in the first place?

The background to this study is as follows. Several years ago, when I worked at CenSoC, UTS, Jordan Louviere called a bunch of us into his office. He had been looking at results from a variety of DCEs for some reason and was puzzled by some marked differences in the types of decision rule (utility function) elicited, depending on the study. Traditionally he was used to up to (approximately) 10% of respondents answering on the basis of a single attribute (lexicographically) – most typically “choose the alternative (profile) with the lowest cost”. Suddenly we were seeing rates of 30% of more. Why were such a substantial minority of respondents suddenly deciding they didn’t want to trade across attributes at all, but wanted to use only one? He realised that this increase in rates began around the time CenSoC had begun to use Street & Burgess designs. For those who don’t know, Street and Burgess were two highly respected statisticians/mathematicians working at UTS who had collaborated with Louviere from around the turn of the millennium in order to increase the efficiency of DCE designs. Higher efficiency means a lower required sample size – precision around utility parameter estimates is improved. It also offered Louviere the tantalising possibility of estimating individual-level utility functions, rather than the sample- or subgroup-level ones that DCEs could only manage previously. (Individual level “utility” functions had been around in the “conjoint” literature for a while but these relied on atheoretical methods like rating scales.)

Street and Burgess had begun to provide CenSoC with designs whose efficiency was 100% (or close to it), rather than 30-70%. We loved them and used them practically all the time. In parallel with this, John Rose at the Institute of Transport and Logistics Studies at Sydney University had begun utilising highly efficient designs – though of a different sort. However, what efficient designs have in common – and really what contributes heavily to their efficiency – is a lack of level overlap. This means that if the respondent is presented with two pairs of options, each with five attributes, few, and in many cases none, of those attributes will have the same level in both options. Thus, the respondent has to keep in mind the differences in ALL FIVE ATTRIBUTES at once when making a choice. Now, this might be cognitively difficult. Indeed John Rose, to his immense credit, made abundantly clear in the early years in a published paper that his designs, although STATISTICALLY EFFICIENT, might not be “COGNITIVELY EFFICIENT”, in that people might find them difficult (pushing up their error variance) or, even worse, use a simplifying heuristic (such as “choose the cheapest option”) in order to get through the DCE. (Shame on us CenSoCers for not reading that paper more closely.) Clearly in the latter case you are getting biased estimates – not only are your parameter estimates biased (in an unknown direction) but the functional form of the utility function for such respondents is wrong. Now John merely hypothesised this problem – he had no empirical data to test his hypothesis, and recommended that people go collect data. For many years they didn’t.

Hence we went on our merry way, utilising S&B designs, until Louviere spotted the problem and the potential reason for it. The problem was all the surveys he looked at utilised ONE DCE – so there is ONE level of efficiency – so he had only between-survey data and couldn’t be certain it was the efficiency that was driving the changes in respondent decision rule: perhaps the surveys with these high rates of uni-attribute decision-making were done in areas where people GENUINELY chose on the basis of a single attribute?

I chatted to him and he realised I was designing a survey in which I had an opportunity to do a within-subject choice experiment. Specifically, if my Singapore-based collaborators agreed, I could administer TWO DCEs to all respondents. Now I am not going to tell all about how we did this exactly but cutting to the chase, 60% (!) of respondents answered on the basis of a single attribute in a S&B design but one third of these (20% overall) then traded across attributes in a much less efficient design that exhibited some level overlap (making it – arguably – cognitively simpler). Finally, we had within subject evidence that PEOPLE INTERACT WITH THE DESIGN. Which, of course, has serious implications for generalisability, if found to be a common problem.

Why is this an issue for future valuation exercises? Well, I have seen presentations from researchers who used highly efficient designs in DCEs to get a tariff for health instruments. Essentially the choices on offer are time trade-off types where both quality of life and length of life differ. Now although the TTO is (probably) easier than the Standard Gamble, it is still a hard thing to get your head around if you have any cognitive impairment or are in a vulnerable group. So we probably don’t want to make things even more difficult than they already are.

This, of course, creates headaches for researchers: if we reduce efficiency to make the task easier then the required sample sizes will go up. We may have more limited ability to identify heterogeneity or estimate individual level models. But, as usual, I believe we are in a world of second best, so compromises may have to be made. One would be to cut out the length of life attribute from the DCE altogether and use a TTO to rescale the health values estimated from an easier task like BWS Case 2 – as I advocated a number of years ago as a “second best” valuation option. Not ideal, but would do the job. Other ideas include helping respondents get familiar with the options on offer through the use of other choices tasks (again, Case 2), which we have done in a study recently. In any case, if the design issue proves common – and my gut feeling, given the complexity of decision-making in health, is that it will – we will need to be imaginative with our designs.

A final issue that broadly comes under the design “umbrella” concerns sampling. One of the chapters in the BWS book utilised sampling criteria that deliberately avoided making the sample representative of the wider population. Why would we do that? Well, when you have a limited dependent variable model with discrete outcomes (rather than continuous outcomes in TTO/SG/VAS), characterising the heterogeneity correctly becomes absolutely crucial: if there is heteroscedasticity, the estimates won’t simply be inefficient, but BIASED. BIG problem. If, say, depressed people have different health preferences and choice consistency to non-depressed people but you don’t have enough depressed people in your sample to spot this, and mix them in with the others in estimation, you have the WRONG POPULATION TARIFF. So (and I have said this before in published papers), EVEN if you want a population tariff, to work within the traditional extra-welfarist paradigm, you still have to get the heterogeneity right – you must probably OVERSAMPLE those groups you suspect of having different preferences. Then, when you have estimated the tariffs for the different groups you must RE-WEIGHT to reflect the population distribution. THAT is the way to get the true population tariff. Of course if people in various different health states do not differ from the “average member of the population” in their health preferences the problem goes away. The problem, as the chapter in the book shows, is that (at least for well-being) people with impairments DO have different preferences: those impaired on attribute x desire to improve on attribute x, whilst those impaired on attribute y switch to wanting attribute z to compensate. The picture is extremely complex. So you should be using what we call “quota sampling” – making sure you have enough people in various key impaired states to estimate models for those subgroups. So survey design is a lot more complicated when you ditch TTO/SG/VAS.

I don’t mean to sound glass half empty regarding design. Leonie Burgess, when presented with the implications of her designs, was fascinated and saw it as opportunity (to change and improve the models) rather than a problem. I see it this way too. Things will get interesting (again) in DCEs in the coming years as we find out what we need to do in the design field to ensure we get unbiased tariffs for use in decision-making.

Although the third and final blog in this series (to appear next week) may seem superficially similar to the first one – I will discuss the size of the descriptive system – I will write in more detail about the process of constructing the instrument, both qualitative and quantitative*, and offer recommendations that may help alleviate the tension mentioned in the first blog.

*Discussing in more detail some constructive comments I had from Professor David Parkin.

Copyright Terry N Flynn 2015.

This, together with the accompanying blogs, will form a working paper to be submitted to SSRN. Please cite appropriately if referring to these issues in academic papers.