Where next for discrete choice health valuation – part three

Where next for discrete choice health valuation – part three

This is the last part of a series of three pieces on the valuation of health/well-being using discrete choice tasks. It should be borne in mind that this covers several stated preference (SP) methods, including discrete choice experiments (DCEs), best-worst scaling (BWS) and ranking studies. I discussed the tension between the needs of the descriptive system and valuation in part one, whilst part two dealt with issues in design.

In this blog entry I wish to return to the descriptive system of generic health/well-being (quality of life) instruments. In some ways it is an extension of the first part: I will describe the construction of the descriptive system. However, I will propose quantitative techniques for use in the construction of the descriptive system itself – something that to date has been left almost entirely to the qualitative members of the teams I have worked in.

I stress at the outset that I do not wish to downplay the importance of the work that qualitative colleagues do and have done for the ICECAP instruments. Garbage in garbage out and you MUST have a good descriptive system if your tariff is going to be any good whatsoever. Indeed I was co-author on a paper that described in detail the advantages and disadvantages of various types of qualitative techniques in constructing instruments for use in DCEs. I merely wish to discuss how quantitative analyses/techniques – and in particular, DCEs – can aid in the interpretation of the qualitative data, and head off some problems we have encountered at the refereeing stage when validating the estimated (and published) tariff. This is actually an extension of something we have already done in the ICECAP supportive care measure (ICECAP-SCM) for end-of-life care: we used DCEs to help us decide on how many levels (four or five) the measure should include. (In fact, the DCE helped us decide on four levels – a view that the qualitative team members had been leaning towards anyway, but were unsure whether their data could robustly justify this.)

So how might DCEs and other quantitative techniques be used in other stages of qualitative work to construct the instrument itself? I have been forming the view, over several years of constructing ICECAP instruments, that DCEs could help test wording, whilst more conventional techniques like checking the correlation matrix could help with minimising the number of “wasted” profiles.

In terms of wording, I have had two thoughts, one that is probably quite novel to much of the health economics community and one which is not. The first concerns the use of Case 1 BWS (the “Object Case”) in checking potential statements (attributes/dimensions). Case 1 was the last of the three cases to be introduced to health care (but the first to get published at all, by Jordan Louviere and Adam Finn in 1992). It is a remarkably easy DCE to run. The choice items are not attribute levels as in Case 2, nor entire profiles as in Case 3, but typically statements or policy goals etc, with no attribute and level structure to them. In this instance we could test which wording of a concept is most salient (and which is least salient) to respondents. Here is a simplistic example to illustrate how a choice task might look – the wider DCE with a number of such choice tasks is intended to be administered to an interviewee from the qualitative research:

Which of the following statements do you think most clearly describes the things you talked about related to “feeling safe” and which do you think describes it least clearly?

Most Statement Least

I want to feel secure

I want to feel protected from harm

I want to feel comfortable all the time

 

Now, in addition to the adjectives used to define safe (which could itself be a statement), the researcher could vary the word “want” (“wish”) and other aspects of the sentence structure. There would probably need to be in the region of four to seven statements to be tested, since a separate Case 1 BWS DCE would be required for each concept. Thus, we are definitely talking about a follow-up interview which might be as long as the initial interview used to elicit the concepts themselves. However, it could be valuable in testing out different words in terms of their salience to respondents, particularly if used in conjunction with think-aloud techniques and a verbal debrief.

You might wonder how the interviewer could know the full DCE results in order to conduct the debrief and obtain and understanding of why certain statements are not salient to the interviewee? Well, Case 1 BWS is nice because if the DCE is administered on a laptop or tablet, there is a program that can bring up the importance scores instantly after the respondent finishes the last choice set. We have used this in an end-of-life care pilot and the older people loved having their results – we also were able to show them how common their views were and how they compared to the wider population – turns out they are not so different from the younger Facebook generation!

Another use of DCEs will no doubt be more familiar to this audience: testing a pilot of the actual DCE/BWS study. Again, real-time scores could be made available if a Case 2 study (as in all the ICECAP valuation studies to date) is used (though we haven’t programmed a Case 2 to do that yet). This could identify problems an individual has with the instrument, once the draft attributes are actually put together to form a profile. Perhaps the pilot attribute wording of the “safety” concept, once in the context of a complete quality of life state, implies something else, or confuses the respondent. A program could be written to iteratively do analysis – add that respondent’s data to an existing dataset and output the results, to see if there are any emerging weird patterns in the utility/capability scores. Again, doing think-aloud and debriefing would help the interviewer understand the reasons for any problems.

Either or both of these two types of DCEs could be integrated into the second qualitative stage of a project – where the proposed concepts defining health or quality of life are actually put into an attribute structure. They may aid the project team later on, both in reducing the amount of piloting necessary for the full valuation exercise but also at stage where we have had problems: showing validity of the instrument. Referees have frequently been a little awkward over validity. Strictly speaking, we used a different paradigm (a qualitative one) to produce the items (concepts, and then attributes). We do end up publishing (construct) validity papers eventually anyway, but the objection that we didn’t do this at the item development stage sometimes crops up anyway. I would propose doing some work that might count as “classical item development construct validity”: getting individuals to answer the draft instrument and calculating the correlation matrix. We might even do simpler work if numbers of respondents are too small: observing if their answers to two attributes seem to move in tandem. This might alert the team to a problem that otherwise would only become apparent when it is too late and the instrument has been valued: a high correlation between responses to two attributes.

A high correlation means that a lot of the information on health/quality of life can be ascertained from the answer to one attribute/dimension; the second adds little extra information. This essentially means we have “wasted” one attribute position – perhaps a rethink of the qualitative findings might have led to a different conceptualisation.

Avoiding this problem potentially means that more profiles (health/QoL states) are observed in real life studies and potentially increases the sensitivity of the instrument. Who knows, perhaps a particular specific health condition that otherwise doesn’t seem to be captured very well by the generic health instrument under consideration might have been captured with a different set of attributes? Poor vision has been claimed not to be captured well by generic instruments. Well, perhaps a rethink of the attributes at the generation stage might mean that the effects of poor vision on health might be captured by some attribute that isn’t explicitly about vision at all? I don’t claim that such problems would be solved by what I have proposed above, but I think they are worthy of investigation in future work to construct new instruments. Of course, any increase in the number of profiles observed in practice may present its own problems, as David Parkin remarked in a comment to Part One – analysis can become more complicated, when, for certain instruments (I’m thinking health here perhaps more than QoL) “a large proportion of real world observations are [already] covered by a small proportion of profiles”. So maybe we don’t need to go overboard on this problem. But for QoL I do think it is important if we are claiming that our finalised instrument is capable of covering all real world observations.

I hope it is clear that the ideas here present opportunities for qualitative researchers (more work, but more papers!) and not threats. I re-iterate that I don’t wish to supplant the immensely important work on concept generation, attribute development and level testing that they do. I merely wanted to do some “think aloud” myself regarding how some early DCE and other quantitative work might help qualitative researchers. After all, as we reported in our paper comparing the qualitative methods, many qualitative researchers find the process of attribute generation for DCEs difficult: this process of “narrowing down the data” goes against everything they usually do (expand the data and draw upon the richness, often using case studies). Thankfully the team I have worked with for ICECAP-O, ICECAP-A, the CES and ICECAP-SCM has remained pretty constant, allowing the qualitative researchers to draw upon previous experience in performing this difficult task. Finally, I encourage qualitative researchers to, in turn, “encroach upon our turf” and interview people with the most extreme views in the valuation exercise* – knowing why they have such extreme views about the relative importance of this versus that attribute will tell us even more about the views of humans about health and QoL and give us insights that may prove useful in the next generation of instrument development. Opportunities for all here!

*We can identify them! 🙂

 

Copyright Terry N Flynn 2015.

This, together with the accompanying blogs, will form a working paper to be submitted to SSRN. Please cite appropriately if referring to these issues in academic papers.