Category Archives: DCE support

design: a bit behind the ball

I just had a citation alert to this article on design efficiency in DCEs in health. Nowadays I skim citation alerts (at best) as they come so thick and fast (*polishes halo*). However, this one caught my eye being in a BMJ Open Access journal and being on design, a subject currently close to my heart.

The article wasn’t bad. I just wouldn’t say it was good either. Whilst the quantity of references was sufficient, there were a number of them that frankly were irrelevant and should have been replaced by (much) more important ones. When none of the major textbooks that include chapters on design guidance (ones by Rose/Hensher/Louviere etc) are mentioned, nor a key paper by Rose and Bliemer, you sigh.

Plus, I know I might be mis-remembering this (no longer having institutional access to check, and with the paper copies of key references being packed away unaccessible at the moment), but investigations of factors affecting design efficiency have been done already, surely?

But it was the cognitive vs statistical efficiency issue that really got me to sign up in order to make the following comment (which seems, at present, to be in moderation purgatory, though Monday may change things).

Nice investigation but I’m afraid some key non-health references are missing which would have addressed/begun to address some issues you raised. Regarding design guidance, the two seminal textbooks are not referenced, together with Rose and Bliemer’s 2009 paper.

You also appear to have understated the seriousness of the problem if the quest for efficiency leads respondents to use heuristics: your results become BIASED (useless). You say “Using a statistically efficient design may result in a complex DCE, increasing the cognitive burden for respondents and reducing the validity of results. Simplifying designs can improve the consistency of participants’ choices which will help yield lower error variance, lower choice variability, lower choice uncertainty and lower variance heterogeneity” but these are the least of your worries if the functional form of the utility function depends on the design. To their credit, Rose and Bliemer pointed out this possibility back in 2009; it’s already been observed in between-subject comparisons and I and co-authors published the first within-subject study in health and found the problem was extremely severe:

Flynn TN, Bilger M, Malhotra C, Finkelstein EA. Are Efficient Designs Used In Discrete Choice Experiments Too Difficult For Some Respondents? A Case Study Eliciting Preferences for End-Of-Life care. Pharmacoeconomics 2016:34(3);273-284

The paper was submitted right around the time ours came out in the print version, but I know our e-version was around before then, not to mention the possibility of adding it at the review stage. Which actually leads me to worry about the refereeing process just as much as aspects of the original paper.

Best-Worst Scaling in Voting

My comment to one of the links posted to today’s “Water Cooler” posting at Naked Capitalism. (Cross posted to my company blog too). The original link concerned a proposal in the US state of Maine to introduce ranked voting rather than the first-past-the-post (FPTP) that is ubiquitous in the US and UK…..the proposal sounds attractive to people, but…..

“Ranking is a double edged sword – not that I condone the current first past the post (FPTP) system endemic in the US and UK (it’s the worst of all worlds) – but people should first look at what oddballs have ended up in the Federal Senate in Australia. Plus that awful Pauline Hanson may be about to make a comeback there.

Ranking has proven very very difficult to properly axiomatize – i.e. in practice, there are a whole load of assumptions that must hold for the typical “elimination from the bottom” (or any other vote aggregation method) to properly reflect the strength of preference in the population. For instance:
(1) Not everybody ranks in the same way (top-bottom / bottom-top / top, bottom, then middle, or any other of a huge number of methods);
(2) An individual can give you different rankings depending on how you ask him/her to provide you with answers (again, ask ranks 1, 2, 3, etc,…. 9, 8, 7, etc, 1, 9, 2, 8 etc ….)
(3) People have different degrees of certainty at different ranking depths – they are typically far less sure about their middle rankings than their top and bottom choices.

Unfortunately, where academic marketing, psychology and economics studies have been done properly, these kind of problems have proven to be endemic….furthermore they often matter to the final outcome, which is worrying. It’s why gods of the field of math psych (from Luce and Marley in the 1960s onwards) were very very cautious in condoning ranking as a method.

Statement of conflict of interest: Marley and I are co-authors on the definitive textbook on an alternative method called best-worst scaling….it asks people for their most and least preferred options only. The math is much easier and I’d be very very interested to see what would have happened in both the Rep/Dem primaries if it had been used – generally you subtract the number of “least preferred” votes from the number of “most preferred” – so people like Clinton and Trump with high negatives get into trouble….”

What I didn’t say (since the work is technical) is that Tony Marley has done a lot of work in voting and has published at least one paper extolling BWS as a method of voting.



what can’t be legitimately separated from how

I and co-authors have a chapter in the just published book “Care at the end of life” edited by Jeff Round. I haven’t had a chance to read most of it yet, but from I’ve seen so far it’s great.

Chris Sampson has a good chapter on the objects we value when examining the end-of-life trajectory. It’s nicely written and parts of it tie in with my series on “where next for discrete choice valuation”, parts one, (which he cites), two, three, but particularly (and too late for the book), four.

The issue concerns a better separation of what we are valuing from how we value it. I came at it from a slightly different angle from Chris, though I sense we’re trying to get people to address the same question. It’s of increasing importance now the ICECAP instruments are becoming more mainstream. I’m often thought of as “the valuation guy” – yet how we valued Capabilities is intimately tied up with how the measures might (or might not) be used, as well as the concepts behind them. When I became aware that the method we used – Case 2 BWS – would not necessarily have given us the same as those from discrete choice experiments, part of me worried…..briefly. But in truth, I honestly think our method is more in tune with the spirit of Sen’s ideas. (Not to mention the fact we seem to be getting similar estimates, though I explain why this is probably so in this instance previously).

I have said quite a bit already in the blogs, but it’s nice to see others also coming at this issue from other directions. Anybody working on developing the Capabilities Approach must remain in close contact with those who are working on valuation methods.

Where next for discrete choice health valuation – part four

Why Case 2 Best-Worst Scaling is NOT the same as a traditional choice experiment

This post is prompted by both a funding submission and a paper. My intention is to explain why Case 2 BWS (the profile case – asking respondents to choose the best and worst attribute levels in a SINGLE profile/state) does NOT necessarily give you the same answers as a discrete choice experiment (asking you to choose ONLY between WHOLE profiles/states). (This might not be clear from the book – CUP 2015).

So, an example. In a case 2 BWS question concerning health you might choose “extreme pain” as the worst attribute level and “no anxiety/depression” as the best attribute level. You considered a SINGLE health state.

In a DCE (or Case 3 BWS) you might have to choose between THAT state and another that had “no pain” and “extreme anxiety/depression” as two of the attribute levels.  All other attribute levels remain the same.

Now common sense suggests that if you preferred no depression in the Case 2 task that you would also choose the state with no depression in the DCE task. Unfortunately common sense might be wrong.


Well it comes down to trade-offs – as economics usually does. Case 2 does the following. It essentially puts ALL attribute levels on an interval or ratio scale – a rating scale. BUT it does it PROPERLY, unlike a traditional rating scale. The positions have known, mathematical properties (choose “this” over “that” x% of the time).

DCEs (or Case 3 BWS) don’t do exactly that. They estimate “with what probability of impairment x would you trade to improve impairment y”. Thus they naturally consider choices BETWEEN whole health states, forcing you to think “that’s bad, really bad, but how bad would this other aspect of life have to be to make me indifferent between the two”. And the answer might be different.

Now Tony Marley has shown that under mild conditions the estimates obtained from the two should be linearly related. But there is one stronger, underlying caveat – that CONTEXT has not changed.

What is context?

Well I shall use a variation on the example from our 2008 JMP example. Suppose I want to fly from London to Miami. I might be doing it for holiday or work. Now mathematical psychologists (choice modellers) would assume the utility associated with zero/one/two/three stops is fixed. Suppose the utilities are mean centred (so zero is probably positive, one might be, whilst two or three are probably negative – who likes American airports?). The attribute “importance weight” is a multiplicative weight applied to these four attribute LEVEL utilities, depending on the context. So, for business it is probably BIG: you don’t want to (can’t) hang around. For a holiday it may be smaller (you can accept less direct routes, particularly if they are cheaper). In any case the weight “stretches” the level utilities away from zero (business) or “squashes” them towards zero (holiday). It’s a flexible model and tells us a lot, potentially, about how contexts affects the benefits we accrue. However, there’s a problem. We can’t estimate the attribute importance weights from a single dataset – in the same way that we can’t estimate the variance scale parameter from a single dataset.

So what do we do?

40 years of research, culminating in Marley, Flynn & Louviere (JMP 2008) established that for discrete choice data we have no choice: we MUST estimate the same attribute levels TWICE or more, VARYING THE CONTEXT. Two datapoints for two unknowns – ring any bells? 😉 So do a DCE where you are always having to consider flying for business, then do a SECOND DCE where you are always having to consider flying for a holiday. You then “net out” the common level utilities and you have the (context dependent) attribute importance weights.

So how does this shed light on the Case 2/DCE issue?

Well we have two sets of estimates again: one from a Case 2 task asking us “how bad is each impairment” and another asking us “would you trade this for that”. We can, IF THE DESIGN IS THE SAME, do the same “netting out” to work out the person’s decision rule: “extreme pain is awful but I’d rather not be depressed instead, thank you very much”.

Now this is an example where the ordering of those two impairments changed as a result of context: depression became more important when the individual had to trade. That may or may not be possible or important – that’s an empirical matter. But if we are intending moving from a DCE (multi-state) framework to a Case 2 (single state) framework we should be checking this, where possible. Now for various ICECAP instruments we didn’t, mainly because:

(1) The sample groups were too vulnerable (unable) to do DCEs (older people, children, etc);

(2) We thought the “relative degrees of badness” idea behind Case 2 was more in tune with the ideas of Sen, who developed the Capability Approach (though he has never been keen on quantitative valuation at all, it does have to be said).

Checks, where possible – Flynn et al (JoCM), Potoglou et al (SSM), also seem to confirm that the more general the outcome (e.g. well-being/health) the less likely that the two sets of estimates will deviate in relative magnitude (at least), which is nice.

However, I hope I have convinced people that knowing “how bad this impairment is” and “how bad that impairment is” does not necessarily allow you to say how bad this impairment is relative to that impairment. It’s an empirical matter. Whether you care is a normative issue.

But one this is for CERTAIN: DON’T CONFOUND DESIGN CHANGES WITH CONTEXT CHANGES – in other words keep the same design if you are going to compare the two, if you want to be certain any differences are really due to context and not some unknown mix of design and context. After all we already know that design affects estimates from both Case 2 and DCE tasks.

EDIT to add copyright note 8 April 2016:

Copyright Terry N Flynn 2016.

This, together with the accompanying blogs, will form a working paper to be submitted to SSRN. Please cite appropriately if referring to these issues in academic papers.


part 4 on where next for DCEs to come soon

Normal service is gradually being renewed after a bunch of stuff slowed me down.

Part four, to be written at the latest over Easter, concerns confusion over the valuation of health/well-being states using Case 2 Best-Worst Scaling compared with “more traditional DCEs”. Basically there are differences conceptually and potentially empirically that should be recognised. Depending on what you are valuing, and how much you want representativeness, the differences may not matter really, but if you find differences it does NOT necessarily mean at least one method is wrong. Which, unfortunately, was what one paper out there wrote. So I’ll try to put the record straight ASAP.

highly efficient DCEs can be bad paper published

Sorry, I should have posted this over a week ago – have had a cold and been working on a project.

But the paper on the perils of highly efficient designs in DCEs has been published! Wahay.

Personally I consider this paper to be the second best of my career (after the first BWS one in JHE)…ironic that I have left academia!

In terms of the content, some economists still don’t “get it” but that’s their problem really 😉

wiki and stuff

I have spent almost a day doing work on wikipedia articles.

I did some tidying up and editing of the BWS article, and I made substantive edits to the choice modelling one. In terms of the latter, I have tried not to fall foul of the NPOV crime – “neutral point of view”. I know that there are bunch of diehards out there in favour of the term “conjoint analysis”. The guy who is perhaps the top environmental economist, one of the top three choice modellers/marketers and I wrote an article explaining why it really is wrong to call what we do “conjoint analysis” – that is a particular academic technique/model, much as the maxdiff one is.

However, I do recognise that this is a battle we won’t win: too much industry is using those terms. Thus I acknowledged why “choice-based conjoint analysis” is used and attempted to give a full and frank justification for this. Of course I also gave our counter-argument, which relies on the academic case for DCEs and BWS!

Anyway I hope the two articles help newbies. I might edit the conjoint analysis one – I wouldn’t attempt to down play the enormous contributions by people like Green etc but I would make clear that the move towards choice-based techniques should lead the reader to another page. I would hope that that would not be poking a wasps’ nest!

In other news ,the efficient design paper is close to online publication – have amended the proofs – yay! We have also submitted the childrens’ health paper to a journal – will see how that goes… will be one of my last academic contributions to the field.


Finally I saw a piece by a former colleague in Australia which summarised a study to elicit Australians’ preferences for spending a given federal budget. It’s a shame this study was done – I fought tooth and nail with a former Director not to do it as it would embarrass the centre. Asking people what policy goal they would (1) most and (2) least like to receive spending (a Case 1 BWS study) was flawed for the following reasons:

(1) The Federal budget surplus/deficit is an ENDOGENOUS variable, NOT an exogenous one. The automatic stabilisers (unemployment and other safety net benefits) kick in when the economy goes down and the deficit increases naturally…or maybe the extra demand by people who would otherwise starve brings the economy UP. That’s the point – there is no “amount” the federal govt has to spend.

(2) The whole exercise is framed incorrectly. There is not a “pot of money” collected in taxes that the government has to spend. This confuses the household with the sovereign government. It is a nonsense to think that a sovereign government can “save” in a currency it creates with the press of a button (F5 F5 F5 F5). Plus think back to the beginning in a world without currency. Did the government tax to spend? No of course it didn’t. There was no money in circulation. It SPENT so money could enable trade and then the government could TAX in order to achieve its aims, ensure a demand for the currency etc.


A government spends by crediting accounts of the relevant beneficiaries – there is no money backing this. If the govt wants to issue bonds to “cover” the deficit it can do, but there is NO LINK with the deficit – indeed Australia provided the most recent example of this. Under the Howard government the government made a surplus, ergo it should have stopped issuing bonds (IOUs for overspending). It did this. What happened? The financial sector went apeshit and demanded it keep issuing them since they were running out of risk-free assets and assets upon which to price risk on other assets. It really was a wizard of Oz moment where the man behind the curtain was revealed.

So telling people “there is a budget of X million/billion dollars, what are your priorities?” is a misguided question. People will automatically get their own ideas about what is affordable with that budget of x, and indeed think in terms that there IS a finite budget. There is not. Now OF COURSE the govt can’t spend indefinitely, it’ll cause inflation when all unused factors of production become used. But we are nowhere near that point, as OECD/IMF figures show.

You should tell people “imagine there is no limit to spending; tell me your priorities” with a list of probabe spending for each. You may find that suddenly people choose things they assumed were unaffordable before.

So sorry I4C, you did a dud study. I did try to head this off before, but you have fallen for the fallacy of composition – this goes wayyyyy back oto Keynes. You can’t use microeconomics to solve macroeconomic problems. It’s a different discipline and NOT one us micro people should dabble in.


EDIT I have been informed that the budget was not divvied up in the DCE, which is good, and potentially makes the study correct. I just hope it, in the pre-amble, told respondents that the government budget is not constrained in any way, except when the entire economy is ful employed and there is no “slack in the system” – a situation we haven’t been in since the 1970s!

Where next for discrete choice health valuation – part three

Where next for discrete choice health valuation – part three

This is the last part of a series of three pieces on the valuation of health/well-being using discrete choice tasks. It should be borne in mind that this covers several stated preference (SP) methods, including discrete choice experiments (DCEs), best-worst scaling (BWS) and ranking studies. I discussed the tension between the needs of the descriptive system and valuation in part one, whilst part two dealt with issues in design.

In this blog entry I wish to return to the descriptive system of generic health/well-being (quality of life) instruments. In some ways it is an extension of the first part: I will describe the construction of the descriptive system. However, I will propose quantitative techniques for use in the construction of the descriptive system itself – something that to date has been left almost entirely to the qualitative members of the teams I have worked in.

I stress at the outset that I do not wish to downplay the importance of the work that qualitative colleagues do and have done for the ICECAP instruments. Garbage in garbage out and you MUST have a good descriptive system if your tariff is going to be any good whatsoever. Indeed I was co-author on a paper that described in detail the advantages and disadvantages of various types of qualitative techniques in constructing instruments for use in DCEs. I merely wish to discuss how quantitative analyses/techniques – and in particular, DCEs – can aid in the interpretation of the qualitative data, and head off some problems we have encountered at the refereeing stage when validating the estimated (and published) tariff. This is actually an extension of something we have already done in the ICECAP supportive care measure (ICECAP-SCM) for end-of-life care: we used DCEs to help us decide on how many levels (four or five) the measure should include. (In fact, the DCE helped us decide on four levels – a view that the qualitative team members had been leaning towards anyway, but were unsure whether their data could robustly justify this.)

So how might DCEs and other quantitative techniques be used in other stages of qualitative work to construct the instrument itself? I have been forming the view, over several years of constructing ICECAP instruments, that DCEs could help test wording, whilst more conventional techniques like checking the correlation matrix could help with minimising the number of “wasted” profiles.

In terms of wording, I have had two thoughts, one that is probably quite novel to much of the health economics community and one which is not. The first concerns the use of Case 1 BWS (the “Object Case”) in checking potential statements (attributes/dimensions). Case 1 was the last of the three cases to be introduced to health care (but the first to get published at all, by Jordan Louviere and Adam Finn in 1992). It is a remarkably easy DCE to run. The choice items are not attribute levels as in Case 2, nor entire profiles as in Case 3, but typically statements or policy goals etc, with no attribute and level structure to them. In this instance we could test which wording of a concept is most salient (and which is least salient) to respondents. Here is a simplistic example to illustrate how a choice task might look – the wider DCE with a number of such choice tasks is intended to be administered to an interviewee from the qualitative research:

Which of the following statements do you think most clearly describes the things you talked about related to “feeling safe” and which do you think describes it least clearly?

Most Statement Least

I want to feel secure

I want to feel protected from harm

I want to feel comfortable all the time


Now, in addition to the adjectives used to define safe (which could itself be a statement), the researcher could vary the word “want” (“wish”) and other aspects of the sentence structure. There would probably need to be in the region of four to seven statements to be tested, since a separate Case 1 BWS DCE would be required for each concept. Thus, we are definitely talking about a follow-up interview which might be as long as the initial interview used to elicit the concepts themselves. However, it could be valuable in testing out different words in terms of their salience to respondents, particularly if used in conjunction with think-aloud techniques and a verbal debrief.

You might wonder how the interviewer could know the full DCE results in order to conduct the debrief and obtain and understanding of why certain statements are not salient to the interviewee? Well, Case 1 BWS is nice because if the DCE is administered on a laptop or tablet, there is a program that can bring up the importance scores instantly after the respondent finishes the last choice set. We have used this in an end-of-life care pilot and the older people loved having their results – we also were able to show them how common their views were and how they compared to the wider population – turns out they are not so different from the younger Facebook generation!

Another use of DCEs will no doubt be more familiar to this audience: testing a pilot of the actual DCE/BWS study. Again, real-time scores could be made available if a Case 2 study (as in all the ICECAP valuation studies to date) is used (though we haven’t programmed a Case 2 to do that yet). This could identify problems an individual has with the instrument, once the draft attributes are actually put together to form a profile. Perhaps the pilot attribute wording of the “safety” concept, once in the context of a complete quality of life state, implies something else, or confuses the respondent. A program could be written to iteratively do analysis – add that respondent’s data to an existing dataset and output the results, to see if there are any emerging weird patterns in the utility/capability scores. Again, doing think-aloud and debriefing would help the interviewer understand the reasons for any problems.

Either or both of these two types of DCEs could be integrated into the second qualitative stage of a project – where the proposed concepts defining health or quality of life are actually put into an attribute structure. They may aid the project team later on, both in reducing the amount of piloting necessary for the full valuation exercise but also at stage where we have had problems: showing validity of the instrument. Referees have frequently been a little awkward over validity. Strictly speaking, we used a different paradigm (a qualitative one) to produce the items (concepts, and then attributes). We do end up publishing (construct) validity papers eventually anyway, but the objection that we didn’t do this at the item development stage sometimes crops up anyway. I would propose doing some work that might count as “classical item development construct validity”: getting individuals to answer the draft instrument and calculating the correlation matrix. We might even do simpler work if numbers of respondents are too small: observing if their answers to two attributes seem to move in tandem. This might alert the team to a problem that otherwise would only become apparent when it is too late and the instrument has been valued: a high correlation between responses to two attributes.

A high correlation means that a lot of the information on health/quality of life can be ascertained from the answer to one attribute/dimension; the second adds little extra information. This essentially means we have “wasted” one attribute position – perhaps a rethink of the qualitative findings might have led to a different conceptualisation.

Avoiding this problem potentially means that more profiles (health/QoL states) are observed in real life studies and potentially increases the sensitivity of the instrument. Who knows, perhaps a particular specific health condition that otherwise doesn’t seem to be captured very well by the generic health instrument under consideration might have been captured with a different set of attributes? Poor vision has been claimed not to be captured well by generic instruments. Well, perhaps a rethink of the attributes at the generation stage might mean that the effects of poor vision on health might be captured by some attribute that isn’t explicitly about vision at all? I don’t claim that such problems would be solved by what I have proposed above, but I think they are worthy of investigation in future work to construct new instruments. Of course, any increase in the number of profiles observed in practice may present its own problems, as David Parkin remarked in a comment to Part One – analysis can become more complicated, when, for certain instruments (I’m thinking health here perhaps more than QoL) “a large proportion of real world observations are [already] covered by a small proportion of profiles”. So maybe we don’t need to go overboard on this problem. But for QoL I do think it is important if we are claiming that our finalised instrument is capable of covering all real world observations.

I hope it is clear that the ideas here present opportunities for qualitative researchers (more work, but more papers!) and not threats. I re-iterate that I don’t wish to supplant the immensely important work on concept generation, attribute development and level testing that they do. I merely wanted to do some “think aloud” myself regarding how some early DCE and other quantitative work might help qualitative researchers. After all, as we reported in our paper comparing the qualitative methods, many qualitative researchers find the process of attribute generation for DCEs difficult: this process of “narrowing down the data” goes against everything they usually do (expand the data and draw upon the richness, often using case studies). Thankfully the team I have worked with for ICECAP-O, ICECAP-A, the CES and ICECAP-SCM has remained pretty constant, allowing the qualitative researchers to draw upon previous experience in performing this difficult task. Finally, I encourage qualitative researchers to, in turn, “encroach upon our turf” and interview people with the most extreme views in the valuation exercise* – knowing why they have such extreme views about the relative importance of this versus that attribute will tell us even more about the views of humans about health and QoL and give us insights that may prove useful in the next generation of instrument development. Opportunities for all here!

*We can identify them! 🙂


Copyright Terry N Flynn 2015.

This, together with the accompanying blogs, will form a working paper to be submitted to SSRN. Please cite appropriately if referring to these issues in academic papers.


Where next for discrete choice health valuation – part two

Where next for discrete choice health valuation – part two

Part one of this series on the valuation of health (or quality of life) using discrete choice experiments (DCEs) and their variants concentrated on the tension between the size of the descriptive system and the needs of valuation. In particular, it summarised some disappointing findings in a study using best-worst scaling (BWS) to value the CHU-9D instrument (although I hasten to add we did successfully elicit a child population tariff!) Now I wish to re-emphasise that no definitive conclusion can be drawn from this, specifically whether the method, the instrument, or the kids themselves caused the problems. But it does raise issues that should be borne in mind if any ICECAP instrument for child quality of life is produced.

A very timely event this week has allowed me to discuss (in more detail than I would have otherwise done) a second issue that future valuation exercises using discrete choices (DCEs/BWS/ranking) should consider. The issue is the design of the valuation exercise.

The timely event was the acceptance (by Pharmacoeconomics) of a paper I and colleagues wrote on how varying levels of efficiency in a DCE might cause respondents to act differently. The paper is called “Are Efficient Designs Used In Discrete Choice Experiments Too Difficult For Some Respondents? A Case Study Eliciting Preferences for End-Of-Life Care” by T.N. Flynn, Marcel Bilger, Chetna Malhotra and Eric Finkelstein. The paper (two DCE experts thought) is revolutionary because it was a within-subject, not between-subject survey: all respondents answered TWO DCEs, differing in their level of statistical efficiency. Now, why did we do this, and what was the issue in the first place?

The background to this study is as follows. Several years ago, when I worked at CenSoC, UTS, Jordan Louviere called a bunch of us into his office. He had been looking at results from a variety of DCEs for some reason and was puzzled by some marked differences in the types of decision rule (utility function) elicited, depending on the study. Traditionally he was used to up to (approximately) 10% of respondents answering on the basis of a single attribute (lexicographically) – most typically “choose the alternative (profile) with the lowest cost”. Suddenly we were seeing rates of 30% of more. Why were such a substantial minority of respondents suddenly deciding they didn’t want to trade across attributes at all, but wanted to use only one? He realised that this increase in rates began around the time CenSoC had begun to use Street & Burgess designs. For those who don’t know, Street and Burgess were two highly respected statisticians/mathematicians working at UTS who had collaborated with Louviere from around the turn of the millennium in order to increase the efficiency of DCE designs. Higher efficiency means a lower required sample size – precision around utility parameter estimates is improved. It also offered Louviere the tantalising possibility of estimating individual-level utility functions, rather than the sample- or subgroup-level ones that DCEs could only manage previously. (Individual level “utility” functions had been around in the “conjoint” literature for a while but these relied on atheoretical methods like rating scales.)

Street and Burgess had begun to provide CenSoC with designs whose efficiency was 100% (or close to it), rather than 30-70%. We loved them and used them practically all the time. In parallel with this, John Rose at the Institute of Transport and Logistics Studies at Sydney University had begun utilising highly efficient designs – though of a different sort. However, what efficient designs have in common – and really what contributes heavily to their efficiency – is a lack of level overlap. This means that if the respondent is presented with two pairs of options, each with five attributes, few, and in many cases none, of those attributes will have the same level in both options. Thus, the respondent has to keep in mind the differences in ALL FIVE ATTRIBUTES at once when making a choice. Now, this might be cognitively difficult. Indeed John Rose, to his immense credit, made abundantly clear in the early years in a published paper that his designs, although STATISTICALLY EFFICIENT, might not be “COGNITIVELY EFFICIENT”, in that people might find them difficult (pushing up their error variance) or, even worse, use a simplifying heuristic (such as “choose the cheapest option”) in order to get through the DCE. (Shame on us CenSoCers for not reading that paper more closely.) Clearly in the latter case you are getting biased estimates – not only are your parameter estimates biased (in an unknown direction) but the functional form of the utility function for such respondents is wrong. Now John merely hypothesised this problem – he had no empirical data to test his hypothesis, and recommended that people go collect data. For many years they didn’t.

Hence we went on our merry way, utilising S&B designs, until Louviere spotted the problem and the potential reason for it. The problem was all the surveys he looked at utilised ONE DCE – so there is ONE level of efficiency – so he had only between-survey data and couldn’t be certain it was the efficiency that was driving the changes in respondent decision rule: perhaps the surveys with these high rates of uni-attribute decision-making were done in areas where people GENUINELY chose on the basis of a single attribute?

I chatted to him and he realised I was designing a survey in which I had an opportunity to do a within-subject choice experiment. Specifically, if my Singapore-based collaborators agreed, I could administer TWO DCEs to all respondents. Now I am not going to tell all about how we did this exactly but cutting to the chase, 60% (!) of respondents answered on the basis of a single attribute in a S&B design but one third of these (20% overall) then traded across attributes in a much less efficient design that exhibited some level overlap (making it – arguably – cognitively simpler). Finally, we had within subject evidence that PEOPLE INTERACT WITH THE DESIGN. Which, of course, has serious implications for generalisability, if found to be a common problem.

Why is this an issue for future valuation exercises? Well, I have seen presentations from researchers who used highly efficient designs in DCEs to get a tariff for health instruments. Essentially the choices on offer are time trade-off types where both quality of life and length of life differ. Now although the TTO is (probably) easier than the Standard Gamble, it is still a hard thing to get your head around if you have any cognitive impairment or are in a vulnerable group. So we probably don’t want to make things even more difficult than they already are.

This, of course, creates headaches for researchers: if we reduce efficiency to make the task easier then the required sample sizes will go up. We may have more limited ability to identify heterogeneity or estimate individual level models. But, as usual, I believe we are in a world of second best, so compromises may have to be made. One would be to cut out the length of life attribute from the DCE altogether and use a TTO to rescale the health values estimated from an easier task like BWS Case 2 – as I advocated a number of years ago as a “second best” valuation option. Not ideal, but would do the job. Other ideas include helping respondents get familiar with the options on offer through the use of other choices tasks (again, Case 2), which we have done in a study recently. In any case, if the design issue proves common – and my gut feeling, given the complexity of decision-making in health, is that it will – we will need to be imaginative with our designs.

A final issue that broadly comes under the design “umbrella” concerns sampling. One of the chapters in the BWS book utilised sampling criteria that deliberately avoided making the sample representative of the wider population. Why would we do that? Well, when you have a limited dependent variable model with discrete outcomes (rather than continuous outcomes in TTO/SG/VAS), characterising the heterogeneity correctly becomes absolutely crucial: if there is heteroscedasticity, the estimates won’t simply be inefficient, but BIASED. BIG problem. If, say, depressed people have different health preferences and choice consistency to non-depressed people but you don’t have enough depressed people in your sample to spot this, and mix them in with the others in estimation, you have the WRONG POPULATION TARIFF. So (and I have said this before in published papers), EVEN if you want a population tariff, to work within the traditional extra-welfarist paradigm, you still have to get the heterogeneity right – you must probably OVERSAMPLE those groups you suspect of having different preferences. Then, when you have estimated the tariffs for the different groups you must RE-WEIGHT to reflect the population distribution. THAT is the way to get the true population tariff. Of course if people in various different health states do not differ from the “average member of the population” in their health preferences the problem goes away. The problem, as the chapter in the book shows, is that (at least for well-being) people with impairments DO have different preferences: those impaired on attribute x desire to improve on attribute x, whilst those impaired on attribute y switch to wanting attribute z to compensate. The picture is extremely complex. So you should be using what we call “quota sampling” – making sure you have enough people in various key impaired states to estimate models for those subgroups. So survey design is a lot more complicated when you ditch TTO/SG/VAS.

I don’t mean to sound glass half empty regarding design. Leonie Burgess, when presented with the implications of her designs, was fascinated and saw it as opportunity (to change and improve the models) rather than a problem. I see it this way too. Things will get interesting (again) in DCEs in the coming years as we find out what we need to do in the design field to ensure we get unbiased tariffs for use in decision-making.

Although the third and final blog in this series (to appear next week) may seem superficially similar to the first one – I will discuss the size of the descriptive system – I will write in more detail about the process of constructing the instrument, both qualitative and quantitative*, and offer recommendations that may help alleviate the tension mentioned in the first blog.

*Discussing in more detail some constructive comments I had from Professor David Parkin.

Copyright Terry N Flynn 2015.

This, together with the accompanying blogs, will form a working paper to be submitted to SSRN. Please cite appropriately if referring to these issues in academic papers.