Tag Archives: dces

spring cleaning

New year = digital spring cleaning time! Ugh. No matter how future-proof you aim to be with how you structure files, aim to work seamlessly across PCs etc it never takes long for reality to change and you realise you need to do the rigmarole again.

When admin is done I’m back to the project looking at comparing Case 2 BWS estimates with DCE ones. I shall look with “fresh eyes” since I haven’t worked on it since before xmas. (Plus we need to get this rounded off so we can submit and get paid, hehe.)

Then it’s the (long-delayed) big marketing push for TF Choices LTD. I’ve had a good number of proposals and funded projects come my way so far but can’t rest on my laurels…time to make sure a load of marketers and others know what I can do for them, in addition to the academic community I was part of!

I can’t think of anything methodological I want to shout about today (phew, they think)….I’ll continue to post anything big or of key relevance but as there are only so many hours in the day and company stuff must come front and centre in 2017 it’s likely that my comments and posts will be related to things I’m doing at the time (like Case 2 vs DCEs) rather than detailed posts triggered by twitter or citation alerts I get.


BWS neither friend nor foe

This post replies to some requests I have had asking me to respond to a paper concluding that DCEs are better than BWS for health state valuation. To be honest I am loathe to respond, for reasons that will become apparent.

First of all, let me clarify one thing that people might not appreciate – I most definitely do not want to “evangelise” for BWS and it is not the solution in quite a few circumstances. (See the papers coming out from the CHU-9D child health valuation study I was involved with for starters – BWS was effectively a waste of resources in the end….”best” choices were all we could use for the tariff.)

I only really pushed BWS strongly in my early days as a postdoc when I wanted to make a name for myself. If you read my papers since 2007 (*all* of them) you’ll see the numerous caveats appear with increasing frequency. And that’s before we even get to the BWS book, where we devote an entire chapter discussing unresolved issues including the REAL weaknesses and research areas for BWS (as opposed to straw men I have been seeing in recent literature).

OK now that’s out of the way, I will lay some other cards on the table, many of which are well-known since I’ve not exactly been quiet about them. I had mental health issues associated with my exit from academia. I’m back on my feet now doing private sector work for very appreciative clients, but that doesn’t mean I want to go back and fight old battles….battles which I erroneously thought us three book authors had “won” by passing muster with the top mathematical psychologists, economists and others in the world during peer review. When you publish a paper in the Journal of Mathematical Psychology (the JHE of that field) illustrating a key feature/potential weakness of a DCE (or specifically Case 2 BWS) back in 2008 you tend to expect that papers published in 2016 would not ignore this and would not do research that showed zero awareness of this issue and as a result made fundamental errors – after all, whilst we know clinical trials take a while to go from proposal to main publication, preference studies do NOT take 8+ years to go through this process. I co-ran a BWS study from conceptualisation to results presentation in 6 days when in Sydney. Go figure.

So that’s an example of my biggest frustration – the standards of literature review have often been appalling. Two or three of my papers (ironically including the JHE one, which includes a whopping error which I myself have repeatedly flagged up and which I corrected in my 2008 BMC paper) seem to get inserted as “the obligatory BWS reference to satisfy referees/editors” and in many cases bear no relation to the point being made by authors. Alarm bells immediately flash when I read an abstract via a citation alert and see those were my references. But it keeps happening. Not good practice, folks.

In fact (and at a recent meeting someone with no connection to me said the same thing) in certain areas of patient outcomes research the industry reviews are considered far better than academic ones – they have to be or get laughed out of court.

Anyway, I have been told that good practice eventually drives out bad. Sorry, if that’s true, the timescale was simply too long for me, which didn’t help my career in academia and raised my blood pressure.

Returning to the issue at hand. I’m not going to go through the paper in question, nor the several others that have appeared in the last couple of years purporting to show limitations of BWS. I have a company to run, caring obligations and I’ve written more than enough for anyone to join the dots here if they do a proper literature review. My final attempt to help out was an SSRN paper. But that’s it – without some give and take from the wider community, my most imaginative BWS work will be for clients who put food on the table and who pay – sometimes quite handsomely – for a method that when properly applied shows amazing predictive ability together with insights into how humans make decisions.

Now, of course, health state valuation is another kettle of fish – no revealed preference data etc. However, Tony, Jordan and I discussed why “context” is key in 2008 (JMP); I expounded on this with reference to QALYs in my two 2010 single authored papers, and published a (underpowered) comparison in the 2013 JoCM paper (which I first presented at the 2011 ICMC conference in Leeds, getting constructive criticism from the top choice modellers on Earth). So this issue is not particularly new.

It’s rather poor that nobody has actually used the right design to compare Case 2 BWS with DCEs for health state valuation…I ended up deciding “if you want something done properly you have to do it yourself” and I am very grateful to the EuroQoL Foundation for funding such a study, which I am currently analysing with collaborators. I don’t really “have a dog in this fight” and if Case 2 proves useful then great, and if not then at least I will know exactly why not…and the reasons will have nothing to do with the “BWS is bad m’kayyyyy” papers published recently. (To be fair, I am sometimes limited in what I can access, with no longer having an academic affiliation so full texts are sometimes unavailable, but when there’s NO mention of attribute importance in the abstract, NOR why efficient designs for Case 2 are problematic my Bayesian estimate is 99.99% probability the paper is fundamentally flawed and couldn’t possibly rule BWS in or out as a viable competitor to a DCE.)

If you’d like to know more:

  • Read the book
  • Read all the articles – my google scholar profile is up to date
  • Get up to speed on the issues in discrete choice design theory – fast. Efficient designs are in many many instances extremely good (and I’ve used them) but you need to know exactly why in a Case 2 context they are inappropriate.

If you still don’t understand, get your institution to contract me to run an exec education course. When I’m not working, I’m not earning, full stop.

I’m now far more pragmatic about the pros and cons of academia and really didn’t want to be the archetypal “I’m leaving social media now” whinger. And I’m not leaving. But I am re-prioritising things. Sorry if this sounds harsh/unhelpful – I didn’t want to write this post and hoped to quietly slip beneath the radar, popping up when I thought something insightful based on one of BWS’s REAL disadvantages or Sen’s work etc was mentioned. But people I respect have asked for guidance. So I am giving what I can, given 10 minutes free time I have.

Just trying to end on a positive note – I gave a great exec education course recently. It was a pleasure to engage with people who asked questions that were pertinent to the limitations of BWS and who just wanted to use the right tool for the right job. That’s what I try to do and what we should all aim for. I take my hat off to them all.

Encounter with a GPSI

I recently had a mole removed by a GP with a special interest (GPSI) in dermatology. It was an interesting experience, given that the first ever discrete choice experiment I conducted elicited patient preferences for exactly this type of doctor and specialty.

The study was piggy-backed onto an early (the first?) trial of GPSI care. That trial established equivalence of care with the traditional consultant-led secondary care model (for the large proportion of cases that are routine enough for GPSI care to be appropriate). The DCE, however, showed resistance to GPSI-type care among patients, on average. Now, this was unsurprising: we knew no better and quoted average preferences, which mean nothing usually in DCEs (since you are averaging apples and oranges). Subgroup analyses I did established which patient subgroups were open to GPSI-type care (and when), and those results were all very predictable.

It is the wording we were strongly encouraged to use for the attributes (such as the doctor description etc) that is the subject of this post, particularly in the light of my personal experience of such care “at the sharp end”. We did not use the actual job titles of the doctors: had we done so, we would have given the respondents the choices between “seeing a member of a consultant-led team, which may or may not be the consultant him/herself” versus “seeing a GP who has had (considerable?) special additional training in dermatology”, making it clear that (1) many people don’t see the consultant, contrary to what they believe, and (2) a GPSI is perfectly qualified to deal with their condition and if anything non-routine is found, they are instantly moved to the consultant-led team’s care.

Now, I know why the triallists didn’t like this: patients see “GP” and instantly form (often incorrect) opinions. That was brought home to me when I saw a doctor at the local hospital in Nottingham (actually a private treatment centre subcontracted by the NHS): he never revealed he was a GPSI until we started “talking shop” and suddenly his ID badge was held up in front of me with the exclamation “I was one of the first GPSIs in dermatology appointed!” My referral letter said I would see (consultant) Dr X or a member of his team. Hmmmm. Thankfully I had no preconceptions, and received top notch care – I would certainly see him again if I needed to. (Of course I looked up this GPSI subsequently and it turns out he specialised in surgery first before moving to General Practice to improve conditions for family life, so he was particularly well qualified.) But it did illustrate, albeit anecdotally, that what was really required was a DCE with “labels” (the actual doctor type”) to capture the true patient preferences: that would focus minds on the need for a public education campaign to reduce the stigma associated with GPSIs. What we did, although not misleading in terms of describing the doctors, brushed the underlying problem under the carpet. (So we should have run a labelled DCE – we knew no better then but I am using my own experience to illustrate a serious problem here that continues unabated in health. That’s for another day, however.)

The other attribute I would, with the benefit of being an actual patient, change was location of care. The DCE heavily implied that non-hospital care would be a local general practice. Now, of course, if your general practice doesn’t have the facilities to do minor surgery then this may be grossly misleading. Indeed I had to travel further than the local hospital to get to the GPSI’s surgery for my mole removal. As it happens it didn’t matter: distance as the crow flies was not the important factor in my ability to get there. However, it immediately made me slightly annoyed at the guidance I as the DCE lead received when I did the study. The wording we used was, again, “technically correct” in that the choice was between a place of care that was convenient and local versus not, but I’m fairly sure a non-trivial number of our respondents could have made incorrect assumptions about these attribute levels. I know I did, and I ran the DCE!

It made me a bit (more) cynical about the motives of certain parts of academia: I’d already seen via twitter a much heralded result of a trial I know about that, shall we say, could have been improved upon immensely. Furthermore, I had pause for thought recently when I learnt that some members of industry consider academia-led literature reviews and so-called systematic reviews in certain areas of health to be not worth the paper they’re written on. (I can concur on that regarding recent reviews in my own field). In a time that has seen a huge amount of industry-bashing for selective release of information/publication it really does act as a reminder that some areas of academia need to take a good hard look at their own conduct. Plus, just to be fair, I do shout out about the amazing groups I have worked with or continue to work with. I just feel Ben Goldacre and Danny Dorling were bang on the money in their beliefs (informed by different evidence, which was particularly damning) that bad practice by academia and its associated institutions contributes to the general lack of confidence by the public in the “elites” and how “having your own facts”, whilst of course ludicrous, is a perfectly understandable public reaction to elites that no longer seem to uniformly put the public good first.

As usual I shall make the caveat that there are great groups I work with and this isn’t just “academia bashing”. I just offer constructive criticism based on my own experiences (and mistakes) and give examples of the kind of lack of transparency that cleverer people like Ben and Danny have highlighted as barriers to getting academia more support among the general populace.

effects or dummies redux

That old bugbear comes back….are effects codes really superior to dummy variables?


This note revisits the issue of the specification of categorical variables in choice models, in the context of ongoing discussions that one particular normalisation, namely effects coding, is superior to another, namely dummy coding. For an overview of the issue, the reader is referred to Hensher et al. (2015, see pp. 60–69) or Bech and Gyrd-Hansen (2005). We highlight the theoretical equivalence between the dummy and effects coding and show how parameter values from a model based on one normalisation can be transformed (after estimation) to those from a model with a different normalisation. We also highlight issues with the interpretation of effects coding, and put forward a more well-defined version of effects coding.

That’s one of the joys and frustrations of DCEs; why you can never rest on your laurels and should really be acknowledging that it is a field in its own right; why you should have a DCE expert on your team for all important projects. Just when you thought something was right, its merits are questioned. Fun fun fun.

first reference to discrete choice in health

Just a short update today.

Via Twitter I learned that Professor Philip Clarke (University of Melbourne) gave a great seminar at the Office of Health Economics. His topic was history of economic evaluation in health generally but there was a particular gem in there of interest to me.

It appears we are all wrong. The first time Thurstone’s method of paired comparisons was proposed as a possible way of valuing health states was in 1970! On page 1041 of A health-status index and its application to health-services outcomes. Fanshel S & Bush JW. Operations Research, 18(6): 1021-1066.

We stand corrected, thank you.

PS Thurstone did pairs only because the multinomial model wasn’t available then, only probit (normal based) distributions, which don’t have closed form for 3+ options. So if you want the general (non-health) first reference to the multinomial (conditional) logit, it’s McFadden’s article or, if you’d like the earlier non-economics one, go read and reference Luce and Marley’s books from the 1950s and 1960s. Plus if you want to reference DCEs and why they are better than looking at all pairs – i.e. the addition of experimental design to choice models – it’s Louviere and Hensher’s work in the early 1980s.

EDIT at 11:40 BST to correct OHE’s name.

stop talking about the “death state” in DCEs

I feel like a broken record here – sorry in advance for those who already knew this.

Another paper has:

(1) Talked about putting “Death” in as a state to anchor DCE estimates to get proper QALY values, (although thankfully they didn’t do it in their study, but even saying it is a possible solution is wrong)

(2) Not done a proper literature review. I, together with Tony Marley (who, together with Duncan Luce, axiomatized random utility theory independently of McFadden), debunked that in 2008, and in 2010 I gave the potential solutions in a paper in Pharmacoeconomics.

Can we move on please? From discussions I get the impression the EuroQoL Group understand this – plus they have funded a group of us to test one of my solutions. But there are other groups out there who aren’t up to speed.

For the Japanese group, I’ll just pose a question to a hypothetical scenario that, I hope, will make clear just why the “death state” thing is wrong.

Suppose you have a group of people who for whatever reason (perhaps religious) never pick “death” in preference to a health state.

QUESTION: What happens when you estimate a conditional logit model to get QALY weights?

If you counter with “there are always people who consider some states worse than death and then you can estimate the model, I’d suggest you go read Thurstone, Luce & Marley, and then the Louviere/Hensher stuff. A DCE is, technically, a model of THE INDIVIDUAL. You should, in principle, be able to estimate a model for an individual (if you give them enough choices – of course in practice we typically can’t but you should be able to in theory if your model really is a DCE i.e. rooted in random utility theory).

model disclosure

This post regards a twitter post with an interesting poll and discussion initiated by Chris Carswell (editor of Pharmacoeconomics and The Patient) and twitter handle @PECjournal on whether a statement should be added to a paper to the effect that the authors’ model, when requested, was not submitted for peer review.

I abstained, saying I think a statement should be made if it’s a “traditional” decision analytic/similar CEA/CUA but I personally don’t favour it for DCEs.

The two counter-arguments made were that:

  1. Proprietary models go against the spirit of transparency that is increasingly demanded, &
  2. My point that model selection for DCEs being part art is similar to that used in qualitative research but qualitative researchers still have to submit discussion guides/full survey.

I do acknowledge both points, but my responses would be as follows:

(1) Proprietary software is routinely used to generate designs and (particularly) to analyse results of economic and other models: we’re getting into the nitty-gritty of the likelihood maximisation routine used (EM algorithm/other etc), starting value routines used internally by the stats program, etc. The ultimate black box is the stuff that does everything for the novice/inexperienced DCE researcher, mentioning no names 😉

Now, that doesn’t make things right, but it does mean that unless the researcher has the full code for everything from DCE design to model selection, or can reference it all for reviewers, I don’t think picking on just the DCE model selection issue is fair.

(2) I have no objections to submitting the design of the survey – when I was a reviewer, most fatal errors were made in the design and take the view that no DCE can be properly reviewed without access to the design by reviewers. (Another reason why authors might like to rethink if they are going to use “adaptive conjoint” – are they going to provide the design administered to every respondent? Haha, thought not, and if they do, will reviewers check through such a model, involving programming it in their software. Haha, thought not.) I myself also provide details of the main and secondary analyses I conducted. These can all be reproduced by reviewers, if they want to. The difficulty – and I believe, from my (far more limited, I acknowledge) experience/observation of analysis of qualitative data that it’s the same there – is that value judgments are made: e.g. “have we really reached saturation?” etc. For the reviewer it comes down to “in my experience, do I agree with this?”

And, unfortunately, in my experience in academia, too few peers had sufficient experience – and I mean designing, analysing and interpreting DCEs across multiple fields – to possibly feel comfortable endorsing me when I say “I didn’t use the model dictated by the BIC criterion – or whatever statistical rule you may like – because it routinely gives too many latent classes and I used my experience to choose the best model”. Sorry, yes I sound arrogant, but when any one DCE has literally an infinite number of solutions – a point still ignored or misunderstood by most practitioners – then inevitably experience and gut feelings based on intimate knowledge of your sample, data and survey become paramount.

In short, model selection skills can’t be taught, they must be gained with experience.

And, you are fully entitled to say “well you would say that, you work in industry now”. To which I’d respond, yes, I do have an interest in saying that, but why are academic groups that routinely delay competitor groups’ papers, mis-reference things in order to skew publication metrics and funding likelihood etc not pulled up on their shenanigans? I got a google citation report just today to something – and seeing the authors I would have bet (before reading) 100 GBP with anyone on the planet that the paper of mine that was absolutely crucial to this new publication would not be the citation I got the report for. I would have won the bet, the citation was to something else of mine entirely. I just laugh at these things now, they don’t affect me or my business, but it’s rather sad that they still go on. Particularly in this case when it can contribute to more QALY valuation studies that can’t possibly give the right answer – how is that defensible on equity or efficiency grounds?

So, until basic rules of research – and we’re talking the stuff I was taught in my first PhD supervision like “get the primary source”, not even the more recent transparency stuff – are followed consistently by academics I’m afraid industry is entitled to retort “people in glass houses shouldn’t throw stones”.

happiness redux

There is a piece on happiness at NakedCapitalism.com up today. It is a guest post from VoxEU and unfortunately, though trying to make valid points, falls into the usual holes: the key one is that the data all appear to be Likert-based self-reported happiness scales, which in two major countries (at the very least) have been shown to be deeply misleading (US and Australia). In short, even within these two countries, there are cohort and/or longitudinal effects: the number you state your happiness/life satisfaction to be is heavily dependent upon age (particularly if you are older), independent of (after adjusting for) a huge number of other factors (health, wealth, social empowerment, independence, etc). Moreover this is not “just” the infamous “mid-life dip”: the differences between such measures, and the more comprehensive well-being/quality-of-life ones, are particularly stark in extreme old age and have big implications for retirement age, what resources are needed by the very old etc.

To make comparisons across countries with different cultural backgrounds seems even more hazardous – Likert scales generally were pretty much discredited on such grounds by 2001:

Baumgartner H, Steenkamp J-BEM. Response styles in marketing research: a cross-national investigation. Journal of Marketing Research. 2001;38(2):143-56.

Steenkamp J-BEM, Baumgartner H. Assessing measurement invariance in cross-national consumer research. Journal of Consumer Research. 1998;25(1):78-90.

Five year age bands showing mean levels (after rescaling) of self-rated happiness versus scored quality of life in Bristol

Five year age bands showing mean levels (after rescaling) of self-rated happiness versus scored quality of life in Bristol, UK









The above shows that the ICECAP-O measure (based on discrete choice based outcomes of McFadden, coupled with Capabilities Approach of Sen, both winners of the Economics “Nobel”) tracks happiness (after both rescaled to be on 0-1 scale) reasonably well til middle age. In old age people report suspiciously high life satisfaction/happiness scores even when they have a whole host of problems in their lives. We captured these in the ICECAP-O (collected from the same people who gave us life satisfaction scores), as well as their individual answers to a huge number of questions about these other factors in life. This has been found in the USA too:

US life satisfaction

US life satisfaction








In short, we don’t have a bloody clue what older people are doing when they answer these scales but sure aren’t doing the same thing as younger people.

I discussed further the contribution of trust toward a broad measure of well-being in a talk I gave years ago when in Sydney: in Australia it is basically the case that a lack of trust of those in the local community has a pretty huge (11%) detrimental effect to your quality of life in Sydney but a much smaller, though still significant (5%) effect elsewhere in Australia.

I wish these Likert-based happiness surveys would cease. They really don’t help the field, when much better alternatives are already in routine use.

states worse than dead

No, this isn’t another moan by yours truly about how the valuation people deal (in)correctly with states worse than the death on in health economics valuation exercises (phew).

This tweet interested me.  There are all sorts of things you could do with a discrete choice experiment (DCE) to measure the trade-offs such patients make. When at UTS, we did a DCE that did two things, one novel and one not so novel. The first was an attitudinal one that found there are three segments among Australian retired people (our sample was around 1100 total) when you got them to tell you what statements about life they related to most and least – Best-Worst Scaling. We did something never done before – feed back to them their own results after that survey that they could print off, bring to their doctor to discuss, use as the starting point for and end-of-life care plan etc: results of this form a chapter in the book referenced. Of course the doctors at the sharp end in ICUs had warned us that thanks to TV programmes the general public has much higher expectations about the success/acceptablility of these dramatic interventions than is true in practice, but you could do the same survey with patients. In fact the bare bones of the survey are still live at the link and you can see how you compare with older Aussies.

The second DCE was (by DCE standards) very very simple, but was done to get a handle on the trade-offs people woul make regarding the kinds of interventions in the survey in this Twitter post and unfortunately won’t give you personalised results.

These types of DCEs should become routine. They can be done on touchscreen tablet PCs etc when the patient is waiting to see the doctor, they can give personalised results – not aggregated ones like in the bad old days. People like them, and like to know how they compare with others – the older generation love those surveys comparing them to others just as much as the younger “Facebook generations”. C’mon people, this survey is great and very very informative but we can move forward even further and do it today.

EQ-5D-5L thoughts

Well having a 24 hour sickness bug gave me some opportunities to sleep and think!

Obviously I have collaborators and am not in sole position to make the final call on whether we go ahead with the design Karin and I put together. But I think we may be almost ready to get programming!

I’m excited by this: it draws on various findings from previous projects I have been involved with: the DCE is not highly efficient but it serves its purpose, and, importantly, that of the Case 2 (Profile Case) BWS study. I think we might have had difficulties making this work for the original EQ-5D (3-level version), partly due to issues like the “states that make no sense” but the edited wording for the 5-Level version have helped enormously.

This project certainly won’t provide “the answer” as to whether using BWS can or should be used for valuation. However, if it works, (1) I believe it’ll be a major step forward and (2) I hope the EuroQoL group funds follow-up work.

The general thinking is that I don’t think everyone out there can do a “single all singing-all-dancing” valuation task; splitting it into two or three (I believe) will ultimately tell us more and give more flexibility. After all, lead-time TTOs are used for states worse than death so the precedent of more than one task is there. As I mentioned before, even if what we do “works”, there are inevitably issues the Group would have to discuss regarding the use of different valuation techniques etc, which I won’t pre-empt nor under-estimate.