Biomarker Design: Lessons from Bayes' Theorem

In the last article I posted on "The Digital Biologist", I gave a very brief and simple introduction to Bayes' Theorem, using cancer biomarkers as an example of one of the many ways in which the theorem can be applied to the evaluation of data and evidence in life science R&D. The power of the Bayesian approach was I hope, evident in the analysis of the CA-125 biomarker for ovarian cancer that we considered, and I felt that it would be worthwhile in this follow-up, to round out our discussion by looking in a little more detail at the practical, actionable insights that can be gained by the application of Bayesian analysis to the design of biomarkers. It is all too often that those of us in the field of computational biology are accused of generating models, simulations and algorithms that while pretty or cool, are of little or no practical help to real world research problems. The sting of this accusation comes at least in part from the fact that all too often, this is actually true :-( Bayesian analysis by contrast, can be a really useful and practical computational tool in life science R&D as I hope this brief discussion of its application to biomarker design will show. There are some valuable lessons for biomarker design that can be drawn from the kind of Bayesian analysis that we described in the first part of this discussion, when we considered its application to the use of CA-125 to diagnose ovarian cancer.

Let's suppose that we are a company determined to develop a more reliable biomarker than CA-125 for the early detection of ovarian cancer. One direction we might pursue is to identify a biomarker that predicts disease in actual sufferers with a higher frequency i.e. a biomarker with a better true positive hit rate. We saw in the previous article, that CA-125 only predicts the disease in about 50% of sufferers for stage I ovarian cancer and about 80% of sufferers for stage II and beyond. One of the dilemmas faced by physicians working in the oncology field, is that biomarkers like CA-125 can be poorly predictive of the disease in the early stages when the prognosis and options for treatment are better. It's disheartening for both the patient and the physician to be able to get a reliable diagnosis only when the disease has already progressed to the point at which there are fewer good options for treatment.

I have previously used an analogy from the behavioral sciences to describe this situation: "broken glass and blood on the streets are the "markers" of a riot already in progress but what you really need for successful intervention are the early signs of unrest in the crowd before any real damage is done".

So imagine that our hypothetical biomarker company has made a heavy R&D investment in identifying a biomarker with a better rate of true positives. If our new biomarker has true positive rate of 95% (a fairly significant improvement on our previous value of about 80%) and the same roughly 4% false positive test rate as previously, how much better off are we?

If we plug the numbers into our Bayesian model, the answer is "not much".

The chances of a patient actually having ovarian cancer given a positive test result with our new biomarker are still less than 1 in 4. In fact, even if we were to identify a biomarker with a 99% true positive rate, we could still only declare a roughly 1 in 4 chance of disease given a positive test result.

What if instead of pursuing a better true positive hit rate, our company had invested in reducing the false positive test rate?

Without altering the true positive rate of about 80%, reducing the biomarker's false positive rate from about 4% to 2%, increases the chance of the patient actually having the disease given a positive test result, to better than 1 in 3. If our hypothetical company can get the false positive rate down to 1%, there is actually a better than even chance of a positively-testing patient actually having the disease. Getting the false positive test rate down to 0.1% (approximately 40 times lower than the actual false positive rate for CA-125) means that the patient is very likely to have the disease given a positive test result, with a less than 1 in 10 chance of receiving a false positive diagnosis.

The Bayesian model clearly tells us in the case of ovarian cancer, that our hypothetical company is much better off investing its R&D dollars in the pursuit of lower false positive test rates rather than higher true positive test rates. Even a 99% true positive test rate barely shifts the probabilities associated with a positive test result, whereas getting the false positive test rate down to 1% improves the probability of a true diagnosis from less than 1 in 4, to better than even. Even this scenario however, is far from ideal.

If you look at the actual numbers in the model with regard to the populations of tested patients with and without the disease, there is another valuable lesson to be learned, and it is one that illuminates the reason why improving the true positive test rate while ignoring the false positive test rate is what my countrymen would refer to as "a hiding to nothing".

It is the overwhelmingly larger population of healthy patients versus those with the disease, that is skewing the probability numbers against us and the lower the incidence of the disease, the worse this problem will be.

If ovarian cancer had a higher incidence of say, 1 in 10 women instead of 1 in 72 as is actually the case, a positive test result with CA-125 would correspond to an almost 70% probability of the patient actually having the disease. By contrast, if the ovarian cancer incidence was 1 in 1000 women, a positive test result with CA-125 would still correspond to less than 1 chance in 50 of the patient actually having the disease.

The lower the incidence of the disease you want to diagnose, the correspondingly lower your false positive test rate needs to be.

Imagine for example, the exigencies that a rare cancer like adrenocortical carcinoma which only affects 1 or 2 people in a million imposes on the development of any kind of diagnostic biomarker for that disease. In some rare diseases that have a genetic origin (such as Type II Glycogen Storage Disease for example), there do exist definitive genetic tests for the disease that are essentially unequivocal insofar as they have a false positive rate that is effectively zero.

The Bayesian model presented here is an extremely simple but excellent example of the way in which models can provide intellectual frameworks with which data can be organized and reasoned about. It is this author's opinion that the pharmaceutical and biotechnology industries could actually benefit enormously from a shift in their current emphasis on data, with more attention being paid to the kind of models that have the potential to explain these data, to synthesize useful knowledge from them, and to drive effective decision making based upon the underlying science.

 © The Digital Biologist | All Rights Reserved