Showing posts with label Cancer. Show all posts
Showing posts with label Cancer. Show all posts

Monday, 5 January 2015

Cancer risk - an analysis


My previous post discussed this paper, and its claim that two thirds of cancer types are largely unaffected by environmental or hereditary carcinogenic factors.  While I'm unimpressed by the paper, the idea behind it is interesting, so here's my analysis of its data.

The hypothesis is that "many genomic changes occur simply by chance during DNA replication rather than as a result of carcinogenic factors.  Since the endogenous mutation rate of all human cell types appears to be nearly identical, this concept predicts that there should be a strong, quantitative correlation between the lifetime number of divisions among a particular class of cells within each organ (stem cells) and the lifetime risk of cancer arising in that organ."

So let's suppose that each stem cell division gives rise to cancer with a small probability p.  Then if there are n lifetime divisions, the probability that none of them leads to cancer is (1-p)n, so the lifetime risk of cancer, R, is 1 - (1-p)n.  We can rearrange that to find an expression for p, ln(1-p) = ln(1-R)/n.  For very small p, ln(1-p) = -p, so p = -ln(1-R)/n.  If we plot ln(1-R) against n we should expect to find that for all the organs where carcinogenic factors are absent the values fall on the same straight line through the origin.

However, the values of n range through several orders of magnitude, so we can't create this plot unless we're willing to make all the rare cancers invisibly close to the origin.  Instead, let's take logs again, giving log(p) = log(-ln(1-R)) - log(n).  So on graph of log(-ln(1-R)) against log(n), all the cancers satisfying our hypothesis should fall on a straight line with slope one crossing the y axis at log(p).  (I've switched to base-10 logarithms for this step, to make the powers of ten easier to follow)



Here's the graph, which looks not unlike the one in the paper.  The correlation between the x and y data series is 0.787, again not unlike in the paper.  But the slope of a line through the points is not unity, nor is there a subset of points at the bottom of the envelope of points for which the slope is unity.

(I've arbitrarily given FAP colorectal a cancer risk of one millionth less than one, because the method doesn't allow a risk of exactly one.  Its point could be moved vertically by choosing a different number.)

To explore further how well the data fit the model, I've backed out implied values of p for each cancer type.



Here's the problem.  If the data matched the theory, there would be a group of cancer types at the left end of the chart with similar implied probabilities.  It seems in particular that the risk of small-intestine adenocarcinoma is anomalously low.

[A commentator points out that there is a group of cancer types near the left end of the chart which do have similar implied probabilities (the same eight cancers lie roughly in a straight line in the scatter plot).  But the theory in the paper is that there's a background rate of cancer in any tissue type, depending only on the number of stem cell divisions, because "the endogenous mutation rate of all human cell types appears to be nearly identical".  This theory can't be casually modified to allow for a background rate of cancer in all tissue types except for in the small intestine.  (Oncologists are of course aware that small-bowel cancers are strangely rare.)]

Let's try an alternative theory: that for every tissue type, some fraction of stem cell divisions, call it α, are affected by environmental or heriditary influences in a way which gives them a probability, call it q, of causing cancer.  q is the same for all tissue types.  The remaining divisions carry negligible risk by comparison.  Somewhat arbitrarily, we'll assume α is one for the cancer with the highest implied probability in our previous analysis: that is, q is equal to the p implied for Gallbladder non-papillary adenocarcinoma.  We can now back out a value of α for each cancer.



Well, it's a simplistic theory, but it does have the advantage over our previous model that it fits the data.

It seems to me that picking out gallbladder cancer as high-alpha is a plus for this model, because that cancer has a peculiar geographic spread which can only be due to environmental or hereditary factors.

And I've been mischievous.  In this theory, despite the correlation in the input data between stem cell divisions and cancer risk, every cancer is caused by environmental or hereditary factors.

Saturday, 3 January 2015

Science by press release

Yesterday's Times has a front page story "Two thirds of cancer cases are the result of bad luck rather than poor lifestyle choices...". (paywall)

That doesn't match my preconceptions, so I looked for the story online.    The Independent and the Telegraph agree.  So does the Mail.  And the Express. And the Mirror.

Reuters' headline agrees, but its story suggests something a bit different - that two thirds of an abitrary selection of cancer types occur mainly at random.

The BBC speaks unambiguously of "most cancer types" and so does The Guardian.

The press release which must have given rise to this story features the phrase "two thirds of adult cancer incidence across tissues can be explained primarily by 'bad luck'".  I can't make much sense of "cancer incidence across tissues", so I can't blame the journalists for stumbling over it likewise.  But the reporters who got the story right must have managed to scan down to the paragraph where the press release explains that the researchers "found that 22 cancer types could be largely explained by the “bad luck” factor of random DNA mutations during cell division. The other nine cancer types had incidences higher than predicted by "bad luck" and were presumably due to a combination of bad luck plus environmental or inherited factors."

I emphasize that "two thirds of cancer types" is not at all the same as "two thirds of cancer cases".  Two rare cancers apparently unrelated to environmental factors will count for far fewer cases than one common cancer in the other category.

So what of the paper behind the press release?  Here's the abstract, with a paywalled link to the whole paper.  Or you can 'preview' the paper for free here, to the extent your conscience permits.  Supplementary data and methodology descriptions are here.

The hypothesis behind the paper is that cancer is to a large extent caused by errors arising during stem cell division, at a rate which is independent of the tissue type involved.  The researchers therefore obtain estimates of the lifetime number of stem cell divisions various tissue types, and plot that against lifetime cancer incidence, obtaining a significant-looking scatter plot (Figure 1 in the published paper).  So far so good.

But they've used a log-log plot, necessary to cover the orders of magnitude variations in the data.  Now, if you think, as the researchers apparently do, that cancer risk is proportional to number of stem cell divisions, it follows that the slope of a log-log plot should be unity.  It isn't, by eye it's more like two thirds.  The researchers, busy calculating a linear correlation between the log values seem not to have noticed this surprising result.  Instead they square the correlation to get an R2 of 65%, which may (it's not clear) be the source of the "two-thirds of cancer types" claim.

If so, that claim is based on a total failure of comprehension of what correlation means.  Imagine a hypothetical world in which cancer occurs during stem cell division with some significant probability only if a given environmental factor is present, and that environmental factor is present equally in all tissue types.  In this world cancer incidence across tissue types is perfectly correlated with the number of stem cell divisions, but nevertheless all cancer is caused by the environmental factor.

It's simply impossible to say anything about the importance of environmental factors in a statistical analysis without including those factors as an input to the analysis.

However, the press release also features the paragraph I quoted about 22 out of 31 cancer types being largely explained by bad luck.  Perhaps that's what they mean by two thirds.  To get this number, they devised an Extra Risk Score - ERS for short.  Then they used AI methods to divide cancer types into two types based on the ERS values.  So what's the ERS?  The Supplement describes it as "the (negative value of the) area of the rectangle formed in the upper-left quadrant of Fig. 1 by the two coordinates (in logarithmic scale) of a data point as its sides." That is, it's the product of the {base-10 logarithm of stem cell divisions} and the {base-10 logarithm of lifetime cancer risk}.   (The cancer risk logarithm is negative (or zero) since lifetime risk is less than (or equal to) one.)

Shorn of the detail, it's the product of two logarithms.  How does that make sense?  Multiplying two logarithms is bizarre; for all ordinary purposes you're supposed to add them.  For this analyis, a simple measure would seem to be the ratio of lifetime incidence to stem cell divisions, or you might prefer the log of that ratio, which would be the log of the incidence minus the log of the stem cell divisions.

(On further reflection, the number I'd use would be {log(1-incidence)/divisions}.  That doesn't give a defined answer for lifetime incidence of unity, but you can get a number by using an incidence of just less than unity.  Among the other cancer types, it picks out gallbladder cancer as having the highest environmental or heriditary risk, which is consistent with that cancer's unusual geographical variation of incidence.)

The Supplement attempts to justify multiplying the logarithms by explaining why dividing them woudn't make sense.  Which is a bit like advocating playing football in ballet shoes because it would be foolish to wear stilettos.

Whatever ERS calculation you used, the clustering method would still divide the cancers into two groups, because that's what clustering methods do, but different calculations would put different cancer types in the high-ERS group.  If you want, as the senior author does, to draw conclusions from composition of the high-ERS cluster, you need a sound justification for your ERS calculation.

__

To its credit, The Guardian has published a piece pointing out the correlation misunderstanding.  This piece is also highly unimpressed by the paper, and this review of it has mixed feelings.

Me, I suppose the underlying idea has some truth in it.  But the methodology is the worst I've ever seen in a prominently published paper.

Update: more commentary from Understanding Uncertainty and StatsGuy 

Update: Bradley J Fikes, author of this piece in the San Diego Union-Tribute, complains in comments here that the title of this post is ill-chosen.  He points out that he didn't write his story simply from the press release, but checked it with John Hopkins before it was published.  He's got a point about the title: more than half of what I say here is criticism of the paper not of the press release.



Tuesday, 22 November 2011

Median survival time from cancer diagnosis

The BBC has a story about median survival times from diagnosis for various cancers, and how they have changed in the last 40 years.  For some cancers there's been a big improvement, for others there isn't.  A spokeswoman for Cancer Research UK says that more research is urgently needed into cancers for which there has been little improvement.

The source is a research briefing paper by Macmillan Cancer Support.  Under the heading "Shocking Variation" the introduction says:
First the good news: overall median survival time for all cancer types 40 years ago was just one year, now it is predicted to be nearly six years. This improvement is testament to the improvements in surgery, diagnosis, radiotherapy, and new drugs. There have been particularly dramatic improvements in survival time for breast cancer, colon cancer and Non-Hodgkin’s Lymphoma – with many years added to median survival times.
But the good news is tempered by the woeful lack of improvement in other cancers. There has been almost no progress for cancers like lung and brain, where median survival times have risen by mere weeks. Shockingly pancreatic cancer median survival time has hardly risen at all. The NHS and cancer community must urgently look at why.
Apart from not being shocked, I don't disagree with that.  But there is something important left unsaid.  There are three ways to improve cancer survival time from diagnosis.
1) Better treatments
2) Earlier diagnosis, even if the treatment is ineffective
3) More effective treatment made possible by earlier diagnosis

Certainly treatments have got better for all cancers - medical science is a wonderful thing.  But there are few in which this has given us a really big increase in median survival time: Non-Hodgkin's Lymphoma is one.  I suspect that most of the improvement has been from much earlier diagnosis made possible by scanning technology invented in the early 70s, by endoscopy, and by testing for tumour markers such as PSA.  And it is hard to separate effects (2) and (3).

Screening programmes are likely to become widely deployed only if there is evidence that they decrease mortality: that suggests that treatment following early diagnosis reduces mortality even in asymptomatic patients, but it doesn't tell us by how much it increases median survival.  (There's a helpful discussion of how to evaluate screening programmes here.)

The Macmillan report notes that the prostate cancer should be treated with caution because of the increased "incidence" of low grade tumours following the introduction of PSA testing (they should say "diagnosis").  Similar caution should be applied to interpreting the data for all cancers.

Thursday, 10 November 2011

Is the NHS a world leader in cancer care?

The Guardian is pleased to report that "The prime minister and health secretary have criticised the NHS on cancer, but new figures suggest the service is a world leader".  This is based on a study published in the British Journal of Cancer, which says that "the NHS in England and Wales has helped achieve the biggest drop in cancer deaths and displayed the most efficient use of resources among 10 leading countries worldwide."

The Guardian doesn't link to the paper, and its report is a masterpiece of unclarity.  "While six countries saw falls of at least 20%, England and Wales – which in 1979-81 had the third highest rate with 4,156 deaths per million men – improved the most, achieving the fifth lowest rate among the 10 countries by 2004-06 with 2,869 deaths per million."  Which is to say that England and Wales improved from eighth to fifth out out of ten countries.

The actual paper, by Colin Pritchard and Tamas Hickish, is here.  It finds that England and Wales has had the largest reduction in male cancer deaths out of the ten countries over the 25-year period, and that England and Wales has the highest ratio of improvement in cancer mortality to percentage of GDP spent on cancer care.

The relative improvement for women is much less impressive.  I've tried averaging to get a rate for the sexes combined (the calculation ought to be more elaborate than a crude average, but this is useful estimate) and the result is that England and Wales has had the largest reduction by this measure too.

The data don't cover exactly the same dates in all ten countries, so I've calculated an annualized rate of improvement.  England and Wales is easily the best, at 1.2% per year (Germany is second at 0.99%).

Are these good measures of performance?  Tim Worstall pungently comments that "What [the figures] show is that the NHS used to be shite at cancer and now it’s only middle ranking".  He's right that England and Wales ranked in the middle of the ten countries for cancer mortality at the end of the period studied.  It had almost exactly the same cancer mortality rate as the USA (the USA data are from one year earlier).  The lowest mortality was in Japan and the highest in The Netherlands.  I don't believe this tells us anything about the relative merits of various healthcare systems, as Worstall might like it to.  Regarding the figures relative to GDP he adds "...a system which spends less to cure less cancer is going to be more efficient in its use of money to cure cases of cancer. Because it’s only curing the easy cases."  That's a fair point in theory.  But if the USA with all its extra spending is curing lots of hard cases, the effects aren't showing in the mortality data.

The government has a strategy document which proclaims that "we aim to save an additional 5,000 lives every year by 2014/15".  Cancer Research UK tells us that in 2009 there were 156,090 deaths from cancer, so that would be an improvement of about 3.2%, or about 1.06% per year over three years.  It seems the government's aim is to slow down the rate of improvement.  (The Pritchard and Hickish analysis is for ages 15-74 in England and Wales, so the comparison is not exact.)

It's genuinely hard to compare one health service with another.  If you compare mortality data, you are looking at different populations with different lifestyles and different methods of compiling statistics - differences in healthcare may be a minor factor.  By looking at improvements in mortality, Pritchard and Hickish eliminate some of these effects but have the problem that the quality of healthcare at the start date may differ markedly between the various countries.  The government analysis prefers to refer to survival times from diagnosis: the problem with this is that it's as much a measure of how early you diagnose as of how effectively you treat.