Tuesday, 6 January 2015

Two-thirds of cancers - collected links

News sites getting the meaning of "two thirds of cases" wrong: Independent ,Telegraph, Mail, Express, Mirror, Huffington Post
News site getting it wrong in the headline but right in the text without one having to scroll down: Reuters.
News sites getting it right: BBC, Guardian.

The press release.


The Science abstract, with paywalled link to the paper
Free preview of the paper
Supplement on the data and methodology


Long critical review of the paper and its reporting: David Gorski
Discussion of the reporting: Andrew Maynard, Science-Presse (in French)
Criticism of the interpretation of correlation: Guardian, statsguy, Antonio Rinaldi (in Italian), with his own model, me, with a toy model
Criticism of the correlation calculation: StatsChat
Criticism of the clustering methodology: Understanding Uncertainty (with discussion of the reporting), statsguy, me, with discussion of the methodology generally
Criticism of the message: Cancer Research UK (with discussion of the reporting and the paper)
Expressing doubts about the accuracy of the data: Paul Knoepfler

A few comments on the paper: Science

Support for the paper: Steven Novella
Support for the message: PZ Myers, expressing disdain for those reluctant to accept the role of random chance

Monday, 5 January 2015

Cancer risk - an analysis

My previous post discussed this paper, and its claim that two thirds of cancer types are largely unaffected by environmental or hereditary carcinogenic factors.  While I'm unimpressed by the paper, the idea behind it is interesting, so here's my analysis of its data.

The hypothesis is that "many genomic changes occur simply by chance during DNA replication rather than as a result of carcinogenic factors.  Since the endogenous mutation rate of all human cell types appears to be nearly identical, this concept predicts that there should be a strong, quantitative correlation between the lifetime number of divisions among a particular class of cells within each organ (stem cells) and the lifetime risk of cancer arising in that organ."

So let's suppose that each stem cell division gives rise to cancer with a small probability p.  Then if there are n lifetime divisions, the probability that none of them leads to cancer is (1-p)n, so the lifetime risk of cancer, R, is 1 - (1-p)n.  We can rearrange that to find an expression for p, ln(1-p) = ln(1-R)/n.  For very small p, ln(1-p) = -p, so p = -ln(1-R)/n.  If we plot ln(1-R) against n we should expect to find that for all the organs where carcinogenic factors are absent the values fall on the same straight line through the origin.

However, the values of n range through several orders of magnitude, so we can't create this plot unless we're willing to make all the rare cancers invisibly close to the origin.  Instead, let's take logs again, giving log(p) = log(-ln(1-R)) - log(n).  So on graph of log(-ln(1-R)) against log(n), all the cancers satisfying our hypothesis should fall on a straight line with slope one crossing the y axis at log(p).  (I've switched to base-10 logarithms for this step, to make the powers of ten easier to follow)

Here's the graph, which looks not unlike the one in the paper.  The correlation between the x and y data series is 0.787, again not unlike in the paper.  But the slope of a line through the points is not unity, nor is there a subset of points at the bottom of the envelope of points for which the slope is unity.

(I've arbitrarily given FAP colorectal a cancer risk of one millionth less than one, because the method doesn't allow a risk of exactly one.  Its point could be moved vertically by choosing a different number.)

To explore further how well the data fit the model, I've backed out implied values of p for each cancer type.

Here's the problem.  If the data matched the theory, there would be a group of cancer types at the left end of the chart with similar implied probabilities.  It seems in particular that the risk of small-intestine adenocarcinoma is anomalously low.

[A commentator points out that there is a group of cancer types near the left end of the chart which do have similar implied probabilities (the same eight cancers lie roughly in a straight line in the scatter plot).  But the theory in the paper is that there's a background rate of cancer in any tissue type, depending only on the number of stem cell divisions, because "the endogenous mutation rate of all human cell types appears to be nearly identical".  This theory can't be casually modified to allow for a background rate of cancer in all tissue types except for in the small intestine.  (Oncologists are of course aware that small-bowel cancers are strangely rare.)]

Let's try an alternative theory: that for every tissue type, some fraction of stem cell divisions, call it α, are affected by environmental or heriditary influences in a way which gives them a probability, call it q, of causing cancer.  q is the same for all tissue types.  The remaining divisions carry negligible risk by comparison.  Somewhat arbitrarily, we'll assume α is one for the cancer with the highest implied probability in our previous analysis: that is, q is equal to the p implied for Gallbladder non-papillary adenocarcinoma.  We can now back out a value of α for each cancer.

Well, it's a simplistic theory, but it does have the advantage over our previous model that it fits the data.

It seems to me that picking out gallbladder cancer as high-alpha is a plus for this model, because that cancer has a peculiar geographic spread which can only be due to environmental or hereditary factors.

And I've been mischievous.  In this theory, despite the correlation in the input data between stem cell divisions and cancer risk, every cancer is caused by environmental or hereditary factors.

Saturday, 3 January 2015

Science by press release

Yesterday's Times has a front page story "Two thirds of cancer cases are the result of bad luck rather than poor lifestyle choices...". (paywall)

That doesn't match my preconceptions, so I looked for the story online.    The Independent and the Telegraph agree.  So does the Mail.  And the Express. And the Mirror.

Reuters' headline agrees, but its story suggests something a bit different - that two thirds of an abitrary selection of cancer types occur mainly at random.

The BBC speaks unambiguously of "most cancer types" and so does The Guardian.

The press release which must have given rise to this story features the phrase "two thirds of adult cancer incidence across tissues can be explained primarily by 'bad luck'".  I can't make much sense of "cancer incidence across tissues", so I can't blame the journalists for stumbling over it likewise.  But the reporters who got the story right must have managed to scan down to the paragraph where the press release explains that the researchers "found that 22 cancer types could be largely explained by the “bad luck” factor of random DNA mutations during cell division. The other nine cancer types had incidences higher than predicted by "bad luck" and were presumably due to a combination of bad luck plus environmental or inherited factors."

I emphasize that "two thirds of cancer types" is not at all the same as "two thirds of cancer cases".  Two rare cancers apparently unrelated to environmental factors will count for far fewer cases than one common cancer in the other category.

So what of the paper behind the press release?  Here's the abstract, with a paywalled link to the whole paper.  Or you can 'preview' the paper for free here, to the extent your conscience permits.  Supplementary data and methodology descriptions are here.

The hypothesis behind the paper is that cancer is to a large extent caused by errors arising during stem cell division, at a rate which is independent of the tissue type involved.  The researchers therefore obtain estimates of the lifetime number of stem cell divisions various tissue types, and plot that against lifetime cancer incidence, obtaining a significant-looking scatter plot (Figure 1 in the published paper).  So far so good.

But they've used a log-log plot, necessary to cover the orders of magnitude variations in the data.  Now, if you think, as the researchers apparently do, that cancer risk is proportional to number of stem cell divisions, it follows that the slope of a log-log plot should be unity.  It isn't, by eye it's more like two thirds.  The researchers, busy calculating a linear correlation between the log values seem not to have noticed this surprising result.  Instead they square the correlation to get an R2 of 65%, which may (it's not clear) be the source of the "two-thirds of cancer types" claim.

If so, that claim is based on a total failure of comprehension of what correlation means.  Imagine a hypothetical world in which cancer occurs during stem cell division with some significant probability only if a given environmental factor is present, and that environmental factor is present equally in all tissue types.  In this world cancer incidence across tissue types is perfectly correlated with the number of stem cell divisions, but nevertheless all cancer is caused by the environmental factor.

It's simply impossible to say anything about the importance of environmental factors in a statistical analysis without including those factors as an input to the analysis.

However, the press release also features the paragraph I quoted about 22 out of 31 cancer types being largely explained by bad luck.  Perhaps that's what they mean by two thirds.  To get this number, they devised an Extra Risk Score - ERS for short.  Then they used AI methods to divide cancer types into two types based on the ERS values.  So what's the ERS?  The Supplement describes it as "the (negative value of the) area of the rectangle formed in the upper-left quadrant of Fig. 1 by the two coordinates (in logarithmic scale) of a data point as its sides." That is, it's the product of the {base-10 logarithm of stem cell divisions} and the {base-10 logarithm of lifetime cancer risk}.   (The cancer risk logarithm is negative (or zero) since lifetime risk is less than (or equal to) one.)

Shorn of the detail, it's the product of two logarithms.  How does that make sense?  Multiplying two logarithms is bizarre; for all ordinary purposes you're supposed to add them.  For this analyis, a simple measure would seem to be the ratio of lifetime incidence to stem cell divisions, or you might prefer the log of that ratio, which would be the log of the incidence minus the log of the stem cell divisions.

(On further reflection, the number I'd use would be {log(1-incidence)/divisions}.  That doesn't give a defined answer for lifetime incidence of unity, but you can get a number by using an incidence of just less than unity.  Among the other cancer types, it picks out gallbladder cancer as having the highest environmental or heriditary risk, which is consistent with that cancer's unusual geographical variation of incidence.)

The Supplement attempts to justify multiplying the logarithms by explaining why dividing them woudn't make sense.  Which is a bit like advocating playing football in ballet shoes because it would be foolish to wear stilettos.

Whatever ERS calculation you used, the clustering method would still divide the cancers into two groups, because that's what clustering methods do, but different calculations would put different cancer types in the high-ERS group.  If you want, as the senior author does, to draw conclusions from composition of the high-ERS cluster, you need a sound justification for your ERS calculation.


To its credit, The Guardian has published a piece pointing out the correlation misunderstanding.  This piece is also highly unimpressed by the paper, and this review of it has mixed feelings.

Me, I suppose the underlying idea has some truth in it.  But the methodology is the worst I've ever seen in a prominently published paper.

Update: more commentary from Understanding Uncertainty and StatsGuy 

Update: Bradley J Fikes, author of this piece in the San Diego Union-Tribute, complains in comments here that the title of this post is ill-chosen.  He points out that he didn't write his story simply from the press release, but checked it with John Hopkins before it was published.  He's got a point about the title: more than half of what I say here is criticism of the paper not of the press release.