Yesterday's

*Times* has a front page story "Two thirds of cancer cases are the result of bad luck rather than poor lifestyle choices...". (

paywall)

That doesn't match my preconceptions, so I looked for the story online. The

Independent and the

Telegraph agree. So does the

Mail. And the

Express. And the

Mirror.

Reuters' headline agrees, but its story suggests something a bit different - that two thirds of an abitrary selection of cancer

*types *occur mainly at random.

The

BBC speaks unambiguously of "most cancer types" and so does

The Guardian.

The

press release which must have given rise to this story features the phrase "two thirds of adult cancer incidence across tissues can be explained primarily by 'bad luck'". I can't make much sense of "cancer incidence across tissues", so I can't blame the journalists for stumbling over it likewise. But the reporters who got the story right must have managed to scan down to the paragraph where the press release explains that the researchers "found that 22 cancer types could be largely explained by the “bad luck” factor of random DNA mutations during cell division. The other nine cancer types had incidences higher than predicted by "bad luck" and were presumably due to a combination of bad luck plus environmental or inherited factors."

I emphasize that "two thirds of cancer types" is not at all the same as "two thirds of cancer cases". Two rare cancers apparently unrelated to environmental factors will count for far fewer cases than one common cancer in the other category.

So what of the paper behind the press release?

Here's the abstract, with a paywalled link to the whole paper. Or you can 'preview' the paper for free

here, to the extent your conscience permits. Supplementary data and methodology descriptions are

here.

The hypothesis behind the paper is that cancer is to a large extent caused by errors arising during stem cell division, at a rate which is independent of the tissue type involved. The researchers therefore obtain estimates of the lifetime number of stem cell divisions various tissue types, and plot that against lifetime cancer incidence, obtaining a significant-looking scatter plot (Figure 1 in the published paper). So far so good.

But they've used a log-log plot, necessary to cover the orders of magnitude variations in the data. Now, if you think, as the researchers apparently do, that cancer risk is proportional to number of stem cell divisions, it follows that the slope of a log-log plot should be unity. It isn't, by eye it's more like two thirds. The researchers, busy calculating a linear correlation between the log values seem not to have noticed this surprising result. Instead they square the correlation to get an R

^{2} of 65%, which may (it's not clear) be the source of the "two-thirds of cancer types" claim.

If so, that claim is based on a total failure of comprehension of what correlation means. Imagine a hypothetical world in which cancer occurs during stem cell division with some significant probability only if a given environmental factor is present, and that environmental factor is present equally in all tissue types. In this world cancer incidence across tissue types is perfectly correlated with the number of stem cell divisions, but nevertheless all cancer is caused by the environmental factor.

It's simply impossible to say anything about the importance of environmental factors in a statistical analysis without including those factors as an input to the analysis.

However, the press release also features the paragraph I quoted about 22 out of 31 cancer types being largely explained by bad luck. Perhaps that's what they mean by two thirds. To get this number, they devised an

*Extra Risk Score - ERS *for short. Then they used AI methods to divide cancer types into two types based on the ERS values. So what's the ERS? The Supplement describes it as "the (negative value of the) area of the rectangle formed in the upper-left quadrant of Fig. 1 by the two coordinates (in logarithmic scale) of a data point as its sides." That is, it's the product of the {base-10 logarithm of stem cell divisions} and the {base-10 logarithm of lifetime cancer risk}. (The cancer risk logarithm is negative (or zero) since lifetime risk is less than (or equal to) one.)

Shorn of the detail, it's the product of two logarithms. How does that make sense? Multiplying two logarithms is bizarre; for all ordinary purposes you're supposed to add them. For this analyis, a simple measure would seem to be the ratio of lifetime incidence to stem cell divisions, or you might prefer the log of that ratio, which would be the log of the incidence minus the log of the stem cell divisions.

(On further reflection, the number I'd use would be {log(1-incidence)/divisions}. That doesn't give a defined answer for lifetime incidence of unity, but you can get a number by using an incidence of just less than unity. Among the other cancer types, it picks out gallbladder cancer as having the highest environmental or heriditary risk, which is consistent with that cancer's unusual geographical variation of incidence.)

The Supplement attempts to justify multiplying the logarithms by explaining why dividing them woudn't make sense. Which is a bit like advocating playing football in ballet shoes because it would be foolish to wear stilettos.

Whatever ERS calculation you used, the clustering method would still divide the cancers into two groups, because that's what clustering methods do, but different calculations would put different cancer types in the high-ERS group. If you want, as the senior author does, to draw conclusions from composition of the high-ERS cluster, you need a sound justification for your ERS calculation.

__

To its credit, The Guardian has published a

piece pointing out the correlation misunderstanding.

This piece is also highly unimpressed by the paper, and

this review of it has mixed feelings.

Me, I suppose the underlying idea has some truth in it. But the methodology is the worst I've ever seen in a prominently published paper.

Update: more commentary from

Understanding Uncertainty and

StatsGuy
Update: Bradley J Fikes, author of

this piece in the

*San Diego Union-Tribute*, complains in comments

here that the title of this post is ill-chosen. He points out that he didn't write his story simply from the press release, but checked it with John Hopkins before it was published. He's got a point about the title: more than half of what I say here is criticism of the paper not of the press release.