How many genes have been associated with cancer in PubMed?


Francisco Requena


March 20, 2021

In the biomedical literature, it is common to find sentences like:

“Besides, the gene [gene symbol] has been associated with [type of cancer(s)] [References]”

The structure of these sentences can change from article to article, but the underlying idea and goal are the same. I will try to summarise it in the following sentence:

“Hello reader/editor/reviewer, I was studying [any field], and I found this gene. I think it is a relevant/remarkable finding because it has been associated with cancer [references]. Therefore, it supports my hypothesis about the biological relevance of the gene in my field. Please, publish it.”

This approach is valid and logical as long as the association gene <-> cancer has been well-described and validated by different experiments and research teams. Unfortunately, some of these associations will be just spurious and no well-supported.

To explore this problem, we will count the number of articles in PubMed associating cancer with each one of the 19,205 protein-coding genes in the human genome.

To do so, we will write a simple code in R that will make a query for each gene to PubMed using the fantastic rentrez package.

The script has two simple steps:

You can find the code below:

gene_symbols <- read_tsv('') %>% 
# Careful: it takes long to make all the queries
query_pubmed <- function(input_gene) {
  query_tmp <- entrez_search(db ="pubmed", 
                             term = paste(paste0(input_gene, '[Title/Abstract]'),' AND ', 'cancer[Title/Abstract]'), 
                             retmax = 600)
  tibble('gene' = input_gene, 'n_hits' = length(query_tmp[['ids']]))
result_genes <- gene_symbols %>% map_dfr(~ query_pubmed(.x))
result_genes %>%
  ggplot(aes(n_hits)) +
  geom_histogram(binwidth = 5) +
  theme_minimal() +
  labs(x = 'Nº articles', y = 'Nº genes')

result_genes %>%
  mutate(category = case_when(
    n_hits == 0 ~ '0 articles',
    n_hits >= 1 & n_hits <= 5 ~ '1-5 articles',
    TRUE ~ '>5 articles'
  )) %>% 
  count(category) %>%
  mutate(perc = n / sum(n)) %>%
  ggplot(aes(reorder(category,perc), perc)) +
    geom_col(aes(fill = category), color = 'black') +
  scale_y_continuous(label = percent, limits = c(0, 1)) +
  geom_label(aes(label = paste0(round(perc, 2)*100, '%'))) +
  labs(fill = 'Category', x = 'Category', y = 'Percentage') +

As you can see in the plot, 41% of the genes have been associated with cancer in more than five articles, 36% in 1-5 articles, and only 23% of the genes with no publications.

If I choose a random protein-coding from the human genome and do a query in PubMed, it is more likely (77%) to find at least one article than none.

This data reflects how easy it is to find articles associating cancer with most of the genes. Therefore, when a reader finds this kind of argument [my gene is important -> gene + cancer + references] should take it with a grain of salt.

An interesting point is the reasons behind these numbers. From a biological perspective, it is difficult to assume the relevance in cancer of most of the human genome even though cancer is a group including many different kinds of diseases with their subgroups.

In the following points, I describe some of the reasons that might explain these numbers:

It is reasonable to think that a similar scenario happens with many research published trying to link their analysis with any aspect of cancer though the evidence is limit.

To clarify, this is by no means a way to discredit researchers with work related to cancer. It is a way to make people aware of the problematic aspect of finding articles in PubMed describing the gene A associated with cancer and using them as evidence without further analysis.

Some ideas for a future version
