Analyzing the consume of drugs in Europe
Mar 9, 2017
Francisco Requena
7 minute read

The European Monitoring Center for Drugs and Drug Addiction (EMCDDA) is a public organization which provides an overview of the european drugs problems with relevant data, such as the consume of a determined type of drugs (cannabis, ecstasy, cocaine…), price, purity…In addition, this information is sorted by relevant variables such as genre, age or country.

Through the website, we can access to that knowledge, but there is an inconvenient, the whole information is provided by numerous tables that can be downloaded in .xlsx format, so the process of merging multiples tables with data of interest is slow and tedious. Thereby, there is not any exploratory or visualization tool which helps the scientific community to understand this valuable information, but hidden in hundreds of separate tables.

So, I decided to develop an application which solves these problems that makes the information more accessible to the scientific community, combining two tools: R and Shiny.

The first step is to gather all the information provided in .xslx format. I downloaded every table from the section “Prevalence of drug use” and I put them together in an unique dataframe. In total we have 504 columns (yes, a lot) where every column represents the consume of a certain type of drug under a concrete condition. Every column had to be identified by a descriptive name, thereby I elaborated a code of strings with legends:

  if (name_drug == "can") {name_drug <- "cannabis"}
  if (name_drug == "coc") {name_drug <- "cocaine"}
  if (name_drug == "amp") {name_drug <- "amphetamines"}
  if (name_drug == "ecs") {name_drug <- "ecstasy"}
  if (name_drug == "lsd") {name_drug <- "lsd"}
  if (name_drug == "any") {name_drug <- "any illegal drugs"}
  if (name_drug == "alc") {name_drug <- "alcohol"}
  if (name_drug == "tob") {name_drug <- "tobacco"}

  if (name_cycle == "life") {name_cycle <- "lifetime"}
  if (name_cycle == "year") {name_cycle <- "last year"}
  if (name_cycle == "month") {name_cycle <- "last month"}

  if (name_interval == "all") {name_interval <- "(15-64 yr.)"}
  if (name_interval == "young") {name_interval <- "(15-34 yr.)"}
  if (name_interval == "15") {name_interval <- "(15-24 yr.)"}
  if (name_interval == "25") {name_interval <- "(25-34 yr.)"}
  if (name_interval == "35") {name_interval <- "(35-54 yr.)"}
  if (name_interval == "45") {name_interval <- "(45-54 yr.)"}
  if (name_interval == "55") {name_interval <- "(55-64 yr.)"}
  if (name_genre == "fe") {name_genre <- "female"}
  if (name_genre == "ma") {name_genre <- "male"}
  if (name_genre == "to") {name_genre <- "total"}

For example, the column with the following information:

(1).Consume of cocaine + (2).Last year + (3).35-54 yr. + (4).Male

(Column name) “coc_year_35_ma”

Below, I show you how the application get that information:

Once created the dataframe with the total columns with their respective code of names, it is time to play with data. I decided to use different visualization tools which reflect the same type of data, so the user could tackle same information with different perspectives:

In my opinion, the graphics should be as simple as possible, because it helps the user to understand easier the information displayed. Due to this, I would like to emphasize two details:

1. As you can see, the scale color used in the map and the bar plot of consume is exactly the same. Keeping the same scale (1) between different graphics help the user to think less about what the scale represents.

2. At the first moment, I did not want to create an interactive graphic because many times add an extra of complexity unneeded, but in this case, it was needed.The function of an interactive box plot was double: first, the user can see a legend along every point with the country and the percentage of consume label. Second, the majority of points are located in a reduced interval of values, then, the user can zoom in every interval of interest and appreciates the differences.

Finally, two features were added that helps the user saves the data produced by the application:

1. The user can download the data chosen in format .csv. Thereby, it can be imported in Excel and create custom graphics.

2. A Rmarkdown report in format .html with all the graphics generated by the app.

We can get some interpretations through the visualization of data:

Consume of cocaine between countries. If we compare the routes where the drug traffickers introduce the narcotics in the european territory (as you can see in the picture below)(2) and the consumption by country; we can see a negative linear relation between the consume and the distance that the drug has to travel from the countries of arrival (Spain, Nertherlands, Belgium, Italy and France) (2).

In addition, we could analyze quantitatively the values of consume of cocaine respect to the distance of travel needed. We can calculate the sum of distances between the three largest cities of every country and the five main countries of arrival, represented by their capitals.

Yes, but how could we do this? Of course, with R:

# Load of libraries

# Data source

# Load of world cities by longitude and latitude

list_cities <- read.csv('world_cities.csv')

# Load of cocaine consume by country (obtained with DrugsPlot)

data_consume <- read.csv('data_drugsplot.csv')
data_consume[,1] <- NULL
data_consume <- na.omit(data_consume)
colnames(data_consume) <- c("country","consume")

# Obtaining capitals of the main countries of arrival of cocaine (Spain, France, Italy, Nertherland, Belgium)

arrivals_cities <- filter(list_cities, city %in% c("Madrid", "Rome", "Brussels", "Paris", "Amsterdam" ))

# Getting the three cities with highest population of every country

list_cities <- select(list_cities, city, lat, lng, pop, country) %>%
  filter(country %in% data_consume$country) %>%
  group_by(country) %>% 
  top_n(3, pop) # Here, we can change the number of cities selected

# Getting the sum of distances of every city respect to every capital mentioned above. 

list_cities <-list_cities %>%
  rowwise() %>%
  mutate(distance_m = distHaversine(c(lat, lng), c(arrivals_cities$lat[1], arrivals_cities$lng[1])) + 
                    distHaversine(c(lat, lng), c(arrivals_cities$lat[2], arrivals_cities$lng[2])) +
                    distHaversine(c(lat, lng), c(arrivals_cities$lat[3], arrivals_cities$lng[3])) +
                    distHaversine(c(lat, lng), c(arrivals_cities$lat[4], arrivals_cities$lng[4])) +
                    distHaversine(c(lat, lng), c(arrivals_cities$lat[5], arrivals_cities$lng[5]))) %>%
  mutate(distance_km = distance_m/1000) %>%
  left_join(data_consume, by = "country" )

ggplot(list_cities, aes(distance_km, consume)) +
  geom_point() +
  stat_smooth(method = 'lm') +
  theme_bw() + 
  ggtitle('Consume of cocain vs Distance of arrivals') +
  ylab('Consume of cocain (%)') +
  xlab('Distance respect to main countries or arrival (km)') +
  theme(plot.title = element_text(hjust = 0.5))

# Drawing network cities

map("world", fill=T, col="grey25", bg="white", ylim=c(35,70.0), xlim=c(-10,45.0))
points(list_cities$lng,list_cities$lat, pch=3, cex=0.3, col="chocolate1")
for (i in 1:nrow(arrivals_cities)) {
for (j in 1:nrow(list_cities)) {
connection <- gcIntermediate(c(arrivals_cities$lng[i], arrivals_cities$lat[i]), c(list_cities$lng[j], list_cities$lat[j]))
lines(connection, lwd=0.3, col="turquoise1") 
title( main = "Distances between the largest cities of every country \n and the five main countries of arrival of cocaine")
# Residuals:
# Min 1Q Median 3Q Max 
# -0.77725 -0.36353 -0.09222 0.25582 1.42392
# Coefficients:
# Estimate Std. Error t value Pr(>|t|) 
# (Intercept) 1.625e+00 1.787e-01 9.091 1.8e-13 ***
# distance -1.054e-07 1.966e-08 -5.363 1.0e-06 ***
# —
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# Residual standard error: 0.5408 on 70 degrees of freedom
# Multiple R-squared: 0.2912, Adjusted R-squared: 0.2811 
# F-statistic: 28.76 on 1 and 70 DF, p-value: 9.997e-07

We can create a map of connections which represent the three largest cities of every country with the 5 arrival countries (mentioned above) represented by their capitals:

Consume of drugs in female population. It is always lower than in male population. Looking for information about this fact, I founded an interesting article in the British newspaper “Telegraph” where talked about the culture factor:

Low values of Turkey. Every condition analyzed represents that Turkey has the lowest value of consume. Clearly, the cultural differences are a crucial element in the consume of drugs:

These analyzes are just a small amount of examples that we can develop with this data. I invite you to use the app (Drugsplot)and share your ideas in the comments section!


(1). Maybe, you surprise why I did not use sequential color. In fact, my first option was sequential color but the problem is “Turkey”. I mean, every country have a similar value (more or less) but Turkey in every drug, age, genre…it doesn’t matter…Turkey has always an amazing low value (very healthy people, i guess…). So, when I decided to use sequential color…the contrast between Turkey and the rest of countries was good….instead the contrast between the rest of countries was very low…so I decided to use a scale which has more than a color, such as “Spectral” a diverging scale…where the contrast, with this data, is higher than using sequential.

(2). Source:

comments powered by Disqus