Europe is a quite heterogenous assembly of countries with there own culture and qualities, but how do they do in death rates comparisons? And are there any factors helping or hindering healthcare structures? Animated by these questions, the links between healthcare resources and death rates in European countries were asking to be explored!
Healthcare is paramount in our society and of great personal interest, both on well-being and financial levels. I thus chose to explore European datasets related to healthcare resources, and apply some advanced analytical methods in order to find out if there were any insightful data-driven information.
Data analyst
My own curiosity
Excel
Python (statsmodels - Scikit-learn - Folium)
Tableau Public
I wanted to look into the Eurostat database collection and see if I could find some interesting and recent data sets on healthcare resources and death causes in Europe, containing geographical markers for spatial analysis.
I selected and combined the following data sets:
Causes of death - crude death rate by NUTS 2 region of residence: the main and more complete data set of the study.
Health care expenditure by provider: data available only on a country level, with no NUTS 2 grain.
Self-reported unmet needs for medical examination by main reason declared and NUTS 2 region
NOTE: An important aspect on these data sets is the common unit of counts per 100K inhabitants, which makes comparisons possible without the need to normalize all counts to the total population.
Two elements of the combined data set stand out:
most variables have a right skewed distribution, with many counts on low values;
Many variables have outliers, mostly from the fact that the variables contain both countries and regions values, making the countries values outliers compared to regional values (comparing regions to countries won't make sense).
Since there are 91 different death causes referenced in variables, I focused my attention on the relation between the healtcare resources and the varaible countaining the aggregate of all death causes. The correlation matrix bellow is a good starting point in order to see which relations are the strongest if there are any:
In the resources to each other, there is the strongest correlation of the total number of beds to the total number of physicians. Most interestingly, related to the death rates, the correlation scores are not very strong and go from the amount of euros spent, to the number of beds and finaly to the number of physicians.
At this stage of the study I could formulate the hypothesis stating that: If healthcare providers spend more money per inhabitant, then death rates are reduced.
Before inquiring further into this hypothesis, I then moved on to the spatial analysis in order to have a global overview of the spending habits by countries, and the death rates by regions. I used the Folium library to create this map that shows the data for 2021, with darker blue for the highest values and yellowish for the lowest values:
Deaths rates vs euros spent by healthcare providers
Main insights:
Switzerland, Liechtenstein and Norway have very high healthcare expenditures and have among the smallest death rates;
Eastern countries have the lowest spending healthcare providers and share the higher death rates;
Countries with very low healthcare providers expenditures are the ones with the most deaths (all causes) per 100K inhabitants (supporting the first hypothesis);
Portugal is an exception, with very low expenditures and low death rates (bacalhau magic?).
Looking into the correlations between healthcare resources and the death rates by applying a linear regression model was the next natural step in order to validate the hypothesis. Unfortunatly, the data set only included about a 100 rows with the expenditures of healthcare providers, not allowing for a consistent analysis.
I chose to use both the number of beds and the number of physicians since they were the next most correlated variables to the death counts and because they are well correlated to each other. Also, I assumed that having and sustaining hospitals' beds and many physicians were sources of expenditures for healthcare providers.
The trend suggests that the more beds there are, the more deaths there are. A higher number of beds doesn't equate with more or less deaths (all causes included) in Europe since the model isn't accurate enough with a poor correlation test (R-squared of 0.2).
The trend suggests that the less inhabitants per physicians there are, the less deaths there are. With a very low R-squared of 0.06, the linear model explains the variability of the data in only 6%, meaning the trend isn't conclusive .
Eventhough the trends were visualy satisfying on the scatterplots, the regression analysis made it evident that there is a very poor correlation between death rates and these two elements of death resources.
To inquire further into my dataset, I then moved on to a clustering analysis, trying to use all 91 variables of death causes. First, I had to group all variables using a Principal Components Analysis (PCA) from the too many variables, using an elbow curve on the cumulative explained variance and I decided to keep 12 main components.
Then, I used a K-means clustering model on these 12 components.
Deciding the number of clusters using the elbow curve wasn't straightforward, so I implemented both 5, then 4 clusters. It was after testing that 5 clusters didn't bring much more clarity and 4 clusters were easier on the eyes to grasp differences.
With no surprise, the clusters were well visible on the scatterplot crossing the variables of the total death causes and the amount of euros spent per inhabitant:
with a strong geographical parameter
with a strong spending habit parameter
These clusters are not of death causes from one another, they rather are based on death rates for components of death causes on the whole dataset with rows for each year and country. We could say that some countries share similar death rates on different death causes, allowing for an easier geographical comparison and understanding of Eurpoean death causes.
Looking a bit further into the Eurostat database collection, I found a monthly death rate from 2019 to 2021 which could be a great basis to understand the impact of the COVID 19 in this particular time period. I then asked myself what it would have been like in the first semester of 2022 if COVID spread equally following the trend started in 2019.
I started by analysing the time series decompostion looking mostly for trend and seasonality to asses if it is stationary:
Fom this decomposition:
There is a positive trend
There is a seasonality, mostly on winter seasons
There is some residual noise which might be explained by Covid high peeks
After applying a Dickey fuller test to make sure non-stationarity had to be dealt with, I standardized the time series, before fitting an ARIMA (with both regression and moving average components) model.
Given a Covid 19 comparable for 2022, death rates should increase at the beginning of 2022 and decrease while summer arrives.
The accuracy of the model isn't perfect since it was only fed with 36 entry points and is solely on the Covid period, not accounting for the previous time period before the pandemic.
There is a link between healthcare resources and deaths rates !
The datasets on healthcare resources shows a stronger link comparing the amounts spent by healthcare providers (namely hospitals) and the number of death rates. Rich countries with more resources have less deaths on the comparable unit of deaths per 100K inhabitants.
Human resources (physicians) and physical resources (available beds) are of a smaller importance as far as the correlation goes, but shouldn't be undermined since they both represent hospital's expenditures.
European countries are marked by evident geographical differences, I was expecting less differences before leading this analysis.
Data was limited by a short time period, and lacking some data points on the healthcare resources, mostly on the prviders expenditures.
Medical knowledge on all death causes would allow for a great study on differences by country but couldn't be done on this purely statistical approach.
To go further in the comparisons, a broader study could include GDP, lifespan, health insurances, and the many cornerstones of the population's quality of life (food, habits, sports, nature, air, etc.).
Also, data from the survey wasn't exploited in the study, maybe an analysis of perceived healtcare and actual numbers could be interesting to look at.