Chapter 3 Data transformation

We aim at merging all data into one tidy master data frame within which each observation will correspond to one country and one year.

3.1 Selecting the time range

We hope to keep a sufficiently wide range so as to observe clear change in energy consumption behavior across time, while at the same time reducing missing data for as many countries as possible.

We choose to restrict our multivariate analysis to the narrowest time range. This is exactly the intersection of all ranges : 1990-2019.

We note that this plot does not tell us anything about missing values. Indeed, a given observation might be missing although within the time range of the data set.

3.2 Selecting the variables

We hope to be able to analyze and understand the cross-correlation between energy consumption and human development. To do so, we need the following:

Data indicators for HDI and all its major components, namely: GDP, life_expectancy, years_of_schooling;
Figures for consumption per energy type;
Population data and other country indicators like geographic region.

3.3 Data transformation procedure

Our data transformation works as follows.

Determine the needed variables from the respective datasets
Create dataframes for each of the datasets that only contains the needed variables and time frames to avoid having an unnecessarily large final dataframe that would slow down analyses

Energy data
GDP data
Life expectancy data
Schooling data
Population data
HDI data

Filter the dataframes on common ISO codes (unique identifiers for countries) such that only countries for which all dataframes match up are selected. We find that there is little to no variance in the countries from the different datasets but that some include geographic regions that are filtered by this procedure as they do not have ISO codes.
Convert previously wide dataframes into a long format such that each row is one unique observation of country and year
Merge all dataframes on their ISO code and year
Add the country names from the library countrycode based on the ISO codes to guarantee a common naming scheme and simplify interpretation.

We use the resulting master_df as our database in all further analyses. Of course, we sometimes have to modify or calculate new variables based on the current question.

variable	dataset
ISO_code	worldbank_country_info
HDI	HDI
population	population
years_of_schooling	un_years_of_schooling
life_expectancy	life_expectancy
GDP	worldbank_gdp
fossil_fuel	owid_energy_data
hydro	owid_energy_data
low_carb	owid_energy_data
nuclear	owid_energy_data
other_renew	owid_energy_data
renew	owid_energy_data
solar	owid_energy_data
wind	owid_energy_data
coal	owid_energy_data
gas	owid_energy_data
oil	owid_energy_data
biofuel	owid_energy_data
primary	owid_energy_data