Chapter 3 Data transformation

We aim at merging all data into one tidy master data frame within which each observation will correspond to one country and one year.

3.1 Selecting the time range

We hope to keep a sufficiently wide range so as to observe clear change in energy consumption behavior across time, while at the same time reducing missing data for as many countries as possible.

We choose to restrict our multivariate analysis to the narrowest time range. This is exactly the intersection of all ranges : 1990-2019.

We note that this plot does not tell us anything about missing values. Indeed, a given observation might be missing although within the time range of the data set.


3.2 Selecting the variables

We hope to be able to analyze and understand the cross-correlation between energy consumption and human development. To do so, we need the following:

  • Data indicators for HDI and all its major components, namely: GDP, life_expectancy, years_of_schooling;
  • Figures for consumption per energy type;
  • Population data and other country indicators like geographic region.

3.3 Data transformation procedure

Our data transformation works as follows.

  1. Determine the needed variables from the respective datasets
  2. Create dataframes for each of the datasets that only contains the needed variables and time frames to avoid having an unnecessarily large final dataframe that would slow down analyses
  • Energy data
  • GDP data
  • Life expectancy data
  • Schooling data
  • Population data
  • HDI data
  1. Filter the dataframes on common ISO codes (unique identifiers for countries) such that only countries for which all dataframes match up are selected. We find that there is little to no variance in the countries from the different datasets but that some include geographic regions that are filtered by this procedure as they do not have ISO codes.
  2. Convert previously wide dataframes into a long format such that each row is one unique observation of country and year
  3. Merge all dataframes on their ISO code and year
  4. Add the country names from the library countrycode based on the ISO codes to guarantee a common naming scheme and simplify interpretation.

We use the resulting master_df as our database in all further analyses. Of course, we sometimes have to modify or calculate new variables based on the current question.

variable dataset
ISO_code worldbank_country_info
HDI HDI
population population
years_of_schooling un_years_of_schooling
life_expectancy life_expectancy
GDP worldbank_gdp
fossil_fuel owid_energy_data
hydro owid_energy_data
low_carb owid_energy_data
nuclear owid_energy_data
other_renew owid_energy_data
renew owid_energy_data
solar owid_energy_data
wind owid_energy_data
coal owid_energy_data
gas owid_energy_data
oil owid_energy_data
biofuel owid_energy_data
primary owid_energy_data