I wanted to take a look at how country rankings for the UNDP’s Gender Development Index and Gender Inequality Index have changed over time. Fortunately, the UNDP has an API to query the data directly, so I wouldn’t need to scrape pdfs or websites. Unfortunately, I had never done anything with RESTful APIs or JSON so this is me figuring it out.
As always, thanks to the thankless, those generous souls who have posted their wisdom on the web to enlighten the rest of us. I found the following tutorials/answers particularly helpful:
- https://www.programmableweb.com/news/how-to-access-any-restful-api-using-r-language/how-to/2017/07/21?page=2
- https://medium.com/@traffordDataLab/querying-apis-in-r-39029b73d5f1
- https://stackoverflow.com/questions/26106408/create-a-ranking-variable-with-dplyr
- https://stackoverflow.com/questions/28391850/reverse-order-of-discrete-y-axis-in-ggplot2
First, I’ll fire up R and load my favorite libraries. I prefer ESS mode in Emacs. I love the tidyverse for obvious reasons. I’ll need httr and jsonlite to query and parse the data. I use a linux workstation but often send visualizations to people using Windows, and the fonts in the pdf files don’t quite turn out right if I don’t use extrafont:
lapply(c("tidyverse","extrafont","httr","jsonlite"),library,character.only=TRUE)
Now I need to go to the United Nations Development Programme Human Development Report page to learn about the data and register for the API at http://hdr.undp.org/en/content/human-development-report-office-statistical-data-api. Registration is easy, but the password appears to go out in plain text so I let Firefox generate and remember a random strong password for me.
I’ll be looking at two indexes. Each one has their code (137906 for GDI and 68606 for GII) that I’ll use in the query. Let’s do GDI first:
res_gdi<-GET("http://ec2-54-174-131-205.compute-1.amazonaws.com/API/HDRO_API.php/indicator_id=137906")
To query the data and return the results, then
d_gdi<-fromJSON(rawToChar(res_gdi$content))
to parse the JSON into something a little easier to digest. At this point I got a little bit hung up at first. When I took at look at it, I got, not the first few rows, but a ginormous chunk of data:
> head(d_gdi) $indicator_value $indicator_value$AFG $indicator_value$AFG$137906
$indicator_value$AFG$137906
$2000
[1] 0.322 $indicator_value$AFG$137906
$2005
[1] 0.519 $indicator_value$AFG$137906
$2010
[1] 0.595 $indicator_value$AFG$137906
$2011
[1] 0.609 ...
And on and on. Going back through the tutorials and looking around, I took a look at the names:
> names(d_gdi)
[1] "indicator_value" "country_name" "indicator_name"
It turns out that all the data I’m looking for (value, country, year) are in that first name, indicator_value
. I can grab just that, but it helps to unlist it and put it in a data.frame to give me something I’m more familiar with:
df_gdi<-data.frame(unlist(d_gdi[[1]])) > head(df_gdi) unlist.d_gdi..1… AFG.137906.2000 0.322 AFG.137906.2005 0.519 AFG.137906.2010 0.595 AFG.137906.2011 0.609 AFG.137906.2012 0.618 AFG.137906.2013 0.627
That’s more like it! Now I can see the country value, the indicator, the year, and the actual value. It’s still a little funky because half the info I want is now the row names, but I can create a column based on the row names, then separate that out into the columns I need.
df_gdi$ciy<-row.names(df_gdi) > head(df_gdi) unlist.d_gdi..1… ciy AFG.137906.2000 0.322 AFG.137906.2000 AFG.137906.2005 0.519 AFG.137906.2005 AFG.137906.2010 0.595 AFG.137906.2010 AFG.137906.2011 0.609 AFG.137906.2011 AFG.137906.2012 0.618 AFG.137906.2012 AFG.137906.2013 0.627 AFG.137906.2013
Now I’ll get the values and put them in a column. I’m sure there’s a better way to do this, but in the interest of just getting it done and moving on to the next step…
df_gdi$values<-df_gdi[,1] > head(df_gdi) unlist.d_gdi..1… ciy values AFG.137906.2000 0.322 AFG.137906.2000 0.322 AFG.137906.2005 0.519 AFG.137906.2005 0.519 AFG.137906.2010 0.595 AFG.137906.2010 0.595 AFG.137906.2011 0.609 AFG.137906.2011 0.609 AFG.137906.2012 0.618 AFG.137906.2012 0.618 AFG.137906.2013 0.627 AFG.137906.2013 0.627
Now I want to rename the rows to their numbers:
row.names(df_gdi)<-1:nrow(df_gdi) > head(df_gdi) unlist.d_gdi..1… ciy values 1 0.322 AFG.137906.2000 0.322 2 0.519 AFG.137906.2005 0.519 3 0.595 AFG.137906.2010 0.595 4 0.609 AFG.137906.2011 0.609 5 0.618 AFG.137906.2012 0.618 6 0.627 AFG.137906.2013 0.627
Looks like I still have the original column hanging around. How about if I just take the two columns I want and create a new data frame?
gdi<-data.frame(cbind(df_gdi$ciy,df_gdi$values)) > head(gdi) X1 X2 1 AFG.137906.2000 0.322 2 AFG.137906.2005 0.519 3 AFG.137906.2010 0.595 4 AFG.137906.2011 0.609 5 AFG.137906.2012 0.618 6 AFG.137906.2013 0.627
Interesting. I didn’t think I was going to lose the column names in that operation. I’ll have to figure out what happened there one of these days. Meanwhile, just rename the columns:
colnames(gdi)<-c("ciy","values")
And finally, separate out the other values I want into their own columns:
gdi <- gdi %>% separate(ciy, c("country_code","indicator","year"), remove = TRUE) > head(gdi) country_code indicator year values 1 AFG 137906 2000 0.322 2 AFG 137906 2005 0.519 3 AFG 137906 2010 0.595 4 AFG 137906 2011 0.609 5 AFG 137906 2012 0.618 6 AFG 137906 2013 0.627
Perfect! Now I’ll do the same for the other indicator, the GII:
res_gii<-GET("http://ec2-54-174-131-205.compute-1.amazonaws.com/API/HDRO_API.php/indicator_id=68606") d_gii<-fromJSON(rawToChar(res_gii$content)) df_gii<-data.frame(unlist(d_gii[[1]])) df_gii$ciy<-row.names(df_gii) df_gii$values<-df_gii[,1] gii<-cbind(df_gii$ciy,as.numeric(df_gii$values)) row.names(gii)<-1:nrow(gii) colnames(gii)<-c("ciy","values") gii<-data.frame(gii) gii<-gii %>% separate(ciy, c("country_code","indicator","year"), remove = TRUE) > head(gii) country_code indicator year values 1 AFG 68606 2005 0.745 2 AFG 68606 2010 0.751 3 AFG 68606 2011 0.743 4 AFG 68606 2012 0.734 5 AFG 68606 2013 0.724 6 AFG 68606 2014 0.714
Wonderful! Now I’ll get the country names. It turns out they were returned in my first (and second) query, so I just need to parse them out into their own data frame (and name the columns) which I will merge into my two index value data frames:
countries <- data.frame(cbind(row.names(data.frame(unlist(d_gdi[[2]]))),data.frame(unlist(d_gdi[[2]]))[,1])) colnames(countries) <- c("country_code","country") > head(countries) country_code country 1 AFG Afghanistan 2 AGO Angola 3 ALB Albania 4 ARE United Arab Emirates 5 ARG Argentina 6 ARM Armenia gdi <- merge(gdi,countries) %>% select(year,values,country) gii <- merge(gii,countries) %>% select(year,values,country) > head(gdi) year values country 1 2000 0.322 Afghanistan 2 2005 0.519 Afghanistan 3 2010 0.595 Afghanistan 4 2011 0.609 Afghanistan 5 2012 0.618 Afghanistan 6 2013 0.627 Afghanistan > head(gii) year values country 1 2015 0.702 Afghanistan 2 2005 0.745 Afghanistan 3 2010 0.751 Afghanistan 4 2011 0.743 Afghanistan 5 2012 0.734 Afghanistan 6 2013 0.724 Afghanistan
Now it’s time to insert rankings. I really had a hard time figuring out how to do this because I wanted to rank the countries within each year. Also, it turns out it’s really important to decide how you want to handle ties when two elements of your list have the same value. It seems the UNDP handles ties analogous to what we get when using the ‘first’ method for the base::rank function. What I ended up doing was grouping by year, then mutating to get the rank (and note that GII is improved by approaching 0, while GDI is improved by approaching 1):
gii <- gii %>% group_by(year) %>% mutate(rank = rank(values, ties.method='first'))
gdi <- gdi %>% group_by(year) %>% mutate(rank = rank(desc(values), ties.method='first'))
> gii
# A tibble: 1,931 x 4
# Groups: year [13]
year values country rank
1 2015 0.702 Afghanistan 155
2 2005 0.745 Afghanistan 142
3 2010 0.751 Afghanistan 148
4 2011 0.743 Afghanistan 149
5 2012 0.734 Afghanistan 152
6 2013 0.724 Afghanistan 148
7 2014 0.714 Afghanistan 150
8 2016 0.69 Afghanistan 157
9 2017 0.673 Afghanistan 154
10 2019 0.655 Afghanistan 156
… with 1,921 more rows
> gdi
# A tibble: 2,069 x 4
# Groups: year [13]
year values country rank
1 2000 0.322 Afghanistan 146
2 2005 0.519 Afghanistan 157
3 2010 0.595 Afghanistan 163
4 2011 0.609 Afghanistan 164
5 2012 0.618 Afghanistan 164
6 2013 0.627 Afghanistan 163
7 2014 0.634 Afghanistan 164
8 2015 0.639 Afghanistan 164
9 2016 0.646 Afghanistan 164
10 2017 0.658 Afghanistan 165
… with 2,059 more rows
Looks pretty good so far. Let’s take a glance at the top spot over the years, then at the first few spots for the last year in the data (2019):
> gii %>% filter(rank==1) %>% arrange(year) # A tibble: 13 x 4 # Groups: year [13] year values country rank 1 1995 0.09 Sweden 1 2 2000 0.061 Sweden 1 3 2005 0.052 Sweden 1 4 2010 0.05 Sweden 1 5 2011 0.049 Sweden 1 6 2012 0.048 Sweden 1 7 2013 0.046 Denmark 1 8 2014 0.044 Denmark 1 9 2015 0.044 Denmark 1 10 2016 0.042 Denmark 1 11 2017 0.041 Switzerland 1 12 2018 0.039 Switzerland 1 13 2019 0.025 Switzerland 1 > gdi %>% filter(year=='2019') %>% arrange(year, rank) # A tibble: 167 x 4 # Groups: year [1] year values country rank 1 2019 1.036 Latvia 1 2 2019 1.03 Lithuania 2 3 2019 1.03 Qatar 3 4 2019 1.023 Mongolia 4 5 2019 1.019 Panama 5 6 2019 1.017 Estonia 6 7 2019 1.016 Uruguay 7 8 2019 1.014 Lesotho 8 9 2019 1.014 Moldova (Republic of) 9 10 2019 1.012 Nicaragua 10
Hmmm…GII looks about right, but GDI is looking a little…unexpected? It turns out GDI has a lot of complexity in it. From the UNDP HDR website:
Estimating the female and male HDIs for all countries relies on many approximations, such as assuming wage ratios of 0.8 for many countries. Because of this the estimated HDIs need to be interpreted with caution. We prefer not to rank the countries based on these approximated HDIs. Instead, we group countries into five GDI groups by absolute deviation from gender parity in HDI values.
Group 1 countries have high equality in HDI achievements between women and men: absolute deviation less than 2.5 percent; group 2 has medium-high equality in HDI achievements between women and men: absolute deviation between 2.5 percent and 5 percent; group 3 has medium equality in HDI achievements between women and men: absolute deviation between 5 percent and 7.5 percent; group 4 has medium-low equality in HDI achievements between women and men: absolute deviation between 7.5 percent and 10 percent; and group 5 has low equality in HDI achievements between women and men: absolute deviation from gender parity greater than 10 percent.
What to do now? The raw data don’t have the country groupings. The current year’s report only has the groupings for this year (it seems likely, given the methodology, that the groupings changed over time). The methodology derives these groups by absolute deviation from parity, so perhaps we can calculate the absolute deviation from 1 (full parity for this index) then rank the countries and see what happens? Let’s give it a shot. Also, I just noticed my values are still character fields! Ought to have noticed that before. Let’s make them values, re-run the ranking, and include the deviation, then take a look at the number one ranked countries again:
gii$values <- as.numeric(gii$values) gii <- gii %>% group_by(year) %>% mutate(rank = rank(values, ties.method='first')) > head(gii) # A tibble: 6 x 4 # Groups: year [6] year values country rank 1 2015 0.702 Afghanistan 155 2 2005 0.745 Afghanistan 142 3 2010 0.751 Afghanistan 148 4 2011 0.743 Afghanistan 149 5 2012 0.734 Afghanistan 152 6 2013 0.724 Afghanistan 148 gdi$values <- as.numeric(gdi$values) gdi$abs_variance <- abs(gdi$values-1) gdi <- gdi %>% group_by(year) %>% mutate(rank = rank(abs_variance, ties.method='first')) > gdi # A tibble: 2,069 x 5 # Groups: year [13] year values country rank abs_variance 1 2000 0.322 Afghanistan 146 0.6780000000 2 2005 0.519 Afghanistan 157 0.481 3 2010 0.595 Afghanistan 163 0.405 4 2011 0.609 Afghanistan 164 0.391 5 2012 0.618 Afghanistan 164 0.382 6 2013 0.627 Afghanistan 163 0.373 7 2014 0.634 Afghanistan 164 0.366 8 2015 0.639 Afghanistan 164 0.361 9 2016 0.646 Afghanistan 164 0.354 10 2017 0.658 Afghanistan 165 0.34200000 > gdi %>% filter(rank==1) %>% arrange(year) # A tibble: 13 x 5 # Groups: year [13] year values country rank abs_variance 1 1995 1 Sweden 1 0 2 2000 1 Uruguay 1 0 3 2005 1.001000000 Finland 1 0.001000000000 4 2010 1.001000000 Brazil 1 0.001000000000 5 2011 1.002 Thailand 1 0.002000000000 6 2012 1 Sweden 1 0 7 2013 1 Thailand 1 0 8 2014 1 Slovenia 1 0 9 2015 1.001000000 Namibia 1 0.001000000000 10 2016 1 Kazakhstan 1 0 11 2017 1 Dominican Republic 1 0 12 2018 1 Ukraine 1 0 13 2019 1 Ukraine 1 0
That looks a little better. Still some surprises. Clearly, we’re missing something from how UNDP calculated it. Notably, UNDP only provides the HDI ranking, not the GDI ranking. I also played around a bit with the ties.method
, but it didn’t move the rankings much. Normally, this result would be something to dig into, but since the point of this exercise is the R and not the methodology per se, we’re going to continue on as though the ranking were correct, but it clearly is not.
Next, I’ll want to graph my results. I tried looking at all the countries together, but the data are all over the place and it just looks like a mess. Behold:
ggplot(gii)+ geom_line( aes( x=year, y=values, group=country, color=country), size=1)+ theme( legend.position='off')
I thought it might be more interesting to look at just the highest, lowest, and median ranked, but it wasn’t any better, just bad in a different way:
gii_smry <- bind_rows( gii %>% group_by(year) %>% filter(values==min(values)), gii %>% group_by(year) %>% filter(values==max(values)), gii %>% group_by(year) %>% filter(rank==median(rank, na.rm=TRUE)) ) ggplot(gii_smry)+ geom_line( aes( x=year, y=values, group=country, color=country), size=1)+ geom_point( aes( x=year, y=values, group=country, color=country), size=1)+ theme( legend.position='bottom')
I had to add in points because so many countries occupied the median position just once. This graph ended up being more a cheer for the Nordics and one long raspberry for Yemen. Not interesting.
In the end, I decided it would be most interesting just to take a country and see how its rankings on both indices changed over the years. I’d really like to use Shiny so an end user could select the country, but I don’t yet know how to do that. Maybe my next project?
Anyway, final result below for Sweden, prettied up a bit. I decided to graph the data by combining both indices and filtering by the desired country. I had to add an index
column to each data set so I could distinguish them clearly on the graph:
gii$index='GII'; gdi$index='GDI' ctry <- 'Sweden' combined <- bind_rows(filter(gii,country==ctry), filter(gdi, country==ctry)) ggplot(combined)+ geom_line( aes(x=year, y=reorder(rank,desc(rank)), group=index, color=index), size=2)+ scale_color_manual( values=c("mediumpurple1","darkolivegreen3"))+ geom_text(data=combined, aes(x=year, y=reorder(rank,desc(rank)), label=rank), vjust=1, family="Liberation Sans", size=6 )+ theme( panel.border=element_rect(fill=NA), panel.background=element_rect(fill="lavenderblush2"), panel.grid=element_line(color="white"), plot.title=element_text(hjust=.95,vjust=-15,size=24,family="Droid Sans",color="black"), axis.ticks.x=element_blank(), axis.ticks.y=element_blank(), axis.text.y=element_blank(), legend.position=c(.05,.15), legend.title=element_blank(), legend.background=element_rect(fill="lavenderblush2"), legend.key=element_rect(fill="lavenderblush2"), legend.box.background=element_rect(color='black', size=.5))+ labs(title=paste(ctry,'Woman Empowerment\nRankings'), subtitle='', x='',y='' )+ annotate("text", x=3, y=.7, label='Source: United Nations Development Programme, Human Development Report Office', family="Liberation Sans", fontface="italic", size=3)