Gender Development, Inequality Indexes

I wanted to take a look at how country rankings for the UNDP’s Gender Development Index and Gender Inequality Index have changed over time. Fortunately, the UNDP has an API to query the data directly, so I wouldn’t need to scrape pdfs or websites. Unfortunately, I had never done anything with RESTful APIs or JSON so this is me figuring it out.

As always, thanks to the thankless, those generous souls who have posted their wisdom on the web to enlighten the rest of us. I found the following tutorials/answers particularly helpful:

First, I’ll fire up R and load my favorite libraries. I prefer ESS mode in Emacs. I love the tidyverse for obvious reasons. I’ll need httr and jsonlite to query and parse the data. I use a linux workstation but often send visualizations to people using Windows, and the fonts in the pdf files don’t quite turn out right if I don’t use extrafont:

lapply(c("tidyverse","extrafont","httr","jsonlite"),library,character.only=TRUE)

Now I need to go to the United Nations Development Programme Human Development Report page to learn about the data and register for the API at http://hdr.undp.org/en/content/human-development-report-office-statistical-data-api. Registration is easy, but the password appears to go out in plain text so I let Firefox generate and remember a random strong password for me.

I’ll be looking at two indexes. Each one has their code (137906 for GDI and 68606 for GII) that I’ll use in the query. Let’s do GDI first:

res_gdi<-GET("http://ec2-54-174-131-205.compute-1.amazonaws.com/API/HDRO_API.php/indicator_id=137906")

To query the data and return the results, then

d_gdi<-fromJSON(rawToChar(res_gdi$content))

to parse the JSON into something a little easier to digest. At this point I got a little bit hung up at first. When I took at look at it, I got, not the first few rows, but a ginormous chunk of data:

> head(d_gdi)
$indicator_value
$indicator_value$AFG
$indicator_value$AFG$137906
$indicator_value$AFG$137906$2000
[1] 0.322
$indicator_value$AFG$137906$2005
[1] 0.519
$indicator_value$AFG$137906$2010
[1] 0.595
$indicator_value$AFG$137906$2011
[1] 0.609
...

And on and on. Going back through the tutorials and looking around, I took a look at the names:

> names(d_gdi)
[1] "indicator_value" "country_name" "indicator_name"

It turns out that all the data I’m looking for (value, country, year) are in that first name, indicator_value. I can grab just that, but it helps to unlist it and put it in a data.frame to give me something I’m more familiar with:

df_gdi<-data.frame(unlist(d_gdi[[1]]))
   > head(df_gdi)
                   unlist.d_gdi..1…
   AFG.137906.2000              0.322
   AFG.137906.2005              0.519
   AFG.137906.2010              0.595
   AFG.137906.2011              0.609
   AFG.137906.2012              0.618
   AFG.137906.2013              0.627

That’s more like it! Now I can see the country value, the indicator, the year, and the actual value. It’s still a little funky because half the info I want is now the row names, but I can create a column based on the row names, then separate that out into the columns I need.

df_gdi$ciy<-row.names(df_gdi)
   > head(df_gdi)
                   unlist.d_gdi..1…             ciy
   AFG.137906.2000              0.322 AFG.137906.2000
   AFG.137906.2005              0.519 AFG.137906.2005
   AFG.137906.2010              0.595 AFG.137906.2010
   AFG.137906.2011              0.609 AFG.137906.2011
   AFG.137906.2012              0.618 AFG.137906.2012
   AFG.137906.2013              0.627 AFG.137906.2013

Now I’ll get the values and put them in a column. I’m sure there’s a better way to do this, but in the interest of just getting it done and moving on to the next step…

df_gdi$values<-df_gdi[,1]    
   > head(df_gdi)
                   unlist.d_gdi..1…             ciy values
   AFG.137906.2000              0.322 AFG.137906.2000  0.322
   AFG.137906.2005              0.519 AFG.137906.2005  0.519
   AFG.137906.2010              0.595 AFG.137906.2010  0.595
   AFG.137906.2011              0.609 AFG.137906.2011  0.609
   AFG.137906.2012              0.618 AFG.137906.2012  0.618
   AFG.137906.2013              0.627 AFG.137906.2013  0.627

Now I want to rename the rows to their numbers:

row.names(df_gdi)<-1:nrow(df_gdi)
   > head(df_gdi)
     unlist.d_gdi..1…             ciy values
   1              0.322 AFG.137906.2000  0.322
   2              0.519 AFG.137906.2005  0.519
   3              0.595 AFG.137906.2010  0.595
   4              0.609 AFG.137906.2011  0.609
   5              0.618 AFG.137906.2012  0.618
   6              0.627 AFG.137906.2013  0.627

Looks like I still have the original column hanging around. How about if I just take the two columns I want and create a new data frame?

gdi<-data.frame(cbind(df_gdi$ciy,df_gdi$values))
   > head(gdi)
                  X1    X2
   1 AFG.137906.2000 0.322
   2 AFG.137906.2005 0.519
   3 AFG.137906.2010 0.595
   4 AFG.137906.2011 0.609
   5 AFG.137906.2012 0.618
   6 AFG.137906.2013 0.627

Interesting. I didn’t think I was going to lose the column names in that operation. I’ll have to figure out what happened there one of these days. Meanwhile, just rename the columns:

colnames(gdi)<-c("ciy","values")

And finally, separate out the other values I want into their own columns:

gdi <- gdi %>% separate(ciy, c("country_code","indicator","year"), remove = TRUE)
   > head(gdi)
     country_code indicator year values
   1          AFG    137906 2000  0.322
   2          AFG    137906 2005  0.519
   3          AFG    137906 2010  0.595
   4          AFG    137906 2011  0.609
   5          AFG    137906 2012  0.618
   6          AFG    137906 2013  0.627

Perfect! Now I’ll do the same for the other indicator, the GII:

res_gii<-GET("http://ec2-54-174-131-205.compute-1.amazonaws.com/API/HDRO_API.php/indicator_id=68606")
d_gii<-fromJSON(rawToChar(res_gii$content))
df_gii<-data.frame(unlist(d_gii[[1]]))
df_gii$ciy<-row.names(df_gii) df_gii$values<-df_gii[,1]
gii<-cbind(df_gii$ciy,as.numeric(df_gii$values))
row.names(gii)<-1:nrow(gii) colnames(gii)<-c("ciy","values")
gii<-data.frame(gii) gii<-gii %>% separate(ciy,
c("country_code","indicator","year"), remove = TRUE)
   > head(gii)
     country_code indicator year values
   1          AFG     68606 2005  0.745
   2          AFG     68606 2010  0.751
   3          AFG     68606 2011  0.743
   4          AFG     68606 2012  0.734
   5          AFG     68606 2013  0.724
   6          AFG     68606 2014  0.714

Wonderful! Now I’ll get the country names. It turns out they were returned in my first (and second) query, so I just need to parse them out into their own data frame (and name the columns) which I will merge into my two index value data frames:

countries <- data.frame(cbind(row.names(data.frame(unlist(d_gdi[[2]]))),data.frame(unlist(d_gdi[[2]]))[,1]))
colnames(countries) <- c("country_code","country")
> head(countries)
   country_code              country
 1          AFG          Afghanistan
 2          AGO               Angola
 3          ALB              Albania
 4          ARE United Arab Emirates
 5          ARG            Argentina
 6          ARM              Armenia

gdi <- merge(gdi,countries) %>% select(year,values,country)
gii <- merge(gii,countries) %>% select(year,values,country)
   > head(gdi)
     year values     country
   1 2000  0.322 Afghanistan
   2 2005  0.519 Afghanistan
   3 2010  0.595 Afghanistan
   4 2011  0.609 Afghanistan
   5 2012  0.618 Afghanistan
   6 2013  0.627 Afghanistan
   > head(gii)
     year values     country
   1 2015  0.702 Afghanistan
   2 2005  0.745 Afghanistan
   3 2010  0.751 Afghanistan
   4 2011  0.743 Afghanistan
   5 2012  0.734 Afghanistan
   6 2013  0.724 Afghanistan

Now it’s time to insert rankings. I really had a hard time figuring out how to do this because I wanted to rank the countries within each year. Also, it turns out it’s really important to decide how you want to handle ties when two elements of your list have the same value. It seems the UNDP handles ties analogous to what we get when using the ‘first’ method for the base::rank function. What I ended up doing was grouping by year, then mutating to get the rank (and note that GII is improved by approaching 0, while GDI is improved by approaching 1):

gii <- gii %>% group_by(year) %>% mutate(rank = rank(values, ties.method='first'))
gdi <- gdi %>% group_by(year) %>% mutate(rank = rank(desc(values), ties.method='first'))
> gii
 # A tibble: 1,931 x 4
 # Groups:   year [13]
    year  values country      rank              
  1 2015  0.702  Afghanistan   155
  2 2005  0.745  Afghanistan   142
  3 2010  0.751  Afghanistan   148
  4 2011  0.743  Afghanistan   149
  5 2012  0.734  Afghanistan   152
  6 2013  0.724  Afghanistan   148
  7 2014  0.714  Afghanistan   150
  8 2016  0.69   Afghanistan   157
  9 2017  0.673  Afghanistan   154
 10 2019  0.655  Afghanistan   156
 … with 1,921 more rows
> gdi
 # A tibble: 2,069 x 4
 # Groups:   year [13]
    year  values country      rank              
  1 2000  0.322  Afghanistan   146
  2 2005  0.519  Afghanistan   157
  3 2010  0.595  Afghanistan   163
  4 2011  0.609  Afghanistan   164
  5 2012  0.618  Afghanistan   164
  6 2013  0.627  Afghanistan   163
  7 2014  0.634  Afghanistan   164
  8 2015  0.639  Afghanistan   164
  9 2016  0.646  Afghanistan   164
 10 2017  0.658  Afghanistan   165
 … with 2,059 more rows

Looks pretty good so far. Let’s take a glance at the top spot over the years, then at the first few spots for the last year in the data (2019):

> gii %>% filter(rank==1) %>% arrange(year)
 # A tibble: 13 x 4
 # Groups:   year [13]
    year  values country      rank              
  1 1995  0.09   Sweden          1
  2 2000  0.061  Sweden          1
  3 2005  0.052  Sweden          1
  4 2010  0.05   Sweden          1
  5 2011  0.049  Sweden          1
  6 2012  0.048  Sweden          1
  7 2013  0.046  Denmark         1
  8 2014  0.044  Denmark         1
  9 2015  0.044  Denmark         1
 10 2016  0.042  Denmark         1
 11 2017  0.041  Switzerland     1
 12 2018  0.039  Switzerland     1
 13 2019  0.025  Switzerland     1 

> gdi %>% filter(year=='2019') %>% arrange(year, rank)
 # A tibble: 167 x 4
 # Groups:   year [1]
    year  values country                rank                    
  1 2019  1.036  Latvia                    1
  2 2019  1.03   Lithuania                 2
  3 2019  1.03   Qatar                     3
  4 2019  1.023  Mongolia                  4
  5 2019  1.019  Panama                    5
  6 2019  1.017  Estonia                   6
  7 2019  1.016  Uruguay                   7
  8 2019  1.014  Lesotho                   8
  9 2019  1.014  Moldova (Republic of)     9
 10 2019  1.012  Nicaragua                10

Hmmm…GII looks about right, but GDI is looking a little…unexpected? It turns out GDI has a lot of complexity in it. From the UNDP HDR website:

Estimating the female and male HDIs for all countries relies on many approximations, such as assuming wage ratios of 0.8 for many countries. Because of this the estimated HDIs need to be interpreted with caution. We prefer not to rank the countries based on these approximated HDIs. Instead, we group countries into five GDI groups by absolute deviation from gender parity in HDI values.

Group 1 countries have high equality in HDI achievements between women and men: absolute deviation less than 2.5 percent; group 2 has medium-high equality in HDI achievements between women and men: absolute deviation between 2.5 percent and 5 percent; group 3 has medium equality in HDI achievements between women and men: absolute deviation between 5 percent and 7.5 percent; group 4 has medium-low equality in HDI achievements between women and men: absolute deviation between 7.5 percent and 10 percent; and group 5 has low equality in HDI achievements between women and men: absolute deviation from gender parity greater than 10 percent.

What to do now? The raw data don’t have the country groupings. The current year’s report only has the groupings for this year (it seems likely, given the methodology, that the groupings changed over time). The methodology derives these groups by absolute deviation from parity, so perhaps we can calculate the absolute deviation from 1 (full parity for this index) then rank the countries and see what happens? Let’s give it a shot. Also, I just noticed my values are still character fields! Ought to have noticed that before. Let’s make them values, re-run the ranking, and include the deviation, then take a look at the number one ranked countries again:

gii$values <- as.numeric(gii$values)
   gii <- gii %>% group_by(year) %>% mutate(rank = rank(values, ties.method='first'))
   > head(gii)
   # A tibble: 6 x 4
   # Groups:   year [6]
     year  values country      rank      
   1 2015   0.702 Afghanistan   155
   2 2005   0.745 Afghanistan   142
   3 2010   0.751 Afghanistan   148
   4 2011   0.743 Afghanistan   149
   5 2012   0.734 Afghanistan   152
   6 2013   0.724 Afghanistan   148 
gdi$values <- as.numeric(gdi$values)
gdi$abs_variance <- abs(gdi$values-1)
gdi <- gdi %>% group_by(year) %>% mutate(rank = rank(abs_variance, ties.method='first'))
> gdi
 # A tibble: 2,069 x 5
 # Groups:   year [13]
    year  values country      rank abs_variance                  
  1 2000   0.322 Afghanistan   146 0.6780000000
  2 2005   0.519 Afghanistan   157 0.481       
  3 2010   0.595 Afghanistan   163 0.405       
  4 2011   0.609 Afghanistan   164 0.391       
  5 2012   0.618 Afghanistan   164 0.382       
  6 2013   0.627 Afghanistan   163 0.373       
  7 2014   0.634 Afghanistan   164 0.366       
  8 2015   0.639 Afghanistan   164 0.361       
  9 2016   0.646 Afghanistan   164 0.354       
 10 2017   0.658 Afghanistan   165 0.34200000 
> gdi %>% filter(rank==1) %>% arrange(year)
 # A tibble: 13 x 5
 # Groups:   year [13]
    year       values country             rank   abs_variance                                
  1 1995  1           Sweden                 1 0             
  2 2000  1           Uruguay                1 0             
  3 2005  1.001000000 Finland                1 0.001000000000
  4 2010  1.001000000 Brazil                 1 0.001000000000
  5 2011  1.002       Thailand               1 0.002000000000
  6 2012  1           Sweden                 1 0             
  7 2013  1           Thailand               1 0             
  8 2014  1           Slovenia               1 0             
  9 2015  1.001000000 Namibia                1 0.001000000000
 10 2016  1           Kazakhstan             1 0             
 11 2017  1           Dominican Republic     1 0             
 12 2018  1           Ukraine                1 0             
 13 2019  1           Ukraine                1 0

That looks a little better. Still some surprises. Clearly, we’re missing something from how UNDP calculated it. Notably, UNDP only provides the HDI ranking, not the GDI ranking. I also played around a bit with the ties.method, but it didn’t move the rankings much. Normally, this result would be something to dig into, but since the point of this exercise is the R and not the methodology per se, we’re going to continue on as though the ranking were correct, but it clearly is not.

Next, I’ll want to graph my results. I tried looking at all the countries together, but the data are all over the place and it just looks like a mess. Behold:

ggplot(gii)+
     geom_line(
         aes(
             x=year,
             y=values,
             group=country,
             color=country),
         size=1)+
     theme(
         legend.position='off')

I thought it might be more interesting to look at just the highest, lowest, and median ranked, but it wasn’t any better, just bad in a different way:

gii_smry <- bind_rows(     gii %>% group_by(year) %>% filter(values==min(values)),
     gii %>% group_by(year) %>% filter(values==max(values)),
     gii %>% group_by(year) %>% filter(rank==median(rank, na.rm=TRUE))
 )

ggplot(gii_smry)+
     geom_line(
         aes(
             x=year,
             y=values,
             group=country,
             color=country),
         size=1)+
     geom_point(
         aes(
             x=year,
             y=values,
             group=country,
             color=country),
         size=1)+
     theme(
         legend.position='bottom')

I had to add in points because so many countries occupied the median position just once. This graph ended up being more a cheer for the Nordics and one long raspberry for Yemen. Not interesting.

In the end, I decided it would be most interesting just to take a country and see how its rankings on both indices changed over the years. I’d really like to use Shiny so an end user could select the country, but I don’t yet know how to do that. Maybe my next project?

Anyway, final result below for Sweden, prettied up a bit. I decided to graph the data by combining both indices and filtering by the desired country. I had to add an index column to each data set so I could distinguish them clearly on the graph:

gii$index='GII'; gdi$index='GDI'
 ctry <- 'Sweden'
 combined <- bind_rows(filter(gii,country==ctry), filter(gdi, country==ctry))
ggplot(combined)+
     geom_line(
         aes(x=year, y=reorder(rank,desc(rank)), group=index, color=index),
         size=2)+
     scale_color_manual(
         values=c("mediumpurple1","darkolivegreen3"))+
     geom_text(data=combined,
               aes(x=year, y=reorder(rank,desc(rank)), label=rank),
               vjust=1,
               family="Liberation Sans",
               size=6
               )+
     theme(
         panel.border=element_rect(fill=NA),
         panel.background=element_rect(fill="lavenderblush2"),
         panel.grid=element_line(color="white"),
         plot.title=element_text(hjust=.95,vjust=-15,size=24,family="Droid Sans",color="black"),
         axis.ticks.x=element_blank(),
         axis.ticks.y=element_blank(),
         axis.text.y=element_blank(),
         legend.position=c(.05,.15),
         legend.title=element_blank(),
         legend.background=element_rect(fill="lavenderblush2"),
         legend.key=element_rect(fill="lavenderblush2"),
         legend.box.background=element_rect(color='black', size=.5))+
     labs(title=paste(ctry,'Woman Empowerment\nRankings'),
          subtitle='',
          x='',y=''
          )+
     annotate("text",
              x=3,
              y=.7,
              label='Source: United Nations Development Programme, Human Development Report Office',
              family="Liberation Sans",
              fontface="italic",
              size=3)

Multiple Equilibria

Gender Development, Inequality Indexes

Related

Leave a Reply Cancel reply