Predictors of Gender Equality

An in-depth look at the decision tree process and further conclusions

Variable Importance Chart

Through decision tree analysis, I was able to narrow down which indicators in the GSoDi were most important for predicting the level of gender equality. The analysis yielded the following as the most important variables. The relative importance measure (as graphed below) is defined as the the proportion of the total decision-making power acheived by each variable in a decision tree for gender equality based solely upon the indicators listed in the chart.

Variable Importance Chart for Gender Equality

This variable importance chart is based on a decision tree where the output was a discretized version of the gender equality feature. To convert gender equality into a discrete variable (in the GSoDi dataset, all variables are continuous between 0 and 1), I simply binned the data into 4 parts: 0.4 and below, over 0.4 and up to or including 0.6, over 0.6 and up to or including 0.8, and over 0.8. Based upon those bins. A decision tree with only the variables in the chart above predicts with 76.659% accuracy the correct bin that a given observation would be in for its measure of gender equality. Furthermore, 99.489% of classifications made with this decision tree place the observation in either its correct bin or an adjacent one. For example, if the true value of a gender equality observation belongs in the (0.8, 1.0] bin, this decision tree is 99.489% likely to predict this obersevation belongs in either the (0.8, 1.0] bin or the (06, 0.8] bin.

Further Analysis

To see how each of our important variables related to gender equality, see below and click on "Toggle show/hide" for each variable to see a scatterplot of all observations where the variable in question can be plotted against gender equality. The points in these scatter plots come from all countries across all years of observation. Here, they are listed in order of greatest importance (as noted by the variable importance chart above). You can see here that some variables have a more obviously strong relationship with gender eqality than others, but ultimately each of these important variables had a noticeably positive correlation with our output (gender equality).

Social Group Exclusion

Social Group Index versus Gender Equality

Urban/Rural Location Exclusion

Urban/Rural Location Index versus Gender Equality

Socio-Economic Exclusion

Socio-Economic Group Index versus Gender Equality

Political Exclusion

Political Group Index versus Gender Equality

Justice for Women

Access to Justice for Women versus Gender Equality

Justice for Men

Access to Justice for Men versus Gender Equality

Social Group Equality: Civil Liberties

Social Group Equality: Civil Liberties versus Gender Equality

Power Distribution

Power Distribution versus Gender Equality

Freedom of Foreign Movement

Freedom of Foreign Movement versus Gender Equality

Freedom of Academic and Cultural Expression

Freedom of Academic and Cultural Expression versus Gender Equality

Freedom of Domestic Movement: Women

Freedom of Domestic Movement for Women versus Gender Equality

Freedom of Domestic Movement: Men

Freedom of Domestic Movement for Men versus Gender Equality

Freedom of Discussion for Men

Freedom of Discussion for Men versus Gender Equality

Print/Broadcast Censorship

Print/Broadcast Censorship Effort versus Gender Equality

Media Self-Censorship

Media Self-Censorship versus Gender Equality

Decision Tree Process

Choosing Indicators

If you recall, our dataset has multiple levels of granularity, starting from the domain and subdomain level, all the way down to the individual indicator level. I leveraged this fact to inform my variable-selection process. I started with a decision tree with the highest level variables (i.e. at the domain level), and used the variable importance chart from each decision tree to inform which variables to include in my next analysis. I gradually worked down from domain-level variables to the individual indicator variables, never including anything that was a product of or indicator for the gender equality variable. You can see more details about how this process went in the code.

Pruning the Tree

Once I had drilled down to the lowest level of data granularity, I had a decision tree with high accuracy but 40 indicator variables. This situation seemed really dangerous and prone to overfitting, so I wanted to take steps to guard against that while still maintaining a highly accurate decision tree. To do this, I "pruned" the decision tree by keeping only the indicators whcih contributed to at least 1% of the total importance in decision making.

Calculating Accuracy

Something that was really important to the process of selecting a model and determining its accuracy was having both query data and test data to make decisions and conclusions with. While 60% of the total dataset was training data (that is, the data used to build up the decision tree models), 20% of the data was kept back as query data and another 20% as test data. The query data was used to estimate the accuracy of multiple decision tree models, and was a helpful tool in selecting which model to pursue. The test data never touched the model building process until the final model was selected, and at that point was used to estimate the accuracy of the decision tree by introducing "new" data to model and seeing how it fared. Accuracy was calculated in the usual way, dividing the number of correctly classified observations by the total number of observations.

I also wanted to create a measurement that, in a general sense, explained how often the decision tree was "close" in predicting the level of gender equality. For this, I added up all of the "successful classifications" to additionally include anything classified in a bin adjacent to its true value. For instance, not only correct predictions of observations in the (0.6, 0.8] bin were counted as successful predictions, but also observations whose true value was in that bin but were classified as being (0.4, 0.6] or (0.8, 1.0]. The total number of "successful classifications" was then divided by the total number of observations.

Final Tree

Ultimately, the final tree can be plotted as follows:

Decision Tree for Gender Equality

In case you don't have your cheat sheet handy, the variables mentioned in this plot are:

  • v_22_06: freedom of academic and cultural expression
  • v_22_31: freedom of foreign movement
  • v_22_32: CSO women's participation
  • v_23_06: exclusion by socio-economic group (inverted)
  • v_23_07: exlusion by political group index (inverted)
  • v_23_08: exclusion by social group index (inverted)
  • v_23_09: exclusion by urban/rural location index (inverted)

Dig Deeper

Still want to see more of the process? Check out the full code on GitHub!