Ariel Wentworth | Gender Equality

Variable Importance Chart

Through decision tree analysis, I was able to narrow down which indicators in the GSoDi were most important for predicting the level of gender equality. The analysis yielded the following as the most important variables. The relative importance measure (as graphed below) is defined as the the proportion of the total decision-making power acheived by each variable in a decision tree for gender equality based solely upon the indicators listed in the chart.

This variable importance chart is based on a decision tree where the output was a discretized version of the gender equality feature. To convert gender equality into a discrete variable (in the GSoDi dataset, all variables are continuous between 0 and 1), I simply binned the data into 4 parts: 0.4 and below, over 0.4 and up to or including 0.6, over 0.6 and up to or including 0.8, and over 0.8. Based upon those bins. A decision tree with only the variables in the chart above predicts with 76.659% accuracy the correct bin that a given observation would be in for its measure of gender equality. Furthermore, 99.489% of classifications made with this decision tree place the observation in either its correct bin or an adjacent one. For example, if the true value of a gender equality observation belongs in the (0.8, 1.0] bin, this decision tree is 99.489% likely to predict this obersevation belongs in either the (0.8, 1.0] bin or the (06, 0.8] bin.

Decision Tree Process

Choosing Indicators

If you recall, our dataset has multiple levels of granularity, starting from the domain and subdomain level, all the way down to the individual indicator level. I leveraged this fact to inform my variable-selection process. I started with a decision tree with the highest level variables (i.e. at the domain level), and used the variable importance chart from each decision tree to inform which variables to include in my next analysis. I gradually worked down from domain-level variables to the individual indicator variables, never including anything that was a product of or indicator for the gender equality variable. You can see more details about how this process went in the code.

Pruning the Tree

Once I had drilled down to the lowest level of data granularity, I had a decision tree with high accuracy but 40 indicator variables. This situation seemed really dangerous and prone to overfitting, so I wanted to take steps to guard against that while still maintaining a highly accurate decision tree. To do this, I "pruned" the decision tree by keeping only the indicators whcih contributed to at least 1% of the total importance in decision making.

Calculating Accuracy

Something that was really important to the process of selecting a model and determining its accuracy was having both query data and test data to make decisions and conclusions with. While 60% of the total dataset was training data (that is, the data used to build up the decision tree models), 20% of the data was kept back as query data and another 20% as test data. The query data was used to estimate the accuracy of multiple decision tree models, and was a helpful tool in selecting which model to pursue. The test data never touched the model building process until the final model was selected, and at that point was used to estimate the accuracy of the decision tree by introducing "new" data to model and seeing how it fared. Accuracy was calculated in the usual way, dividing the number of correctly classified observations by the total number of observations.

I also wanted to create a measurement that, in a general sense, explained how often the decision tree was "close" in predicting the level of gender equality. For this, I added up all of the "successful classifications" to additionally include anything classified in a bin adjacent to its true value. For instance, not only correct predictions of observations in the (0.6, 0.8] bin were counted as successful predictions, but also observations whose true value was in that bin but were classified as being (0.4, 0.6] or (0.8, 1.0]. The total number of "successful classifications" was then divided by the total number of observations.

Final Tree

Ultimately, the final tree can be plotted as follows:

In case you don't have your cheat sheet handy, the variables mentioned in this plot are:

v_22_06: freedom of academic and cultural expression
v_22_31: freedom of foreign movement
v_22_32: CSO women's participation
v_23_06: exclusion by socio-economic group (inverted)
v_23_07: exlusion by political group index (inverted)
v_23_08: exclusion by social group index (inverted)
v_23_09: exclusion by urban/rural location index (inverted)

Predictors of Gender Equality

Variable Importance Chart

Further Analysis

Social Group Exclusion

Urban/Rural Location Exclusion

Socio-Economic Exclusion

Political Exclusion

Justice for Women

Justice for Men

Social Group Equality: Civil Liberties

Power Distribution

Freedom of Foreign Movement

Freedom of Academic and Cultural Expression

Freedom of Domestic Movement: Women

Freedom of Domestic Movement: Men

Freedom of Discussion for Men

Print/Broadcast Censorship

Media Self-Censorship

Decision Tree Process

Choosing Indicators

Pruning the Tree

Calculating Accuracy

Final Tree

Dig Deeper