Variable Importance Chart
Through decision tree analysis, I was able to narrow down which indicators in the GSoDi were most important for predicting the level of gender equality. The analysis yielded the following as the most important variables. The relative importance measure (as graphed below) is defined as the the proportion of the total decision-making power acheived by each variable in a decision tree for gender equality based solely upon the indicators listed in the chart.
![Variable Importance Chart for Gender Equality](../assets/genderEquality.png)
This variable importance chart is based on a decision tree where the output was a discretized version of the gender equality feature. To convert gender equality into a discrete variable (in the GSoDi dataset, all variables are continuous between 0 and 1), I simply binned the data into 4 parts: 0.4 and below, over 0.4 and up to or including 0.6, over 0.6 and up to or including 0.8, and over 0.8. Based upon those bins. A decision tree with only the variables in the chart above predicts with 76.659% accuracy the correct bin that a given observation would be in for its measure of gender equality. Furthermore, 99.489% of classifications made with this decision tree place the observation in either its correct bin or an adjacent one. For example, if the true value of a gender equality observation belongs in the (0.8, 1.0] bin, this decision tree is 99.489% likely to predict this obersevation belongs in either the (0.8, 1.0] bin or the (06, 0.8] bin.
Further Analysis
To see how each of our important variables related to gender equality, see below and click on "Toggle show/hide" for each variable to see a scatterplot of all observations where the variable in question can be plotted against gender equality. The points in these scatter plots come from all countries across all years of observation. Here, they are listed in order of greatest importance (as noted by the variable importance chart above). You can see here that some variables have a more obviously strong relationship with gender eqality than others, but ultimately each of these important variables had a noticeably positive correlation with our output (gender equality).
Social Group Exclusion
Urban/Rural Location Exclusion
![Urban/Rural Location Index versus Gender Equality](../assets/gender-equality-relationships/v_23_09.png)
Socio-Economic Exclusion
![Socio-Economic Group Index versus Gender Equality](../assets/gender-equality-relationships/v_23_06.png)
Political Exclusion
![Political Group Index versus Gender Equality](../assets/gender-equality-relationships/v_23_07.png)
Justice for Women
![Access to Justice for Women versus Gender Equality](../assets/gender-equality-relationships/v_21_02.png)
Justice for Men
![Access to Justice for Men versus Gender Equality](../assets/gender-equality-relationships/v_21_01.png)
Social Group Equality: Civil Liberties
![Social Group Equality: Civil Liberties versus Gender Equality](../assets/gender-equality-relationships/v_23_02.png)
Power Distribution
![Power Distribution versus Gender Equality](../assets/gender-equality-relationships/v_23_04.png)
Freedom of Foreign Movement
![Freedom of Foreign Movement versus Gender Equality](../assets/gender-equality-relationships/v_22_31.png)
Freedom of Academic and Cultural Expression
![Freedom of Academic and Cultural Expression versus Gender Equality](../assets/gender-equality-relationships/v_22_06.png)
Freedom of Domestic Movement: Women
![Freedom of Domestic Movement for Women versus Gender Equality](../assets/gender-equality-relationships/v_22_32.png)
Freedom of Domestic Movement: Men
![Freedom of Domestic Movement for Men versus Gender Equality](../assets/gender-equality-relationships/v_22_33.png)
Freedom of Discussion for Men
![Freedom of Discussion for Men versus Gender Equality](../assets/gender-equality-relationships/v_22_05.png)
Print/Broadcast Censorship
![Print/Broadcast Censorship Effort versus Gender Equality](../assets/gender-equality-relationships/v_22_01.png)
Media Self-Censorship
![Media Self-Censorship versus Gender Equality](../assets/gender-equality-relationships/v_22_03.png)
Decision Tree Process
Choosing Indicators
If you recall, our dataset has multiple levels of granularity, starting from the domain and subdomain level, all the way down to the individual indicator level. I leveraged this fact to inform my variable-selection process. I started with a decision tree with the highest level variables (i.e. at the domain level), and used the variable importance chart from each decision tree to inform which variables to include in my next analysis. I gradually worked down from domain-level variables to the individual indicator variables, never including anything that was a product of or indicator for the gender equality variable. You can see more details about how this process went in the code.
Pruning the Tree
Once I had drilled down to the lowest level of data granularity, I had a decision tree with high accuracy but 40 indicator variables. This situation seemed really dangerous and prone to overfitting, so I wanted to take steps to guard against that while still maintaining a highly accurate decision tree. To do this, I "pruned" the decision tree by keeping only the indicators whcih contributed to at least 1% of the total importance in decision making.
Calculating Accuracy
Something that was really important to the process of selecting a model and determining its accuracy was having both query data and test data to make decisions and conclusions with. While 60% of the total dataset was training data (that is, the data used to build up the decision tree models), 20% of the data was kept back as query data and another 20% as test data. The query data was used to estimate the accuracy of multiple decision tree models, and was a helpful tool in selecting which model to pursue. The test data never touched the model building process until the final model was selected, and at that point was used to estimate the accuracy of the decision tree by introducing "new" data to model and seeing how it fared. Accuracy was calculated in the usual way, dividing the number of correctly classified observations by the total number of observations.
I also wanted to create a measurement that, in a general sense, explained how often the decision tree was "close" in predicting the level of gender equality. For this, I added up all of the "successful classifications" to additionally include anything classified in a bin adjacent to its true value. For instance, not only correct predictions of observations in the (0.6, 0.8] bin were counted as successful predictions, but also observations whose true value was in that bin but were classified as being (0.4, 0.6] or (0.8, 1.0]. The total number of "successful classifications" was then divided by the total number of observations.
Final Tree
Ultimately, the final tree can be plotted as follows:
![Decision Tree for Gender Equality](../assets/genderEqDecisionTree.png)
In case you don't have your cheat sheet handy, the variables mentioned in this plot are:
v_22_06
: freedom of academic and cultural expressionv_22_31
: freedom of foreign movementv_22_32
: CSO women's participationv_23_06
: exclusion by socio-economic group (inverted)v_23_07
: exlusion by political group index (inverted)v_23_08
: exclusion by social group index (inverted)v_23_09
: exclusion by urban/rural location index (inverted)
Dig Deeper
Still want to see more of the process? Check out the full code on GitHub!