percent results from the CPAN

Alien-XGBoost

```r
importance <- xgb.importance(feature_names = sparse_matrix@Dimnames[[2]], model = bst)
head(importance)
```

```
##             Feature        Gain      Cover  Frequency
## 1:              Age 0.622031651 0.67251706 0.67241379
## 2: TreatmentPlacebo 0.285750607 0.11916656 0.10344828
## 3:          SexMale 0.048744054 0.04522027 0.08620690
## 4:      AgeDiscret6 0.016604647 0.04784637 0.05172414
## 5:      AgeDiscret3 0.016373791 0.08028939 0.05172414
## 6:      AgeDiscret4 0.009270558 0.02858801 0.01724138
```

> The column `Gain` provide the information we are looking for.
>
> As you can see, features are classified by `Gain`.

`Gain` is the improvement in accuracy brought by a feature to the branches it is on. The idea is that before adding a new split on a feature X to the branch there was some wrongly classified elements, after adding the split on this feature, there are...

`Cover` measures the relative quantity of observations concerned by a feature.

`Frequency` is a simpler way to measure the `Gain`. It just counts the number of times a feature is used in all generated trees. You should not use it (unless you know why you want to use it).

#### Improvement in the interpretability of feature importance data.table

We can go deeper in the analysis of the model. In the `data.table` above, we have discovered which features counts to predict if the illness will go or not. But we don't yet know the role of these features. For instance, one of the question we may wa...

One simple solution is to count the co-occurrences of a feature and a class of the classification.

For that purpose we will execute the same function as above but using two more parameters, `data` and `label`.


```r
importanceRaw <- xgb.importance(feature_names = sparse_matrix@Dimnames[[2]], model = bst, data = sparse_matrix, label = output_vector)

# Cleaning for better display
importanceClean <- importanceRaw[,`:=`(Cover=NULL, Frequency=NULL)]

head(importanceClean)
```

```
##             Feature        Split       Gain RealCover RealCover %
## 1: TreatmentPlacebo -1.00136e-05 0.28575061         7   0.2500000
## 2:              Age         61.5 0.16374034        12   0.4285714
## 3:              Age           39 0.08705750         8   0.2857143
## 4:              Age         57.5 0.06947553        11   0.3928571
## 5:          SexMale -1.00136e-05 0.04874405         4   0.1428571
## 6:              Age         53.5 0.04620627        10   0.3571429
```

> In the table above we have removed two not needed columns and select only the first lines.

First thing you notice is the new column `Split`. It is the split applied to the feature on a branch of one of the tree. Each split is present, therefore a feature can appear several times in this table. Here we can see the feature `Age` is used seve...

How the split is applied to count the co-occurrences? It is always `<`. For instance, in the second line, we measure the number of persons under 61.5 years with the illness gone after the treatment.

The two other new columns are `RealCover` and `RealCover %`. In the first column it measures the number of observations in the dataset where the split is respected and the label marked as `1`. The second column is the percentage of the whole populati...

Therefore, according to our findings, getting a placebo doesn't seem to help but being younger than 61 years may help (seems logic).

> You may wonder how to interpret the `< 1.00001` on the first line. Basically, in a sparse `Matrix`, there is no `0`, therefore, looking for one hot-encoded categorical observations validating the rule `< 1.00001` is like just looking for `1` for th...

### Plotting the feature importance


All these things are nice, but it would be even better to plot the results.


```r
xgb.plot.importance(importance_matrix = importanceRaw)
```

```
## Error in xgb.plot.importance(importance_matrix = importanceRaw): Importance matrix is not correct (column names issue)
```

Feature have automatically been divided in 2 clusters: the interesting features... and the others.

> Depending of the dataset and the learning parameters you may have more than two clusters. Default value is to limit them to `10`, but you can increase this limit. Look at the function documentation for more information.

According to the plot above, the most important features in this dataset to predict if the treatment will work are :

* the Age ;
* having received a placebo or not ;
* the sex is third but already included in the not interesting features group ;
* then we see our generated features (AgeDiscret). We can see that their contribution is very low.

### Do these results make sense?


Let's check some **Chi2** between each of these features and the label.

Higher **Chi2** means better correlation.


```r
c2 <- chisq.test(df$Age, output_vector)
print(c2)
```

```
## 
## 	Pearson's Chi-squared test
## 
## data:  df$Age and output_vector
## X-squared = 35.475, df = 35, p-value = 0.4458
```

Pearson correlation between Age and illness disapearing is **35.48**.


```r
c2 <- chisq.test(df$AgeDiscret, output_vector)
print(c2)
```

```
( run in 0.967 second using v1.01-cache-2.11-cpan-e1769b4cff6 )