Alien-XGBoost

 view release on metacpan or  search on metacpan

xgboost/R-package/vignettes/discoverYourData.Rmd  view on Meta::CPAN

##### Grouping per 10 years

For the first feature we create groups of age by rounding the real age.

Note that we transform it to `factor` so the algorithm treat these age groups as independent values.

Therefore, 20 is not closer to 30 than 60. To make it short, the distance between ages is lost in this transformation.

```{r}
head(df[,AgeDiscret := as.factor(round(Age/10,0))])
```

##### Random split into two groups

Following is an even stronger simplification of the real age with an arbitrary split at 30 years old. We choose this value **based on nothing**. We will see later if simplifying the information based on arbitrary values is a good strategy (you may al...

```{r}
head(df[,AgeCat:= as.factor(ifelse(Age > 30, "Old", "Young"))])
```

##### Risks in adding correlated features

These new features are highly correlated to the `Age` feature because they are simple transformations of this feature.

For many machine learning algorithms, using correlated features is not a good idea. It may sometimes make prediction less accurate, and most of the time make interpretation of the model almost impossible. GLM, for instance, assumes that the features ...

Fortunately, decision tree algorithms (including boosted trees) are very robust to these features. Therefore we have nothing to do to manage this situation.

##### Cleaning data

We remove ID as there is nothing to learn from this feature (it would just add some noise).

```{r, results='hide'}
df[,ID:=NULL]
```

We will list the different values for the column `Treatment`:

```{r}
levels(df[,Treatment])
```


#### Encoding categorical features

Next step, we will transform the categorical data to dummy variables.
Several encoding methods exist, e.g., [one-hot encoding](http://en.wikipedia.org/wiki/One-hot) is a common approach.
We will use the [dummy contrast coding](http://www.ats.ucla.edu/stat/r/library/contrast_coding.htm#dummy) which is popular because it producess "full rank" encoding (also see [this blog post by Max Kuhn](http://appliedpredictivemodeling.com/blog/2013...

The purpose is to transform each value of each *categorical* feature into a *binary* feature `{0, 1}`.

For example, the column `Treatment` will be replaced by two columns, `TreatmentPlacebo`, and `TreatmentTreated`. Each of them will be *binary*. Therefore, an observation which has the value `Placebo` in column `Treatment` before the transformation wi...

Column `Improved` is excluded because it will be our `label` column, the one we want to predict.

```{r, warning=FALSE,message=FALSE}
sparse_matrix <- sparse.model.matrix(Improved ~ ., data = df)[,-1]
head(sparse_matrix)
```

> Formula `Improved ~ .` used above means transform all *categorical* features but column `Improved` to binary values. The `-1` column selection removes the intercept column which is full of `1` (this column is generated by the conversion). For more ...

Create the output `numeric` vector (not as a sparse `Matrix`):

```{r}
output_vector = df[,Improved] == "Marked"
```

1. set `Y` vector to `0`;
2. set `Y` to `1` for rows where `Improved == Marked` is `TRUE` ;
3. return `Y` vector.

Build the model
---------------

The code below is very usual. For more information, you can look at the documentation of `xgboost` function (or at the vignette [Xgboost presentation](https://github.com/dmlc/xgboost/blob/master/R-package/vignettes/xgboostPresentation.Rmd)).

```{r}
bst <- xgboost(data = sparse_matrix, label = output_vector, max_depth = 4,
               eta = 1, nthread = 2, nrounds = 10,objective = "binary:logistic")

```

You can see some `train-error: 0.XXXXX` lines followed by a number. It decreases. Each line shows how well the model explains your data. Lower is better.

A model which fits too well may [overfit](http://en.wikipedia.org/wiki/Overfitting) (meaning it copy/paste too much the past, and won't be that good to predict the future).

> Here you can see the numbers decrease until line 7 and then increase.
>
> It probably means we are overfitting. To fix that I should reduce the number of rounds to `nrounds = 4`. I will let things like that because I don't really care for the purpose of this example :-)

Feature importance
------------------

## Measure feature importance


### Build the feature importance data.table

Remember, each binary column corresponds to a single value of one of *categorical* features.

```{r}
importance <- xgb.importance(feature_names = colnames(sparse_matrix), model = bst)
head(importance)
```

> The column `Gain` provide the information we are looking for.
>
> As you can see, features are classified by `Gain`.

`Gain` is the improvement in accuracy brought by a feature to the branches it is on. The idea is that before adding a new split on a feature X to the branch there was some wrongly classified elements, after adding the split on this feature, there are...

`Cover` measures the relative quantity of observations concerned by a feature.

`Frequency` is a simpler way to measure the `Gain`. It just counts the number of times a feature is used in all generated trees. You should not use it (unless you know why you want to use it).

#### Improvement in the interpretability of feature importance data.table

We can go deeper in the analysis of the model. In the `data.table` above, we have discovered which features counts to predict if the illness will go or not. But we don't yet know the role of these features. For instance, one of the question we may wa...

One simple solution is to count the co-occurrences of a feature and a class of the classification.

For that purpose we will execute the same function as above but using two more parameters, `data` and `label`.

```{r}
importanceRaw <- xgb.importance(feature_names = colnames(sparse_matrix), model = bst, data = sparse_matrix, label = output_vector)

# Cleaning for better display
importanceClean <- importanceRaw[,`:=`(Cover=NULL, Frequency=NULL)]

head(importanceClean)
```

> In the table above we have removed two not needed columns and select only the first lines.

First thing you notice is the new column `Split`. It is the split applied to the feature on a branch of one of the tree. Each split is present, therefore a feature can appear several times in this table. Here we can see the feature `Age` is used seve...

How the split is applied to count the co-occurrences? It is always `<`. For instance, in the second line, we measure the number of persons under 61.5 years with the illness gone after the treatment.

The two other new columns are `RealCover` and `RealCover %`. In the first column it measures the number of observations in the dataset where the split is respected and the label marked as `1`. The second column is the percentage of the whole populati...

Therefore, according to our findings, getting a placebo doesn't seem to help but being younger than 61 years may help (seems logic).

> You may wonder how to interpret the `< 1.00001` on the first line. Basically, in a sparse `Matrix`, there is no `0`, therefore, looking for one hot-encoded categorical observations validating the rule `< 1.00001` is like just looking for `1` for th...

### Plotting the feature importance


All these things are nice, but it would be even better to plot the results.

```{r, fig.width=8, fig.height=5, fig.align='center'}
xgb.plot.importance(importance_matrix = importance)
```

Feature have automatically been divided in 2 clusters: the interesting features... and the others.

> Depending of the dataset and the learning parameters you may have more than two clusters. Default value is to limit them to `10`, but you can increase this limit. Look at the function documentation for more information.

According to the plot above, the most important features in this dataset to predict if the treatment will work are :

* the Age ;
* having received a placebo or not ;
* the sex is third but already included in the not interesting features group ;
* then we see our generated features (AgeDiscret). We can see that their contribution is very low.

### Do these results make sense?


Let's check some **Chi2** between each of these features and the label.

Higher **Chi2** means better correlation.

```{r, warning=FALSE, message=FALSE}
c2 <- chisq.test(df$Age, output_vector)
print(c2)
```

Pearson correlation between Age and illness disapearing is **`r round(c2$statistic, 2 )`**.

```{r, warning=FALSE, message=FALSE}
c2 <- chisq.test(df$AgeDiscret, output_vector)
print(c2)
```

Our first simplification of Age gives a Pearson correlation is **`r round(c2$statistic, 2)`**.

```{r, warning=FALSE, message=FALSE}
c2 <- chisq.test(df$AgeCat, output_vector)
print(c2)
```

The perfectly random split I did between young and old at 30 years old have a low correlation of **`r round(c2$statistic, 2)`**. It's a result we may expect as may be in my mind > 30 years is being old (I am 32 and starting feeling old, this may expl...

Morality: don't let your *gut* lower the quality of your model.



( run in 1.103 second using v1.01-cache-2.11-cpan-13bb782fe5a )