frame results from the CPAN

Alien-XGBoost
---
title: "Understand your dataset with Xgboost"
output:
  rmarkdown::html_vignette:
    css: vignette.css
    number_sections: yes
    toc: yes
author: Tianqi Chen, Tong He, MichaÃ«l Benesty, Yuan Tang
vignette: >
  %\VignetteIndexEntry{Discover your data}
  %\VignetteEngine{knitr::rmarkdown}
  \usepackage[utf8]{inputenc}
---

Understand your dataset with XGBoost
====================================

Introduction
------------

The purpose of this vignette is to show you how to use **Xgboost** to discover and understand your own dataset better.

This vignette is not about predicting anything (see [Xgboost presentation](https://github.com/dmlc/xgboost/blob/master/R-package/vignettes/xgboostPresentation.Rmd)). We will explain how to use **Xgboost** to highlight the *link* between the *features...

Package loading:

```{r libLoading, results='hold', message=F, warning=F}
require(xgboost)
require(Matrix)
require(data.table)
if (!require('vcd')) install.packages('vcd')
```

> **VCD** package is used for one of its embedded dataset only.

Preparation of the dataset
--------------------------

### Numeric v.s. categorical variables


**Xgboost** manages only `numeric` vectors.

What to do when you have *categorical* data?

A *categorical* variable has a fixed number of different values. For instance, if a variable called *Colour* can have only one of these three values, *red*, *blue* or *green*, then *Colour* is a *categorical* variable.

> In **R**, a *categorical* variable is called `factor`.
>
> Type `?factor` in the console for more information.

To answer the question above we will convert *categorical* variables to `numeric` one.

### Conversion from categorical to numeric variables

#### Looking at the raw data

In this Vignette we will see how to transform a *dense* `data.frame` (*dense* = few zeroes in the matrix) with *categorical* variables to a very *sparse* matrix (*sparse* = lots of zero in the matrix) of `numeric` features.

The method we are going to see is usually called [one-hot encoding](http://en.wikipedia.org/wiki/One-hot).

The first step is to load `Arthritis` dataset in memory and wrap it with `data.table` package.

```{r, results='hide'}
data(Arthritis)
df <- data.table(Arthritis, keep.rownames = F)
```

> `data.table` is 100% compliant with **R** `data.frame` but its syntax is more consistent and its performance for large dataset is [best in class](http://stackoverflow.com/questions/21435339/data-table-vs-dplyr-can-one-do-something-well-the-other-ca...

The first thing we want to do is to have a look to the first few lines of the `data.table`:

```{r}
head(df)
```

Now we will check the format of each column.

```{r}
str(df)
```

2 columns have `factor` type, one has `ordinal` type.

> `ordinal` variable :
>
> * can take a limited number of values (like `factor`) ;
> * these values are ordered (unlike `factor`). Here these ordered values are: `Marked > Some > None`

#### Creation of new features based on old ones

We will add some new *categorical* features to see if it helps.

##### Grouping per 10 years

For the first feature we create groups of age by rounding the real age.

Note that we transform it to `factor` so the algorithm treat these age groups as independent values.

Therefore, 20 is not closer to 30 than 60. To make it short, the distance between ages is lost in this transformation.

```{r}
head(df[,AgeDiscret := as.factor(round(Age/10,0))])
```

##### Random split into two groups

Following is an even stronger simplification of the real age with an arbitrary split at 30 years old. We choose this value **based on nothing**. We will see later if simplifying the information based on arbitrary values is a good strategy (you may al...

```{r}
head(df[,AgeCat:= as.factor(ifelse(Age > 30, "Old", "Young"))])
```

##### Risks in adding correlated features

These new features are highly correlated to the `Age` feature because they are simple transformations of this feature.

For many machine learning algorithms, using correlated features is not a good idea. It may sometimes make prediction less accurate, and most of the time make interpretation of the model almost impossible. GLM, for instance, assumes that the features ...

Fortunately, decision tree algorithms (including boosted trees) are very robust to these features. Therefore we have nothing to do to manage this situation.

##### Cleaning data

We remove ID as there is nothing to learn from this feature (it would just add some noise).

```{r, results='hide'}
df[,ID:=NULL]
```
( run in 1.014 second using v1.01-cache-2.11-cpan-df04353d9ac )