Alien-XGBoost

 view release on metacpan or  search on metacpan

xgboost/demo/kaggle-otto/understandingXGBoostModel.Rmd  view on Meta::CPAN

```
> `magrittr` and `data.table` are here to make the code cleaner and much more rapid.

Let's explore the dataset.

```{r explore}
# Train dataset dimensions
dim(train)

# Training content
train[1:6,1:5, with =F]

# Test dataset dimensions
dim(test)

# Test content
test[1:6,1:5, with =F]
```
> We only display the 6 first rows and 5 first columns for convenience

Each *column* represents a feature measured by an `integer`. Each *row* is an **Otto** product.

Obviously the first column (`ID`) doesn't contain any useful information.

To let the algorithm focus on real stuff, we will delete it.

```{r clean, results='hide'}
# Delete ID column in training dataset
train[, id := NULL]

# Delete ID column in testing dataset
test[, id := NULL]
```

According to its description, the **Otto** challenge is a multi class classification challenge. We need to extract the labels (here the name of the different classes) from the dataset. We only have two files (test and training), it seems logical that...

```{r searchLabel}
# Check the content of the last column
train[1:6, ncol(train), with  = F]
# Save the name of the last column
nameLastCol <- names(train)[ncol(train)]
```

The classes are provided as character string in the `r ncol(train)`th column called `r nameLastCol`. As you may know, **XGBoost** doesn't support anything else than numbers. So we will convert classes to `integer`. Moreover, according to the document...

For that purpose, we will:

* extract the target column
* remove `Class_` from each class name
* convert to `integer`
* remove `1` to the new value

```{r classToIntegers}
# Convert from classes to numbers
y <- train[, nameLastCol, with = F][[1]] %>% gsub('Class_','',.) %>% {as.integer(.) -1}

# Display the first 5 levels
y[1:5]
```

We remove label column from training dataset, otherwise **XGBoost** would use it to guess the labels!

```{r deleteCols, results='hide'}
train[, nameLastCol:=NULL, with = F]
```

`data.table` is an awesome implementation of data.frame, unfortunately it is not a format supported natively by **XGBoost**. We need to convert both datasets (training and test) in `numeric` Matrix format.

```{r convertToNumericMatrix}
trainMatrix <- train[,lapply(.SD,as.numeric)] %>% as.matrix
testMatrix <- test[,lapply(.SD,as.numeric)] %>% as.matrix
```

Model training
==============

Before the learning we will use the cross validation to evaluate the our error rate.

Basically **XGBoost** will divide the training data in `nfold` parts, then **XGBoost** will retain the first part to use it as the test data and perform a training. Then it will reintegrate the first part and retain the second part, do a training and...

You can look at the function documentation for more information.

```{r crossValidation}
numberOfClasses <- max(y) + 1

param <- list("objective" = "multi:softprob",
              "eval_metric" = "mlogloss",
              "num_class" = numberOfClasses)

cv.nround <- 5
cv.nfold <- 3

bst.cv = xgb.cv(param=param, data = trainMatrix, label = y,
                nfold = cv.nfold, nrounds = cv.nround)
```
> As we can see the error rate is low on the test dataset (for a 5mn trained model).

Finally, we are ready to train the real model!!!

```{r modelTraining}
nround = 50
bst = xgboost(param=param, data = trainMatrix, label = y, nrounds=nround)
```

Model understanding
===================

Feature importance
------------------

So far, we have built a model made of **`r nround`** trees.

To build a tree, the dataset is divided recursively several times. At the end of the process, you get groups of observations (here, these observations are properties regarding **Otto** products).

Each division operation is called a *split*.

Each group at each division level is called a branch and the deepest level is called a *leaf*.

In the final model, these *leafs* are supposed to be as pure as possible for each tree, meaning in our case that each *leaf* should be made of one class of **Otto** product only (of course it is not true, but that's what we try to achieve in a minimu...

**Not all *splits* are equally important**. Basically the first *split* of a tree will have more impact on the purity that, for instance, the deepest *split*. Intuitively, we understand that the first *split* makes most of the work, and the following...



( run in 0.462 second using v1.01-cache-2.11-cpan-99c4e6809bf )