best results from the CPAN

Alien-XGBoost

```{r dataSize, message=F, warning=F}
dim(train$data)
dim(test$data)
```

This dataset is very small to not make the **R** package too heavy, however **XGBoost** is built to manage huge dataset very efficiently.

As seen below, the `data` are stored in a `dgCMatrix` which is a *sparse* matrix and `label` vector is a `numeric` vector (`{0,1}`):

```{r dataClass, message=F, warning=F}
class(train$data)[1]
class(train$label)
```

### Basic Training using XGBoost


This step is the most critical part of the process for the quality of our model.

#### Basic training

We are using the `train` data. As explained above, both `data` and `label` are stored in a `list`.

In a *sparse* matrix, cells containing `0` are not stored in memory. Therefore, in a dataset mainly made of `0`, memory size is reduced. It is very usual to have such dataset.

We will train decision tree model using the following parameters:

* `objective = "binary:logistic"`: we will train a binary classification model ;
* `max_depth = 2`: the trees won't be deep, because our case is very simple ;
* `nthread = 2`: the number of cpu threads we are going to use;
* `nrounds = 2`: there will be two passes on the data, the second one will enhance the model by further reducing the difference between ground truth and prediction.

```{r trainingSparse, message=F, warning=F}
bstSparse <- xgboost(data = train$data, label = train$label, max_depth = 2, eta = 1, nthread = 2, nrounds = 2, objective = "binary:logistic")
```

> More complex the relationship between your features and your `label` is, more passes you need.

#### Parameter variations

##### Dense matrix

Alternatively, you can put your dataset in a *dense* matrix, i.e. a basic **R** matrix.

```{r trainingDense, message=F, warning=F}
bstDense <- xgboost(data = as.matrix(train$data), label = train$label, max_depth = 2, eta = 1, nthread = 2, nrounds = 2, objective = "binary:logistic")
```

##### xgb.DMatrix

**XGBoost** offers a way to group them in a `xgb.DMatrix`. You can even add other meta data in it. It will be useful for the most advanced features we will discover later.

```{r trainingDmatrix, message=F, warning=F}
dtrain <- xgb.DMatrix(data = train$data, label = train$label)
bstDMatrix <- xgboost(data = dtrain, max_depth = 2, eta = 1, nthread = 2, nrounds = 2, objective = "binary:logistic")
```

##### Verbose option

**XGBoost** has several features to help you to view how the learning progress internally. The purpose is to help you to set the best parameters, which is the key of your model quality.

One of the simplest way to see the training progress is to set the `verbose` option (see below for more advanced technics).

```{r trainingVerbose0, message=T, warning=F}
# verbose = 0, no message
bst <- xgboost(data = dtrain, max_depth = 2, eta = 1, nthread = 2, nrounds = 2, objective = "binary:logistic", verbose = 0)
```

```{r trainingVerbose1, message=T, warning=F}
# verbose = 1, print evaluation metric
bst <- xgboost(data = dtrain, max_depth = 2, eta = 1, nthread = 2, nrounds = 2, objective = "binary:logistic", verbose = 1)
```

```{r trainingVerbose2, message=T, warning=F}
# verbose = 2, also print information about tree
bst <- xgboost(data = dtrain, max_depth = 2, eta = 1, nthread = 2, nrounds = 2, objective = "binary:logistic", verbose = 2)
```

## Basic prediction using XGBoost


## Perform the prediction


The purpose of the model we have built is to classify new data. As explained before, we will use the `test` dataset for this step.

```{r predicting, message=F, warning=F}
pred <- predict(bst, test$data)

# size of the prediction vector
print(length(pred))

# limit display of predictions to the first 10
print(head(pred))
```

These numbers doesn't look like *binary classification* `{0,1}`. We need to perform a simple transformation before being able to use these results.

## Transform the regression in a binary classification


The only thing that **XGBoost** does is a *regression*. **XGBoost** is using `label` vector to build its *regression* model.

How can we use a *regression* model to perform a binary classification?

If we think about the meaning of a regression applied to our data, the numbers we get are probabilities that a datum will be classified as `1`. Therefore, we will set the rule that if this probability for a specific datum is `> 0.5` then the observat...

```{r predictingTest, message=F, warning=F}
prediction <- as.numeric(pred > 0.5)
print(head(prediction))
```

## Measuring model performance


To measure the model performance, we will compute a simple metric, the *average error*.

```{r predictingAverageError, message=F, warning=F}
err <- mean(as.numeric(pred > 0.5) != test$label)
print(paste("test-error=", err))
( run in 0.548 second using v1.01-cache-2.11-cpan-9581c071862 )