best results from the CPAN

best
Alien-XGBoost
view release on metacpan or search on metacpan
xgboost/doc/R-package/xgboostPresentation.md view on Meta::CPAN
This step is the most critical part of the process for the quality of our model.

#### Basic training

We are using the `train` data. As explained above, both `data` and `label` are stored in a `list`.

In a *sparse* matrix, cells containing `0` are not stored in memory. Therefore, in a dataset mainly made of `0`, memory size is reduced. It is very usual to have such dataset.

We will train decision tree model using the following parameters:

* `objective = "binary:logistic"`: we will train a binary classification model ;
* `max.deph = 2`: the trees won't be deep, because our case is very simple ;
* `nthread = 2`: the number of cpu threads we are going to use;
* `nround = 2`: there will be two passes on the data, the second one will enhance the model by further reducing the difference between ground truth and prediction.


```r
bstSparse <- xgboost(data = train$data, label = train$label, max.depth = 2, eta = 1, nthread = 2, nround = 2, objective = "binary:logistic")
```

```
## [0]	train-error:0.046522
## [1]	train-error:0.022263
```

> More complex the relationship between your features and your `label` is, more passes you need.

#### Parameter variations

##### Dense matrix

Alternatively, you can put your dataset in a *dense* matrix, i.e. a basic **R** matrix.


```r
bstDense <- xgboost(data = as.matrix(train$data), label = train$label, max.depth = 2, eta = 1, nthread = 2, nround = 2, objective = "binary:logistic")
```

```
## [0]	train-error:0.046522
## [1]	train-error:0.022263
```

##### xgb.DMatrix

**XGBoost** offers a way to group them in a `xgb.DMatrix`. You can even add other meta data in it. It will be useful for the most advanced features we will discover later.


```r
dtrain <- xgb.DMatrix(data = train$data, label = train$label)
bstDMatrix <- xgboost(data = dtrain, max.depth = 2, eta = 1, nthread = 2, nround = 2, objective = "binary:logistic")
```

```
## [0]	train-error:0.046522
## [1]	train-error:0.022263
```

##### Verbose option

**XGBoost** has several features to help you to view how the learning progress internally. The purpose is to help you to set the best parameters, which is the key of your model quality.

One of the simplest way to see the training progress is to set the `verbose` option (see below for more advanced technics).


```r
# verbose = 0, no message
bst <- xgboost(data = dtrain, max.depth = 2, eta = 1, nthread = 2, nround = 2, objective = "binary:logistic", verbose = 0)
```


```r
# verbose = 1, print evaluation metric
bst <- xgboost(data = dtrain, max.depth = 2, eta = 1, nthread = 2, nround = 2, objective = "binary:logistic", verbose = 1)
```

```
## [0]	train-error:0.046522
## [1]	train-error:0.022263
```


```r
# verbose = 2, also print information about tree
bst <- xgboost(data = dtrain, max.depth = 2, eta = 1, nthread = 2, nround = 2, objective = "binary:logistic", verbose = 2)
```

```
## [11:41:01] amalgamation/../src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 6 extra nodes, 0 pruned nodes, max_depth=2
## [0]	train-error:0.046522
## [11:41:01] amalgamation/../src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 4 extra nodes, 0 pruned nodes, max_depth=2
## [1]	train-error:0.022263
```

## Basic prediction using XGBoost


## Perform the prediction


The purpose of the model we have built is to classify new data. As explained before, we will use the `test` dataset for this step.


```r
pred <- predict(bst, test$data)

# size of the prediction vector
print(length(pred))
```

```
## [1] 1611
```

```r
# limit display of predictions to the first 10
print(head(pred))
```

```
## [1] 0.28583017 0.92392391 0.28583017 0.28583017 0.05169873 0.92392391
( run in 1.249 second using v1.01-cache-2.11-cpan-39bf76dae61 )