Alien-XGBoost
view release on metacpan or search on metacpan
xgboost/R-package/vignettes/discoverYourData.Rmd view on Meta::CPAN
---
title: "Understand your dataset with Xgboost"
output:
rmarkdown::html_vignette:
css: vignette.css
number_sections: yes
toc: yes
author: Tianqi Chen, Tong He, Michaël Benesty, Yuan Tang
vignette: >
%\VignetteIndexEntry{Discover your data}
%\VignetteEngine{knitr::rmarkdown}
\usepackage[utf8]{inputenc}
---
Understand your dataset with XGBoost
====================================
Introduction
------------
The purpose of this vignette is to show you how to use **Xgboost** to discover and understand your own dataset better.
This vignette is not about predicting anything (see [Xgboost presentation](https://github.com/dmlc/xgboost/blob/master/R-package/vignettes/xgboostPresentation.Rmd)). We will explain how to use **Xgboost** to highlight the *link* between the *features...
Package loading:
```{r libLoading, results='hold', message=F, warning=F}
require(xgboost)
require(Matrix)
require(data.table)
if (!require('vcd')) install.packages('vcd')
```
> **VCD** package is used for one of its embedded dataset only.
Preparation of the dataset
--------------------------
### Numeric v.s. categorical variables
**Xgboost** manages only `numeric` vectors.
What to do when you have *categorical* data?
A *categorical* variable has a fixed number of different values. For instance, if a variable called *Colour* can have only one of these three values, *red*, *blue* or *green*, then *Colour* is a *categorical* variable.
> In **R**, a *categorical* variable is called `factor`.
>
> Type `?factor` in the console for more information.
To answer the question above we will convert *categorical* variables to `numeric` one.
### Conversion from categorical to numeric variables
#### Looking at the raw data
In this Vignette we will see how to transform a *dense* `data.frame` (*dense* = few zeroes in the matrix) with *categorical* variables to a very *sparse* matrix (*sparse* = lots of zero in the matrix) of `numeric` features.
The method we are going to see is usually called [one-hot encoding](http://en.wikipedia.org/wiki/One-hot).
The first step is to load `Arthritis` dataset in memory and wrap it with `data.table` package.
```{r, results='hide'}
data(Arthritis)
df <- data.table(Arthritis, keep.rownames = F)
```
> `data.table` is 100% compliant with **R** `data.frame` but its syntax is more consistent and its performance for large dataset is [best in class](http://stackoverflow.com/questions/21435339/data-table-vs-dplyr-can-one-do-something-well-the-other-ca...
The first thing we want to do is to have a look to the first few lines of the `data.table`:
( run in 0.454 second using v1.01-cache-2.11-cpan-e93a5daba3e )