Alien-XGBoost

 view release on metacpan or  search on metacpan

xgboost/R-package/vignettes/discoverYourData.Rmd  view on Meta::CPAN

---
title: "Understand your dataset with Xgboost"
output:
  rmarkdown::html_vignette:
    css: vignette.css
    number_sections: yes
    toc: yes
author: Tianqi Chen, Tong He, Michaël Benesty, Yuan Tang
vignette: >
  %\VignetteIndexEntry{Discover your data}
  %\VignetteEngine{knitr::rmarkdown}
  \usepackage[utf8]{inputenc}
---

Understand your dataset with XGBoost
====================================

Introduction
------------

The purpose of this vignette is to show you how to use **Xgboost** to discover and understand your own dataset better.

This vignette is not about predicting anything (see [Xgboost presentation](https://github.com/dmlc/xgboost/blob/master/R-package/vignettes/xgboostPresentation.Rmd)). We will explain how to use **Xgboost** to highlight the *link* between the *features...

Package loading:

```{r libLoading, results='hold', message=F, warning=F}
require(xgboost)
require(Matrix)
require(data.table)
if (!require('vcd')) install.packages('vcd')
```

> **VCD** package is used for one of its embedded dataset only.

Preparation of the dataset
--------------------------

### Numeric v.s. categorical variables


**Xgboost** manages only `numeric` vectors.

What to do when you have *categorical* data?

A *categorical* variable has a fixed number of different values. For instance, if a variable called *Colour* can have only one of these three values, *red*, *blue* or *green*, then *Colour* is a *categorical* variable.

> In **R**, a *categorical* variable is called `factor`.
>
> Type `?factor` in the console for more information.

To answer the question above we will convert *categorical* variables to `numeric` one.

### Conversion from categorical to numeric variables

#### Looking at the raw data

In this Vignette we will see how to transform a *dense* `data.frame` (*dense* = few zeroes in the matrix) with *categorical* variables to a very *sparse* matrix (*sparse* = lots of zero in the matrix) of `numeric` features.

The method we are going to see is usually called [one-hot encoding](http://en.wikipedia.org/wiki/One-hot).

The first step is to load `Arthritis` dataset in memory and wrap it with `data.table` package.

```{r, results='hide'}
data(Arthritis)
df <- data.table(Arthritis, keep.rownames = F)
```

> `data.table` is 100% compliant with **R** `data.frame` but its syntax is more consistent and its performance for large dataset is [best in class](http://stackoverflow.com/questions/21435339/data-table-vs-dplyr-can-one-do-something-well-the-other-ca...

The first thing we want to do is to have a look to the first few lines of the `data.table`:



( run in 0.454 second using v1.01-cache-2.11-cpan-e93a5daba3e )