frame results from the CPAN

frame
Alien-XGBoost
view release on metacpan or search on metacpan
xgboost/R-package/vignettes/xgboost.Rnw view on Meta::CPAN
\documentclass{article}
\RequirePackage{url}
\usepackage{hyperref}
\RequirePackage{amsmath}
\RequirePackage{natbib}
\RequirePackage[a4paper,lmargin={1.25in},rmargin={1.25in},tmargin={1in},bmargin={1in}]{geometry}

\makeatletter
% \VignetteIndexEntry{xgboost: eXtreme Gradient Boosting}
%\VignetteKeywords{xgboost, gbm, gradient boosting machines}
%\VignettePackage{xgboost}
% \VignetteEngine{knitr::knitr}
\makeatother

\begin{document}
%\SweaveOpts{concordance=TRUE}

<<knitropts,echo=FALSE,message=FALSE>>=
if (require('knitr')) opts_chunk$set(fig.width = 5, fig.height = 5, fig.align = 'center', tidy = FALSE, warning = FALSE, cache = TRUE)
@

%
<<prelim,echo=FALSE>>=
xgboost.version <- packageDescription("xgboost")$Version

@
%

    \begin{center}
    \vspace*{6\baselineskip}
    \rule{\textwidth}{1.6pt}\vspace*{-\baselineskip}\vspace*{2pt}
    \rule{\textwidth}{0.4pt}\\[2\baselineskip]
    {\LARGE \textbf{xgboost: eXtreme Gradient Boosting}}\\[1.2\baselineskip]
    \rule{\textwidth}{0.4pt}\vspace*{-\baselineskip}\vspace{3.2pt}
    \rule{\textwidth}{1.6pt}\\[2\baselineskip]
    {\Large Tianqi Chen, Tong He}\\[\baselineskip]
    {\large Package Version: \Sexpr{xgboost.version}}\\[\baselineskip]
    {\large \today}\par
    \vfill
    \end{center}

\thispagestyle{empty}

\clearpage

\setcounter{page}{1}

\section{Introduction}

This is an introductory document of using the \verb@xgboost@ package in R. 

\verb@xgboost@ is short for eXtreme Gradient Boosting package. It is an efficient
 and scalable implementation of gradient boosting framework by \citep{friedman2001greedy} \citep{friedman2000additive}. 
The package includes efficient linear model solver and tree learning algorithm.
It supports various objective functions, including regression, classification
and ranking. The package is made to be extendible, so that users are also allowed to define their own objectives easily. It has several features:
\begin{enumerate}
    \item{Speed: }{\verb@xgboost@ can automatically do parallel computation on 
    Windows and Linux, with openmp. It is generally over 10 times faster than
    \verb@gbm@.}
    \item{Input Type: }{\verb@xgboost@ takes several types of input data:}
    \begin{itemize}
        \item{Dense Matrix: }{R's dense matrix, i.e. \verb@matrix@}
        \item{Sparse Matrix: }{R's sparse matrix \verb@Matrix::dgCMatrix@}
        \item{Data File: }{Local data files}
        \item{xgb.DMatrix: }{\verb@xgboost@'s own class. Recommended.}
    \end{itemize}
    \item{Sparsity: }{\verb@xgboost@ accepts sparse input for both tree booster 
    and linear booster, and is optimized for sparse input.}
    \item{Customization: }{\verb@xgboost@ supports customized objective function 
    and evaluation function}
    \item{Performance: }{\verb@xgboost@ has better performance on several different
    datasets.}
\end{enumerate}


\section{Example with Mushroom data}

In this section, we will illustrate some common usage of \verb@xgboost@. The 
Mushroom data is cited from UCI Machine Learning Repository. \citep{Bache+Lichman:2013} 

<<Training and prediction with iris>>=
library(xgboost)
data(agaricus.train, package='xgboost')
data(agaricus.test, package='xgboost')
train <- agaricus.train
test <- agaricus.test
bst <- xgboost(data = train$data, label = train$label, max_depth = 2, eta = 1, 
               nrounds = 2, objective = "binary:logistic")
xgb.save(bst, 'model.save')
bst = xgb.load('model.save')
pred <- predict(bst, test$data)
@

\verb@xgboost@ is the main function to train a \verb@Booster@, i.e. a model.
\verb@predict@ does prediction on the model.

Here we can save the model to a binary local file, and load it when needed.
We can't inspect the trees inside. However we have another function to save the
model in plain text. 
<<Dump Model>>=
xgb.dump(bst, 'model.dump')
@

The output looks like 

\begin{verbatim}
booster[0]:
0:[f28<1.00001] yes=1,no=2,missing=2
  1:[f108<1.00001] yes=3,no=4,missing=4
    3:leaf=1.85965
    4:leaf=-1.94071
  2:[f55<1.00001] yes=5,no=6,missing=6
( run in 1.660 second using v1.01-cache-2.11-cpan-df04353d9ac )