nix results from the CPAN

Alien-XGBoost
---
layout: post
title:  XGBoost4J: Portable Distributed Tree Boosting in DataFlow
date:   2016-03-15 12:00:00
author: Nan Zhu, Tianqi Chen
comments: true
---

## Introduction
[XGBoost](https://github.com/dmlc/xgboost) is a library designed and optimized for tree boosting. Gradient boosting trees model is originally proposed by Friedman et al. By embracing multi-threads and introducing regularization, XGBoost delivers high...
XGBoost has provided native interfaces for  C++, R, python, Julia and Java users.
It is used by both [data exploration and production scenarios](https://github.com/dmlc/xgboost/tree/master/demo#usecases) to solve real world machine learning problems.

The distributed XGBoost is described in the [recently published paper](http://arxiv.org/abs/1603.02754).
In short, the XGBoost system runs magnitudes faster than existing alternatives of distributed ML,
and uses far fewer resources. The reader is more than welcomed to refer to the paper for more details.

Despite the current great success, one of our ultimate goals is to make XGBoost even more available for all production scenario.
Programming languages and data processing/storage systems based on Java Virtual Machine (JVM) play the significant roles in the BigData ecosystem. [Hadoop](http://hadoop.apache.org/), [Spark](http://spark.apache.org/) and more recently introduced [Fl...

On the other side, the emerging demands of machine learning and deep learning
inspires many excellent machine learning libraries.
Many of these machine learning libraries(e.g. [XGBoost](https://github.com/dmlc/xgboost)/[MxNet](https://github.com/dmlc/mxnet))
requires new computation abstraction and native support (e.g. C++ for GPU computing).
They are also often [much more efficient](http://arxiv.org/abs/1603.02754).

The gap between the implementation fundamentals of the general data processing frameworks and the more specific machine learning libraries/systems prohibits the smooth connection between these two types of systems, thus brings unnecessary inconvenien...

We want best of both worlds, so we can use the data processing frameworks like Spark and Flink together with
the best distributed machine learning solutions.
To resolve the situation, we introduce the new-brewed [XGBoost4J](https://github.com/dmlc/xgboost/tree/master/jvm-packages),
<b>XGBoost</b> for <b>J</b>VM Platform. We aim to provide the clean Java/Scala APIs and the integration with the most popular data processing systems developed in JVM-based languages.

## Unix Philosophy in Machine Learning

XGBoost and XGBoost4J adopts Unix Philosophy.
XGBoost **does its best in one thing -- tree boosting** and is **being designed to work with other systems**.
We strongly believe that machine learning solution should not be restricted to certain language or certain platform.

Specifically, users will be able to use distributed XGBoost in both Spark and Flink, and possibly more frameworks in Future.
We have made the API in a portable way so it **can be easily ported to other Dataflow frameworks provided by the Cloud**.
XGBoost4J shares its core with other XGBoost libraries, which means data scientists can use R/python
read and visualize the model trained distributedly.
It also means that user can start with single machine version for exploration,
which already can handle hundreds of million examples.

## System Overview

In the following Figure, we describe the overall architecture of XGBoost4J. XGBoost4J provides the Java/Scala API calling the core functionality of XGBoost library. Most importantly, it not only supports the single-machine model training, but also pr...

![XGBoost4J Architecture](https://raw.githubusercontent.com/dmlc/web-data/master/xgboost/xgboost4j.png)


By calling the XGBoost4J API, users can scale the model training to the cluster. XGBoost4J calls the running instance of XGBoost worker in Spark/Flink task and run them across the cluster. The communication among the distributed model training tasks ...

With the abstraction of XGBoost4J, users can build an unified data analytic application ranging from Extract-Transform-Loading, data exploration, machine learning model training and the final data product service. The following figure illustrate an e...

![XGBoost4J Architecture](https://raw.githubusercontent.com/dmlc/web-data/master/xgboost/unified_pipeline.png)


## Single-machine Training Walk-through

In this section, we will work through the APIs of XGBoost4J by examples.
We will be using scala for demonstration, but we also have a complete API for java users.

To start the model training and evaluation, we need to prepare the training and test set:

```scala
val trainMax = new DMatrix("../../demo/data/agaricus.txt.train")
val testMax = new DMatrix("../../demo/data/agaricus.txt.test")
```

After preparing the data, we can train our model:

```scala
val params = new mutable.HashMap[String, Any]()
params += "eta" -> 1.0
params += "max_depth" -> 2
params += "silent" -> 1
params += "objective" -> "binary:logistic"

val watches = new mutable.HashMap[String, DMatrix]
watches += "train" -> trainMax
watches += "test" -> testMax

val round = 2
// train a model
val booster = XGBoost.train(trainMax, params.toMap, round, watches.toMap)
```

We then evaluate our model:

```scala
val predicts = booster.predict(testMax)
```
( run in 2.335 seconds using v1.01-cache-2.11-cpan-39bf76dae61 )