streaming results from the CPAN

streaming
Alien-XGBoost
view release on metacpan or search on metacpan
xgboost/doc/jvm/xgboost4j_full_integration.md view on Meta::CPAN
## Introduction

On March 2016, we released the first version of [XGBoost4J](http://dmlc.ml/2016/03/14/xgboost4j-portable-distributed-xgboost-in-spark-flink-and-dataflow.html), which is a set of packages providing Java/Scala interfaces of XGBoost and the integration ...

The integrations with Spark/Flink, a.k.a. <b>XGBoost4J-Spark</b> and <b>XGBoost-Flink</b>, receive the tremendous positive feedbacks from the community. It enables users to build a unified pipeline, embedding  XGBoost into the data processing system ...

![XGBoost4J Architecture](https://raw.githubusercontent.com/dmlc/web-data/master/xgboost/unified_pipeline.png)

In the last months, we have a lot of communication with the users and gain the deeper understanding of the users' latest usage scenario and requirements:

* XGBoost keeps gaining more and more deployments in the production environment and the adoption in machine learning competitions [Link](http://datascience.la/xgboost-workshop-and-meetup-talk-with-tianqi-chen/).

* While Spark is still the mainstream data processing tool in most of scenarios, more and more users are porting their RDD-based Spark programs to [DataFrame/Dataset APIs](http://spark.apache.org/docs/latest/sql-programming-guide.html) for the well-d...

* Spark itself has presented a clear roadmap that DataFrame/Dataset would be the base of the latest and future features, e.g. latest version of [ML pipeline](http://spark.apache.org/docs/latest/ml-guide.html) and [Structured Streaming](http://spark.a...

Based on these feedbacks from the users, we observe a gap between the original RDD-based XGBoost4J-Spark and the users' latest usage scenario as well as the future direction of Spark ecosystem. To fill this gap, we start working on the <b><i>integrat...


## A Full Integration of XGBoost and DataFrame/Dataset

The following figure illustrates the new pipeline architecture with the latest XGBoost4J-Spark.

![XGBoost4J New Architecture](https://raw.githubusercontent.com/dmlc/web-data/master/xgboost/unified_pipeline_new.png)

Being different with the previous version, users are able to use both low- and high-level memory abstraction in Spark, i.e. RDD and DataFrame/Dataset. The DataFrame/Dataset abstraction grants the user to manipulate structured datasets and utilize the...

```scala
// load sales records saved in json files
val salesDF = spark.read.json("sales.json")
// call XGBoost API to train with the DataFrame-represented training set
val xgboostModel = XGBoost.trainWithDataFrame(
      salesDF, paramMap, numRound, nWorkers, useExternalMemory)
```

By integrating with DataFrame/Dataset, XGBoost4J-Spark not only enables users to call DataFrame/Dataset APIs directly but also make  DataFrame/Dataset-based Spark features available to XGBoost users, e.g. ML Package.

### Integration with ML Package

ML package of Spark provides a set of convenient tools for feature extraction/transformation/selection. Additionally, with the model selection tool in ML package, users can select the best model through an automatic parameter searching process which ...

#### Feature Extraction/Transformation/Selection

The following example shows a feature transformer which converts the string-typed storeType feature to the numeric storeTypeIndex. The transformed DataFrame is then fed to train XGBoost model.

```scala
import org.apache.spark.ml.feature.StringIndexer

// load sales records saved in json files
val salesDF = spark.read.json("sales.json")

// transform the string-represented storeType feature to numeric storeTypeIndex
val indexer = new StringIndexer()
  .setInputCol("storeType")
  .setOutputCol("storeTypeIndex")
// drop the extra column
val indexed = indexer.fit(salesDF).transform(df).drop("storeType")

// use the transformed dataframe as training dataset
val xgboostModel = XGBoost.trainWithDataFrame(
      indexed, paramMap, numRound, nWorkers, useExternalMemory)
```

#### Pipelining

Spark ML package allows the user to build a complete pipeline from feature extraction/transformation/selection to model training. We integrate XGBoost with ML package and make it feasible to embed XGBoost into such a pipeline seamlessly. The followin...

```scala
import org.apache.spark.ml.feature.StringIndexer

// load sales records saved in json files
val salesDF = spark.read.json("sales.json")

// transform the string-represented storeType feature to numeric storeTypeIndex
val indexer = new StringIndexer()
( run in 0.449 second using v1.01-cache-2.11-cpan-2b0bae70ee8 )