Alien-XGBoost
view release on metacpan - search on metacpan
view release on metacpan or search on metacpan
xgboost/doc/jvm/xgboost4j_full_integration.md view on Meta::CPAN
## Introduction
On March 2016, we released the first version of [XGBoost4J](http://dmlc.ml/2016/03/14/xgboost4j-portable-distributed-xgboost-in-spark-flink-and-dataflow.html), which is a set of packages providing Java/Scala interfaces of XGBoost and the integration ...
The integrations with Spark/Flink, a.k.a. <b>XGBoost4J-Spark</b> and <b>XGBoost-Flink</b>, receive the tremendous positive feedbacks from the community. It enables users to build a unified pipeline, embedding XGBoost into the data processing system ...
![XGBoost4J Architecture](https://raw.githubusercontent.com/dmlc/web-data/master/xgboost/unified_pipeline.png)
In the last months, we have a lot of communication with the users and gain the deeper understanding of the users' latest usage scenario and requirements:
* XGBoost keeps gaining more and more deployments in the production environment and the adoption in machine learning competitions [Link](http://datascience.la/xgboost-workshop-and-meetup-talk-with-tianqi-chen/).
* While Spark is still the mainstream data processing tool in most of scenarios, more and more users are porting their RDD-based Spark programs to [DataFrame/Dataset APIs](http://spark.apache.org/docs/latest/sql-programming-guide.html) for the well-d...
* Spark itself has presented a clear roadmap that DataFrame/Dataset would be the base of the latest and future features, e.g. latest version of [ML pipeline](http://spark.apache.org/docs/latest/ml-guide.html) and [Structured Streaming](http://spark.a...
Based on these feedbacks from the users, we observe a gap between the original RDD-based XGBoost4J-Spark and the users' latest usage scenario as well as the future direction of Spark ecosystem. To fill this gap, we start working on the <b><i>integrat...
## A Full Integration of XGBoost and DataFrame/Dataset
The following figure illustrates the new pipeline architecture with the latest XGBoost4J-Spark.
![XGBoost4J New Architecture](https://raw.githubusercontent.com/dmlc/web-data/master/xgboost/unified_pipeline_new.png)
Being different with the previous version, users are able to use both low- and high-level memory abstraction in Spark, i.e. RDD and DataFrame/Dataset. The DataFrame/Dataset abstraction grants the user to manipulate structured datasets and utilize the...
```scala
// load sales records saved in json files
val salesDF = spark.read.json("sales.json")
// call XGBoost API to train with the DataFrame-represented training set
val xgboostModel = XGBoost.trainWithDataFrame(
salesDF, paramMap, numRound, nWorkers, useExternalMemory)
```
By integrating with DataFrame/Dataset, XGBoost4J-Spark not only enables users to call DataFrame/Dataset APIs directly but also make DataFrame/Dataset-based Spark features available to XGBoost users, e.g. ML Package.
### Integration with ML Package
ML package of Spark provides a set of convenient tools for feature extraction/transformation/selection. Additionally, with the model selection tool in ML package, users can select the best model through an automatic parameter searching process which ...
#### Feature Extraction/Transformation/Selection
The following example shows a feature transformer which converts the string-typed storeType feature to the numeric storeTypeIndex. The transformed DataFrame is then fed to train XGBoost model.
```scala
import org.apache.spark.ml.feature.StringIndexer
// load sales records saved in json files
val salesDF = spark.read.json("sales.json")
// transform the string-represented storeType feature to numeric storeTypeIndex
val indexer = new StringIndexer()
.setInputCol("storeType")
.setOutputCol("storeTypeIndex")
// drop the extra column
val indexed = indexer.fit(salesDF).transform(df).drop("storeType")
// use the transformed dataframe as training dataset
val xgboostModel = XGBoost.trainWithDataFrame(
indexed, paramMap, numRound, nWorkers, useExternalMemory)
```
#### Pipelining
Spark ML package allows the user to build a complete pipeline from feature extraction/transformation/selection to model training. We integrate XGBoost with ML package and make it feasible to embed XGBoost into such a pipeline seamlessly. The followin...
```scala
import org.apache.spark.ml.feature.StringIndexer
// load sales records saved in json files
val salesDF = spark.read.json("sales.json")
// transform the string-represented storeType feature to numeric storeTypeIndex
val indexer = new StringIndexer()
view all matches for this distributionview release on metacpan - search on metacpan
( run in 2.373 seconds using v1.00-cache-2.02-grep-82fe00e-cpan-b63e86051f13 )