Alien-XGBoost

 view release on metacpan or  search on metacpan

xgboost/doc/tutorials/aws_yarn.md  view on Meta::CPAN

Distributed XGBoost YARN on AWS
===============================
This is a step-by-step tutorial on how to setup and run distributed [XGBoost](https://github.com/dmlc/xgboost)
on an AWS EC2 cluster. Distributed XGBoost runs on various platforms such as MPI, SGE and Hadoop YARN.
In this tutorial, we use YARN as an example since this is a widely used solution for distributed computing.

Prerequisite
------------
We need to get a [AWS key-pair](http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-key-pairs.html)
to access the AWS services. Let us assume that we are using a key ```mykey``` and  the corresponding permission file ```mypem.pem```.

We also need [AWS credentials](http://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSGettingStartedGuide/AWSCredentials.html),
which includes an `ACCESS_KEY_ID` and a `SECRET_ACCESS_KEY`.

Finally, we will need a S3 bucket to host the data and the model, ```s3://mybucket/```

Setup a Hadoop YARN Cluster
---------------------------
This sections shows how to start a Hadoop YARN cluster from scratch.
You can skip this step if you have already have one.
We will be using [yarn-ec2](https://github.com/tqchen/yarn-ec2) to start the cluster.

We can first clone the yarn-ec2 script by the following command.
```bash
git clone https://github.com/tqchen/yarn-ec2
```

To use the script, we must set the environment variables `AWS_ACCESS_KEY_ID` and
`AWS_SECRET_ACCESS_KEY` properly. This can be done by adding the following two lines in
`~/.bashrc` (replacing the strings with the correct ones)

```bash
export AWS_ACCESS_KEY_ID=AKIAIOSFODNN7EXAMPLE
export AWS_SECRET_ACCESS_KEY=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
```

Now we can launch a master machine of the cluster from EC2
```bash
./yarn-ec2 -k mykey -i mypem.pem launch xgboost
```
Wait a few mininutes till the master machine gets up.

After the master machine gets up, we can query the public DNS of the master machine using the following command.
```bash
./yarn-ec2 -k mykey -i mypem.pem get-master xgboost
```
It will show the public DNS of the master machine like ```ec2-xx-xx-xx.us-west-2.compute.amazonaws.com```
Now we can open the browser, and type (replace the DNS with the master DNS)
```
ec2-xx-xx-xx.us-west-2.compute.amazonaws.com:8088
```
This will show the job tracker of the YARN cluster. Note that we may have to wait a few minutes before the master finishes bootstrapping and starts the
job tracker.

After the master machine gets up, we can freely add more slave machines to the cluster.
The following command add m3.xlarge instances to the cluster.
```bash
./yarn-ec2 -k mykey -i mypem.pem -t m3.xlarge -s 2 addslave xgboost
```
We can also choose to add two spot instances
```bash
./yarn-ec2 -k mykey -i mypem.pem -t m3.xlarge -s 2 addspot xgboost
```
The slave machines will start up, bootstrap  and report to the master.
You can check if the slave machines are connected by clicking on the Nodes link on the job tracker.
Or simply type the following URL (replace DNS ith the master DNS)
```
ec2-xx-xx-xx.us-west-2.compute.amazonaws.com:8088/cluster/nodes
```

One thing we should note is that not all the links in the job tracker work.
This is due to that many of them use the private IP of AWS, which can only be accessed by EC2.

 view all matches for this distribution
 view release on metacpan -  search on metacpan

( run in 1.579 second using v1.00-cache-2.02-grep-82fe00e-cpan-2c419f77a38b )