alive results from the CPAN

Alien-XGBoost

#define DMLC_ARRAY_VIEW_H_

#include <vector>
#include <array>

namespace dmlc {

/*!
 * \brief Read only data structure to reference continuous memory region of array.
 * Provide unified view for vector, array and C style array.
 * This data structure do not guarantee aliveness of referenced array.
 *
 * Make sure do not use array_view to record data in async function closures.
 * Also do not use array_view to create reference to temporary data structure.
 *
 * \tparam ValueType The value
 *
 * \code
 *  std::vector<int> myvec{1,2,3};
 *  dmlc::array_view<int> view(myvec);
 *  // indexed visit to the view.

xgboost/dmlc-core/include/dmlc/lua.h view on Meta::CPAN

   * \return the corresponding c type.
   */
  template<typename T>
  inline T Get() const;
  /*!
   * \brief Get user data pointer from LuaRef
   *
   *  CAREFUL when getting userdata(e.g. pointer to Tensor's storage) from LuaRef.
   *  Remember they are managed by Lua, and can get deleted when all the
   *  LuaRef to the userdata destructs. A good practice is always use a LuaRef to keep
   *  the userdata alive when you need them from C++ side.
   *
   * \tparam T the type of pointer to be fetched.
   * \return the corresponding c type.
   */
  template<typename T>
  inline T* GetUDataPtr() const;
  /*! \return whether the value is nil */
  inline bool is_nil() const;
  /*!
   * \brief invoke the LuaRef as function

xgboost/dmlc-core/tracker/dmlc_tracker/local.py view on Meta::CPAN

# pylint: disable=invalid-name
from __future__ import absolute_import

import sys
import os
import subprocess
import logging
from threading import Thread
from . import tracker

keepalive = """
nrep=0
rc=254
while [ $rc -ne 0 ];
do
    export DMLC_NUM_ATTEMPT=$nrep
    %s
    rc=$?;
    nrep=$((nrep+1));
done
"""

xgboost/dmlc-core/tracker/dmlc_tracker/local.py view on Meta::CPAN


    ntrial = 0
    while True:
        if os.name == 'nt':
            env['DMLC_NUM_ATTEMPT'] = str(ntrial)
            ret = subprocess.call(cmd, shell=True, env=env)
            if ret != 0:
                ntrial += 1
                continue
        else:
            bash = keepalive % (cmd)
            ret = subprocess.call(bash, shell=True, executable='bash', env=env)
        if ret == 0:
            logging.debug('Thread %d exit with 0', taskid)
            return
        else:
            if os.name == 'nt':
                sys.exit(-1)
            else:
                raise RuntimeError('Get nonzero return code=%d' % ret)

xgboost/jvm-packages/xgboost4j/src/main/scala/ml/dmlc/xgboost4j/scala/rabit/handler/RabitWorkerHandler.scala view on Meta::CPAN

  case object AwaitingCommand extends State
  // [3] Brokers connections between workers per ring/tree/parent link map.
  case object BuildingLinkMap extends State
  // [4] A transient state in which the worker reports the number of errors in establishing
  // connections to other peer workers. If no errors, transition to next state.
  case object AwaitingErrorCount extends State
  // [5] Awaiting the worker to report its port number for accepting connections from peer workers.
  // This port number information is later forwarded to linked workers.
  case object AwaitingPortNumber extends State
  // [6] Final state after completing the setup with the connecting worker. At this stage, the
  // worker will have closed the Tcp connection. The actor remains alive to handle messages from
  // peer actors representing workers with pending setups.
  case object SetupComplete extends State

  sealed trait DataField
  case object IntField extends DataField
  // an integer preceding the actual string
  case object StringField extends DataField
  case object IntSeqField extends DataField

  object DataStruct {

xgboost/rabit/doc/guide.md view on Meta::CPAN

* The other nodes wait in the call of the second Allreduce in order to help node 1 to recover.
* When node 1 restarts, it will call ```LoadCheckPoint```, and get the latest checkpoint from one of the existing nodes.
* Then node 1 can start from the latest checkpoint and continue running.
* When node 1 calls the first Allreduce again, as the other nodes already know the result, node 1 can get it from one of them.
* When node 1 reaches the second Allreduce, the other nodes find out that node 1 has catched up and they can continue the program normally.

This fault tolerance model is based on a key property of Allreduce and
Broadcast: All the nodes get the same result after calling Allreduce/Broadcast.
Because of this property, any node can record the results of history
Allreduce/Broadcast calls.  When a node is recovered, it can fetch the lost
results from some alive nodes and rebuild its model.

The checkpoint is introduced so that we can discard the history results of
Allreduce/Broadcast calls before the latest checkpoint. This saves memory
consumption used for backup.  The checkpoint of each node is a model defined by
users and can be split into 2 parts: a global model and a local model. The
global model is shared by all nodes and can be backed up by any nodes. The
local model of a node is replicated to some other nodes (selected using a ring
replication strategy).  The checkpoint is only saved in the memory without
touching the disk which makes rabit programs more efficient.  The strategy of
rabit is different from the fail-restart strategy where all the nodes restart
from the same checkpoint when any of them fail.  In rabit, all the alive nodes
will block in the Allreduce call and help the recovery.  To catch up, the
recovered node fetches its latest checkpoint and the results of
Allreduce/Broadcast calls after the checkpoint from some alive nodes.

This is just a conceptual introduction to rabit's fault tolerance model. The actual implementation is more sophisticated,
and can deal with more complicated cases such as multiple nodes failure and node failure during recovery phase.

xgboost/rabit/src/allreduce_base.cc view on Meta::CPAN

    }
    if (!match) all_links.push_back(r);
  }
  // close listening sockets
  sock_listen.Close();
  this->parent_index = -1;
  // setup tree links and ring structure
  tree_links.plinks.clear();
  for (size_t i = 0; i < all_links.size(); ++i) {
    utils::Assert(!all_links[i].sock.BadSocket(), "ReConnectLink: bad socket");
    // set the socket to non-blocking mode, enable TCP keepalive
    all_links[i].sock.SetNonBlock(true);
    all_links[i].sock.SetKeepAlive(true);
    if (tree_neighbors.count(all_links[i].rank) != 0) {
      if (all_links[i].rank == parent_rank) {
        parent_index = static_cast<int>(tree_links.plinks.size());
      }
      tree_links.plinks.push_back(&all_links[i]);
    }
    if (all_links[i].rank == prev_rank) ring_prev = &all_links[i];
    if (all_links[i].rank == next_rank) ring_next = &all_links[i];

xgboost/rabit/src/socket.h view on Meta::CPAN

 * \brief a wrapper of TCP socket that hopefully be cross platform
 */
class TCPSocket : public Socket{
 public:
  // constructor
  TCPSocket(void) : Socket(INVALID_SOCKET) {
  }
  explicit TCPSocket(SOCKET sockfd) : Socket(sockfd) {
  }
  /*!
   * \brief enable/disable TCP keepalive
   * \param keepalive whether to set the keep alive option on
   */
  inline void SetKeepAlive(bool keepalive) {
    int opt = static_cast<int>(keepalive);
    if (setsockopt(sockfd, SOL_SOCKET, SO_KEEPALIVE,
                   reinterpret_cast<char*>(&opt), sizeof(opt)) < 0) {
      Socket::Error("SetKeepAlive");
    }
  }
  /*!
   * \brief create the socket, call this before using socket
   * \param af domain
   */
  inline void Create(int af = PF_INET) {

xgboost/rabit/test/README.md view on Meta::CPAN

====
This folder contains internal testcases to test correctness and efficiency of rabit API

The example running scripts for testcases are given by test.mk
* type ```make -f test.mk testcasename``` to run certain testcase


Helper Scripts
====
* test.mk contains Makefile documentation of all testcases
* keepalive.sh helper bash to restart a program when it dies abnormally

List of Programs
====
* speed_test: test the running speed of rabit API
* test_local_recover: test recovery of local state when error happens
* test_model_recover: test recovery of global state when error happens

xgboost/rabit/test/test.mk view on Meta::CPAN

# this is a makefile used to show testcases of rabit
.PHONY: all

all: model_recover_10_10k  model_recover_10_10k_die_same model_recover_10_10k_die_hard local_recover_10_10k

# this experiment test recovery with actually process exit, use keepalive to keep program alive
model_recover_10_10k:
	../dmlc-core/tracker/dmlc-submit --cluster local --num-workers=10 model_recover 10000 mock=0,0,1,0 mock=1,1,1,0

model_recover_10_10k_die_same:
	../dmlc-core/tracker/dmlc-submit --cluster local --num-workers=10 model_recover 10000 mock=0,0,1,0 mock=1,1,1,0 mock=0,1,1,0 mock=4,1,1,0 mock=9,1,1,0

model_recover_10_10k_die_hard:
	../dmlc-core/tracker/dmlc-submit --cluster local --num-workers=10 model_recover 10000 mock=0,0,1,0 mock=1,1,1,0 mock=1,1,1,1 mock=0,1,1,0 mock=4,1,1,0 mock=9,1,1,0 mock=8,1,2,0 mock=4,1,3,0

local_recover_10_10k:

( run in 1.518 second using v1.01-cache-2.11-cpan-df04353d9ac )