png results from the CPAN

Alien-XGBoost


    my $ffi = FFI::Platypus->new;
    $ffi->lib(Alien::XGBoost->dynamic_libs);
    $ffi->attach(XGBGetLastError => [] => 'string');
    my $error = XGBGetLastError();

=head1 DESCRIPTION

Alien package to find, and build if necessary XGBoost dynamic library.

=for markdown [![Build Status](https://travis-ci.org/pablrod/p5-Alien-XGBoost.png?branch=master)](https://travis-ci.org/pablrod/p5-Alien-XGBoost)
[![Build status](https://ci.appveyor.com/api/projects/status/nl46f64tjiyaveql/branch/master?svg=true)](https://ci.appveyor.com/project/pablrod/p5-alien-xgboost/branch/master)

This module is to be used by other modules that need the XGBoost
dynamic library available, indeed I've made this for L<AI::XGBoost>.

If you only want to use XGBoost in your perl programns, just use
L<AI::XGBoost> and forget this module. If you want to make other XGBoost
wrappers or use from XS then continue reading.

By now there is no support for compiling your modules against XGBoost.

xgboost/R-package/R/xgb.plot.tree.R view on Meta::CPAN

#' xgb.plot.tree(model = bst)
#' # plot only the first tree and display the node ID:
#' xgb.plot.tree(model = bst, trees = 0, show_node_id = TRUE)
#' 
#' \dontrun{
#' # Below is an example of how to save this plot to a file. 
#' # Note that for `export_graph` to work, the DiagrammeRsvg and rsvg packages must also be installed.
#' library(DiagrammeR)
#' gr <- xgb.plot.tree(model=bst, trees=0:1, render=FALSE)
#' export_graph(gr, 'tree.pdf', width=1500, height=1900)
#' export_graph(gr, 'tree.png', width=1500, height=1900)
#' }
#' 
#' @export
xgb.plot.tree <- function(feature_names = NULL, model = NULL, trees = NULL, plot_width = NULL, plot_height = NULL,
                          render = TRUE, show_node_id = FALSE, ...){
  check.deprecation(...)
  if (!inherits(model, "xgb.Booster")) {
    stop("model: Has to be an object of class xgb.Booster")
  }

xgboost/R-package/man/xgb.plot.tree.Rd view on Meta::CPAN

xgb.plot.tree(model = bst)
# plot only the first tree and display the node ID:
xgb.plot.tree(model = bst, trees = 0, show_node_id = TRUE)

\dontrun{
# Below is an example of how to save this plot to a file. 
# Note that for `export_graph` to work, the DiagrammeRsvg and rsvg packages must also be installed.
library(DiagrammeR)
gr <- xgb.plot.tree(model=bst, trees=0:1, render=FALSE)
export_graph(gr, 'tree.pdf', width=1500, height=1900)
export_graph(gr, 'tree.png', width=1500, height=1900)
}

}

xgboost/README.md view on Meta::CPAN

<img src=https://raw.githubusercontent.com/dmlc/dmlc.github.io/master/img/logo-m/xgboost.png width=135/>  eXtreme Gradient Boosting
===========
[![Build Status](https://travis-ci.org/dmlc/xgboost.svg?branch=master)](https://travis-ci.org/dmlc/xgboost)
[![Build Status](https://ci.appveyor.com/api/projects/status/5ypa8vaed6kpmli8?svg=true)](https://ci.appveyor.com/project/tqchen/xgboost)
[![Documentation Status](https://readthedocs.org/projects/xgboost/badge/?version=latest)](https://xgboost.readthedocs.org)
[![GitHub license](http://dmlc.github.io/img/apache2.svg)](./LICENSE)
[![CRAN Status Badge](http://www.r-pkg.org/badges/version/xgboost)](http://cran.r-project.org/web/packages/xgboost)
[![PyPI version](https://badge.fury.io/py/xgboost.svg)](https://pypi.python.org/pypi/xgboost/)
[![Gitter chat for developers at https://gitter.im/dmlc/xgboost](https://badges.gitter.im/Join%20Chat.svg)](https://gitter.im/dmlc/xgboost?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge&utm_content=badge)

[Documentation](https://xgboost.readthedocs.org) |

xgboost/cub/README.md view on Meta::CPAN

  - Compatible with CUDA dynamic parallelism
- [<b><em>Block-wide "collective" primitives</em></b>] (https://nvlabs.github.com/cub/group___block_module.html)
  - I/O, sort, prefix scan, reduction, histogram, etc.  
  - Compatible with arbitrary thread block sizes and types 
- [<b><em>Warp-wide "collective" primitives</em></b>] (https://nvlabs.github.com/cub/group___warp_module.html)
  - Warp-wide prefix scan, reduction, etc.
  - Safe and architecture-specific
- [<b><em>Thread and resource utilities</em></b>](https://nvlabs.github.com/cub/group___thread_module.html)
  - PTX intrinsics, device reflection, texture-caching iterators, caching memory allocators, etc. 

![Orientation of collective primitives within the CUDA software stack](http://nvlabs.github.com/cub/cub_overview.png)

<br><hr>
<h3>A Simple Example</h3>

```C++
#include <cub/cub.cuh>
 
// Block-sorting CUDA kernel
__global__ void BlockSortKernel(int *d_in, int *d_out)
{

xgboost/cub/cub/block/block_discontinuity.cuh view on Meta::CPAN

#include "../util_ptx.cuh"
#include "../util_namespace.cuh"

/// Optional outer namespace(s)
CUB_NS_PREFIX

/// CUB namespace
namespace cub {

/**
 * \brief The BlockDiscontinuity class provides [<em>collective</em>](index.html#sec0) methods for flagging discontinuities within an ordered set of items partitioned across a CUDA thread block. ![](discont_logo.png)
 * \ingroup BlockModule
 *
 * \tparam T                The data type to be flagged.
 * \tparam BLOCK_DIM_X      The thread block length in threads along the X dimension
 * \tparam BLOCK_DIM_Y      <b>[optional]</b> The thread block length in threads along the Y dimension (default: 1)
 * \tparam BLOCK_DIM_Z      <b>[optional]</b> The thread block length in threads along the Z dimension (default: 1)
 * \tparam PTX_ARCH         <b>[optional]</b> \ptxversion
 *
 * \par Overview
 * - A set of "head flags" (or "tail flags") is often used to indicate corresponding items

xgboost/cub/cub/block/block_exchange.cuh view on Meta::CPAN

#include "../util_type.cuh"
#include "../util_namespace.cuh"

/// Optional outer namespace(s)
CUB_NS_PREFIX

/// CUB namespace
namespace cub {

/**
 * \brief The BlockExchange class provides [<em>collective</em>](index.html#sec0) methods for rearranging data partitioned across a CUDA thread block. ![](transpose_logo.png)
 * \ingroup BlockModule
 *
 * \tparam T                    The data type to be exchanged.
 * \tparam BLOCK_DIM_X          The thread block length in threads along the X dimension
 * \tparam ITEMS_PER_THREAD     The number of items partitioned onto each thread.
 * \tparam WARP_TIME_SLICING    <b>[optional]</b> When \p true, only use enough shared memory for a single warp's worth of tile data, time-slicing the block-wide exchange over multiple synchronized rounds.  Yields a smaller memory footprint at the ex...
 * \tparam BLOCK_DIM_Y          <b>[optional]</b> The thread block length in threads along the Y dimension (default: 1)
 * \tparam BLOCK_DIM_Z          <b>[optional]</b> The thread block length in threads along the Z dimension (default: 1)
 * \tparam PTX_ARCH             <b>[optional]</b> \ptxversion
 *

xgboost/cub/cub/block/block_histogram.cuh view on Meta::CPAN

};



/******************************************************************************
 * Block histogram
 ******************************************************************************/


/**
 * \brief The BlockHistogram class provides [<em>collective</em>](index.html#sec0) methods for constructing block-wide histograms from data samples partitioned across a CUDA thread block. ![](histogram_logo.png)
 * \ingroup BlockModule
 *
 * \tparam T                    The sample type being histogrammed (must be castable to an integer bin identifier)
 * \tparam BLOCK_DIM_X          The thread block length in threads along the X dimension
 * \tparam ITEMS_PER_THREAD     The number of items per thread
 * \tparam BINS                 The number bins within the histogram
 * \tparam ALGORITHM            <b>[optional]</b> cub::BlockHistogramAlgorithm enumerator specifying the underlying algorithm to use (default: cub::BLOCK_HISTO_SORT)
 * \tparam BLOCK_DIM_Y          <b>[optional]</b> The thread block length in threads along the Y dimension (default: 1)
 * \tparam BLOCK_DIM_Z          <b>[optional]</b> The thread block length in threads along the Z dimension (default: 1)
 * \tparam PTX_ARCH             <b>[optional]</b> \ptxversion

xgboost/cub/cub/block/block_load.cuh view on Meta::CPAN

     * - The utilization of memory transactions (coalescing) remains high regardless
     *   of items loaded per thread.
     * - Provisions less shared memory temporary storage, but incurs larger
     *   latencies than the BLOCK_LOAD_WARP_TRANSPOSE alternative.
     */
    BLOCK_LOAD_WARP_TRANSPOSE_TIMESLICED,
};


/**
 * \brief The BlockLoad class provides [<em>collective</em>](index.html#sec0) data movement methods for loading a linear segment of items from memory into a [<em>blocked arrangement</em>](index.html#sec5sec3) across a CUDA thread block.  ![](block_lo...
 * \ingroup BlockModule
 * \ingroup UtilIo
 *
 * \tparam InputT               The data type to read into (which must be convertible from the input iterator's value type).
 * \tparam BLOCK_DIM_X          The thread block length in threads along the X dimension
 * \tparam ITEMS_PER_THREAD     The number of consecutive items partitioned onto each thread.
 * \tparam ALGORITHM            <b>[optional]</b> cub::BlockLoadAlgorithm tuning policy.  default: cub::BLOCK_LOAD_DIRECT.
 * \tparam WARP_TIME_SLICING    <b>[optional]</b> Whether or not only one warp's worth of shared memory should be allocated and time-sliced among block-warps during any load-related data transpositions (versus each warp having its own storage). (defa...
 * \tparam BLOCK_DIM_Y          <b>[optional]</b> The thread block length in threads along the Y dimension (default: 1)
 * \tparam BLOCK_DIM_Z          <b>[optional]</b> The thread block length in threads along the Z dimension (default: 1)

xgboost/cub/cub/block/block_radix_sort.cuh view on Meta::CPAN

#include "../util_type.cuh"
#include "../util_namespace.cuh"

/// Optional outer namespace(s)
CUB_NS_PREFIX

/// CUB namespace
namespace cub {

/**
 * \brief The BlockRadixSort class provides [<em>collective</em>](index.html#sec0) methods for sorting items partitioned across a CUDA thread block using a radix sorting method.  ![](sorting_logo.png)
 * \ingroup BlockModule
 *
 * \tparam KeyT                 KeyT type
 * \tparam BLOCK_DIM_X          The thread block length in threads along the X dimension
 * \tparam ITEMS_PER_THREAD     The number of items per thread
 * \tparam ValueT               <b>[optional]</b> ValueT type (default: cub::NullType, which indicates a keys-only sort)
 * \tparam RADIX_BITS           <b>[optional]</b> The number of radix bits per digit place (default: 4 bits)
 * \tparam MEMOIZE_OUTER_SCAN   <b>[optional]</b> Whether or not to buffer outer raking scan partials to incur fewer shared memory reads at the expense of higher register pressure (default: true for architectures SM35 and newer, false otherwise).
 * \tparam INNER_SCAN_ALGORITHM <b>[optional]</b> The cub::BlockScanAlgorithm algorithm to use (default: cub::BLOCK_SCAN_WARP_SCANS)
 * \tparam SMEM_CONFIG          <b>[optional]</b> Shared memory bank mode (default: \p cudaSharedMemBankSizeFourByte)

xgboost/cub/cub/block/block_raking_layout.cuh view on Meta::CPAN

#include "../util_type.cuh"
#include "../util_namespace.cuh"

/// Optional outer namespace(s)
CUB_NS_PREFIX

/// CUB namespace
namespace cub {

/**
 * \brief BlockRakingLayout provides a conflict-free shared memory layout abstraction for 1D raking across thread block data.    ![](raking.png)
 * \ingroup BlockModule
 *
 * \par Overview
 * This type facilitates a shared memory usage pattern where a block of CUDA
 * threads places elements into shared memory and then reduces the active
 * parallelism to one "raking" warp of threads for serially aggregating consecutive
 * sequences of shared items.  Padding is inserted to eliminate bank conflicts
 * (for most data types).
 *
 * \tparam T                        The data type to be exchanged.

xgboost/cub/cub/block/block_reduce.cuh view on Meta::CPAN

     * \par
     * Execution is comprised of three phases:
     * -# Upsweep sequential reduction in registers (if threads contribute more
     *    than one input each).  Threads in warps other than the first warp place
     *    their partial reductions into shared memory.
     * -# Upsweep sequential reduction in shared memory.  Threads within the first
     *    warp continue to accumulate by raking across segments of shared partial reductions
     * -# A warp-synchronous Kogge-Stone style reduction within the raking warp.
     *
     * \par
     * \image html block_reduce.png
     * <div class="centercaption">\p BLOCK_REDUCE_RAKING data flow for a hypothetical 16-thread threadblock and 4-thread raking warp.</div>
     *
     * \par Performance Considerations
     * - This variant performs less communication than BLOCK_REDUCE_RAKING_NON_COMMUTATIVE
     *   and is preferable when the reduction operator is commutative.  This variant
     *   applies fewer reduction operators  than BLOCK_REDUCE_WARP_REDUCTIONS, and can provide higher overall
     *   throughput across the GPU when suitably occupied.  However, turn-around latency may be
     *   higher than to BLOCK_REDUCE_WARP_REDUCTIONS and thus less-desirable
     *   when the GPU is under-occupied.
     */

xgboost/cub/cub/block/block_reduce.cuh view on Meta::CPAN

     * \par
     * Execution is comprised of three phases:
     * -# Upsweep sequential reduction in registers (if threads contribute more
     *    than one input each).  Each thread then places the partial reduction
     *    of its item(s) into shared memory.
     * -# Upsweep sequential reduction in shared memory.  Threads within a
     *    single warp rake across segments of shared partial reductions.
     * -# A warp-synchronous Kogge-Stone style reduction within the raking warp.
     *
     * \par
     * \image html block_reduce.png
     * <div class="centercaption">\p BLOCK_REDUCE_RAKING data flow for a hypothetical 16-thread threadblock and 4-thread raking warp.</div>
     *
     * \par Performance Considerations
     * - This variant performs more communication than BLOCK_REDUCE_RAKING
     *   and is only preferable when the reduction operator is non-commutative.  This variant
     *   applies fewer reduction operators than BLOCK_REDUCE_WARP_REDUCTIONS, and can provide higher overall
     *   throughput across the GPU when suitably occupied.  However, turn-around latency may be
     *   higher than to BLOCK_REDUCE_WARP_REDUCTIONS and thus less-desirable
     *   when the GPU is under-occupied.
     */

xgboost/cub/cub/block/block_reduce.cuh view on Meta::CPAN

     * Execution is comprised of four phases:
     * -# Upsweep sequential reduction in registers (if threads contribute more
     *    than one input each).  Each thread then places the partial reduction
     *    of its item(s) into shared memory.
     * -# Compute a shallow, but inefficient warp-synchronous Kogge-Stone style
     *    reduction within each warp.
     * -# A propagation phase where the warp reduction outputs in each warp are
     *    updated with the aggregate from each preceding warp.
     *
     * \par
     * \image html block_scan_warpscans.png
     * <div class="centercaption">\p BLOCK_REDUCE_WARP_REDUCTIONS data flow for a hypothetical 16-thread threadblock and 4-thread raking warp.</div>
     *
     * \par Performance Considerations
     * - This variant applies more reduction operators than BLOCK_REDUCE_RAKING
     *   or BLOCK_REDUCE_RAKING_NON_COMMUTATIVE, which may result in lower overall
     *   throughput across the GPU.  However turn-around latency may be lower and
     *   thus useful when the GPU is under-occupied.
     */
    BLOCK_REDUCE_WARP_REDUCTIONS,
};


/******************************************************************************
 * Block reduce
 ******************************************************************************/

/**
 * \brief The BlockReduce class provides [<em>collective</em>](index.html#sec0) methods for computing a parallel reduction of items partitioned across a CUDA thread block. ![](reduce_logo.png)
 * \ingroup BlockModule
 *
 * \tparam T                Data type being reduced
 * \tparam BLOCK_DIM_X      The thread block length in threads along the X dimension
 * \tparam ALGORITHM        <b>[optional]</b> cub::BlockReduceAlgorithm enumerator specifying the underlying algorithm to use (default: cub::BLOCK_REDUCE_WARP_REDUCTIONS)
 * \tparam BLOCK_DIM_Y      <b>[optional]</b> The thread block length in threads along the Y dimension (default: 1)
 * \tparam BLOCK_DIM_Z      <b>[optional]</b> The thread block length in threads along the Z dimension (default: 1)
 * \tparam PTX_ARCH         <b>[optional]</b> \ptxversion
 *
 * \par Overview

xgboost/cub/cub/block/block_scan.cuh view on Meta::CPAN

    /**
     * \par Overview
     * An efficient "raking reduce-then-scan" prefix scan algorithm.  Execution is comprised of five phases:
     * -# Upsweep sequential reduction in registers (if threads contribute more than one input each).  Each thread then places the partial reduction of its item(s) into shared memory.
     * -# Upsweep sequential reduction in shared memory.  Threads within a single warp rake across segments of shared partial reductions.
     * -# A warp-synchronous Kogge-Stone style exclusive scan within the raking warp.
     * -# Downsweep sequential exclusive scan in shared memory.  Threads within a single warp rake across segments of shared partial reductions, seeded with the warp-scan output.
     * -# Downsweep sequential scan in registers (if threads contribute more than one input), seeded with the raking scan output.
     *
     * \par
     * \image html block_scan_raking.png
     * <div class="centercaption">\p BLOCK_SCAN_RAKING data flow for a hypothetical 16-thread threadblock and 4-thread raking warp.</div>
     *
     * \par Performance Considerations
     * - Although this variant may suffer longer turnaround latencies when the
     *   GPU is under-occupied, it can often provide higher overall throughput
     *   across the GPU when suitably occupied.
     */
    BLOCK_SCAN_RAKING,

xgboost/cub/cub/block/block_scan.cuh view on Meta::CPAN


    /**
     * \par Overview
     * A quick "tiled warpscans" prefix scan algorithm.  Execution is comprised of four phases:
     * -# Upsweep sequential reduction in registers (if threads contribute more than one input each).  Each thread then places the partial reduction of its item(s) into shared memory.
     * -# Compute a shallow, but inefficient warp-synchronous Kogge-Stone style scan within each warp.
     * -# A propagation phase where the warp scan outputs in each warp are updated with the aggregate from each preceding warp.
     * -# Downsweep sequential scan in registers (if threads contribute more than one input), seeded with the raking scan output.
     *
     * \par
     * \image html block_scan_warpscans.png
     * <div class="centercaption">\p BLOCK_SCAN_WARP_SCANS data flow for a hypothetical 16-thread threadblock and 4-thread raking warp.</div>
     *
     * \par Performance Considerations
     * - Although this variant may suffer lower overall throughput across the
     *   GPU because due to a heavy reliance on inefficient warpscans, it can
     *   often provide lower turnaround latencies when the GPU is under-occupied.
     */
    BLOCK_SCAN_WARP_SCANS,
};


/******************************************************************************
 * Block scan
 ******************************************************************************/

/**
 * \brief The BlockScan class provides [<em>collective</em>](index.html#sec0) methods for computing a parallel prefix sum/scan of items partitioned across a CUDA thread block. ![](block_scan_logo.png)
 * \ingroup BlockModule
 *
 * \tparam T                Data type being scanned
 * \tparam BLOCK_DIM_X      The thread block length in threads along the X dimension
 * \tparam ALGORITHM        <b>[optional]</b> cub::BlockScanAlgorithm enumerator specifying the underlying algorithm to use (default: cub::BLOCK_SCAN_RAKING)
 * \tparam BLOCK_DIM_Y      <b>[optional]</b> The thread block length in threads along the Y dimension (default: 1)
 * \tparam BLOCK_DIM_Z      <b>[optional]</b> The thread block length in threads along the Z dimension (default: 1)
 * \tparam PTX_ARCH         <b>[optional]</b> \ptxversion
 *
 * \par Overview

xgboost/cub/cub/block/block_store.cuh view on Meta::CPAN

     *   of items written per thread.
     * - Provisions less shared memory temporary storage, but incurs larger
     *   latencies than the BLOCK_STORE_WARP_TRANSPOSE alternative.
     */
    BLOCK_STORE_WARP_TRANSPOSE_TIMESLICED,

};


/**
 * \brief The BlockStore class provides [<em>collective</em>](index.html#sec0) data movement methods for writing a [<em>blocked arrangement</em>](index.html#sec5sec3) of items partitioned across a CUDA thread block to a linear segment of memory.  ![]...
 * \ingroup BlockModule
 * \ingroup UtilIo
 *
 * \tparam T                    The type of data to be written.
 * \tparam BLOCK_DIM_X          The thread block length in threads along the X dimension
 * \tparam ITEMS_PER_THREAD     The number of consecutive items partitioned onto each thread.
 * \tparam ALGORITHM            <b>[optional]</b> cub::BlockStoreAlgorithm tuning policy enumeration.  default: cub::BLOCK_STORE_DIRECT.
 * \tparam WARP_TIME_SLICING    <b>[optional]</b> Whether or not only one warp's worth of shared memory should be allocated and time-sliced among block-warps during any load-related data transpositions (versus each warp having its own storage). (defa...
 * \tparam BLOCK_DIM_Y          <b>[optional]</b> The thread block length in threads along the Y dimension (default: 1)
 * \tparam BLOCK_DIM_Z          <b>[optional]</b> The thread block length in threads along the Z dimension (default: 1)

xgboost/cub/cub/device/device_histogram.cuh view on Meta::CPAN

#include "../util_namespace.cuh"

/// Optional outer namespace(s)
CUB_NS_PREFIX

/// CUB namespace
namespace cub {


/**
 * \brief DeviceHistogram provides device-wide parallel operations for constructing histogram(s) from a sequence of samples data residing within device-accessible memory. ![](histogram_logo.png)
 * \ingroup SingleModule
 *
 * \par Overview
 * A <a href="http://en.wikipedia.org/wiki/Histogram"><em>histogram</em></a>
 * counts the number of observations that fall into each of the disjoint categories (known as <em>bins</em>).
 *
 * \par Usage Considerations
 * \cdp_class{DeviceHistogram}
 *
 */

xgboost/cub/cub/device/device_partition.cuh view on Meta::CPAN

#include "../util_namespace.cuh"

/// Optional outer namespace(s)
CUB_NS_PREFIX

/// CUB namespace
namespace cub {


/**
 * \brief DevicePartition provides device-wide, parallel operations for partitioning sequences of data items residing within device-accessible memory. ![](partition_logo.png)
 * \ingroup SingleModule
 *
 * \par Overview
 * These operations apply a selection criterion to construct a partitioned output sequence from items selected/unselected from
 * a specified input sequence.
 *
 * \par Usage Considerations
 * \cdp_class{DevicePartition}
 *
 * \par Performance
 * \linear_performance{partition}
 *
 * \par
 * The following chart illustrates DevicePartition::If
 * performance across different CUDA architectures for \p int32 items,
 * where 50% of the items are randomly selected for the first partition.
 * \plots_below
 *
 * \image html partition_if_int32_50_percent.png
 *
 */
struct DevicePartition
{
    /**
     * \brief Uses the \p d_flags sequence to split the corresponding items from \p d_in into a partitioned sequence \p d_out.  The total number of items copied into the first partition is written to \p d_num_selected_out. ![](partition_flags_logo.pn...
     *
     * \par
     * - The value type of \p d_flags must be castable to \p bool (e.g., \p bool, \p char, \p int, etc.).
     * - Copies of the selected items are compacted into \p d_out and maintain their original
     *   relative ordering, however copies of the unselected items are compacted into the
     *   rear of \p d_out in reverse order.
     * - \devicestorage
     *
     * \par Snippet
     * The code snippet below illustrates the compaction of items selected from an \p int device vector.

xgboost/cub/cub/device/device_partition.cuh view on Meta::CPAN

            d_num_selected_out,
            SelectOp(),
            EqualityOp(),
            num_items,
            stream,
            debug_synchronous);
    }


    /**
     * \brief Uses the \p select_op functor to split the corresponding items from \p d_in into a partitioned sequence \p d_out.  The total number of items copied into the first partition is written to \p d_num_selected_out. ![](partition_logo.png)
     *
     * \par
     * - Copies of the selected items are compacted into \p d_out and maintain their original
     *   relative ordering, however copies of the unselected items are compacted into the
     *   rear of \p d_out in reverse order.
     * - \devicestorage
     *
     * \par Performance
     * The following charts illustrate saturated partition-if performance across different
     * CUDA architectures for \p int32 and \p int64 items, respectively.  Items are
     * selected for the first partition with 50% probability.
     *
     * \image html partition_if_int32_50_percent.png
     * \image html partition_if_int64_50_percent.png
     *
     * \par
     * The following charts are similar, but 5% selection probability for the first partition:
     *
     * \image html partition_if_int32_5_percent.png
     * \image html partition_if_int64_5_percent.png
     *
     * \par Snippet
     * The code snippet below illustrates the compaction of items selected from an \p int device vector.
     * \par
     * \code
     * #include <cub/cub.cuh>   // or equivalently <cub/device/device_partition.cuh>
     *
     * // Functor type for selecting values less than some criteria
     * struct LessThan
     * {

xgboost/cub/cub/device/device_radix_sort.cuh view on Meta::CPAN

#include "../util_namespace.cuh"

/// Optional outer namespace(s)
CUB_NS_PREFIX

/// CUB namespace
namespace cub {


/**
 * \brief DeviceRadixSort provides device-wide, parallel operations for computing a radix sort across a sequence of data items residing within device-accessible memory. ![](sorting_logo.png)
 * \ingroup SingleModule
 *
 * \par Overview
 * The [<em>radix sorting method</em>](http://en.wikipedia.org/wiki/Radix_sort) arranges
 * items into ascending (or descending) order.  The algorithm relies upon a positional representation for
 * keys, i.e., each key is comprised of an ordered sequence of symbols (e.g., digits,
 * characters, etc.) specified from least-significant to most-significant.  For a
 * given input sequence of keys and a set of rules specifying a total ordering
 * of the symbolic alphabet, the radix sorting method produces a lexicographic
 * ordering of those keys.

xgboost/cub/cub/device/device_radix_sort.cuh view on Meta::CPAN

 * that ensure lexicographic key ordering.
 *
 * \par Usage Considerations
 * \cdp_class{DeviceRadixSort}
 *
 * \par Performance
 * \linear_performance{radix sort} The following chart illustrates DeviceRadixSort::SortKeys
 * performance across different CUDA architectures for uniform-random \p uint32 keys.
 * \plots_below
 *
 * \image html lsb_radix_sort_int32_keys.png
 *
 */
struct DeviceRadixSort
{

    /******************************************************************//**
     * \name KeyT-value pairs
     *********************************************************************/
    //@{

xgboost/cub/cub/device/device_radix_sort.cuh view on Meta::CPAN

     * - The contents of the input data are not altered by the sorting operation
     * - An optional bit subrange <tt>[begin_bit, end_bit)</tt> of differentiating key bits can be specified.  This can reduce overall sorting overhead and yield a corresponding performance improvement.
     * - \devicestorageNP  For sorting using only <em>O</em>(<tt>P</tt>) temporary storage, see the sorting interface using DoubleBuffer wrappers below.
     * - \devicestorage
     *
     * \par Performance
     * The following charts illustrate saturated sorting performance across different
     * CUDA architectures for uniform-random <tt>uint32,uint32</tt> and
     * <tt>uint64,uint64</tt> pairs, respectively.
     *
     * \image html lsb_radix_sort_int32_pairs.png
     * \image html lsb_radix_sort_int64_pairs.png
     *
     * \par Snippet
     * The code snippet below illustrates the sorting of a device vector of \p int keys
     * with associated vector of \p int values.
     * \par
     * \code
     * #include <cub/cub.cuh>   // or equivalently <cub/device/device_radix_sort.cuh>
     *
     * // Declare, allocate, and initialize device-accessible pointers for sorting data
     * int  num_items;          // e.g., 7

xgboost/cub/cub/device/device_radix_sort.cuh view on Meta::CPAN

     *   specified and the targeted device architecture).
     * - An optional bit subrange <tt>[begin_bit, end_bit)</tt> of differentiating key bits can be specified.  This can reduce overall sorting overhead and yield a corresponding performance improvement.
     * - \devicestorageP
     * - \devicestorage
     *
     * \par Performance
     * The following charts illustrate saturated sorting performance across different
     * CUDA architectures for uniform-random <tt>uint32,uint32</tt> and
     * <tt>uint64,uint64</tt> pairs, respectively.
     *
     * \image html lsb_radix_sort_int32_pairs.png
     * \image html lsb_radix_sort_int64_pairs.png
     *
     * \par Snippet
     * The code snippet below illustrates the sorting of a device vector of \p int keys
     * with associated vector of \p int values.
     * \par
     * \code
     * #include <cub/cub.cuh>   // or equivalently <cub/device/device_radix_sort.cuh>
     *
     * // Declare, allocate, and initialize device-accessible pointers for sorting data
     * int  num_items;          // e.g., 7

xgboost/cub/cub/device/device_radix_sort.cuh view on Meta::CPAN

     * \par
     * - The contents of the input data are not altered by the sorting operation
     * - An optional bit subrange <tt>[begin_bit, end_bit)</tt> of differentiating key bits can be specified.  This can reduce overall sorting overhead and yield a corresponding performance improvement.
     * - \devicestorageNP  For sorting using only <em>O</em>(<tt>P</tt>) temporary storage, see the sorting interface using DoubleBuffer wrappers below.
     * - \devicestorage
     *
     * \par Performance
     * The following charts illustrate saturated sorting performance across different
     * CUDA architectures for uniform-random \p uint32 and \p uint64 keys, respectively.
     *
     * \image html lsb_radix_sort_int32_keys.png
     * \image html lsb_radix_sort_int64_keys.png
     *
     * \par Snippet
     * The code snippet below illustrates the sorting of a device vector of \p int keys.
     * \par
     * \code
     * #include <cub/cub.cuh>   // or equivalently <cub/device/device_radix_sort.cuh>
     *
     * // Declare, allocate, and initialize device-accessible pointers for sorting data
     * int  num_items;          // e.g., 7
     * int  *d_keys_in;         // e.g., [8, 6, 7, 5, 3, 0, 9]

xgboost/cub/cub/device/device_radix_sort.cuh view on Meta::CPAN

     *   now contains the sorted output sequence (a function of the number of key bits
     *   specified and the targeted device architecture).
     * - An optional bit subrange <tt>[begin_bit, end_bit)</tt> of differentiating key bits can be specified.  This can reduce overall sorting overhead and yield a corresponding performance improvement.
     * - \devicestorageP
     * - \devicestorage
     *
     * \par Performance
     * The following charts illustrate saturated sorting performance across different
     * CUDA architectures for uniform-random \p uint32 and \p uint64 keys, respectively.
     *
     * \image html lsb_radix_sort_int32_keys.png
     * \image html lsb_radix_sort_int64_keys.png
     *
     * \par Snippet
     * The code snippet below illustrates the sorting of a device vector of \p int keys.
     * \par
     * \code
     * #include <cub/cub.cuh>   // or equivalently <cub/device/device_radix_sort.cuh>
     *
     * // Declare, allocate, and initialize device-accessible pointers for sorting data
     * int  num_items;          // e.g., 7
     * int  *d_key_buf;         // e.g., [8, 6, 7, 5, 3, 0, 9]

xgboost/cub/cub/device/device_reduce.cuh view on Meta::CPAN

#include "../util_namespace.cuh"

/// Optional outer namespace(s)
CUB_NS_PREFIX

/// CUB namespace
namespace cub {


/**
 * \brief DeviceReduce provides device-wide, parallel operations for computing a reduction across a sequence of data items residing within device-accessible memory. ![](reduce_logo.png)
 * \ingroup SingleModule
 *
 * \par Overview
 * A <a href="http://en.wikipedia.org/wiki/Reduce_(higher-order_function)"><em>reduction</em></a> (or <em>fold</em>)
 * uses a binary combining operator to compute a single aggregate from a sequence of input elements.
 *
 * \par Usage Considerations
 * \cdp_class{DeviceReduce}
 *
 * \par Performance
 * \linear_performance{reduction, reduce-by-key, and run-length encode}
 *
 * \par
 * The following chart illustrates DeviceReduce::Sum
 * performance across different CUDA architectures for \p int32 keys.
 *
 * \image html reduce_int32.png
 *
 * \par
 * The following chart illustrates DeviceReduce::ReduceByKey (summation)
 * performance across different CUDA architectures for \p fp32
 * values.  Segments are identified by \p int32 keys, and have lengths uniformly sampled from [1,1000].
 *
 * \image html reduce_by_key_fp32_len_500.png
 *
 * \par
 * \plots_below
 *
 */
struct DeviceReduce
{
    /**
     * \brief Computes a device-wide reduction using the specified binary \p reduction_op functor and initial value \p init.
     *

xgboost/cub/cub/device/device_reduce.cuh view on Meta::CPAN

     *
     * \par
     * - Uses \p 0 as the initial value of the reduction.
     * - Does not support \p + operators that are non-commutative..
     * - \devicestorage
     *
     * \par Performance
     * The following charts illustrate saturated sum-reduction performance across different
     * CUDA architectures for \p int32 and \p int64 items, respectively.
     *
     * \image html reduce_int32.png
     * \image html reduce_int64.png
     *
     * \par Snippet
     * The code snippet below illustrates the sum-reduction of a device vector of \p int data elements.
     * \par
     * \code
     * #include <cub/cub.cuh>   // or equivalently <cub/device/device_radix_sort.cuh>
     *
     * // Declare, allocate, and initialize device-accessible pointers for input and output
     * int  num_items;      // e.g., 7
     * int  *d_in;          // e.g., [8, 6, 7, 5, 3, 0, 9]

xgboost/cub/cub/device/device_reduce.cuh view on Meta::CPAN

     *
     * \par
     * - The <tt>==</tt> equality operator is used to determine whether keys are equivalent
     * - \devicestorage
     *
     * \par Performance
     * The following chart illustrates reduction-by-key (sum) performance across
     * different CUDA architectures for \p fp32 and \p fp64 values, respectively.  Segments
     * are identified by \p int32 keys, and have lengths uniformly sampled from [1,1000].
     *
     * \image html reduce_by_key_fp32_len_500.png
     * \image html reduce_by_key_fp64_len_500.png
     *
     * \par
     * The following charts are similar, but with segment lengths uniformly sampled from [1,10]:
     *
     * \image html reduce_by_key_fp32_len_5.png
     * \image html reduce_by_key_fp64_len_5.png
     *
     * \par Snippet
     * The code snippet below illustrates the segmented reduction of \p int values grouped
     * by runs of associated \p int keys.
     * \par
     * \code
     * #include <cub/cub.cuh>   // or equivalently <cub/device/device_reduce.cuh>
     *
     * // CustomMin functor
     * struct CustomMin

xgboost/cub/cub/device/device_run_length_encode.cuh view on Meta::CPAN

#include "../util_namespace.cuh"

/// Optional outer namespace(s)
CUB_NS_PREFIX

/// CUB namespace
namespace cub {


/**
 * \brief DeviceRunLengthEncode provides device-wide, parallel operations for demarcating "runs" of same-valued items within a sequence residing within device-accessible memory. ![](run_length_encode_logo.png)
 * \ingroup SingleModule
 *
 * \par Overview
 * A <a href="http://en.wikipedia.org/wiki/Run-length_encoding"><em>run-length encoding</em></a>
 * computes a simple compressed representation of a sequence of input elements such that each
 * maximal "run" of consecutive same-valued data items is encoded as a single data value along with a
 * count of the elements in that run.
 *
 * \par Usage Considerations
 * \cdp_class{DeviceRunLengthEncode}
 *
 * \par Performance
 * \linear_performance{run-length encode}
 *
 * \par
 * The following chart illustrates DeviceRunLengthEncode::RunLengthEncode performance across
 * different CUDA architectures for \p int32 items.
 * Segments have lengths uniformly sampled from [1,1000].
 *
 * \image html rle_int32_len_500.png
 *
 * \par
 * \plots_below
 *
 */
struct DeviceRunLengthEncode
{

    /**
     * \brief Computes a run-length encoding of the sequence \p d_in.

xgboost/cub/cub/device/device_run_length_encode.cuh view on Meta::CPAN

     *   respectively.
     * - The total number of runs encountered is written to \p d_num_runs_out.
     * - The <tt>==</tt> equality operator is used to determine whether values are equivalent
     * - \devicestorage
     *
     * \par Performance
     * The following charts illustrate saturated encode performance across different
     * CUDA architectures for \p int32 and \p int64 items, respectively.  Segments have
     * lengths uniformly sampled from [1,1000].
     *
     * \image html rle_int32_len_500.png
     * \image html rle_int64_len_500.png
     *
     * \par
     * The following charts are similar, but with segment lengths uniformly sampled from [1,10]:
     *
     * \image html rle_int32_len_5.png
     * \image html rle_int64_len_5.png
     *
     * \par Snippet
     * The code snippet below illustrates the run-length encoding of a sequence of \p int values.
     * \par
     * \code
     * #include <cub/cub.cuh>   // or equivalently <cub/device/device_run_length_encode.cuh>
     *
     * // Declare, allocate, and initialize device-accessible pointers for input and output
     * int          num_items;          // e.g., 8
     * int          *d_in;              // e.g., [0, 2, 2, 9, 5, 5, 5, 8]

xgboost/cub/cub/device/device_scan.cuh view on Meta::CPAN

#include "../util_namespace.cuh"

/// Optional outer namespace(s)
CUB_NS_PREFIX

/// CUB namespace
namespace cub {


/**
 * \brief DeviceScan provides device-wide, parallel operations for computing a prefix scan across a sequence of data items residing within device-accessible memory. ![](device_scan.png)
 * \ingroup SingleModule
 *
 * \par Overview
 * Given a sequence of input elements and a binary reduction operator, a [<em>prefix scan</em>](http://en.wikipedia.org/wiki/Prefix_sum)
 * produces an output sequence where each element is computed to be the reduction
 * of the elements occurring earlier in the input sequence.  <em>Prefix sum</em>
 * connotes a prefix scan with the addition operator. The term \em inclusive indicates
 * that the <em>i</em><sup>th</sup> output reduction incorporates the <em>i</em><sup>th</sup> input.
 * The term \em exclusive indicates the <em>i</em><sup>th</sup> input is not incorporated into
 * the <em>i</em><sup>th</sup> output reduction.

xgboost/cub/cub/device/device_scan.cuh view on Meta::CPAN

 * \cdp_class{DeviceScan}
 *
 * \par Performance
 * \linear_performance{prefix scan}
 *
 * \par
 * The following chart illustrates DeviceScan::ExclusiveSum
 * performance across different CUDA architectures for \p int32 keys.
 * \plots_below
 *
 * \image html scan_int32.png
 *
 */
struct DeviceScan
{
    /******************************************************************//**
     * \name Exclusive scans
     *********************************************************************/
    //@{

    /**
     * \brief Computes a device-wide exclusive prefix sum.  The value of 0 is applied as the initial value, and is assigned to *d_out.
     *
     * \par
     * - Supports non-commutative sum operators.
     * - \devicestorage
     *
     * \par Performance
     * The following charts illustrate saturated exclusive sum performance across different
     * CUDA architectures for \p int32 and \p int64 items, respectively.
     *
     * \image html scan_int32.png
     * \image html scan_int64.png
     *
     * \par Snippet
     * The code snippet below illustrates the exclusive prefix sum of an \p int device vector.
     * \par
     * \code
     * #include <cub/cub.cuh>   // or equivalently <cub/device/device_scan.cuh>
     *
     * // Declare, allocate, and initialize device-accessible pointers for input and output
     * int  num_items;      // e.g., 7
     * int  *d_in;          // e.g., [8, 6, 7, 5, 3, 0, 9]

xgboost/cub/cub/device/device_segmented_radix_sort.cuh view on Meta::CPAN

#include "../util_namespace.cuh"

/// Optional outer namespace(s)
CUB_NS_PREFIX

/// CUB namespace
namespace cub {


/**
 * \brief DeviceSegmentedRadixSort provides device-wide, parallel operations for computing a batched radix sort across multiple, non-overlapping sequences of data items residing within device-accessible memory. ![](segmented_sorting_logo.png)
 * \ingroup SegmentedModule
 *
 * \par Overview
 * The [<em>radix sorting method</em>](http://en.wikipedia.org/wiki/Radix_sort) arranges
 * items into ascending (or descending) order.  The algorithm relies upon a positional representation for
 * keys, i.e., each key is comprised of an ordered sequence of symbols (e.g., digits,
 * characters, etc.) specified from least-significant to most-significant.  For a
 * given input sequence of keys and a set of rules specifying a total ordering
 * of the symbolic alphabet, the radix sorting method produces a lexicographic
 * ordering of those keys.

xgboost/cub/cub/device/device_segmented_reduce.cuh view on Meta::CPAN

#include "../util_namespace.cuh"

/// Optional outer namespace(s)
CUB_NS_PREFIX

/// CUB namespace
namespace cub {


/**
 * \brief DeviceSegmentedReduce provides device-wide, parallel operations for computing a reduction across multiple sequences of data items residing within device-accessible memory. ![](reduce_logo.png)
 * \ingroup SegmentedModule
 *
 * \par Overview
 * A <a href="http://en.wikipedia.org/wiki/Reduce_(higher-order_function)"><em>reduction</em></a> (or <em>fold</em>)
 * uses a binary combining operator to compute a single aggregate from a sequence of input elements.
 *
 * \par Usage Considerations
 * \cdp_class{DeviceSegmentedReduce}
 *
 */

xgboost/cub/cub/device/device_select.cuh view on Meta::CPAN

#include "../util_namespace.cuh"

/// Optional outer namespace(s)
CUB_NS_PREFIX

/// CUB namespace
namespace cub {


/**
 * \brief DeviceSelect provides device-wide, parallel operations for compacting selected items from sequences of data items residing within device-accessible memory. ![](select_logo.png)
 * \ingroup SingleModule
 *
 * \par Overview
 * These operations apply a selection criterion to selectively copy
 * items from a specified input sequence to a compact output sequence.
 *
 * \par Usage Considerations
 * \cdp_class{DeviceSelect}
 *
 * \par Performance
 * \linear_performance{select-flagged, select-if, and select-unique}
 *
 * \par
 * The following chart illustrates DeviceSelect::If
 * performance across different CUDA architectures for \p int32 items,
 * where 50% of the items are randomly selected.
 *
 * \image html select_if_int32_50_percent.png
 *
 * \par
 * The following chart illustrates DeviceSelect::Unique
 * performance across different CUDA architectures for \p int32 items
 * where segments have lengths uniformly sampled from [1,1000].
 *
 * \image html select_unique_int32_len_500.png
 *
 * \par
 * \plots_below
 *
 */
struct DeviceSelect
{
    /**
     * \brief Uses the \p d_flags sequence to selectively copy the corresponding items from \p d_in into \p d_out.  The total number of items selected is written to \p d_num_selected_out. ![](select_flags_logo.png)
     *
     * \par
     * - The value type of \p d_flags must be castable to \p bool (e.g., \p bool, \p char, \p int, etc.).
     * - Copies of the selected items are compacted into \p d_out and maintain their original relative ordering.
     * - \devicestorage
     *
     * \par Snippet
     * The code snippet below illustrates the compaction of items selected from an \p int device vector.
     * \par
     * \code

xgboost/cub/cub/device/device_select.cuh view on Meta::CPAN

            d_num_selected_out,
            SelectOp(),
            EqualityOp(),
            num_items,
            stream,
            debug_synchronous);
    }


    /**
     * \brief Uses the \p select_op functor to selectively copy items from \p d_in into \p d_out.  The total number of items selected is written to \p d_num_selected_out. ![](select_logo.png)
     *
     * \par
     * - Copies of the selected items are compacted into \p d_out and maintain their original relative ordering.
     * - \devicestorage
     *
     * \par Performance
     * The following charts illustrate saturated select-if performance across different
     * CUDA architectures for \p int32 and \p int64 items, respectively.  Items are
     * selected with 50% probability.
     *
     * \image html select_if_int32_50_percent.png
     * \image html select_if_int64_50_percent.png
     *
     * \par
     * The following charts are similar, but 5% selection probability:
     *
     * \image html select_if_int32_5_percent.png
     * \image html select_if_int64_5_percent.png
     *
     * \par Snippet
     * The code snippet below illustrates the compaction of items selected from an \p int device vector.
     * \par
     * \code
     * #include <cub/cub.cuh>   // or equivalently <cub/device/device_select.cuh>
     *
     * // Functor type for selecting values less than some criteria
     * struct LessThan
     * {

xgboost/cub/cub/device/device_select.cuh view on Meta::CPAN

            d_num_selected_out,
            select_op,
            EqualityOp(),
            num_items,
            stream,
            debug_synchronous);
    }


    /**
     * \brief Given an input sequence \p d_in having runs of consecutive equal-valued keys, only the first key from each run is selectively copied to \p d_out.  The total number of items selected is written to \p d_num_selected_out. ![](unique_logo.p...
     *
     * \par
     * - The <tt>==</tt> equality operator is used to determine whether keys are equivalent
     * - Copies of the selected items are compacted into \p d_out and maintain their original relative ordering.
     * - \devicestorage
     *
     * \par Performance
     * The following charts illustrate saturated select-unique performance across different
     * CUDA architectures for \p int32 and \p int64 items, respectively.  Segments have
     * lengths uniformly sampled from [1,1000].
     *
     * \image html select_unique_int32_len_500.png
     * \image html select_unique_int64_len_500.png
     *
     * \par
     * The following charts are similar, but with segment lengths uniformly sampled from [1,10]:
     *
     * \image html select_unique_int32_len_5.png
     * \image html select_unique_int64_len_5.png
     *
     * \par Snippet
     * The code snippet below illustrates the compaction of items selected from an \p int device vector.
     * \par
     * \code
     * #include <cub/cub.cuh>       // or equivalently <cub/device/device_select.cuh>
     *
     * // Declare, allocate, and initialize device-accessible pointers for input and output
     * int  num_items;              // e.g., 8
     * int  *d_in;                  // e.g., [0, 2, 2, 9, 5, 5, 5, 8]

xgboost/cub/cub/util_ptx.cuh view on Meta::CPAN

    unsigned int ret;
    asm volatile("mov.u32 %0, %%lanemask_ge;" : "=r"(ret) );
    return ret;
}

/** @} */       // end group UtilPtx



/**
 * \brief Shuffle-up for any data type.  Each <em>warp-lane<sub>i</sub></em> obtains the value \p input contributed by <em>warp-lane</em><sub><em>i</em>-<tt>src_offset</tt></sub>.  For thread lanes \e i < src_offset, the thread's own \p input is retu...
 * \ingroup WarpModule
 *
 * \par
 * - Available only for SM3.0 or newer
 *
 * \par Snippet
 * The code snippet below illustrates each thread obtaining a \p double value from the
 * predecessor of its predecessor.
 * \par
 * \code

xgboost/cub/cub/util_ptx.cuh view on Meta::CPAN

    {
        shuffle_word       = SHFL_UP_SYNC((unsigned int)input_alias[WORD], src_offset, first_lane, member_mask);
        output_alias[WORD] = shuffle_word;
    }

    return output;
}


/**
 * \brief Shuffle-down for any data type.  Each <em>warp-lane<sub>i</sub></em> obtains the value \p input contributed by <em>warp-lane</em><sub><em>i</em>+<tt>src_offset</tt></sub>.  For thread lanes \e i >= WARP_THREADS, the thread's own \p input is...
 * \ingroup WarpModule
 *
 * \par
 * - Available only for SM3.0 or newer
 *
 * \par Snippet
 * The code snippet below illustrates each thread obtaining a \p double value from the
 * successor of its successor.
 * \par
 * \code

xgboost/cub/cub/util_ptx.cuh view on Meta::CPAN

        output_alias[WORD] = shuffle_word;
    }

    return output;
}


/**
 * \brief Shuffle-broadcast for any data type.  Each <em>warp-lane<sub>i</sub></em> obtains the value \p input
 * contributed by <em>warp-lane</em><sub><tt>src_lane</tt></sub>.  For \p src_lane < 0 or \p src_lane >= WARP_THREADS,
 * then the thread's own \p input is returned to the thread. ![](shfl_broadcast_logo.png)
 *
 * \ingroup WarpModule
 *
 * \par
 * - Available only for SM3.0 or newer
 *
 * \par Snippet
 * The code snippet below illustrates each thread obtaining a \p double value from <em>warp-lane</em><sub>0</sub>.
 *
 * \par

xgboost/cub/cub/warp/warp_reduce.cuh view on Meta::CPAN

/// CUB namespace
namespace cub {


/**
 * \addtogroup WarpModule
 * @{
 */

/**
 * \brief The WarpReduce class provides [<em>collective</em>](index.html#sec0) methods for computing a parallel reduction of items partitioned across a CUDA thread warp. ![](warp_reduce_logo.png)
 *
 * \tparam T                        The reduction input/output element type
 * \tparam LOGICAL_WARP_THREADS     <b>[optional]</b> The number of threads per "logical" warp (may be less than the number of hardware warp threads).  Default is the warp size of the targeted CUDA compute-capability (e.g., 32 threads for SM20).
 * \tparam PTX_ARCH                 <b>[optional]</b> \ptxversion
 *
 * \par Overview
 * - A <a href="http://en.wikipedia.org/wiki/Reduce_(higher-order_function)"><em>reduction</em></a> (or <em>fold</em>)
 *   uses a binary combining operator to compute a single aggregate from a list of input elements.
 * - Supports "logical" warps smaller than the physical warp size (e.g., logical warps of 8 threads)
 * - The number of entrant threads must be an multiple of \p LOGICAL_WARP_THREADS

xgboost/cub/cub/warp/warp_scan.cuh view on Meta::CPAN


/// CUB namespace
namespace cub {

/**
 * \addtogroup WarpModule
 * @{
 */

/**
 * \brief The WarpScan class provides [<em>collective</em>](index.html#sec0) methods for computing a parallel prefix scan of items partitioned across a CUDA thread warp.  ![](warp_scan_logo.png)
 *
 * \tparam T                        The scan input/output element type
 * \tparam LOGICAL_WARP_THREADS     <b>[optional]</b> The number of threads per "logical" warp (may be less than the number of hardware warp threads).  Default is the warp size associated with the CUDA Compute Capability targeted by the compiler (e.g...
 * \tparam PTX_ARCH                 <b>[optional]</b> \ptxversion
 *
 * \par Overview
 * - Given a list of input elements and a binary reduction operator, a [<em>prefix scan</em>](http://en.wikipedia.org/wiki/Prefix_sum)
 *   produces an output list where each element is computed to be the reduction
 *   of the elements occurring earlier in the input list.  <em>Prefix sum</em>
 *   connotes a prefix scan with the addition operator. The term \em inclusive indicates

xgboost/dmlc-core/doc/Doxyfile view on Meta::CPAN

TREEVIEW_WIDTH         = 250

# When the EXT_LINKS_IN_WINDOW option is set to YES doxygen will open
# links to external symbols imported via tag files in a separate window.

EXT_LINKS_IN_WINDOW    = NO

# Use this tag to change the font size of Latex formulas included
# as images in the HTML documentation. The default is 10. Note that
# when you change the font size after a successful doxygen run you need
# to manually remove any form_*.png images from the HTML output directory
# to force them to be regenerated.

FORMULA_FONTSIZE       = 10

# Use the FORMULA_TRANPARENT tag to determine whether or not the images
# generated for formulas are transparent PNGs. Transparent PNGs are
# not supported properly for IE 6.0, but are supported on all modern browsers.
# Note that when changing this option you need to delete any form_*.png files
# in the HTML output before the changes have effect.

FORMULA_TRANSPARENT    = YES

# Enable the USE_MATHJAX option to render LaTeX formulas using MathJax
# (see http://www.mathjax.org) which uses client side Javascript for the
# rendering instead of using prerendered bitmaps. Use this if you do not
# have LaTeX installed or if you want to formulas look prettier in the HTML
# output. When enabled you also need to install MathJax separately and
# configure the path to it using the MATHJAX_RELPATH option.

xgboost/dmlc-core/doc/Doxyfile view on Meta::CPAN

GRAPHICAL_HIERARCHY    = YES

# If the DIRECTORY_GRAPH, SHOW_DIRECTORIES and HAVE_DOT tags are set to YES
# then doxygen will show the dependencies a directory has on other directories
# in a graphical way. The dependency relations are determined by the #include
# relations between the files in the directories.

DIRECTORY_GRAPH        = YES

# The DOT_IMAGE_FORMAT tag can be used to set the image format of the images
# generated by dot. Possible values are svg, png, jpg, or gif.
# If left blank png will be used. If you choose svg you need to set
# HTML_FILE_EXTENSION to xhtml in order to make the SVG files
# visible in IE 9+ (other browsers do not have this requirement).

DOT_IMAGE_FORMAT       = png

# If DOT_IMAGE_FORMAT is set to svg, then this option can be set to YES to
# enable generation of interactive SVG images that allow zooming and panning.
# Note that this requires a modern browser other than Internet Explorer.
# Tested and working are Firefox, Chrome, Safari, and Opera. For IE 9+ you
# need to set HTML_FILE_EXTENSION to xhtml in order to make the SVG files
# visible. Older versions of IE do not have SVG support.

INTERACTIVE_SVG        = NO

xgboost/doc/Doxyfile view on Meta::CPAN


# When the EXT_LINKS_IN_WINDOW option is set to YES doxygen will open links to
# external symbols imported via tag files in a separate window.
# The default value is: NO.
# This tag requires that the tag GENERATE_HTML is set to YES.

EXT_LINKS_IN_WINDOW    = NO

# Use this tag to change the font size of LaTeX formulas included as images in
# the HTML documentation. When you change the font size after a successful
# doxygen run you need to manually remove any form_*.png images from the HTML
# output directory to force them to be regenerated.
# Minimum value: 8, maximum value: 50, default value: 10.
# This tag requires that the tag GENERATE_HTML is set to YES.

FORMULA_FONTSIZE       = 10

# Use the FORMULA_TRANPARENT tag to determine whether or not the images
# generated for formulas are transparent PNGs. Transparent PNGs are not
# supported properly for IE 6.0, but are supported on all modern browsers.
#
# Note that when changing this option you need to delete any form_*.png files in
# the HTML output directory before the changes have effect.
# The default value is: YES.
# This tag requires that the tag GENERATE_HTML is set to YES.

FORMULA_TRANSPARENT    = YES

# Enable the USE_MATHJAX option to render LaTeX formulas using MathJax (see
# http://www.mathjax.org) which uses client side Javascript for the rendering
# instead of using prerendered bitmaps. Use this if you do not have LaTeX
# installed or if you want to formulas look prettier in the HTML output. When

xgboost/doc/Doxyfile view on Meta::CPAN

# The default value is: YES.
# This tag requires that the tag HAVE_DOT is set to YES.

DIRECTORY_GRAPH        = YES

# The DOT_IMAGE_FORMAT tag can be used to set the image format of the images
# generated by dot.
# Note: If you choose svg you need to set HTML_FILE_EXTENSION to xhtml in order
# to make the SVG files visible in IE 9+ (other browsers do not have this
# requirement).
# Possible values are: png, png:cairo, png:cairo:cairo, png:cairo:gd, png:gd,
# png:gd:gd, jpg, jpg:cairo, jpg:cairo:gd, jpg:gd, jpg:gd:gd, gif, gif:cairo,
# gif:cairo:gd, gif:gd, gif:gd:gd and svg.
# The default value is: png.
# This tag requires that the tag HAVE_DOT is set to YES.

DOT_IMAGE_FORMAT       = png

# If DOT_IMAGE_FORMAT is set to svg, then this option can be set to YES to
# enable generation of interactive SVG images that allow zooming and panning.
#
# Note that this requires a modern browser other than Internet Explorer. Tested
# and working are Firefox, Chrome, Safari, and Opera.
# Note: For IE 9+ you need to set HTML_FILE_EXTENSION to xhtml in order to make
# the SVG files visible. Older versions of IE do not have SVG support.
# The default value is: NO.
# This tag requires that the tag HAVE_DOT is set to YES.

xgboost/doc/_static/xgboost-theme/layout.html view on Meta::CPAN

    {%- endif %}
    {%- if next %}
    <link rel="next" title="{{ next.title|striptags|e }}" href="{{ next.link|e }}" />
    {%- endif %}
    {%- if prev %}
    <link rel="prev" title="{{ prev.title|striptags|e }}" href="{{ prev.link|e }}" />
    {%- endif %}
{%- endblock %}
{%- block extrahead %} {% endblock %}

    <link rel="icon" type="image/png" href="https://raw.githubusercontent.com/dmlc/web-data/master/mxnet/image/mxnet-icon.png">
  </head>
  <body role="document">
    {%- include "navbar.html" %}

    {% if pagename != 'index' %}
    <div class="container">
      <div class="row">
        {{ sidebar() }}
        <div class="content">
          {% block body %} {% endblock %}

xgboost/doc/jvm/xgboost4j-intro.md view on Meta::CPAN

We have made the API in a portable way so it **can be easily ported to other Dataflow frameworks provided by the Cloud**.
XGBoost4J shares its core with other XGBoost libraries, which means data scientists can use R/python
read and visualize the model trained distributedly.
It also means that user can start with single machine version for exploration,
which already can handle hundreds of million examples.

## System Overview

In the following Figure, we describe the overall architecture of XGBoost4J. XGBoost4J provides the Java/Scala API calling the core functionality of XGBoost library. Most importantly, it not only supports the single-machine model training, but also pr...

![XGBoost4J Architecture](https://raw.githubusercontent.com/dmlc/web-data/master/xgboost/xgboost4j.png)


By calling the XGBoost4J API, users can scale the model training to the cluster. XGBoost4J calls the running instance of XGBoost worker in Spark/Flink task and run them across the cluster. The communication among the distributed model training tasks ...

With the abstraction of XGBoost4J, users can build an unified data analytic application ranging from Extract-Transform-Loading, data exploration, machine learning model training and the final data product service. The following figure illustrate an e...

![XGBoost4J Architecture](https://raw.githubusercontent.com/dmlc/web-data/master/xgboost/unified_pipeline.png)


## Single-machine Training Walk-through

In this section, we will work through the APIs of XGBoost4J by examples.
We will be using scala for demonstration, but we also have a complete API for java users.

To start the model training and evaluation, we need to prepare the training and test set:

```scala

( run in 0.597 second using v1.01-cache-2.11-cpan-5b529ec07f3 )