Alien-XGBoost

 view release on metacpan or  search on metacpan

xgboost/NEWS.md  view on Meta::CPAN

  - Remove most of the raw pointers to smart ptrs, for RAII safety.
* Add official option to approximate algorithm `tree_method` to parameter.
  - Change default behavior to switch to prefer faster algorithm.
  - User will get a message when approximate algorithm is chosen.
* Change library name to libxgboost.so
* Backward compatiblity
  - The binary buffer file is not backward compatible with previous version.
  - The model file is backward compatible on 64 bit platforms.
* The model file is compatible between 64/32 bit platforms(not yet tested).
* External memory version and other advanced features will be exposed to R library as well on linux.
  - Previously some of the features are blocked due to C++11 and threading limits.
  - The windows version is still blocked due to Rtools do not support ```std::thread```.
* rabit and dmlc-core are maintained through git submodule
  - Anyone can open PR to update these dependencies now.
* Improvements
  - Rabit and xgboost libs are not thread-safe and use thread local PRNGs
  - This could fix some of the previous problem which runs xgboost on multiple threads.
* JVM Package
  - Enable xgboost4j for java and scala
  - XGBoost distributed now runs on Flink and Spark.
* Support model attributes listing for meta data.
  - https://github.com/dmlc/xgboost/pull/1198

xgboost/cub/CHANGE_LOG.TXT  view on Meta::CPAN

        - Issue #47: Caching allocator needs to clean up cuda error upon successful retry 
        - Issue #46: Very high amount of needed memory from the cub::DeviceHistogram::HistogramEven routine
        - Issue #45: Caching Device Allocator fails with debug output enabled
        - Fix for generic-type reduce-by-key warpscan (sm3.x and newer)

//-----------------------------------------------------------------------------

1.5.2    03/21/2016
	- Improved medium-size scan performance for sm5x (Maxwell)
    - Refactored caching allocator for device memory
   		- Spends less time locked
		- Failure to allocate a block from the runtime will retry once after
		  freeing cached allocations
		- Now respects max-bin (issue where blocks in excess of max-bin were
		  still being retained in free cache)
		- Uses C++11 mutex when available
    - Bug fixes: 
        - Fix for generic-type reduce-by-key warpscan (sm3.x and newer)
          
//-----------------------------------------------------------------------------

xgboost/cub/README.md  view on Meta::CPAN

 
     // Allocate shared memory
     __shared__ union {
         typename BlockRadixSort::TempStorage  sort;
         typename BlockLoad::TempStorage       load; 
         typename BlockStore::TempStorage      store; 
     } temp_storage; 

     int block_offset = blockIdx.x * (128 * 16);	  // OffsetT for this block's ment

     // Obtain a segment of 2048 consecutive keys that are blocked across threads
     int thread_keys[16];
     BlockLoad(temp_storage.load).Load(d_in + block_offset, thread_keys);
     __syncthreads();

     // Collectively sort the keys
     BlockRadixSort(temp_storage.sort).Sort(thread_keys);
     __syncthreads();

     // Store the sorted segment 
     BlockStore(temp_storage.store).Store(d_out + block_offset, thread_keys);

xgboost/cub/README.md  view on Meta::CPAN


Each thread block uses cub::BlockRadixSort to collectively sort 
its own input segment.  The class is specialized by the 
data type being sorted, by the number of threads per block, by the number of 
keys per thread, and implicitly by the targeted compilation architecture.  

The cub::BlockLoad and cub::BlockStore classes are similarly specialized.    
Furthermore, to provide coalesced accesses to device memory, these primitives are 
configured to access memory using a striped access pattern (where consecutive threads 
simultaneously access consecutive items) and then <em>transpose</em> the keys into 
a [<em>blocked arrangement</em>](index.html#sec4sec3) of elements across threads. 

Once specialized, these classes expose opaque \p TempStorage member types.  
The thread block uses these storage types to statically allocate the union of 
shared memory needed by the thread block.  (Alternatively these storage types 
could be aliased to global memory allocations).

<br><hr>
<h3>Stable Releases</h3>

CUB releases are labeled using version identifiers having three fields: 

xgboost/cub/cub/agent/agent_spmv_orig.cuh  view on Meta::CPAN

            ValueT  value               = wd_values[tile_start_coord.y + nonzero_idx];

            ValueT  vector_value        = spmv_params.t_vector_x[column_idx];
#if (CUB_PTX_ARCH >= 350)
            vector_value                = wd_vector_x[column_idx];
#endif

            nonzeros[ITEM]              = value * vector_value;
        }

        // Exchange striped->blocked
        BlockExchangeT(temp_storage.exchange).StripedToBlocked(nonzeros);

        CTA_SYNC();

        // Compute an inclusive prefix sum
        BlockPrefixSumT(temp_storage.prefix_sum).InclusiveSum(nonzeros, nonzeros);

        CTA_SYNC();

        if (threadIdx.x == 0)
            s_tile_nonzeros[0] = 0.0;

xgboost/cub/cub/agent/agent_spmv_row_based.cuh  view on Meta::CPAN

            row_start = temp_storage.nonzeros[local_row_nonzero_idx];
            temp_storage.nonzeros[local_row_nonzero_idx] = NAN_TOKEN;
        }

        CTA_SYNC();

        //
        // Segmented scan
        //

        // Read strip of nonzeros into thread-blocked order, setup segment flags
        KeyValuePairT scan_items[NNZ_PER_THREAD];
        for (int ITEM = 0; ITEM < NNZ_PER_THREAD; ++ITEM)
        {
            int     local_nonzero_idx   = (threadIdx.x * NNZ_PER_THREAD) + ITEM;
            ValueT  value               = temp_storage.nonzeros[local_nonzero_idx];
            bool    is_nan              = (value != value);

            scan_items[ITEM].value  = (is_nan) ? 0.0 : value;
            scan_items[ITEM].key    = is_nan;
        }

xgboost/cub/cub/block/block_discontinuity.cuh  view on Meta::CPAN

 * \tparam T                The data type to be flagged.
 * \tparam BLOCK_DIM_X      The thread block length in threads along the X dimension
 * \tparam BLOCK_DIM_Y      <b>[optional]</b> The thread block length in threads along the Y dimension (default: 1)
 * \tparam BLOCK_DIM_Z      <b>[optional]</b> The thread block length in threads along the Z dimension (default: 1)
 * \tparam PTX_ARCH         <b>[optional]</b> \ptxversion
 *
 * \par Overview
 * - A set of "head flags" (or "tail flags") is often used to indicate corresponding items
 *   that differ from their predecessors (or successors).  For example, head flags are convenient
 *   for demarcating disjoint data segments as part of a segmented scan or reduction.
 * - \blocked
 *
 * \par Performance Considerations
 * - \granularity
 *
 * \par A Simple Example
 * \blockcollective{BlockDiscontinuity}
 * \par
 * The code snippet below illustrates the head flagging of 512 integer items that
 * are partitioned in a [<em>blocked arrangement</em>](index.html#sec5sec3) across 128 threads
 * where each thread owns 4 consecutive items.
 * \par
 * \code
 * #include <cub/cub.cuh>   // or equivalently <cub/block/block_discontinuity.cuh>
 *
 * __global__ void ExampleKernel(...)
 * {
 *     // Specialize BlockDiscontinuity for a 1D block of 128 threads on type int
 *     typedef cub::BlockDiscontinuity<int, 128> BlockDiscontinuity;
 *
 *     // Allocate shared memory for BlockDiscontinuity
 *     __shared__ typename BlockDiscontinuity::TempStorage temp_storage;
 *
 *     // Obtain a segment of consecutive items that are blocked across threads
 *     int thread_data[4];
 *     ...
 *
 *     // Collectively compute head flags for discontinuities in the segment
 *     int head_flags[4];
 *     BlockDiscontinuity(temp_storage).FlagHeads(head_flags, thread_data, cub::Inequality());
 *
 * \endcode
 * \par
 * Suppose the set of input \p thread_data across the block of threads is

xgboost/cub/cub/block/block_discontinuity.cuh  view on Meta::CPAN

    /**
     * \brief Sets head flags indicating discontinuities between items partitioned across the thread block, for which the first item has no reference and is always flagged.
     *
     * \par
     * - The flag <tt>head_flags<sub><em>i</em></sub></tt> is set for item
     *   <tt>input<sub><em>i</em></sub></tt> when
     *   <tt>flag_op(</tt><em>previous-item</em><tt>, input<sub><em>i</em></sub>)</tt>
     *   returns \p true (where <em>previous-item</em> is either the preceding item
     *   in the same thread or the last item in the previous thread).
     * - For <em>thread</em><sub>0</sub>, item <tt>input<sub>0</sub></tt> is always flagged.
     * - \blocked
     * - \granularity
     * - \smemreuse
     *
     * \par Snippet
     * The code snippet below illustrates the head-flagging of 512 integer items that
     * are partitioned in a [<em>blocked arrangement</em>](index.html#sec5sec3) across 128 threads
     * where each thread owns 4 consecutive items.
     * \par
     * \code
     * #include <cub/cub.cuh>   // or equivalently <cub/block/block_discontinuity.cuh>
     *
     * __global__ void ExampleKernel(...)
     * {
     *     // Specialize BlockDiscontinuity for a 1D block of 128 threads on type int
     *     typedef cub::BlockDiscontinuity<int, 128> BlockDiscontinuity;
     *
     *     // Allocate shared memory for BlockDiscontinuity
     *     __shared__ typename BlockDiscontinuity::TempStorage temp_storage;
     *
     *     // Obtain a segment of consecutive items that are blocked across threads
     *     int thread_data[4];
     *     ...
     *
     *     // Collectively compute head flags for discontinuities in the segment
     *     int head_flags[4];
     *     BlockDiscontinuity(temp_storage).FlagHeads(head_flags, thread_data, cub::Inequality());
     *
     * \endcode
     * \par
     * Suppose the set of input \p thread_data across the block of threads is

xgboost/cub/cub/block/block_discontinuity.cuh  view on Meta::CPAN

     * \brief Sets head flags indicating discontinuities between items partitioned across the thread block.
     *
     * \par
     * - The flag <tt>head_flags<sub><em>i</em></sub></tt> is set for item
     *   <tt>input<sub><em>i</em></sub></tt> when
     *   <tt>flag_op(</tt><em>previous-item</em><tt>, input<sub><em>i</em></sub>)</tt>
     *   returns \p true (where <em>previous-item</em> is either the preceding item
     *   in the same thread or the last item in the previous thread).
     * - For <em>thread</em><sub>0</sub>, item <tt>input<sub>0</sub></tt> is compared
     *   against \p tile_predecessor_item.
     * - \blocked
     * - \granularity
     * - \smemreuse
     *
     * \par Snippet
     * The code snippet below illustrates the head-flagging of 512 integer items that
     * are partitioned in a [<em>blocked arrangement</em>](index.html#sec5sec3) across 128 threads
     * where each thread owns 4 consecutive items.
     * \par
     * \code
     * #include <cub/cub.cuh>   // or equivalently <cub/block/block_discontinuity.cuh>
     *
     * __global__ void ExampleKernel(...)
     * {
     *     // Specialize BlockDiscontinuity for a 1D block of 128 threads on type int
     *     typedef cub::BlockDiscontinuity<int, 128> BlockDiscontinuity;
     *
     *     // Allocate shared memory for BlockDiscontinuity
     *     __shared__ typename BlockDiscontinuity::TempStorage temp_storage;
     *
     *     // Obtain a segment of consecutive items that are blocked across threads
     *     int thread_data[4];
     *     ...
     *
     *     // Have thread0 obtain the predecessor item for the entire tile
     *     int tile_predecessor_item;
     *     if (threadIdx.x == 0) tile_predecessor_item == ...
     *
     *     // Collectively compute head flags for discontinuities in the segment
     *     int head_flags[4];
     *     BlockDiscontinuity(temp_storage).FlagHeads(

xgboost/cub/cub/block/block_discontinuity.cuh  view on Meta::CPAN

     * \brief Sets tail flags indicating discontinuities between items partitioned across the thread block, for which the last item has no reference and is always flagged.
     *
     * \par
     * - The flag <tt>tail_flags<sub><em>i</em></sub></tt> is set for item
     *   <tt>input<sub><em>i</em></sub></tt> when
     *   <tt>flag_op(input<sub><em>i</em></sub>, </tt><em>next-item</em><tt>)</tt>
     *   returns \p true (where <em>next-item</em> is either the next item
     *   in the same thread or the first item in the next thread).
     * - For <em>thread</em><sub><em>BLOCK_THREADS</em>-1</sub>, item
     *   <tt>input</tt><sub><em>ITEMS_PER_THREAD</em>-1</sub> is always flagged.
     * - \blocked
     * - \granularity
     * - \smemreuse
     *
     * \par Snippet
     * The code snippet below illustrates the tail-flagging of 512 integer items that
     * are partitioned in a [<em>blocked arrangement</em>](index.html#sec5sec3) across 128 threads
     * where each thread owns 4 consecutive items.
     * \par
     * \code
     * #include <cub/cub.cuh>   // or equivalently <cub/block/block_discontinuity.cuh>
     *
     * __global__ void ExampleKernel(...)
     * {
     *     // Specialize BlockDiscontinuity for a 1D block of 128 threads on type int
     *     typedef cub::BlockDiscontinuity<int, 128> BlockDiscontinuity;
     *
     *     // Allocate shared memory for BlockDiscontinuity
     *     __shared__ typename BlockDiscontinuity::TempStorage temp_storage;
     *
     *     // Obtain a segment of consecutive items that are blocked across threads
     *     int thread_data[4];
     *     ...
     *
     *     // Collectively compute tail flags for discontinuities in the segment
     *     int tail_flags[4];
     *     BlockDiscontinuity(temp_storage).FlagTails(tail_flags, thread_data, cub::Inequality());
     *
     * \endcode
     * \par
     * Suppose the set of input \p thread_data across the block of threads is

xgboost/cub/cub/block/block_discontinuity.cuh  view on Meta::CPAN

     *
     * \par
     * - The flag <tt>tail_flags<sub><em>i</em></sub></tt> is set for item
     *   <tt>input<sub><em>i</em></sub></tt> when
     *   <tt>flag_op(input<sub><em>i</em></sub>, </tt><em>next-item</em><tt>)</tt>
     *   returns \p true (where <em>next-item</em> is either the next item
     *   in the same thread or the first item in the next thread).
     * - For <em>thread</em><sub><em>BLOCK_THREADS</em>-1</sub>, item
     *   <tt>input</tt><sub><em>ITEMS_PER_THREAD</em>-1</sub> is compared
     *   against \p tile_successor_item.
     * - \blocked
     * - \granularity
     * - \smemreuse
     *
     * \par Snippet
     * The code snippet below illustrates the tail-flagging of 512 integer items that
     * are partitioned in a [<em>blocked arrangement</em>](index.html#sec5sec3) across 128 threads
     * where each thread owns 4 consecutive items.
     * \par
     * \code
     * #include <cub/cub.cuh>   // or equivalently <cub/block/block_discontinuity.cuh>
     *
     * __global__ void ExampleKernel(...)
     * {
     *     // Specialize BlockDiscontinuity for a 1D block of 128 threads on type int
     *     typedef cub::BlockDiscontinuity<int, 128> BlockDiscontinuity;
     *
     *     // Allocate shared memory for BlockDiscontinuity
     *     __shared__ typename BlockDiscontinuity::TempStorage temp_storage;
     *
     *     // Obtain a segment of consecutive items that are blocked across threads
     *     int thread_data[4];
     *     ...
     *
     *     // Have thread127 obtain the successor item for the entire tile
     *     int tile_successor_item;
     *     if (threadIdx.x == 127) tile_successor_item == ...
     *
     *     // Collectively compute tail flags for discontinuities in the segment
     *     int tail_flags[4];
     *     BlockDiscontinuity(temp_storage).FlagTails(

xgboost/cub/cub/block/block_discontinuity.cuh  view on Meta::CPAN

     *   returns \p true (where <em>previous-item</em> is either the preceding item
     *   in the same thread or the last item in the previous thread).
     * - For <em>thread</em><sub>0</sub>, item <tt>input<sub>0</sub></tt> is always flagged.
     * - The flag <tt>tail_flags<sub><em>i</em></sub></tt> is set for item
     *   <tt>input<sub><em>i</em></sub></tt> when
     *   <tt>flag_op(input<sub><em>i</em></sub>, </tt><em>next-item</em><tt>)</tt>
     *   returns \p true (where <em>next-item</em> is either the next item
     *   in the same thread or the first item in the next thread).
     * - For <em>thread</em><sub><em>BLOCK_THREADS</em>-1</sub>, item
     *   <tt>input</tt><sub><em>ITEMS_PER_THREAD</em>-1</sub> is always flagged.
     * - \blocked
     * - \granularity
     * - \smemreuse
     *
     * \par Snippet
     * The code snippet below illustrates the head- and tail-flagging of 512 integer items that
     * are partitioned in a [<em>blocked arrangement</em>](index.html#sec5sec3) across 128 threads
     * where each thread owns 4 consecutive items.
     * \par
     * \code
     * #include <cub/cub.cuh>   // or equivalently <cub/block/block_discontinuity.cuh>
     *
     * __global__ void ExampleKernel(...)
     * {
     *     // Specialize BlockDiscontinuity for a 1D block of 128 threads on type int
     *     typedef cub::BlockDiscontinuity<int, 128> BlockDiscontinuity;
     *
     *     // Allocate shared memory for BlockDiscontinuity
     *     __shared__ typename BlockDiscontinuity::TempStorage temp_storage;
     *
     *     // Obtain a segment of consecutive items that are blocked across threads
     *     int thread_data[4];
     *     ...
     *
     *     // Collectively compute head and flags for discontinuities in the segment
     *     int head_flags[4];
     *     int tail_flags[4];
     *     BlockDiscontinuity(temp_storage).FlagTails(
     *         head_flags, tail_flags, thread_data, cub::Inequality());
     *
     * \endcode

xgboost/cub/cub/block/block_discontinuity.cuh  view on Meta::CPAN

     *   in the same thread or the last item in the previous thread).
     * - For <em>thread</em><sub>0</sub>, item <tt>input<sub>0</sub></tt> is always flagged.
     * - The flag <tt>tail_flags<sub><em>i</em></sub></tt> is set for item
     *   <tt>input<sub><em>i</em></sub></tt> when
     *   <tt>flag_op(input<sub><em>i</em></sub>, </tt><em>next-item</em><tt>)</tt>
     *   returns \p true (where <em>next-item</em> is either the next item
     *   in the same thread or the first item in the next thread).
     * - For <em>thread</em><sub><em>BLOCK_THREADS</em>-1</sub>, item
     *   <tt>input</tt><sub><em>ITEMS_PER_THREAD</em>-1</sub> is compared
     *   against \p tile_predecessor_item.
     * - \blocked
     * - \granularity
     * - \smemreuse
     *
     * \par Snippet
     * The code snippet below illustrates the head- and tail-flagging of 512 integer items that
     * are partitioned in a [<em>blocked arrangement</em>](index.html#sec5sec3) across 128 threads
     * where each thread owns 4 consecutive items.
     * \par
     * \code
     * #include <cub/cub.cuh>   // or equivalently <cub/block/block_discontinuity.cuh>
     *
     * __global__ void ExampleKernel(...)
     * {
     *     // Specialize BlockDiscontinuity for a 1D block of 128 threads on type int
     *     typedef cub::BlockDiscontinuity<int, 128> BlockDiscontinuity;
     *
     *     // Allocate shared memory for BlockDiscontinuity
     *     __shared__ typename BlockDiscontinuity::TempStorage temp_storage;
     *
     *     // Obtain a segment of consecutive items that are blocked across threads
     *     int thread_data[4];
     *     ...
     *
     *     // Have thread127 obtain the successor item for the entire tile
     *     int tile_successor_item;
     *     if (threadIdx.x == 127) tile_successor_item == ...
     *
     *     // Collectively compute head and flags for discontinuities in the segment
     *     int head_flags[4];
     *     int tail_flags[4];

xgboost/cub/cub/block/block_discontinuity.cuh  view on Meta::CPAN

     *   in the same thread or the last item in the previous thread).
     * - For <em>thread</em><sub>0</sub>, item <tt>input<sub>0</sub></tt> is compared
     *   against \p tile_predecessor_item.
     * - The flag <tt>tail_flags<sub><em>i</em></sub></tt> is set for item
     *   <tt>input<sub><em>i</em></sub></tt> when
     *   <tt>flag_op(input<sub><em>i</em></sub>, </tt><em>next-item</em><tt>)</tt>
     *   returns \p true (where <em>next-item</em> is either the next item
     *   in the same thread or the first item in the next thread).
     * - For <em>thread</em><sub><em>BLOCK_THREADS</em>-1</sub>, item
     *   <tt>input</tt><sub><em>ITEMS_PER_THREAD</em>-1</sub> is always flagged.
     * - \blocked
     * - \granularity
     * - \smemreuse
     *
     * \par Snippet
     * The code snippet below illustrates the head- and tail-flagging of 512 integer items that
     * are partitioned in a [<em>blocked arrangement</em>](index.html#sec5sec3) across 128 threads
     * where each thread owns 4 consecutive items.
     * \par
     * \code
     * #include <cub/cub.cuh>   // or equivalently <cub/block/block_discontinuity.cuh>
     *
     * __global__ void ExampleKernel(...)
     * {
     *     // Specialize BlockDiscontinuity for a 1D block of 128 threads on type int
     *     typedef cub::BlockDiscontinuity<int, 128> BlockDiscontinuity;
     *
     *     // Allocate shared memory for BlockDiscontinuity
     *     __shared__ typename BlockDiscontinuity::TempStorage temp_storage;
     *
     *     // Obtain a segment of consecutive items that are blocked across threads
     *     int thread_data[4];
     *     ...
     *
     *     // Have thread0 obtain the predecessor item for the entire tile
     *     int tile_predecessor_item;
     *     if (threadIdx.x == 0) tile_predecessor_item == ...
     *
     *     // Have thread127 obtain the successor item for the entire tile
     *     int tile_successor_item;
     *     if (threadIdx.x == 127) tile_successor_item == ...

xgboost/cub/cub/block/block_discontinuity.cuh  view on Meta::CPAN

     * - For <em>thread</em><sub>0</sub>, item <tt>input<sub>0</sub></tt> is compared
     *   against \p tile_predecessor_item.
     * - The flag <tt>tail_flags<sub><em>i</em></sub></tt> is set for item
     *   <tt>input<sub><em>i</em></sub></tt> when
     *   <tt>flag_op(input<sub><em>i</em></sub>, </tt><em>next-item</em><tt>)</tt>
     *   returns \p true (where <em>next-item</em> is either the next item
     *   in the same thread or the first item in the next thread).
     * - For <em>thread</em><sub><em>BLOCK_THREADS</em>-1</sub>, item
     *   <tt>input</tt><sub><em>ITEMS_PER_THREAD</em>-1</sub> is compared
     *   against \p tile_successor_item.
     * - \blocked
     * - \granularity
     * - \smemreuse
     *
     * \par Snippet
     * The code snippet below illustrates the head- and tail-flagging of 512 integer items that
     * are partitioned in a [<em>blocked arrangement</em>](index.html#sec5sec3) across 128 threads
     * where each thread owns 4 consecutive items.
     * \par
     * \code
     * #include <cub/cub.cuh>   // or equivalently <cub/block/block_discontinuity.cuh>
     *
     * __global__ void ExampleKernel(...)
     * {
     *     // Specialize BlockDiscontinuity for a 1D block of 128 threads on type int
     *     typedef cub::BlockDiscontinuity<int, 128> BlockDiscontinuity;
     *
     *     // Allocate shared memory for BlockDiscontinuity
     *     __shared__ typename BlockDiscontinuity::TempStorage temp_storage;
     *
     *     // Obtain a segment of consecutive items that are blocked across threads
     *     int thread_data[4];
     *     ...
     *
     *     // Have thread0 obtain the predecessor item for the entire tile
     *     int tile_predecessor_item;
     *     if (threadIdx.x == 0) tile_predecessor_item == ...
     *
     *     // Have thread127 obtain the successor item for the entire tile
     *     int tile_successor_item;
     *     if (threadIdx.x == 127) tile_successor_item == ...

xgboost/cub/cub/block/block_exchange.cuh  view on Meta::CPAN

 * \tparam ITEMS_PER_THREAD     The number of items partitioned onto each thread.
 * \tparam WARP_TIME_SLICING    <b>[optional]</b> When \p true, only use enough shared memory for a single warp's worth of tile data, time-slicing the block-wide exchange over multiple synchronized rounds.  Yields a smaller memory footprint at the ex...
 * \tparam BLOCK_DIM_Y          <b>[optional]</b> The thread block length in threads along the Y dimension (default: 1)
 * \tparam BLOCK_DIM_Z          <b>[optional]</b> The thread block length in threads along the Z dimension (default: 1)
 * \tparam PTX_ARCH             <b>[optional]</b> \ptxversion
 *
 * \par Overview
 * - It is commonplace for blocks of threads to rearrange data items between
 *   threads.  For example, the device-accessible memory subsystem prefers access patterns
 *   where data items are "striped" across threads (where consecutive threads access consecutive items),
 *   yet most block-wide operations prefer a "blocked" partitioning of items across threads
 *   (where consecutive items belong to a single thread).
 * - BlockExchange supports the following types of data exchanges:
 *   - Transposing between [<em>blocked</em>](index.html#sec5sec3) and [<em>striped</em>](index.html#sec5sec3) arrangements
 *   - Transposing between [<em>blocked</em>](index.html#sec5sec3) and [<em>warp-striped</em>](index.html#sec5sec3) arrangements
 *   - Scattering ranked items to a [<em>blocked arrangement</em>](index.html#sec5sec3)
 *   - Scattering ranked items to a [<em>striped arrangement</em>](index.html#sec5sec3)
 * - \rowmajor
 *
 * \par A Simple Example
 * \blockcollective{BlockExchange}
 * \par
 * The code snippet below illustrates the conversion from a "blocked" to a "striped" arrangement
 * of 512 integer items partitioned across 128 threads where each thread owns 4 items.
 * \par
 * \code
 * #include <cub/cub.cuh>   // or equivalently <cub/block/block_exchange.cuh>
 *
 * __global__ void ExampleKernel(int *d_data, ...)
 * {
 *     // Specialize BlockExchange for a 1D block of 128 threads owning 4 integer items each
 *     typedef cub::BlockExchange<int, 128, 4> BlockExchange;
 *
 *     // Allocate shared memory for BlockExchange
 *     __shared__ typename BlockExchange::TempStorage temp_storage;
 *
 *     // Load a tile of data striped across threads
 *     int thread_data[4];
 *     cub::LoadDirectStriped<128>(threadIdx.x, d_data, thread_data);
 *
 *     // Collectively exchange data into a blocked arrangement across threads
 *     BlockExchange(temp_storage).StripedToBlocked(thread_data);
 *
 * \endcode
 * \par
 * Suppose the set of striped input \p thread_data across the block of threads is
 * <tt>{ [0,128,256,384], [1,129,257,385], ..., [127,255,383,511] }</tt>.
 * The corresponding output \p thread_data in those threads will be
 * <tt>{ [0,1,2,3], [4,5,6,7], [8,9,10,11], ..., [508,509,510,511] }</tt>.
 *
 * \par Performance Considerations
 * - Proper device-specific padding ensures zero bank conflicts for most types.

xgboost/cub/cub/block/block_exchange.cuh  view on Meta::CPAN


    /// Internal storage allocator
    __device__ __forceinline__ _TempStorage& PrivateStorage()
    {
        __shared__ _TempStorage private_storage;
        return private_storage;
    }


    /**
     * Transposes data items from <em>blocked</em> arrangement to <em>striped</em> arrangement.  Specialized for no timeslicing.
     */
    template <typename OutputT>
    __device__ __forceinline__ void BlockedToStriped(
        InputT          input_items[ITEMS_PER_THREAD],      ///< [in] Items to exchange, converting between <em>blocked</em> and <em>striped</em> arrangements.
        OutputT         output_items[ITEMS_PER_THREAD],     ///< [out] Items to exchange, converting between <em>blocked</em> and <em>striped</em> arrangements.
        Int2Type<false> /*time_slicing*/)
    {
        #pragma unroll
        for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++)
        {
            int item_offset = (linear_tid * ITEMS_PER_THREAD) + ITEM;
            if (INSERT_PADDING) item_offset += item_offset >> LOG_SMEM_BANKS;
            temp_storage.buff[item_offset] = input_items[ITEM];
        }

xgboost/cub/cub/block/block_exchange.cuh  view on Meta::CPAN

        for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++)
        {
            int item_offset = int(ITEM * BLOCK_THREADS) + linear_tid;
            if (INSERT_PADDING) item_offset += item_offset >> LOG_SMEM_BANKS;
            output_items[ITEM] = temp_storage.buff[item_offset];
        }
    }


    /**
     * Transposes data items from <em>blocked</em> arrangement to <em>striped</em> arrangement.  Specialized for warp-timeslicing.
     */
    template <typename OutputT>
    __device__ __forceinline__ void BlockedToStriped(
        InputT          input_items[ITEMS_PER_THREAD],      ///< [in] Items to exchange, converting between <em>blocked</em> and <em>striped</em> arrangements.
        OutputT         output_items[ITEMS_PER_THREAD],     ///< [out] Items to exchange, converting between <em>blocked</em> and <em>striped</em> arrangements.
        Int2Type<true>  /*time_slicing*/)
    {
        InputT temp_items[ITEMS_PER_THREAD];

        #pragma unroll
        for (int SLICE = 0; SLICE < TIME_SLICES; SLICE++)
        {
            const int SLICE_OFFSET  = SLICE * TIME_SLICED_ITEMS;
            const int SLICE_OOB     = SLICE_OFFSET + TIME_SLICED_ITEMS;

xgboost/cub/cub/block/block_exchange.cuh  view on Meta::CPAN

        // Copy
        #pragma unroll
        for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++)
        {
            output_items[ITEM] = temp_items[ITEM];
        }
    }


    /**
     * Transposes data items from <em>blocked</em> arrangement to <em>warp-striped</em> arrangement. Specialized for no timeslicing
     */
    template <typename OutputT>
    __device__ __forceinline__ void BlockedToWarpStriped(
        InputT          input_items[ITEMS_PER_THREAD],      ///< [in] Items to exchange, converting between <em>blocked</em> and <em>striped</em> arrangements.
        OutputT         output_items[ITEMS_PER_THREAD],     ///< [out] Items to exchange, converting between <em>blocked</em> and <em>striped</em> arrangements.
        Int2Type<false> /*time_slicing*/)
    {
        #pragma unroll
        for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++)
        {
            int item_offset = warp_offset + ITEM + (lane_id * ITEMS_PER_THREAD);
            if (INSERT_PADDING) item_offset += item_offset >> LOG_SMEM_BANKS;
            temp_storage.buff[item_offset] = input_items[ITEM];
        }

xgboost/cub/cub/block/block_exchange.cuh  view on Meta::CPAN

        #pragma unroll
        for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++)
        {
            int item_offset = warp_offset + (ITEM * WARP_TIME_SLICED_THREADS) + lane_id;
            if (INSERT_PADDING) item_offset += item_offset >> LOG_SMEM_BANKS;
            output_items[ITEM] = temp_storage.buff[item_offset];
        }
    }

    /**
     * Transposes data items from <em>blocked</em> arrangement to <em>warp-striped</em> arrangement. Specialized for warp-timeslicing
     */
    template <typename OutputT>
    __device__ __forceinline__ void BlockedToWarpStriped(
        InputT          input_items[ITEMS_PER_THREAD],      ///< [in] Items to exchange, converting between <em>blocked</em> and <em>striped</em> arrangements.
        OutputT         output_items[ITEMS_PER_THREAD],     ///< [out] Items to exchange, converting between <em>blocked</em> and <em>striped</em> arrangements.
        Int2Type<true>  /*time_slicing*/)
    {
        if (warp_id == 0)
        {
            #pragma unroll
            for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++)
            {
                int item_offset = ITEM + (lane_id * ITEMS_PER_THREAD);
                if (INSERT_PADDING) item_offset += item_offset >> LOG_SMEM_BANKS;
                temp_storage.buff[item_offset] = input_items[ITEM];

xgboost/cub/cub/block/block_exchange.cuh  view on Meta::CPAN

                    int item_offset = (ITEM * WARP_TIME_SLICED_THREADS) + lane_id;
                    if (INSERT_PADDING) item_offset += item_offset >> LOG_SMEM_BANKS;
                    output_items[ITEM] = temp_storage.buff[item_offset];
                }
            }
        }
    }


    /**
     * Transposes data items from <em>striped</em> arrangement to <em>blocked</em> arrangement.  Specialized for no timeslicing.
     */
    template <typename OutputT>
    __device__ __forceinline__ void StripedToBlocked(
        InputT          input_items[ITEMS_PER_THREAD],      ///< [in] Items to exchange, converting between <em>blocked</em> and <em>striped</em> arrangements.
        OutputT         output_items[ITEMS_PER_THREAD],     ///< [out] Items to exchange, converting between <em>blocked</em> and <em>striped</em> arrangements.
        Int2Type<false> /*time_slicing*/)
    {
        #pragma unroll
        for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++)
        {
            int item_offset = int(ITEM * BLOCK_THREADS) + linear_tid;
            if (INSERT_PADDING) item_offset += item_offset >> LOG_SMEM_BANKS;
            temp_storage.buff[item_offset] = input_items[ITEM];
        }

xgboost/cub/cub/block/block_exchange.cuh  view on Meta::CPAN

        for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++)
        {
            int item_offset = (linear_tid * ITEMS_PER_THREAD) + ITEM;
            if (INSERT_PADDING) item_offset += item_offset >> LOG_SMEM_BANKS;
            output_items[ITEM] = temp_storage.buff[item_offset];
        }
    }


    /**
     * Transposes data items from <em>striped</em> arrangement to <em>blocked</em> arrangement.  Specialized for warp-timeslicing.
     */
    template <typename OutputT>
    __device__ __forceinline__ void StripedToBlocked(
        InputT          input_items[ITEMS_PER_THREAD],      ///< [in] Items to exchange, converting between <em>blocked</em> and <em>striped</em> arrangements.
        OutputT         output_items[ITEMS_PER_THREAD],     ///< [out] Items to exchange, converting between <em>blocked</em> and <em>striped</em> arrangements.
        Int2Type<true>  /*time_slicing*/)
    {
        // Warp time-slicing
        InputT temp_items[ITEMS_PER_THREAD];

        #pragma unroll
        for (int SLICE = 0; SLICE < TIME_SLICES; SLICE++)
        {
            const int SLICE_OFFSET  = SLICE * TIME_SLICED_ITEMS;
            const int SLICE_OOB     = SLICE_OFFSET + TIME_SLICED_ITEMS;

xgboost/cub/cub/block/block_exchange.cuh  view on Meta::CPAN

        // Copy
        #pragma unroll
        for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++)
        {
            output_items[ITEM] = temp_items[ITEM];
        }
    }


    /**
     * Transposes data items from <em>warp-striped</em> arrangement to <em>blocked</em> arrangement.  Specialized for no timeslicing
     */
    template <typename OutputT>
    __device__ __forceinline__ void WarpStripedToBlocked(
        InputT          input_items[ITEMS_PER_THREAD],      ///< [in] Items to exchange, converting between <em>blocked</em> and <em>striped</em> arrangements.
        OutputT         output_items[ITEMS_PER_THREAD],     ///< [out] Items to exchange, converting between <em>blocked</em> and <em>striped</em> arrangements.
        Int2Type<false> /*time_slicing*/)
    {
        #pragma unroll
        for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++)
        {
            int item_offset = warp_offset + (ITEM * WARP_TIME_SLICED_THREADS) + lane_id;
            if (INSERT_PADDING) item_offset += item_offset >> LOG_SMEM_BANKS;
            temp_storage.buff[item_offset] = input_items[ITEM];
        }

xgboost/cub/cub/block/block_exchange.cuh  view on Meta::CPAN

        for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++)
        {
            int item_offset = warp_offset + ITEM + (lane_id * ITEMS_PER_THREAD);
            if (INSERT_PADDING) item_offset += item_offset >> LOG_SMEM_BANKS;
            output_items[ITEM] = temp_storage.buff[item_offset];
        }
    }


    /**
     * Transposes data items from <em>warp-striped</em> arrangement to <em>blocked</em> arrangement.  Specialized for warp-timeslicing
     */
    template <typename OutputT>
    __device__ __forceinline__ void WarpStripedToBlocked(
        InputT          input_items[ITEMS_PER_THREAD],      ///< [in] Items to exchange, converting between <em>blocked</em> and <em>striped</em> arrangements.
        OutputT         output_items[ITEMS_PER_THREAD],     ///< [out] Items to exchange, converting between <em>blocked</em> and <em>striped</em> arrangements.
        Int2Type<true>  /*time_slicing*/)
    {
        #pragma unroll
        for (unsigned int SLICE = 0; SLICE < TIME_SLICES; ++SLICE)
        {
            CTA_SYNC();

            if (warp_id == SLICE)
            {
                #pragma unroll

xgboost/cub/cub/block/block_exchange.cuh  view on Meta::CPAN

                    int item_offset = ITEM + (lane_id * ITEMS_PER_THREAD);
                    if (INSERT_PADDING) item_offset += item_offset >> LOG_SMEM_BANKS;
                    output_items[ITEM] = temp_storage.buff[item_offset];
                }
            }
        }
    }


    /**
     * Exchanges data items annotated by rank into <em>blocked</em> arrangement.  Specialized for no timeslicing.
     */
    template <typename OutputT, typename OffsetT>
    __device__ __forceinline__ void ScatterToBlocked(
        InputT          input_items[ITEMS_PER_THREAD],      ///< [in] Items to exchange, converting between <em>blocked</em> and <em>striped</em> arrangements.
        OutputT         output_items[ITEMS_PER_THREAD],     ///< [out] Items to exchange, converting between <em>blocked</em> and <em>striped</em> arrangements.
        OffsetT         ranks[ITEMS_PER_THREAD],    ///< [in] Corresponding scatter ranks
        Int2Type<false> /*time_slicing*/)
    {
        #pragma unroll
        for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++)
        {
            int item_offset = ranks[ITEM];
            if (INSERT_PADDING) item_offset = SHR_ADD(item_offset, LOG_SMEM_BANKS, item_offset);
            temp_storage.buff[item_offset] = input_items[ITEM];
        }

xgboost/cub/cub/block/block_exchange.cuh  view on Meta::CPAN

        #pragma unroll
        for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++)
        {
            int item_offset = (linear_tid * ITEMS_PER_THREAD) + ITEM;
            if (INSERT_PADDING) item_offset = SHR_ADD(item_offset, LOG_SMEM_BANKS, item_offset);
            output_items[ITEM] = temp_storage.buff[item_offset];
        }
    }

    /**
     * Exchanges data items annotated by rank into <em>blocked</em> arrangement.  Specialized for warp-timeslicing.
     */
    template <typename OutputT, typename OffsetT>
    __device__ __forceinline__ void ScatterToBlocked(
        InputT          input_items[ITEMS_PER_THREAD],      ///< [in] Items to exchange, converting between <em>blocked</em> and <em>striped</em> arrangements.
        OutputT         output_items[ITEMS_PER_THREAD],     ///< [out] Items to exchange, converting between <em>blocked</em> and <em>striped</em> arrangements.
        OffsetT         ranks[ITEMS_PER_THREAD],    ///< [in] Corresponding scatter ranks
        Int2Type<true>  /*time_slicing*/)
    {
        InputT temp_items[ITEMS_PER_THREAD];

        #pragma unroll
        for (int SLICE = 0; SLICE < TIME_SLICES; SLICE++)
        {
            CTA_SYNC();

xgboost/cub/cub/block/block_exchange.cuh  view on Meta::CPAN

            output_items[ITEM] = temp_items[ITEM];
        }
    }


    /**
     * Exchanges data items annotated by rank into <em>striped</em> arrangement.  Specialized for no timeslicing.
     */
    template <typename OutputT, typename OffsetT>
    __device__ __forceinline__ void ScatterToStriped(
        InputT          input_items[ITEMS_PER_THREAD],      ///< [in] Items to exchange, converting between <em>blocked</em> and <em>striped</em> arrangements.
        OutputT         output_items[ITEMS_PER_THREAD],     ///< [out] Items to exchange, converting between <em>blocked</em> and <em>striped</em> arrangements.
        OffsetT         ranks[ITEMS_PER_THREAD],    ///< [in] Corresponding scatter ranks
        Int2Type<false> /*time_slicing*/)
    {
        #pragma unroll
        for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++)
        {
            int item_offset = ranks[ITEM];
            if (INSERT_PADDING) item_offset = SHR_ADD(item_offset, LOG_SMEM_BANKS, item_offset);
            temp_storage.buff[item_offset] = input_items[ITEM];
        }

xgboost/cub/cub/block/block_exchange.cuh  view on Meta::CPAN

            output_items[ITEM] = temp_storage.buff[item_offset];
        }
    }


    /**
     * Exchanges data items annotated by rank into <em>striped</em> arrangement.  Specialized for warp-timeslicing.
     */
    template <typename OutputT, typename OffsetT>
    __device__ __forceinline__ void ScatterToStriped(
        InputT          input_items[ITEMS_PER_THREAD],      ///< [in] Items to exchange, converting between <em>blocked</em> and <em>striped</em> arrangements.
        OutputT         output_items[ITEMS_PER_THREAD],     ///< [out] Items to exchange, converting between <em>blocked</em> and <em>striped</em> arrangements.
        OffsetT         ranks[ITEMS_PER_THREAD],    ///< [in] Corresponding scatter ranks
        Int2Type<true> /*time_slicing*/)
    {
        InputT temp_items[ITEMS_PER_THREAD];

        #pragma unroll
        for (int SLICE = 0; SLICE < TIME_SLICES; SLICE++)
        {
            const int SLICE_OFFSET  = SLICE * TIME_SLICED_ITEMS;
            const int SLICE_OOB     = SLICE_OFFSET + TIME_SLICED_ITEMS;

xgboost/cub/cub/block/block_exchange.cuh  view on Meta::CPAN

    {}


    //@}  end member group
    /******************************************************************//**
     * \name Structured exchanges
     *********************************************************************/
    //@{

    /**
     * \brief Transposes data items from <em>striped</em> arrangement to <em>blocked</em> arrangement.
     *
     * \par
     * - \smemreuse
     *
     * \par Snippet
     * The code snippet below illustrates the conversion from a "striped" to a "blocked" arrangement
     * of 512 integer items partitioned across 128 threads where each thread owns 4 items.
     * \par
     * \code
     * #include <cub/cub.cuh>   // or equivalently <cub/block/block_exchange.cuh>
     *
     * __global__ void ExampleKernel(int *d_data, ...)
     * {
     *     // Specialize BlockExchange for a 1D block of 128 threads owning 4 integer items each
     *     typedef cub::BlockExchange<int, 128, 4> BlockExchange;
     *
     *     // Allocate shared memory for BlockExchange
     *     __shared__ typename BlockExchange::TempStorage temp_storage;
     *
     *     // Load a tile of ordered data into a striped arrangement across block threads
     *     int thread_data[4];
     *     cub::LoadDirectStriped<128>(threadIdx.x, d_data, thread_data);
     *
     *     // Collectively exchange data into a blocked arrangement across threads
     *     BlockExchange(temp_storage).StripedToBlocked(thread_data, thread_data);
     *
     * \endcode
     * \par
     * Suppose the set of striped input \p thread_data across the block of threads is
     * <tt>{ [0,128,256,384], [1,129,257,385], ..., [127,255,383,511] }</tt> after loading from device-accessible memory.
     * The corresponding output \p thread_data in those threads will be
     * <tt>{ [0,1,2,3], [4,5,6,7], [8,9,10,11], ..., [508,509,510,511] }</tt>.
     *
     */
    template <typename OutputT>
    __device__ __forceinline__ void StripedToBlocked(
        InputT      input_items[ITEMS_PER_THREAD],    ///< [in] Items to exchange, converting between <em>striped</em> and <em>blocked</em> arrangements.
        OutputT     output_items[ITEMS_PER_THREAD])   ///< [out] Items from exchange, converting between <em>striped</em> and <em>blocked</em> arrangements.
    {
        StripedToBlocked(input_items, output_items, Int2Type<WARP_TIME_SLICING>());
    }


    /**
     * \brief Transposes data items from <em>blocked</em> arrangement to <em>striped</em> arrangement.
     *
     * \par
     * - \smemreuse
     *
     * \par Snippet
     * The code snippet below illustrates the conversion from a "blocked" to a "striped" arrangement
     * of 512 integer items partitioned across 128 threads where each thread owns 4 items.
     * \par
     * \code
     * #include <cub/cub.cuh>   // or equivalently <cub/block/block_exchange.cuh>
     *
     * __global__ void ExampleKernel(int *d_data, ...)
     * {
     *     // Specialize BlockExchange for a 1D block of 128 threads owning 4 integer items each
     *     typedef cub::BlockExchange<int, 128, 4> BlockExchange;
     *
     *     // Allocate shared memory for BlockExchange
     *     __shared__ typename BlockExchange::TempStorage temp_storage;
     *
     *     // Obtain a segment of consecutive items that are blocked across threads
     *     int thread_data[4];
     *     ...
     *
     *     // Collectively exchange data into a striped arrangement across threads
     *     BlockExchange(temp_storage).BlockedToStriped(thread_data, thread_data);
     *
     *     // Store data striped across block threads into an ordered tile
     *     cub::StoreDirectStriped<STORE_DEFAULT, 128>(threadIdx.x, d_data, thread_data);
     *
     * \endcode
     * \par
     * Suppose the set of blocked input \p thread_data across the block of threads is
     * <tt>{ [0,1,2,3], [4,5,6,7], [8,9,10,11], ..., [508,509,510,511] }</tt>.
     * The corresponding output \p thread_data in those threads will be
     * <tt>{ [0,128,256,384], [1,129,257,385], ..., [127,255,383,511] }</tt> in
     * preparation for storing to device-accessible memory.
     *
     */
    template <typename OutputT>
    __device__ __forceinline__ void BlockedToStriped(
        InputT      input_items[ITEMS_PER_THREAD],    ///< [in] Items to exchange, converting between <em>striped</em> and <em>blocked</em> arrangements.
        OutputT     output_items[ITEMS_PER_THREAD])   ///< [out] Items from exchange, converting between <em>striped</em> and <em>blocked</em> arrangements.
    {
        BlockedToStriped(input_items, output_items, Int2Type<WARP_TIME_SLICING>());
    }



    /**
     * \brief Transposes data items from <em>warp-striped</em> arrangement to <em>blocked</em> arrangement.
     *
     * \par
     * - \smemreuse
     *
     * \par Snippet
     * The code snippet below illustrates the conversion from a "warp-striped" to a "blocked" arrangement
     * of 512 integer items partitioned across 128 threads where each thread owns 4 items.
     * \par
     * \code
     * #include <cub/cub.cuh>   // or equivalently <cub/block/block_exchange.cuh>
     *
     * __global__ void ExampleKernel(int *d_data, ...)
     * {
     *     // Specialize BlockExchange for a 1D block of 128 threads owning 4 integer items each
     *     typedef cub::BlockExchange<int, 128, 4> BlockExchange;
     *
     *     // Allocate shared memory for BlockExchange
     *     __shared__ typename BlockExchange::TempStorage temp_storage;
     *
     *     // Load a tile of ordered data into a warp-striped arrangement across warp threads
     *     int thread_data[4];
     *     cub::LoadSWarptriped<LOAD_DEFAULT>(threadIdx.x, d_data, thread_data);
     *
     *     // Collectively exchange data into a blocked arrangement across threads
     *     BlockExchange(temp_storage).WarpStripedToBlocked(thread_data);
     *
     * \endcode
     * \par
     * Suppose the set of warp-striped input \p thread_data across the block of threads is
     * <tt>{ [0,32,64,96], [1,33,65,97], [2,34,66,98], ..., [415,447,479,511] }</tt>
     * after loading from device-accessible memory.  (The first 128 items are striped across
     * the first warp of 32 threads, the second 128 items are striped across the second warp, etc.)
     * The corresponding output \p thread_data in those threads will be
     * <tt>{ [0,1,2,3], [4,5,6,7], [8,9,10,11], ..., [508,509,510,511] }</tt>.
     *
     */
    template <typename OutputT>
    __device__ __forceinline__ void WarpStripedToBlocked(
        InputT      input_items[ITEMS_PER_THREAD],    ///< [in] Items to exchange, converting between <em>striped</em> and <em>blocked</em> arrangements.
        OutputT     output_items[ITEMS_PER_THREAD])   ///< [out] Items from exchange, converting between <em>striped</em> and <em>blocked</em> arrangements.
    {
        WarpStripedToBlocked(input_items, output_items, Int2Type<WARP_TIME_SLICING>());
    }



    /**
     * \brief Transposes data items from <em>blocked</em> arrangement to <em>warp-striped</em> arrangement.
     *
     * \par
     * - \smemreuse
     *
     * \par Snippet
     * The code snippet below illustrates the conversion from a "blocked" to a "warp-striped" arrangement
     * of 512 integer items partitioned across 128 threads where each thread owns 4 items.
     * \par
     * \code
     * #include <cub/cub.cuh>   // or equivalently <cub/block/block_exchange.cuh>
     *
     * __global__ void ExampleKernel(int *d_data, ...)
     * {
     *     // Specialize BlockExchange for a 1D block of 128 threads owning 4 integer items each
     *     typedef cub::BlockExchange<int, 128, 4> BlockExchange;
     *
     *     // Allocate shared memory for BlockExchange
     *     __shared__ typename BlockExchange::TempStorage temp_storage;
     *
     *     // Obtain a segment of consecutive items that are blocked across threads
     *     int thread_data[4];
     *     ...
     *
     *     // Collectively exchange data into a warp-striped arrangement across threads
     *     BlockExchange(temp_storage).BlockedToWarpStriped(thread_data, thread_data);
     *
     *     // Store data striped across warp threads into an ordered tile
     *     cub::StoreDirectStriped<STORE_DEFAULT, 128>(threadIdx.x, d_data, thread_data);
     *
     * \endcode
     * \par
     * Suppose the set of blocked input \p thread_data across the block of threads is
     * <tt>{ [0,1,2,3], [4,5,6,7], [8,9,10,11], ..., [508,509,510,511] }</tt>.
     * The corresponding output \p thread_data in those threads will be
     * <tt>{ [0,32,64,96], [1,33,65,97], [2,34,66,98], ..., [415,447,479,511] }</tt>
     * in preparation for storing to device-accessible memory. (The first 128 items are striped across
     * the first warp of 32 threads, the second 128 items are striped across the second warp, etc.)
     *
     */
    template <typename OutputT>
    __device__ __forceinline__ void BlockedToWarpStriped(
        InputT      input_items[ITEMS_PER_THREAD],    ///< [in] Items to exchange, converting between <em>striped</em> and <em>blocked</em> arrangements.
        OutputT     output_items[ITEMS_PER_THREAD])   ///< [out] Items from exchange, converting between <em>striped</em> and <em>blocked</em> arrangements.
    {
        BlockedToWarpStriped(input_items, output_items, Int2Type<WARP_TIME_SLICING>());
    }



    //@}  end member group
    /******************************************************************//**
     * \name Scatter exchanges
     *********************************************************************/
    //@{


    /**
     * \brief Exchanges data items annotated by rank into <em>blocked</em> arrangement.
     *
     * \par
     * - \smemreuse
     *
     * \tparam OffsetT                              <b>[inferred]</b> Signed integer type for local offsets
     */
    template <typename OutputT, typename OffsetT>
    __device__ __forceinline__ void ScatterToBlocked(
        InputT      input_items[ITEMS_PER_THREAD],      ///< [in] Items to exchange, converting between <em>striped</em> and <em>blocked</em> arrangements.
        OutputT     output_items[ITEMS_PER_THREAD],     ///< [out] Items from exchange, converting between <em>striped</em> and <em>blocked</em> arrangements.
        OffsetT     ranks[ITEMS_PER_THREAD])            ///< [in] Corresponding scatter ranks
    {
        ScatterToBlocked(input_items, output_items, ranks, Int2Type<WARP_TIME_SLICING>());
    }



    /**
     * \brief Exchanges data items annotated by rank into <em>striped</em> arrangement.
     *
     * \par
     * - \smemreuse
     *
     * \tparam OffsetT                              <b>[inferred]</b> Signed integer type for local offsets
     */
    template <typename OutputT, typename OffsetT>
    __device__ __forceinline__ void ScatterToStriped(
        InputT      input_items[ITEMS_PER_THREAD],      ///< [in] Items to exchange, converting between <em>striped</em> and <em>blocked</em> arrangements.
        OutputT     output_items[ITEMS_PER_THREAD],     ///< [out] Items from exchange, converting between <em>striped</em> and <em>blocked</em> arrangements.
        OffsetT     ranks[ITEMS_PER_THREAD])            ///< [in] Corresponding scatter ranks
    {
        ScatterToStriped(input_items, output_items, ranks, Int2Type<WARP_TIME_SLICING>());
    }



    /**
     * \brief Exchanges data items annotated by rank into <em>striped</em> arrangement.  Items with rank -1 are not exchanged.
     *
     * \par
     * - \smemreuse
     *
     * \tparam OffsetT                              <b>[inferred]</b> Signed integer type for local offsets
     */
    template <typename OutputT, typename OffsetT>
    __device__ __forceinline__ void ScatterToStripedGuarded(
        InputT      input_items[ITEMS_PER_THREAD],      ///< [in] Items to exchange, converting between <em>striped</em> and <em>blocked</em> arrangements.
        OutputT     output_items[ITEMS_PER_THREAD],     ///< [out] Items from exchange, converting between <em>striped</em> and <em>blocked</em> arrangements.
        OffsetT     ranks[ITEMS_PER_THREAD])            ///< [in] Corresponding scatter ranks
    {
        #pragma unroll
        for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++)
        {
            int item_offset = ranks[ITEM];
            if (INSERT_PADDING) item_offset = SHR_ADD(item_offset, LOG_SMEM_BANKS, item_offset);
            if (ranks[ITEM] >= 0)
                temp_storage.buff[item_offset] = input_items[ITEM];
        }

xgboost/cub/cub/block/block_exchange.cuh  view on Meta::CPAN

     * \brief Exchanges valid data items annotated by rank into <em>striped</em> arrangement.
     *
     * \par
     * - \smemreuse
     *
     * \tparam OffsetT                              <b>[inferred]</b> Signed integer type for local offsets
     * \tparam ValidFlag                            <b>[inferred]</b> FlagT type denoting which items are valid
     */
    template <typename OutputT, typename OffsetT, typename ValidFlag>
    __device__ __forceinline__ void ScatterToStripedFlagged(
        InputT      input_items[ITEMS_PER_THREAD],      ///< [in] Items to exchange, converting between <em>striped</em> and <em>blocked</em> arrangements.
        OutputT     output_items[ITEMS_PER_THREAD],     ///< [out] Items from exchange, converting between <em>striped</em> and <em>blocked</em> arrangements.
        OffsetT     ranks[ITEMS_PER_THREAD],            ///< [in] Corresponding scatter ranks
        ValidFlag   is_valid[ITEMS_PER_THREAD])         ///< [in] Corresponding flag denoting item validity
    {
        #pragma unroll
        for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++)
        {
            int item_offset = ranks[ITEM];
            if (INSERT_PADDING) item_offset = SHR_ADD(item_offset, LOG_SMEM_BANKS, item_offset);
            if (is_valid[ITEM])
                temp_storage.buff[item_offset] = input_items[ITEM];

xgboost/cub/cub/block/block_exchange.cuh  view on Meta::CPAN

    }


    //@}  end member group



#ifndef DOXYGEN_SHOULD_SKIP_THIS    // Do not document


    __device__ __forceinline__ void StripedToBlocked(
        InputT      items[ITEMS_PER_THREAD])   ///< [in-out] Items to exchange, converting between <em>striped</em> and <em>blocked</em> arrangements.
    {
        StripedToBlocked(items, items);
    }

    __device__ __forceinline__ void BlockedToStriped(
        InputT      items[ITEMS_PER_THREAD])   ///< [in-out] Items to exchange, converting between <em>striped</em> and <em>blocked</em> arrangements.
    {
        BlockedToStriped(items, items);
    }

    __device__ __forceinline__ void WarpStripedToBlocked(
        InputT      items[ITEMS_PER_THREAD])    ///< [in-out] Items to exchange, converting between <em>striped</em> and <em>blocked</em> arrangements.
    {
        WarpStripedToBlocked(items, items);
    }

    __device__ __forceinline__ void BlockedToWarpStriped(
        InputT      items[ITEMS_PER_THREAD])    ///< [in-out] Items to exchange, converting between <em>striped</em> and <em>blocked</em> arrangements.
    {
        BlockedToWarpStriped(items, items);
    }

    template <typename OffsetT>
    __device__ __forceinline__ void ScatterToBlocked(
        InputT      items[ITEMS_PER_THREAD],    ///< [in-out] Items to exchange, converting between <em>striped</em> and <em>blocked</em> arrangements.
        OffsetT     ranks[ITEMS_PER_THREAD])    ///< [in] Corresponding scatter ranks
    {
        ScatterToBlocked(items, items, ranks);
    }

    template <typename OffsetT>
    __device__ __forceinline__ void ScatterToStriped(
        InputT      items[ITEMS_PER_THREAD],    ///< [in-out] Items to exchange, converting between <em>striped</em> and <em>blocked</em> arrangements.
        OffsetT     ranks[ITEMS_PER_THREAD])    ///< [in] Corresponding scatter ranks
    {
        ScatterToStriped(items, items, ranks);
    }

    template <typename OffsetT>
    __device__ __forceinline__ void ScatterToStripedGuarded(
        InputT      items[ITEMS_PER_THREAD],    ///< [in-out] Items to exchange, converting between <em>striped</em> and <em>blocked</em> arrangements.
        OffsetT     ranks[ITEMS_PER_THREAD])    ///< [in] Corresponding scatter ranks
    {
        ScatterToStripedGuarded(items, items, ranks);
    }

    template <typename OffsetT, typename ValidFlag>
    __device__ __forceinline__ void ScatterToStripedFlagged(
        InputT      items[ITEMS_PER_THREAD],        ///< [in-out] Items to exchange, converting between <em>striped</em> and <em>blocked</em> arrangements.
        OffsetT     ranks[ITEMS_PER_THREAD],        ///< [in] Corresponding scatter ranks
        ValidFlag   is_valid[ITEMS_PER_THREAD])     ///< [in] Corresponding flag denoting item validity
    {
        ScatterToStriped(items, items, ranks, is_valid);
    }

#endif // DOXYGEN_SHOULD_SKIP_THIS


};

xgboost/cub/cub/block/block_load.cuh  view on Meta::CPAN

/// CUB namespace
namespace cub {

/**
 * \addtogroup UtilIo
 * @{
 */


/******************************************************************//**
 * \name Blocked arrangement I/O (direct)
 *********************************************************************/
//@{


/**
 * \brief Load a linear segment of items into a blocked arrangement across the thread block.
 *
 * \blocked
 *
 * \tparam T                    <b>[inferred]</b> The data type to load.
 * \tparam ITEMS_PER_THREAD     <b>[inferred]</b> The number of consecutive items partitioned onto each thread.
 * \tparam InputIteratorT       <b>[inferred]</b> The random-access iterator type for input \iterator.
 */
template <
    typename        InputT,
    int             ITEMS_PER_THREAD,
    typename        InputIteratorT>
__device__ __forceinline__ void LoadDirectBlocked(
    int             linear_tid,                 ///< [in] A suitable 1D thread-identifier for the calling thread (e.g., <tt>(threadIdx.y * blockDim.x) + linear_tid</tt> for 2D thread blocks)
    InputIteratorT  block_itr,                  ///< [in] The thread block's base input iterator for loading from
    InputT          (&items)[ITEMS_PER_THREAD]) ///< [out] Data to load
{
    InputIteratorT thread_itr = block_itr + (linear_tid * ITEMS_PER_THREAD);

    // Load directly in thread-blocked order
    #pragma unroll
    for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++)
    {
        items[ITEM] = thread_itr[ITEM];
    }
}


/**
 * \brief Load a linear segment of items into a blocked arrangement across the thread block, guarded by range.
 *
 * \blocked
 *
 * \tparam T                    <b>[inferred]</b> The data type to load.
 * \tparam ITEMS_PER_THREAD     <b>[inferred]</b> The number of consecutive items partitioned onto each thread.
 * \tparam InputIteratorT       <b>[inferred]</b> The random-access iterator type for input \iterator.
 */
template <
    typename        InputT,
    int             ITEMS_PER_THREAD,
    typename        InputIteratorT>
__device__ __forceinline__ void LoadDirectBlocked(
    int             linear_tid,                 ///< [in] A suitable 1D thread-identifier for the calling thread (e.g., <tt>(threadIdx.y * blockDim.x) + linear_tid</tt> for 2D thread blocks)
    InputIteratorT  block_itr,                  ///< [in] The thread block's base input iterator for loading from
    InputT          (&items)[ITEMS_PER_THREAD], ///< [out] Data to load
    int             valid_items)                ///< [in] Number of valid items to load
{
    InputIteratorT thread_itr = block_itr + (linear_tid * ITEMS_PER_THREAD);

    #pragma unroll
    for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++)
    {
        if ((linear_tid * ITEMS_PER_THREAD) + ITEM < valid_items)
        {
            items[ITEM] = thread_itr[ITEM];
        }
    }
}


/**
 * \brief Load a linear segment of items into a blocked arrangement across the thread block, guarded by range, with a fall-back assignment of out-of-bound elements..
 *
 * \blocked
 *
 * \tparam T                    <b>[inferred]</b> The data type to load.
 * \tparam ITEMS_PER_THREAD     <b>[inferred]</b> The number of consecutive items partitioned onto each thread.
 * \tparam InputIteratorT       <b>[inferred]</b> The random-access iterator type for input \iterator.
 */
template <
    typename        InputT,
    typename        DefaultT,
    int             ITEMS_PER_THREAD,
    typename        InputIteratorT>
__device__ __forceinline__ void LoadDirectBlocked(
    int             linear_tid,                 ///< [in] A suitable 1D thread-identifier for the calling thread (e.g., <tt>(threadIdx.y * blockDim.x) + linear_tid</tt> for 2D thread blocks)
    InputIteratorT  block_itr,                  ///< [in] The thread block's base input iterator for loading from
    InputT          (&items)[ITEMS_PER_THREAD], ///< [out] Data to load
    int             valid_items,                ///< [in] Number of valid items to load
    DefaultT        oob_default)                ///< [in] Default value to assign out-of-bound items
{
    #pragma unroll
    for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++)
        items[ITEM] = oob_default;

    LoadDirectBlocked(linear_tid, block_itr, items, valid_items);
}


#ifndef DOXYGEN_SHOULD_SKIP_THIS    // Do not document

/**
 * Internal implementation for load vectorization
 */
template <
    CacheLoadModifier   MODIFIER,
    typename            T,
    int                 ITEMS_PER_THREAD>
__device__ __forceinline__ void InternalLoadDirectBlockedVectorized(
    int    linear_tid,                 ///< [in] A suitable 1D thread-identifier for the calling thread (e.g., <tt>(threadIdx.y * blockDim.x) + linear_tid</tt> for 2D thread blocks)
    T      *block_ptr,                 ///< [in] Input pointer for loading from
    T      (&items)[ITEMS_PER_THREAD]) ///< [out] Data to load
{
    // Biggest memory access word that T is a whole multiple of
    typedef typename UnitWord<T>::DeviceWord DeviceWord;

    enum
    {
        TOTAL_WORDS = sizeof(items) / sizeof(DeviceWord),

xgboost/cub/cub/block/block_load.cuh  view on Meta::CPAN


    // Vector type
    typedef typename CubVector<DeviceWord, VECTOR_SIZE>::Type Vector;

    // Vector items
    Vector vec_items[VECTORS_PER_THREAD];

    // Aliased input ptr
    Vector* vec_ptr = reinterpret_cast<Vector*>(block_ptr) + (linear_tid * VECTORS_PER_THREAD);

    // Load directly in thread-blocked order
    #pragma unroll
    for (int ITEM = 0; ITEM < VECTORS_PER_THREAD; ITEM++)
    {
        vec_items[ITEM] = ThreadLoad<MODIFIER>(vec_ptr + ITEM);
    }

    // Copy
    #pragma unroll
    for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++)
    {
        items[ITEM] = *(reinterpret_cast<T*>(vec_items) + ITEM);
    }
}

#endif // DOXYGEN_SHOULD_SKIP_THIS


/**
 * \brief Load a linear segment of items into a blocked arrangement across the thread block.
 *
 * \blocked
 *
 * The input offset (\p block_ptr + \p block_offset) must be quad-item aligned
 *
 * The following conditions will prevent vectorization and loading will fall back to cub::BLOCK_LOAD_DIRECT:
 *   - \p ITEMS_PER_THREAD is odd
 *   - The data type \p T is not a built-in primitive or CUDA vector type (e.g., \p short, \p int2, \p double, \p float2, etc.)
 *
 * \tparam T                    <b>[inferred]</b> The data type to load.
 * \tparam ITEMS_PER_THREAD     <b>[inferred]</b> The number of consecutive items partitioned onto each thread.
 */
template <
    typename        T,
    int             ITEMS_PER_THREAD>
__device__ __forceinline__ void LoadDirectBlockedVectorized(
    int linear_tid,                 ///< [in] A suitable 1D thread-identifier for the calling thread (e.g., <tt>(threadIdx.y * blockDim.x) + linear_tid</tt> for 2D thread blocks)
    T   *block_ptr,                 ///< [in] Input pointer for loading from
    T   (&items)[ITEMS_PER_THREAD]) ///< [out] Data to load
{
    InternalLoadDirectBlockedVectorized<LOAD_DEFAULT>(linear_tid, block_ptr, items);
}


//@}  end member group
/******************************************************************//**
 * \name Striped arrangement I/O (direct)
 *********************************************************************/
//@{


xgboost/cub/cub/block/block_load.cuh  view on Meta::CPAN


/** @} */       // end group UtilIo



//-----------------------------------------------------------------------------
// Generic BlockLoad abstraction
//-----------------------------------------------------------------------------

/**
 * \brief cub::BlockLoadAlgorithm enumerates alternative algorithms for cub::BlockLoad to read a linear segment of data from memory into a blocked arrangement across a CUDA thread block.
 */

/**
 * \brief cub::BlockLoadAlgorithm enumerates alternative algorithms for cub::BlockLoad to read a linear segment of data from memory into a blocked arrangement across a CUDA thread block.
 */
enum BlockLoadAlgorithm
{
    /**
     * \par Overview
     *
     * A [<em>blocked arrangement</em>](index.html#sec5sec3) of data is read
     * directly from memory.
     *
     * \par Performance Considerations
     * - The utilization of memory transactions (coalescing) decreases as the
     *   access stride between threads increases (i.e., the number items per thread).
     */
    BLOCK_LOAD_DIRECT,

    /**
     * \par Overview
     *
     * A [<em>blocked arrangement</em>](index.html#sec5sec3) of data is read
     * from memory using CUDA's built-in vectorized loads as a coalescing optimization.
     * For example, <tt>ld.global.v4.s32</tt> instructions will be generated
     * when \p T = \p int and \p ITEMS_PER_THREAD % 4 == 0.
     *
     * \par Performance Considerations
     * - The utilization of memory transactions (coalescing) remains high until the the
     *   access stride between threads (i.e., the number items per thread) exceeds the
     *   maximum vector load width (typically 4 items or 64B, whichever is lower).
     * - The following conditions will prevent vectorization and loading will fall back to cub::BLOCK_LOAD_DIRECT:
     *   - \p ITEMS_PER_THREAD is odd

xgboost/cub/cub/block/block_load.cuh  view on Meta::CPAN

     *   - The block input offset is not quadword-aligned
     *   - The data type \p T is not a built-in primitive or CUDA vector type (e.g., \p short, \p int2, \p double, \p float2, etc.)
     */
    BLOCK_LOAD_VECTORIZE,

    /**
     * \par Overview
     *
     * A [<em>striped arrangement</em>](index.html#sec5sec3) of data is read
     * efficiently from memory and then locally transposed into a
     * [<em>blocked arrangement</em>](index.html#sec5sec3).
     *
     * \par Performance Considerations
     * - The utilization of memory transactions (coalescing) remains high regardless
     *   of items loaded per thread.
     * - The local reordering incurs slightly longer latencies and throughput than the
     *   direct cub::BLOCK_LOAD_DIRECT and cub::BLOCK_LOAD_VECTORIZE alternatives.
     */
    BLOCK_LOAD_TRANSPOSE,


    /**
     * \par Overview
     *
     * A [<em>warp-striped arrangement</em>](index.html#sec5sec3) of data is
     * read efficiently from memory and then locally transposed into a
     * [<em>blocked arrangement</em>](index.html#sec5sec3).
     *
     * \par Usage Considerations
     * - BLOCK_THREADS must be a multiple of WARP_THREADS
     *
     * \par Performance Considerations
     * - The utilization of memory transactions (coalescing) remains high regardless
     *   of items loaded per thread.
     * - The local reordering incurs slightly larger latencies than the
     *   direct cub::BLOCK_LOAD_DIRECT and cub::BLOCK_LOAD_VECTORIZE alternatives.
     * - Provisions more shared storage, but incurs smaller latencies than the
     *   BLOCK_LOAD_WARP_TRANSPOSE_TIMESLICED alternative.
     */
    BLOCK_LOAD_WARP_TRANSPOSE,


    /**
     * \par Overview
     *
     * Like \p BLOCK_LOAD_WARP_TRANSPOSE, a [<em>warp-striped arrangement</em>](index.html#sec5sec3)
     * of data is read directly from memory and then is locally transposed into a
     * [<em>blocked arrangement</em>](index.html#sec5sec3). To reduce the shared memory
     * requirement, only one warp's worth of shared memory is provisioned and is
     * subsequently time-sliced among warps.
     *
     * \par Usage Considerations
     * - BLOCK_THREADS must be a multiple of WARP_THREADS
     *
     * \par Performance Considerations
     * - The utilization of memory transactions (coalescing) remains high regardless
     *   of items loaded per thread.
     * - Provisions less shared memory temporary storage, but incurs larger
     *   latencies than the BLOCK_LOAD_WARP_TRANSPOSE alternative.
     */
    BLOCK_LOAD_WARP_TRANSPOSE_TIMESLICED,
};


/**
 * \brief The BlockLoad class provides [<em>collective</em>](index.html#sec0) data movement methods for loading a linear segment of items from memory into a [<em>blocked arrangement</em>](index.html#sec5sec3) across a CUDA thread block.  ![](block_lo...
 * \ingroup BlockModule
 * \ingroup UtilIo
 *
 * \tparam InputT               The data type to read into (which must be convertible from the input iterator's value type).
 * \tparam BLOCK_DIM_X          The thread block length in threads along the X dimension
 * \tparam ITEMS_PER_THREAD     The number of consecutive items partitioned onto each thread.
 * \tparam ALGORITHM            <b>[optional]</b> cub::BlockLoadAlgorithm tuning policy.  default: cub::BLOCK_LOAD_DIRECT.
 * \tparam WARP_TIME_SLICING    <b>[optional]</b> Whether or not only one warp's worth of shared memory should be allocated and time-sliced among block-warps during any load-related data transpositions (versus each warp having its own storage). (defa...
 * \tparam BLOCK_DIM_Y          <b>[optional]</b> The thread block length in threads along the Y dimension (default: 1)
 * \tparam BLOCK_DIM_Z          <b>[optional]</b> The thread block length in threads along the Z dimension (default: 1)
 * \tparam PTX_ARCH             <b>[optional]</b> \ptxversion
 *
 * \par Overview
 * - The BlockLoad class provides a single data movement abstraction that can be specialized
 *   to implement different cub::BlockLoadAlgorithm strategies.  This facilitates different
 *   performance policies for different architectures, data types, granularity sizes, etc.
 * - BlockLoad can be optionally specialized by different data movement strategies:
 *   -# <b>cub::BLOCK_LOAD_DIRECT</b>.  A [<em>blocked arrangement</em>](index.html#sec5sec3)
 *      of data is read directly from memory.  [More...](\ref cub::BlockLoadAlgorithm)
 *   -# <b>cub::BLOCK_LOAD_VECTORIZE</b>.  A [<em>blocked arrangement</em>](index.html#sec5sec3)
 *      of data is read directly from memory using CUDA's built-in vectorized loads as a
 *      coalescing optimization.    [More...](\ref cub::BlockLoadAlgorithm)
 *   -# <b>cub::BLOCK_LOAD_TRANSPOSE</b>.  A [<em>striped arrangement</em>](index.html#sec5sec3)
 *      of data is read directly from memory and is then locally transposed into a
 *      [<em>blocked arrangement</em>](index.html#sec5sec3).  [More...](\ref cub::BlockLoadAlgorithm)
 *   -# <b>cub::BLOCK_LOAD_WARP_TRANSPOSE</b>.  A [<em>warp-striped arrangement</em>](index.html#sec5sec3)
 *      of data is read directly from memory and is then locally transposed into a
 *      [<em>blocked arrangement</em>](index.html#sec5sec3).  [More...](\ref cub::BlockLoadAlgorithm)
 *   -# <b>cub::BLOCK_LOAD_WARP_TRANSPOSE_TIMESLICED,</b>.  A [<em>warp-striped arrangement</em>](index.html#sec5sec3)
 *      of data is read directly from memory and is then locally transposed into a
 *      [<em>blocked arrangement</em>](index.html#sec5sec3) one warp at a time.  [More...](\ref cub::BlockLoadAlgorithm)
 * - \rowmajor
 *
 * \par A Simple Example
 * \blockcollective{BlockLoad}
 * \par
 * The code snippet below illustrates the loading of a linear
 * segment of 512 integers into a "blocked" arrangement across 128 threads where each
 * thread owns 4 consecutive items.  The load is specialized for \p BLOCK_LOAD_WARP_TRANSPOSE,
 * meaning memory references are efficiently coalesced using a warp-striped access
 * pattern (after which items are locally reordered among threads).
 * \par
 * \code
 * #include <cub/cub.cuh>   // or equivalently <cub/block/block_load.cuh>
 *
 * __global__ void ExampleKernel(int *d_data, ...)
 * {
 *     // Specialize BlockLoad for a 1D block of 128 threads owning 4 integer items each
 *     typedef cub::BlockLoad<int, 128, 4, BLOCK_LOAD_WARP_TRANSPOSE> BlockLoad;
 *
 *     // Allocate shared memory for BlockLoad
 *     __shared__ typename BlockLoad::TempStorage temp_storage;
 *
 *     // Load a segment of consecutive items that are blocked across threads
 *     int thread_data[4];
 *     BlockLoad(temp_storage).Load(d_data, thread_data);
 *
 * \endcode
 * \par
 * Suppose the input \p d_data is <tt>0, 1, 2, 3, 4, 5, ...</tt>.
 * The set of \p thread_data across the block of threads in those threads will be
 * <tt>{ [0,1,2,3], [4,5,6,7], ..., [508,509,510,511] }</tt>.
 *
 */

xgboost/cub/cub/block/block_load.cuh  view on Meta::CPAN

        :
            linear_tid(linear_tid)
        {}

        /// Load a linear segment of items from memory
        template <typename InputIteratorT>
        __device__ __forceinline__ void Load(
            InputIteratorT  block_itr,                      ///< [in] The thread block's base input iterator for loading from
            InputT          (&items)[ITEMS_PER_THREAD])     ///< [out] Data to load
        {
            LoadDirectBlocked(linear_tid, block_itr, items);
        }

        /// Load a linear segment of items from memory, guarded by range
        template <typename InputIteratorT>
        __device__ __forceinline__ void Load(
            InputIteratorT  block_itr,                      ///< [in] The thread block's base input iterator for loading from
            InputT          (&items)[ITEMS_PER_THREAD],     ///< [out] Data to load
            int             valid_items)                    ///< [in] Number of valid items to load
        {
            LoadDirectBlocked(linear_tid, block_itr, items, valid_items);
        }

        /// Load a linear segment of items from memory, guarded by range, with a fall-back assignment of out-of-bound elements
        template <typename InputIteratorT, typename DefaultT>
        __device__ __forceinline__ void Load(
            InputIteratorT  block_itr,                      ///< [in] The thread block's base input iterator for loading from
            InputT          (&items)[ITEMS_PER_THREAD],     ///< [out] Data to load
            int             valid_items,                    ///< [in] Number of valid items to load
            DefaultT        oob_default)                    ///< [in] Default value to assign out-of-bound items
        {
            LoadDirectBlocked(linear_tid, block_itr, items, valid_items, oob_default);
        }

    };


    /**
     * BLOCK_LOAD_VECTORIZE specialization of load helper
     */
    template <int DUMMY>
    struct LoadInternal<BLOCK_LOAD_VECTORIZE, DUMMY>

xgboost/cub/cub/block/block_load.cuh  view on Meta::CPAN

        :
            linear_tid(linear_tid)
        {}

        /// Load a linear segment of items from memory, specialized for native pointer types (attempts vectorization)
        template <typename InputIteratorT>
        __device__ __forceinline__ void Load(
            InputT               *block_ptr,                     ///< [in] The thread block's base input iterator for loading from
            InputT               (&items)[ITEMS_PER_THREAD])     ///< [out] Data to load
        {
            InternalLoadDirectBlockedVectorized<LOAD_DEFAULT>(linear_tid, block_ptr, items);
        }

        /// Load a linear segment of items from memory, specialized for native pointer types (attempts vectorization)
        template <typename InputIteratorT>
        __device__ __forceinline__ void Load(
            const InputT         *block_ptr,                     ///< [in] The thread block's base input iterator for loading from
            InputT               (&items)[ITEMS_PER_THREAD])     ///< [out] Data to load
        {
            InternalLoadDirectBlockedVectorized<LOAD_DEFAULT>(linear_tid, block_ptr, items);
        }

        /// Load a linear segment of items from memory, specialized for native pointer types (attempts vectorization)
        template <
            CacheLoadModifier   MODIFIER,
            typename            ValueType,
            typename            OffsetT>
        __device__ __forceinline__ void Load(
            CacheModifiedInputIterator<MODIFIER, ValueType, OffsetT>    block_itr,                      ///< [in] The thread block's base input iterator for loading from
            InputT                                                     (&items)[ITEMS_PER_THREAD])     ///< [out] Data to load
        {
            InternalLoadDirectBlockedVectorized<MODIFIER>(linear_tid, block_itr.ptr, items);
        }

        /// Load a linear segment of items from memory, specialized for opaque input iterators (skips vectorization)
        template <typename _InputIteratorT>
        __device__ __forceinline__ void Load(
            _InputIteratorT   block_itr,                    ///< [in] The thread block's base input iterator for loading from
            InputT           (&items)[ITEMS_PER_THREAD])   ///< [out] Data to load
        {
            LoadDirectBlocked(linear_tid, block_itr, items);
        }

        /// Load a linear segment of items from memory, guarded by range (skips vectorization)
        template <typename InputIteratorT>
        __device__ __forceinline__ void Load(
            InputIteratorT  block_itr,                      ///< [in] The thread block's base input iterator for loading from
            InputT          (&items)[ITEMS_PER_THREAD],     ///< [out] Data to load
            int             valid_items)                    ///< [in] Number of valid items to load
        {
            LoadDirectBlocked(linear_tid, block_itr, items, valid_items);
        }

        /// Load a linear segment of items from memory, guarded by range, with a fall-back assignment of out-of-bound elements (skips vectorization)
        template <typename InputIteratorT, typename DefaultT>
        __device__ __forceinline__ void Load(
            InputIteratorT  block_itr,                      ///< [in] The thread block's base input iterator for loading from
            InputT          (&items)[ITEMS_PER_THREAD],     ///< [out] Data to load
            int             valid_items,                    ///< [in] Number of valid items to load
            DefaultT          oob_default)                    ///< [in] Default value to assign out-of-bound items
        {
            LoadDirectBlocked(linear_tid, block_itr, items, valid_items, oob_default);
        }

    };


    /**
     * BLOCK_LOAD_TRANSPOSE specialization of load helper
     */
    template <int DUMMY>
    struct LoadInternal<BLOCK_LOAD_TRANSPOSE, DUMMY>

xgboost/cub/cub/block/block_load.cuh  view on Meta::CPAN

            linear_tid(linear_tid)
        {}

        /// Load a linear segment of items from memory
        template <typename InputIteratorT>
        __device__ __forceinline__ void Load(
            InputIteratorT  block_itr,                      ///< [in] The thread block's base input iterator for loading from
            InputT          (&items)[ITEMS_PER_THREAD])     ///< [out] Data to load{
        {
            LoadDirectStriped<BLOCK_THREADS>(linear_tid, block_itr, items);
            BlockExchange(temp_storage).StripedToBlocked(items, items);
        }

        /// Load a linear segment of items from memory, guarded by range
        template <typename InputIteratorT>
        __device__ __forceinline__ void Load(
            InputIteratorT  block_itr,                      ///< [in] The thread block's base input iterator for loading from
            InputT          (&items)[ITEMS_PER_THREAD],     ///< [out] Data to load
            int             valid_items)                    ///< [in] Number of valid items to load
        {
            if (linear_tid == 0)
                temp_storage.valid_items = valid_items;     // Move through volatile smem as a workaround to prevent RF spilling on subsequent loads
            CTA_SYNC();
            LoadDirectStriped<BLOCK_THREADS>(linear_tid, block_itr, items, temp_storage.valid_items);
            BlockExchange(temp_storage).StripedToBlocked(items, items);
        }

        /// Load a linear segment of items from memory, guarded by range, with a fall-back assignment of out-of-bound elements
        template <typename InputIteratorT, typename DefaultT>
        __device__ __forceinline__ void Load(
            InputIteratorT  block_itr,                      ///< [in] The thread block's base input iterator for loading from
            InputT          (&items)[ITEMS_PER_THREAD],     ///< [out] Data to load
            int             valid_items,                    ///< [in] Number of valid items to load
            DefaultT        oob_default)                    ///< [in] Default value to assign out-of-bound items
        {
            if (linear_tid == 0)
                temp_storage.valid_items = valid_items;     // Move through volatile smem as a workaround to prevent RF spilling on subsequent loads
            CTA_SYNC();
            LoadDirectStriped<BLOCK_THREADS>(linear_tid, block_itr, items, temp_storage.valid_items, oob_default);
            BlockExchange(temp_storage).StripedToBlocked(items, items);
        }

    };


    /**
     * BLOCK_LOAD_WARP_TRANSPOSE specialization of load helper
     */
    template <int DUMMY>
    struct LoadInternal<BLOCK_LOAD_WARP_TRANSPOSE, DUMMY>

xgboost/cub/cub/block/block_load.cuh  view on Meta::CPAN

            linear_tid(linear_tid)
        {}

        /// Load a linear segment of items from memory
        template <typename InputIteratorT>
        __device__ __forceinline__ void Load(
            InputIteratorT  block_itr,                      ///< [in] The thread block's base input iterator for loading from
            InputT          (&items)[ITEMS_PER_THREAD])     ///< [out] Data to load{
        {
            LoadDirectWarpStriped(linear_tid, block_itr, items);
            BlockExchange(temp_storage).WarpStripedToBlocked(items, items);
        }

        /// Load a linear segment of items from memory, guarded by range
        template <typename InputIteratorT>
        __device__ __forceinline__ void Load(
            InputIteratorT  block_itr,                      ///< [in] The thread block's base input iterator for loading from
            InputT          (&items)[ITEMS_PER_THREAD],     ///< [out] Data to load
            int             valid_items)                    ///< [in] Number of valid items to load
        {
            if (linear_tid == 0)
                temp_storage.valid_items = valid_items;     // Move through volatile smem as a workaround to prevent RF spilling on subsequent loads
            CTA_SYNC();
            LoadDirectWarpStriped(linear_tid, block_itr, items, temp_storage.valid_items);
            BlockExchange(temp_storage).WarpStripedToBlocked(items, items);
        }


        /// Load a linear segment of items from memory, guarded by range, with a fall-back assignment of out-of-bound elements
        template <typename InputIteratorT, typename DefaultT>
        __device__ __forceinline__ void Load(
            InputIteratorT  block_itr,                      ///< [in] The thread block's base input iterator for loading from
            InputT          (&items)[ITEMS_PER_THREAD],     ///< [out] Data to load
            int             valid_items,                    ///< [in] Number of valid items to load
            DefaultT        oob_default)                    ///< [in] Default value to assign out-of-bound items
        {
            if (linear_tid == 0)
                temp_storage.valid_items = valid_items;     // Move through volatile smem as a workaround to prevent RF spilling on subsequent loads
            CTA_SYNC();
            LoadDirectWarpStriped(linear_tid, block_itr, items, temp_storage.valid_items, oob_default);
            BlockExchange(temp_storage).WarpStripedToBlocked(items, items);
        }
    };


    /**
     * BLOCK_LOAD_WARP_TRANSPOSE_TIMESLICED specialization of load helper
     */
    template <int DUMMY>
    struct LoadInternal<BLOCK_LOAD_WARP_TRANSPOSE_TIMESLICED, DUMMY>
    {

xgboost/cub/cub/block/block_load.cuh  view on Meta::CPAN

            linear_tid(linear_tid)
        {}

        /// Load a linear segment of items from memory
        template <typename InputIteratorT>
        __device__ __forceinline__ void Load(
            InputIteratorT  block_itr,                      ///< [in] The thread block's base input iterator for loading from
            InputT          (&items)[ITEMS_PER_THREAD])     ///< [out] Data to load{
        {
            LoadDirectWarpStriped(linear_tid, block_itr, items);
            BlockExchange(temp_storage).WarpStripedToBlocked(items, items);
        }

        /// Load a linear segment of items from memory, guarded by range
        template <typename InputIteratorT>
        __device__ __forceinline__ void Load(
            InputIteratorT  block_itr,                      ///< [in] The thread block's base input iterator for loading from
            InputT          (&items)[ITEMS_PER_THREAD],     ///< [out] Data to load
            int             valid_items)                    ///< [in] Number of valid items to load
        {
            if (linear_tid == 0)
                temp_storage.valid_items = valid_items;     // Move through volatile smem as a workaround to prevent RF spilling on subsequent loads
            CTA_SYNC();
            LoadDirectWarpStriped(linear_tid, block_itr, items, temp_storage.valid_items);
            BlockExchange(temp_storage).WarpStripedToBlocked(items, items);
        }


        /// Load a linear segment of items from memory, guarded by range, with a fall-back assignment of out-of-bound elements
        template <typename InputIteratorT, typename DefaultT>
        __device__ __forceinline__ void Load(
            InputIteratorT  block_itr,                      ///< [in] The thread block's base input iterator for loading from
            InputT          (&items)[ITEMS_PER_THREAD],     ///< [out] Data to load
            int             valid_items,                    ///< [in] Number of valid items to load
            DefaultT        oob_default)                    ///< [in] Default value to assign out-of-bound items
        {
            if (linear_tid == 0)
                temp_storage.valid_items = valid_items;     // Move through volatile smem as a workaround to prevent RF spilling on subsequent loads
            CTA_SYNC();
            LoadDirectWarpStriped(linear_tid, block_itr, items, temp_storage.valid_items, oob_default);
            BlockExchange(temp_storage).WarpStripedToBlocked(items, items);
        }
    };


    /******************************************************************************
     * Type definitions
     ******************************************************************************/

    /// Internal load implementation to use
    typedef LoadInternal<ALGORITHM, 0> InternalLoad;

xgboost/cub/cub/block/block_load.cuh  view on Meta::CPAN

    /******************************************************************//**
     * \name Data movement
     *********************************************************************/
    //@{


    /**
     * \brief Load a linear segment of items from memory.
     *
     * \par
     * - \blocked
     * - \smemreuse
     *
     * \par Snippet
     * The code snippet below illustrates the loading of a linear
     * segment of 512 integers into a "blocked" arrangement across 128 threads where each
     * thread owns 4 consecutive items.  The load is specialized for \p BLOCK_LOAD_WARP_TRANSPOSE,
     * meaning memory references are efficiently coalesced using a warp-striped access
     * pattern (after which items are locally reordered among threads).
     * \par
     * \code
     * #include <cub/cub.cuh>   // or equivalently <cub/block/block_load.cuh>
     *
     * __global__ void ExampleKernel(int *d_data, ...)
     * {
     *     // Specialize BlockLoad for a 1D block of 128 threads owning 4 integer items each
     *     typedef cub::BlockLoad<int, 128, 4, BLOCK_LOAD_WARP_TRANSPOSE> BlockLoad;
     *
     *     // Allocate shared memory for BlockLoad
     *     __shared__ typename BlockLoad::TempStorage temp_storage;
     *
     *     // Load a segment of consecutive items that are blocked across threads
     *     int thread_data[4];
     *     BlockLoad(temp_storage).Load(d_data, thread_data);
     *
     * \endcode
     * \par
     * Suppose the input \p d_data is <tt>0, 1, 2, 3, 4, 5, ...</tt>.
     * The set of \p thread_data across the block of threads in those threads will be
     * <tt>{ [0,1,2,3], [4,5,6,7], ..., [508,509,510,511] }</tt>.
     *
     */

xgboost/cub/cub/block/block_load.cuh  view on Meta::CPAN

        InputT          (&items)[ITEMS_PER_THREAD]) ///< [out] Data to load
    {
        InternalLoad(temp_storage, linear_tid).Load(block_itr, items);
    }


    /**
     * \brief Load a linear segment of items from memory, guarded by range.
     *
     * \par
     * - \blocked
     * - \smemreuse
     *
     * \par Snippet
     * The code snippet below illustrates the guarded loading of a linear
     * segment of 512 integers into a "blocked" arrangement across 128 threads where each
     * thread owns 4 consecutive items.  The load is specialized for \p BLOCK_LOAD_WARP_TRANSPOSE,
     * meaning memory references are efficiently coalesced using a warp-striped access
     * pattern (after which items are locally reordered among threads).
     * \par
     * \code
     * #include <cub/cub.cuh>   // or equivalently <cub/block/block_load.cuh>
     *
     * __global__ void ExampleKernel(int *d_data, int valid_items, ...)
     * {
     *     // Specialize BlockLoad for a 1D block of 128 threads owning 4 integer items each
     *     typedef cub::BlockLoad<int, 128, 4, BLOCK_LOAD_WARP_TRANSPOSE> BlockLoad;
     *
     *     // Allocate shared memory for BlockLoad
     *     __shared__ typename BlockLoad::TempStorage temp_storage;
     *
     *     // Load a segment of consecutive items that are blocked across threads
     *     int thread_data[4];
     *     BlockLoad(temp_storage).Load(d_data, thread_data, valid_items);
     *
     * \endcode
     * \par
     * Suppose the input \p d_data is <tt>0, 1, 2, 3, 4, 5, 6...</tt> and \p valid_items is \p 5.
     * The set of \p thread_data across the block of threads in those threads will be
     * <tt>{ [0,1,2,3], [4,?,?,?], ..., [?,?,?,?] }</tt>, with only the first two threads
     * being unmasked to load portions of valid data (and other items remaining unassigned).
     *

xgboost/cub/cub/block/block_load.cuh  view on Meta::CPAN

        int             valid_items)                ///< [in] Number of valid items to load
    {
        InternalLoad(temp_storage, linear_tid).Load(block_itr, items, valid_items);
    }


    /**
     * \brief Load a linear segment of items from memory, guarded by range, with a fall-back assignment of out-of-bound elements
     *
     * \par
     * - \blocked
     * - \smemreuse
     *
     * \par Snippet
     * The code snippet below illustrates the guarded loading of a linear
     * segment of 512 integers into a "blocked" arrangement across 128 threads where each
     * thread owns 4 consecutive items.  The load is specialized for \p BLOCK_LOAD_WARP_TRANSPOSE,
     * meaning memory references are efficiently coalesced using a warp-striped access
     * pattern (after which items are locally reordered among threads).
     * \par
     * \code
     * #include <cub/cub.cuh>   // or equivalently <cub/block/block_load.cuh>
     *
     * __global__ void ExampleKernel(int *d_data, int valid_items, ...)
     * {
     *     // Specialize BlockLoad for a 1D block of 128 threads owning 4 integer items each
     *     typedef cub::BlockLoad<int, 128, 4, BLOCK_LOAD_WARP_TRANSPOSE> BlockLoad;
     *
     *     // Allocate shared memory for BlockLoad
     *     __shared__ typename BlockLoad::TempStorage temp_storage;
     *
     *     // Load a segment of consecutive items that are blocked across threads
     *     int thread_data[4];
     *     BlockLoad(temp_storage).Load(d_data, thread_data, valid_items, -1);
     *
     * \endcode
     * \par
     * Suppose the input \p d_data is <tt>0, 1, 2, 3, 4, 5, 6...</tt>,
     * \p valid_items is \p 5, and the out-of-bounds default is \p -1.
     * The set of \p thread_data across the block of threads in those threads will be
     * <tt>{ [0,1,2,3], [4,-1,-1,-1], ..., [-1,-1,-1,-1] }</tt>, with only the first two threads
     * being unmasked to load portions of valid data (and other items are assigned \p -1)

xgboost/cub/cub/block/block_radix_rank.cuh  view on Meta::CPAN

 * \tparam MEMOIZE_OUTER_SCAN   <b>[optional]</b> Whether or not to buffer outer raking scan partials to incur fewer shared memory reads at the expense of higher register pressure (default: true for architectures SM35 and newer, false otherwise).  Se...
 * \tparam INNER_SCAN_ALGORITHM <b>[optional]</b> The cub::BlockScanAlgorithm algorithm to use (default: cub::BLOCK_SCAN_WARP_SCANS)
 * \tparam SMEM_CONFIG          <b>[optional]</b> Shared memory bank mode (default: \p cudaSharedMemBankSizeFourByte)
 * \tparam BLOCK_DIM_Y          <b>[optional]</b> The thread block length in threads along the Y dimension (default: 1)
 * \tparam BLOCK_DIM_Z          <b>[optional]</b> The thread block length in threads along the Z dimension (default: 1)
 * \tparam PTX_ARCH             <b>[optional]</b> \ptxversion
 *
 * \par Overview
 * Blah...
 * - Keys must be in a form suitable for radix ranking (i.e., unsigned bits).
 * - \blocked
 *
 * \par Performance Considerations
 * - \granularity
 *
 * \par Examples
 * \par
 * - <b>Example 1:</b> Simple radix rank of 32-bit integer keys
 *      \code
 *      #include <cub/cub.cuh>
 *

xgboost/cub/cub/block/block_radix_sort.cuh  view on Meta::CPAN

 *   that ensure lexicographic key ordering.
 * - \rowmajor
 *
 * \par Performance Considerations
 * - \granularity
 *
 * \par A Simple Example
 * \blockcollective{BlockRadixSort}
 * \par
 * The code snippet below illustrates a sort of 512 integer keys that
 * are partitioned in a [<em>blocked arrangement</em>](index.html#sec5sec3) across 128 threads
 * where each thread owns 4 consecutive items.
 * \par
 * \code
 * #include <cub/cub.cuh>   // or equivalently <cub/block/block_radix_sort.cuh>
 *
 * __global__ void ExampleKernel(...)
 * {
 *     // Specialize BlockRadixSort for a 1D block of 128 threads owning 4 integer items each
 *     typedef cub::BlockRadixSort<int, 128, 4> BlockRadixSort;
 *
 *     // Allocate shared memory for BlockRadixSort
 *     __shared__ typename BlockRadixSort::TempStorage temp_storage;
 *
 *     // Obtain a segment of consecutive items that are blocked across threads
 *     int thread_keys[4];
 *     ...
 *
 *     // Collectively sort the keys
 *     BlockRadixSort(temp_storage).Sort(thread_keys);
 *
 *     ...
 * \endcode
 * \par
 * Suppose the set of input \p thread_keys across the block of threads is

xgboost/cub/cub/block/block_radix_sort.cuh  view on Meta::CPAN

        int             pass_bits,
        Int2Type<true>  /*is_descending*/)
    {
        DescendingBlockRadixRank(temp_storage.descending_ranking_storage).RankKeys(
            unsigned_keys,
            ranks,
            begin_bit,
            pass_bits);
    }

    /// ExchangeValues (specialized for key-value sort, to-blocked arrangement)
    __device__ __forceinline__ void ExchangeValues(
        ValueT          (&values)[ITEMS_PER_THREAD],
        int             (&ranks)[ITEMS_PER_THREAD],
        Int2Type<false> /*is_keys_only*/,
        Int2Type<true>  /*is_blocked*/)
    {
        CTA_SYNC();

        // Exchange values through shared memory in blocked arrangement
        BlockExchangeValues(temp_storage.exchange_values).ScatterToBlocked(values, ranks);
    }

    /// ExchangeValues (specialized for key-value sort, to-striped arrangement)
    __device__ __forceinline__ void ExchangeValues(
        ValueT          (&values)[ITEMS_PER_THREAD],
        int             (&ranks)[ITEMS_PER_THREAD],
        Int2Type<false> /*is_keys_only*/,
        Int2Type<false> /*is_blocked*/)
    {
        CTA_SYNC();

        // Exchange values through shared memory in blocked arrangement
        BlockExchangeValues(temp_storage.exchange_values).ScatterToStriped(values, ranks);
    }

    /// ExchangeValues (specialized for keys-only sort)
    template <int IS_BLOCKED>
    __device__ __forceinline__ void ExchangeValues(
        ValueT                  (&/*values*/)[ITEMS_PER_THREAD],
        int                     (&/*ranks*/)[ITEMS_PER_THREAD],
        Int2Type<true>          /*is_keys_only*/,
        Int2Type<IS_BLOCKED>    /*is_blocked*/)
    {}

    /// Sort blocked arrangement
    template <int DESCENDING, int KEYS_ONLY>
    __device__ __forceinline__ void SortBlocked(
        KeyT                    (&keys)[ITEMS_PER_THREAD],          ///< Keys to sort
        ValueT                  (&values)[ITEMS_PER_THREAD],        ///< Values to sort
        int                     begin_bit,                          ///< The beginning (least-significant) bit index needed for key comparison
        int                     end_bit,                            ///< The past-the-end (most-significant) bit index needed for key comparison
        Int2Type<DESCENDING>    is_descending,                      ///< Tag whether is a descending-order sort
        Int2Type<KEYS_ONLY>     is_keys_only)                       ///< Tag whether is keys-only sort
    {
        UnsignedBits (&unsigned_keys)[ITEMS_PER_THREAD] =
            reinterpret_cast<UnsignedBits (&)[ITEMS_PER_THREAD]>(keys);

xgboost/cub/cub/block/block_radix_sort.cuh  view on Meta::CPAN

        for (int KEY = 0; KEY < ITEMS_PER_THREAD; KEY++)
        {
            unsigned_keys[KEY] = KeyTraits::TwiddleIn(unsigned_keys[KEY]);
        }

        // Radix sorting passes
        while (true)
        {
            int pass_bits = CUB_MIN(RADIX_BITS, end_bit - begin_bit);

            // Rank the blocked keys
            int ranks[ITEMS_PER_THREAD];
            RankKeys(unsigned_keys, ranks, begin_bit, pass_bits, is_descending);
            begin_bit += RADIX_BITS;

            CTA_SYNC();

            // Exchange keys through shared memory in blocked arrangement
            BlockExchangeKeys(temp_storage.exchange_keys).ScatterToBlocked(keys, ranks);

            // Exchange values through shared memory in blocked arrangement
            ExchangeValues(values, ranks, is_keys_only, Int2Type<true>());

            // Quit if done
            if (begin_bit >= end_bit) break;

            CTA_SYNC();
        }

        // Untwiddle bits if necessary
        #pragma unroll
        for (int KEY = 0; KEY < ITEMS_PER_THREAD; KEY++)
        {
            unsigned_keys[KEY] = KeyTraits::TwiddleOut(unsigned_keys[KEY]);
        }
    }

public:

#ifndef DOXYGEN_SHOULD_SKIP_THIS    // Do not document

    /// Sort blocked -> striped arrangement
    template <int DESCENDING, int KEYS_ONLY>
    __device__ __forceinline__ void SortBlockedToStriped(
        KeyT                    (&keys)[ITEMS_PER_THREAD],          ///< Keys to sort
        ValueT                  (&values)[ITEMS_PER_THREAD],        ///< Values to sort
        int                     begin_bit,                          ///< The beginning (least-significant) bit index needed for key comparison
        int                     end_bit,                            ///< The past-the-end (most-significant) bit index needed for key comparison
        Int2Type<DESCENDING>    is_descending,                      ///< Tag whether is a descending-order sort
        Int2Type<KEYS_ONLY>     is_keys_only)                       ///< Tag whether is keys-only sort
    {
        UnsignedBits (&unsigned_keys)[ITEMS_PER_THREAD] =
            reinterpret_cast<UnsignedBits (&)[ITEMS_PER_THREAD]>(keys);

xgboost/cub/cub/block/block_radix_sort.cuh  view on Meta::CPAN

        for (int KEY = 0; KEY < ITEMS_PER_THREAD; KEY++)
        {
            unsigned_keys[KEY] = KeyTraits::TwiddleIn(unsigned_keys[KEY]);
        }

        // Radix sorting passes
        while (true)
        {
            int pass_bits = CUB_MIN(RADIX_BITS, end_bit - begin_bit);

            // Rank the blocked keys
            int ranks[ITEMS_PER_THREAD];
            RankKeys(unsigned_keys, ranks, begin_bit, pass_bits, is_descending);
            begin_bit += RADIX_BITS;

            CTA_SYNC();

            // Check if this is the last pass
            if (begin_bit >= end_bit)
            {
                // Last pass exchanges keys through shared memory in striped arrangement
                BlockExchangeKeys(temp_storage.exchange_keys).ScatterToStriped(keys, ranks);

                // Last pass exchanges through shared memory in striped arrangement
                ExchangeValues(values, ranks, is_keys_only, Int2Type<false>());

                // Quit
                break;
            }

            // Exchange keys through shared memory in blocked arrangement
            BlockExchangeKeys(temp_storage.exchange_keys).ScatterToBlocked(keys, ranks);

            // Exchange values through shared memory in blocked arrangement
            ExchangeValues(values, ranks, is_keys_only, Int2Type<true>());

            CTA_SYNC();
        }

        // Untwiddle bits if necessary
        #pragma unroll
        for (int KEY = 0; KEY < ITEMS_PER_THREAD; KEY++)
        {
            unsigned_keys[KEY] = KeyTraits::TwiddleOut(unsigned_keys[KEY]);

xgboost/cub/cub/block/block_radix_sort.cuh  view on Meta::CPAN

    __device__ __forceinline__ BlockRadixSort(
        TempStorage &temp_storage)             ///< [in] Reference to memory allocation having layout type TempStorage
    :
        temp_storage(temp_storage.Alias()),
        linear_tid(RowMajorTid(BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z))
    {}


    //@}  end member group
    /******************************************************************//**
     * \name Sorting (blocked arrangements)
     *********************************************************************/
    //@{

    /**
     * \brief Performs an ascending block-wide radix sort over a [<em>blocked arrangement</em>](index.html#sec5sec3) of keys.
     *
     * \par
     * - \granularity
     * - \smemreuse
     *
     * \par Snippet
     * The code snippet below illustrates a sort of 512 integer keys that
     * are partitioned in a [<em>blocked arrangement</em>](index.html#sec5sec3) across 128 threads
     * where each thread owns 4 consecutive keys.
     * \par
     * \code
     * #include <cub/cub.cuh>   // or equivalently <cub/block/block_radix_sort.cuh>
     *
     * __global__ void ExampleKernel(...)
     * {
     *     // Specialize BlockRadixSort for a 1D block of 128 threads owning 4 integer keys each
     *     typedef cub::BlockRadixSort<int, 128, 4> BlockRadixSort;
     *
     *     // Allocate shared memory for BlockRadixSort
     *     __shared__ typename BlockRadixSort::TempStorage temp_storage;
     *
     *     // Obtain a segment of consecutive items that are blocked across threads
     *     int thread_keys[4];
     *     ...
     *
     *     // Collectively sort the keys
     *     BlockRadixSort(temp_storage).Sort(thread_keys);
     *
     * \endcode
     * \par
     * Suppose the set of input \p thread_keys across the block of threads is
     * <tt>{ [0,511,1,510], [2,509,3,508], [4,507,5,506], ..., [254,257,255,256] }</tt>.
     * The corresponding output \p thread_keys in those threads will be
     * <tt>{ [0,1,2,3], [4,5,6,7], [8,9,10,11], ..., [508,509,510,511] }</tt>.
     */
    __device__ __forceinline__ void Sort(
        KeyT    (&keys)[ITEMS_PER_THREAD],          ///< [in-out] Keys to sort
        int     begin_bit   = 0,                    ///< [in] <b>[optional]</b> The beginning (least-significant) bit index needed for key comparison
        int     end_bit     = sizeof(KeyT) * 8)      ///< [in] <b>[optional]</b> The past-the-end (most-significant) bit index needed for key comparison
    {
        NullType values[ITEMS_PER_THREAD];

        SortBlocked(keys, values, begin_bit, end_bit, Int2Type<false>(), Int2Type<KEYS_ONLY>());
    }


    /**
     * \brief Performs an ascending block-wide radix sort across a [<em>blocked arrangement</em>](index.html#sec5sec3) of keys and values.
     *
     * \par
     * - BlockRadixSort can only accommodate one associated tile of values. To "truck along"
     *   more than one tile of values, simply perform a key-value sort of the keys paired
     *   with a temporary value array that enumerates the key indices.  The reordered indices
     *   can then be used as a gather-vector for exchanging other associated tile data through
     *   shared memory.
     * - \granularity
     * - \smemreuse
     *
     * \par Snippet
     * The code snippet below illustrates a sort of 512 integer keys and values that
     * are partitioned in a [<em>blocked arrangement</em>](index.html#sec5sec3) across 128 threads
     * where each thread owns 4 consecutive pairs.
     * \par
     * \code
     * #include <cub/cub.cuh>   // or equivalently <cub/block/block_radix_sort.cuh>
     *
     * __global__ void ExampleKernel(...)
     * {
     *     // Specialize BlockRadixSort for a 1D block of 128 threads owning 4 integer keys and values each
     *     typedef cub::BlockRadixSort<int, 128, 4, int> BlockRadixSort;
     *
     *     // Allocate shared memory for BlockRadixSort
     *     __shared__ typename BlockRadixSort::TempStorage temp_storage;
     *
     *     // Obtain a segment of consecutive items that are blocked across threads
     *     int thread_keys[4];
     *     int thread_values[4];
     *     ...
     *
     *     // Collectively sort the keys and values among block threads
     *     BlockRadixSort(temp_storage).Sort(thread_keys, thread_values);
     *
     * \endcode
     * \par
     * Suppose the set of input \p thread_keys across the block of threads is

xgboost/cub/cub/block/block_radix_sort.cuh  view on Meta::CPAN

     * corresponding output \p thread_keys in those threads will be
     * <tt>{ [0,1,2,3], [4,5,6,7], [8,9,10,11], ..., [508,509,510,511] }</tt>.
     *
     */
    __device__ __forceinline__ void Sort(
        KeyT    (&keys)[ITEMS_PER_THREAD],          ///< [in-out] Keys to sort
        ValueT  (&values)[ITEMS_PER_THREAD],        ///< [in-out] Values to sort
        int     begin_bit   = 0,                    ///< [in] <b>[optional]</b> The beginning (least-significant) bit index needed for key comparison
        int     end_bit     = sizeof(KeyT) * 8)      ///< [in] <b>[optional]</b> The past-the-end (most-significant) bit index needed for key comparison
    {
        SortBlocked(keys, values, begin_bit, end_bit, Int2Type<false>(), Int2Type<KEYS_ONLY>());
    }

    /**
     * \brief Performs a descending block-wide radix sort over a [<em>blocked arrangement</em>](index.html#sec5sec3) of keys.
     *
     * \par
     * - \granularity
     * - \smemreuse
     *
     * \par Snippet
     * The code snippet below illustrates a sort of 512 integer keys that
     * are partitioned in a [<em>blocked arrangement</em>](index.html#sec5sec3) across 128 threads
     * where each thread owns 4 consecutive keys.
     * \par
     * \code
     * #include <cub/cub.cuh>   // or equivalently <cub/block/block_radix_sort.cuh>
     *
     * __global__ void ExampleKernel(...)
     * {
     *     // Specialize BlockRadixSort for a 1D block of 128 threads owning 4 integer keys each
     *     typedef cub::BlockRadixSort<int, 128, 4> BlockRadixSort;
     *
     *     // Allocate shared memory for BlockRadixSort
     *     __shared__ typename BlockRadixSort::TempStorage temp_storage;
     *
     *     // Obtain a segment of consecutive items that are blocked across threads
     *     int thread_keys[4];
     *     ...
     *
     *     // Collectively sort the keys
     *     BlockRadixSort(temp_storage).Sort(thread_keys);
     *
     * \endcode
     * \par
     * Suppose the set of input \p thread_keys across the block of threads is
     * <tt>{ [0,511,1,510], [2,509,3,508], [4,507,5,506], ..., [254,257,255,256] }</tt>.
     * The corresponding output \p thread_keys in those threads will be
     * <tt>{ [511,510,509,508], [11,10,9,8], [7,6,5,4], ..., [3,2,1,0] }</tt>.
     */
    __device__ __forceinline__ void SortDescending(
        KeyT    (&keys)[ITEMS_PER_THREAD],          ///< [in-out] Keys to sort
        int     begin_bit   = 0,                    ///< [in] <b>[optional]</b> The beginning (least-significant) bit index needed for key comparison
        int     end_bit     = sizeof(KeyT) * 8)      ///< [in] <b>[optional]</b> The past-the-end (most-significant) bit index needed for key comparison
    {
        NullType values[ITEMS_PER_THREAD];

        SortBlocked(keys, values, begin_bit, end_bit, Int2Type<true>(), Int2Type<KEYS_ONLY>());
    }


    /**
     * \brief Performs a descending block-wide radix sort across a [<em>blocked arrangement</em>](index.html#sec5sec3) of keys and values.
     *
     * \par
     * - BlockRadixSort can only accommodate one associated tile of values. To "truck along"
     *   more than one tile of values, simply perform a key-value sort of the keys paired
     *   with a temporary value array that enumerates the key indices.  The reordered indices
     *   can then be used as a gather-vector for exchanging other associated tile data through
     *   shared memory.
     * - \granularity
     * - \smemreuse
     *
     * \par Snippet
     * The code snippet below illustrates a sort of 512 integer keys and values that
     * are partitioned in a [<em>blocked arrangement</em>](index.html#sec5sec3) across 128 threads
     * where each thread owns 4 consecutive pairs.
     * \par
     * \code
     * #include <cub/cub.cuh>   // or equivalently <cub/block/block_radix_sort.cuh>
     *
     * __global__ void ExampleKernel(...)
     * {
     *     // Specialize BlockRadixSort for a 1D block of 128 threads owning 4 integer keys and values each
     *     typedef cub::BlockRadixSort<int, 128, 4, int> BlockRadixSort;
     *
     *     // Allocate shared memory for BlockRadixSort
     *     __shared__ typename BlockRadixSort::TempStorage temp_storage;
     *
     *     // Obtain a segment of consecutive items that are blocked across threads
     *     int thread_keys[4];
     *     int thread_values[4];
     *     ...
     *
     *     // Collectively sort the keys and values among block threads
     *     BlockRadixSort(temp_storage).Sort(thread_keys, thread_values);
     *
     * \endcode
     * \par
     * Suppose the set of input \p thread_keys across the block of threads is

xgboost/cub/cub/block/block_radix_sort.cuh  view on Meta::CPAN

     * corresponding output \p thread_keys in those threads will be
     * <tt>{ [511,510,509,508], [11,10,9,8], [7,6,5,4], ..., [3,2,1,0] }</tt>.
     *
     */
    __device__ __forceinline__ void SortDescending(
        KeyT    (&keys)[ITEMS_PER_THREAD],          ///< [in-out] Keys to sort
        ValueT  (&values)[ITEMS_PER_THREAD],        ///< [in-out] Values to sort
        int     begin_bit   = 0,                    ///< [in] <b>[optional]</b> The beginning (least-significant) bit index needed for key comparison
        int     end_bit     = sizeof(KeyT) * 8)      ///< [in] <b>[optional]</b> The past-the-end (most-significant) bit index needed for key comparison
    {
        SortBlocked(keys, values, begin_bit, end_bit, Int2Type<true>(), Int2Type<KEYS_ONLY>());
    }


    //@}  end member group
    /******************************************************************//**
     * \name Sorting (blocked arrangement -> striped arrangement)
     *********************************************************************/
    //@{


    /**
     * \brief Performs an ascending radix sort across a [<em>blocked arrangement</em>](index.html#sec5sec3) of keys, leaving them in a [<em>striped arrangement</em>](index.html#sec5sec3).
     *
     * \par
     * - \granularity
     * - \smemreuse
     *
     * \par Snippet
     * The code snippet below illustrates a sort of 512 integer keys that
     * are initially partitioned in a [<em>blocked arrangement</em>](index.html#sec5sec3) across 128 threads
     * where each thread owns 4 consecutive keys.  The final partitioning is striped.
     * \par
     * \code
     * #include <cub/cub.cuh>   // or equivalently <cub/block/block_radix_sort.cuh>
     *
     * __global__ void ExampleKernel(...)
     * {
     *     // Specialize BlockRadixSort for a 1D block of 128 threads owning 4 integer keys each
     *     typedef cub::BlockRadixSort<int, 128, 4> BlockRadixSort;
     *
     *     // Allocate shared memory for BlockRadixSort
     *     __shared__ typename BlockRadixSort::TempStorage temp_storage;
     *
     *     // Obtain a segment of consecutive items that are blocked across threads
     *     int thread_keys[4];
     *     ...
     *
     *     // Collectively sort the keys
     *     BlockRadixSort(temp_storage).SortBlockedToStriped(thread_keys);
     *
     * \endcode
     * \par
     * Suppose the set of input \p thread_keys across the block of threads is
     * <tt>{ [0,511,1,510], [2,509,3,508], [4,507,5,506], ..., [254,257,255,256] }</tt>.  The
     * corresponding output \p thread_keys in those threads will be
     * <tt>{ [0,128,256,384], [1,129,257,385], [2,130,258,386], ..., [127,255,383,511] }</tt>.
     *
     */
    __device__ __forceinline__ void SortBlockedToStriped(
        KeyT    (&keys)[ITEMS_PER_THREAD],          ///< [in-out] Keys to sort
        int     begin_bit   = 0,                    ///< [in] <b>[optional]</b> The beginning (least-significant) bit index needed for key comparison
        int     end_bit     = sizeof(KeyT) * 8)      ///< [in] <b>[optional]</b> The past-the-end (most-significant) bit index needed for key comparison
    {
        NullType values[ITEMS_PER_THREAD];

        SortBlockedToStriped(keys, values, begin_bit, end_bit, Int2Type<false>(), Int2Type<KEYS_ONLY>());
    }


    /**
     * \brief Performs an ascending radix sort across a [<em>blocked arrangement</em>](index.html#sec5sec3) of keys and values, leaving them in a [<em>striped arrangement</em>](index.html#sec5sec3).
     *
     * \par
     * - BlockRadixSort can only accommodate one associated tile of values. To "truck along"
     *   more than one tile of values, simply perform a key-value sort of the keys paired
     *   with a temporary value array that enumerates the key indices.  The reordered indices
     *   can then be used as a gather-vector for exchanging other associated tile data through
     *   shared memory.
     * - \granularity
     * - \smemreuse
     *
     * \par Snippet
     * The code snippet below illustrates a sort of 512 integer keys and values that
     * are initially partitioned in a [<em>blocked arrangement</em>](index.html#sec5sec3) across 128 threads
     * where each thread owns 4 consecutive pairs.  The final partitioning is striped.
     * \par
     * \code
     * #include <cub/cub.cuh>   // or equivalently <cub/block/block_radix_sort.cuh>
     *
     * __global__ void ExampleKernel(...)
     * {
     *     // Specialize BlockRadixSort for a 1D block of 128 threads owning 4 integer keys and values each
     *     typedef cub::BlockRadixSort<int, 128, 4, int> BlockRadixSort;
     *
     *     // Allocate shared memory for BlockRadixSort
     *     __shared__ typename BlockRadixSort::TempStorage temp_storage;
     *
     *     // Obtain a segment of consecutive items that are blocked across threads
     *     int thread_keys[4];
     *     int thread_values[4];
     *     ...
     *
     *     // Collectively sort the keys and values among block threads
     *     BlockRadixSort(temp_storage).SortBlockedToStriped(thread_keys, thread_values);
     *
     * \endcode
     * \par
     * Suppose the set of input \p thread_keys across the block of threads is
     * <tt>{ [0,511,1,510], [2,509,3,508], [4,507,5,506], ..., [254,257,255,256] }</tt>.  The
     * corresponding output \p thread_keys in those threads will be
     * <tt>{ [0,128,256,384], [1,129,257,385], [2,130,258,386], ..., [127,255,383,511] }</tt>.
     *
     */
    __device__ __forceinline__ void SortBlockedToStriped(
        KeyT    (&keys)[ITEMS_PER_THREAD],          ///< [in-out] Keys to sort
        ValueT  (&values)[ITEMS_PER_THREAD],        ///< [in-out] Values to sort
        int     begin_bit   = 0,                    ///< [in] <b>[optional]</b> The beginning (least-significant) bit index needed for key comparison
        int     end_bit     = sizeof(KeyT) * 8)      ///< [in] <b>[optional]</b> The past-the-end (most-significant) bit index needed for key comparison
    {
        SortBlockedToStriped(keys, values, begin_bit, end_bit, Int2Type<false>(), Int2Type<KEYS_ONLY>());
    }


    /**
     * \brief Performs a descending radix sort across a [<em>blocked arrangement</em>](index.html#sec5sec3) of keys, leaving them in a [<em>striped arrangement</em>](index.html#sec5sec3).
     *
     * \par
     * - \granularity
     * - \smemreuse
     *
     * \par Snippet
     * The code snippet below illustrates a sort of 512 integer keys that
     * are initially partitioned in a [<em>blocked arrangement</em>](index.html#sec5sec3) across 128 threads
     * where each thread owns 4 consecutive keys.  The final partitioning is striped.
     * \par
     * \code
     * #include <cub/cub.cuh>   // or equivalently <cub/block/block_radix_sort.cuh>
     *
     * __global__ void ExampleKernel(...)
     * {
     *     // Specialize BlockRadixSort for a 1D block of 128 threads owning 4 integer keys each
     *     typedef cub::BlockRadixSort<int, 128, 4> BlockRadixSort;
     *
     *     // Allocate shared memory for BlockRadixSort
     *     __shared__ typename BlockRadixSort::TempStorage temp_storage;
     *
     *     // Obtain a segment of consecutive items that are blocked across threads
     *     int thread_keys[4];
     *     ...
     *
     *     // Collectively sort the keys
     *     BlockRadixSort(temp_storage).SortBlockedToStriped(thread_keys);
     *
     * \endcode
     * \par
     * Suppose the set of input \p thread_keys across the block of threads is
     * <tt>{ [0,511,1,510], [2,509,3,508], [4,507,5,506], ..., [254,257,255,256] }</tt>.  The
     * corresponding output \p thread_keys in those threads will be
     * <tt>{ [511,383,255,127], [386,258,130,2], [385,257,128,1], ..., [384,256,128,0] }</tt>.
     *
     */
    __device__ __forceinline__ void SortDescendingBlockedToStriped(
        KeyT    (&keys)[ITEMS_PER_THREAD],          ///< [in-out] Keys to sort
        int     begin_bit   = 0,                    ///< [in] <b>[optional]</b> The beginning (least-significant) bit index needed for key comparison
        int     end_bit     = sizeof(KeyT) * 8)      ///< [in] <b>[optional]</b> The past-the-end (most-significant) bit index needed for key comparison
    {
        NullType values[ITEMS_PER_THREAD];

        SortBlockedToStriped(keys, values, begin_bit, end_bit, Int2Type<true>(), Int2Type<KEYS_ONLY>());
    }


    /**
     * \brief Performs a descending radix sort across a [<em>blocked arrangement</em>](index.html#sec5sec3) of keys and values, leaving them in a [<em>striped arrangement</em>](index.html#sec5sec3).
     *
     * \par
     * - BlockRadixSort can only accommodate one associated tile of values. To "truck along"
     *   more than one tile of values, simply perform a key-value sort of the keys paired
     *   with a temporary value array that enumerates the key indices.  The reordered indices
     *   can then be used as a gather-vector for exchanging other associated tile data through
     *   shared memory.
     * - \granularity
     * - \smemreuse
     *
     * \par Snippet
     * The code snippet below illustrates a sort of 512 integer keys and values that
     * are initially partitioned in a [<em>blocked arrangement</em>](index.html#sec5sec3) across 128 threads
     * where each thread owns 4 consecutive pairs.  The final partitioning is striped.
     * \par
     * \code
     * #include <cub/cub.cuh>   // or equivalently <cub/block/block_radix_sort.cuh>
     *
     * __global__ void ExampleKernel(...)
     * {
     *     // Specialize BlockRadixSort for a 1D block of 128 threads owning 4 integer keys and values each
     *     typedef cub::BlockRadixSort<int, 128, 4, int> BlockRadixSort;
     *
     *     // Allocate shared memory for BlockRadixSort
     *     __shared__ typename BlockRadixSort::TempStorage temp_storage;
     *
     *     // Obtain a segment of consecutive items that are blocked across threads
     *     int thread_keys[4];
     *     int thread_values[4];
     *     ...
     *
     *     // Collectively sort the keys and values among block threads
     *     BlockRadixSort(temp_storage).SortBlockedToStriped(thread_keys, thread_values);
     *
     * \endcode
     * \par
     * Suppose the set of input \p thread_keys across the block of threads is
     * <tt>{ [0,511,1,510], [2,509,3,508], [4,507,5,506], ..., [254,257,255,256] }</tt>.  The
     * corresponding output \p thread_keys in those threads will be
     * <tt>{ [511,383,255,127], [386,258,130,2], [385,257,128,1], ..., [384,256,128,0] }</tt>.
     *
     */
    __device__ __forceinline__ void SortDescendingBlockedToStriped(
        KeyT    (&keys)[ITEMS_PER_THREAD],          ///< [in-out] Keys to sort
        ValueT  (&values)[ITEMS_PER_THREAD],        ///< [in-out] Values to sort
        int     begin_bit   = 0,                    ///< [in] <b>[optional]</b> The beginning (least-significant) bit index needed for key comparison
        int     end_bit     = sizeof(KeyT) * 8)      ///< [in] <b>[optional]</b> The past-the-end (most-significant) bit index needed for key comparison
    {
        SortBlockedToStriped(keys, values, begin_bit, end_bit, Int2Type<true>(), Int2Type<KEYS_ONLY>());
    }


    //@}  end member group

};

/**
 * \example example_block_radix_sort.cu
 */

xgboost/cub/cub/block/block_reduce.cuh  view on Meta::CPAN

     *   higher than to BLOCK_REDUCE_WARP_REDUCTIONS and thus less-desirable
     *   when the GPU is under-occupied.
     */
    BLOCK_REDUCE_RAKING_COMMUTATIVE_ONLY,


    /**
     * \par Overview
     * An efficient "raking" reduction algorithm that supports commutative
     * (e.g., addition) and non-commutative (e.g., string concatenation) reduction
     * operators. \blocked.
     *
     * \par
     * Execution is comprised of three phases:
     * -# Upsweep sequential reduction in registers (if threads contribute more
     *    than one input each).  Each thread then places the partial reduction
     *    of its item(s) into shared memory.
     * -# Upsweep sequential reduction in shared memory.  Threads within a
     *    single warp rake across segments of shared partial reductions.
     * -# A warp-synchronous Kogge-Stone style reduction within the raking warp.
     *

xgboost/cub/cub/block/block_reduce.cuh  view on Meta::CPAN

 * - Computation is slightly more efficient (i.e., having lower instruction overhead) for:
 *   - Summation (<b><em>vs.</em></b> generic reduction)
 *   - \p BLOCK_THREADS is a multiple of the architecture's warp size
 *   - Every thread has a valid input (i.e., full <b><em>vs.</em></b> partial-tiles)
 * - See cub::BlockReduceAlgorithm for performance details regarding algorithmic alternatives
 *
 * \par A Simple Example
 * \blockcollective{BlockReduce}
 * \par
 * The code snippet below illustrates a sum reduction of 512 integer items that
 * are partitioned in a [<em>blocked arrangement</em>](index.html#sec5sec3) across 128 threads
 * where each thread owns 4 consecutive items.
 * \par
 * \code
 * #include <cub/cub.cuh>   // or equivalently <cub/block/block_reduce.cuh>
 *
 * __global__ void ExampleKernel(...)
 * {
 *     // Specialize BlockReduce for a 1D block of 128 threads on type int
 *     typedef cub::BlockReduce<int, 128> BlockReduce;
 *
 *     // Allocate shared memory for BlockReduce
 *     __shared__ typename BlockReduce::TempStorage temp_storage;
 *
 *     // Obtain a segment of consecutive items that are blocked across threads
 *     int thread_data[4];
 *     ...
 *
 *     // Compute the block-wide sum for thread0
 *     int aggregate = BlockReduce(temp_storage).Sum(thread_data);
 *
 * \endcode
 *
 */
template <

xgboost/cub/cub/block/block_reduce.cuh  view on Meta::CPAN

    /**
     * \brief Computes a block-wide reduction for thread<sub>0</sub> using the specified binary reduction functor.  Each thread contributes an array of consecutive input elements.
     *
     * \par
     * - The return value is undefined in threads other than thread<sub>0</sub>.
     * - \granularity
     * - \smemreuse
     *
     * \par Snippet
     * The code snippet below illustrates a max reduction of 512 integer items that
     * are partitioned in a [<em>blocked arrangement</em>](index.html#sec5sec3) across 128 threads
     * where each thread owns 4 consecutive items.
     * \par
     * \code
     * #include <cub/cub.cuh>   // or equivalently <cub/block/block_reduce.cuh>
     *
     * __global__ void ExampleKernel(...)
     * {
     *     // Specialize BlockReduce for a 1D block of 128 threads on type int
     *     typedef cub::BlockReduce<int, 128> BlockReduce;
     *
     *     // Allocate shared memory for BlockReduce
     *     __shared__ typename BlockReduce::TempStorage temp_storage;
     *
     *     // Obtain a segment of consecutive items that are blocked across threads
     *     int thread_data[4];
     *     ...
     *
     *     // Compute the block-wide max for thread0
     *     int aggregate = BlockReduce(temp_storage).Reduce(thread_data, cub::Max());
     *
     * \endcode
     *
     * \tparam ITEMS_PER_THREAD     <b>[inferred]</b> The number of consecutive items partitioned onto each thread.
     * \tparam ReductionOp          <b>[inferred]</b> Binary reduction functor  type having member <tt>T operator()(const T &a, const T &b)</tt>

xgboost/cub/cub/block/block_reduce.cuh  view on Meta::CPAN

    /**
     * \brief Computes a block-wide reduction for thread<sub>0</sub> using addition (+) as the reduction operator.  Each thread contributes an array of consecutive input elements.
     *
     * \par
     * - The return value is undefined in threads other than thread<sub>0</sub>.
     * - \granularity
     * - \smemreuse
     *
     * \par Snippet
     * The code snippet below illustrates a sum reduction of 512 integer items that
     * are partitioned in a [<em>blocked arrangement</em>](index.html#sec5sec3) across 128 threads
     * where each thread owns 4 consecutive items.
     * \par
     * \code
     * #include <cub/cub.cuh>   // or equivalently <cub/block/block_reduce.cuh>
     *
     * __global__ void ExampleKernel(...)
     * {
     *     // Specialize BlockReduce for a 1D block of 128 threads on type int
     *     typedef cub::BlockReduce<int, 128> BlockReduce;
     *
     *     // Allocate shared memory for BlockReduce
     *     __shared__ typename BlockReduce::TempStorage temp_storage;
     *
     *     // Obtain a segment of consecutive items that are blocked across threads
     *     int thread_data[4];
     *     ...
     *
     *     // Compute the block-wide sum for thread0
     *     int aggregate = BlockReduce(temp_storage).Sum(thread_data);
     *
     * \endcode
     *
     * \tparam ITEMS_PER_THREAD     <b>[inferred]</b> The number of consecutive items partitioned onto each thread.
     */

xgboost/cub/cub/block/block_scan.cuh  view on Meta::CPAN

 * - Incurs zero bank conflicts for most types
 * - Computation is slightly more efficient (i.e., having lower instruction overhead) for:
 *   - Prefix sum variants (<b><em>vs.</em></b> generic scan)
 *   - \blocksize
 * - See cub::BlockScanAlgorithm for performance details regarding algorithmic alternatives
 *
 * \par A Simple Example
 * \blockcollective{BlockScan}
 * \par
 * The code snippet below illustrates an exclusive prefix sum of 512 integer items that
 * are partitioned in a [<em>blocked arrangement</em>](index.html#sec5sec3) across 128 threads
 * where each thread owns 4 consecutive items.
 * \par
 * \code
 * #include <cub/cub.cuh>   // or equivalently <cub/block/block_scan.cuh>
 *
 * __global__ void ExampleKernel(...)
 * {
 *     // Specialize BlockScan for a 1D block of 128 threads on type int
 *     typedef cub::BlockScan<int, 128> BlockScan;
 *
 *     // Allocate shared memory for BlockScan
 *     __shared__ typename BlockScan::TempStorage temp_storage;
 *
 *     // Obtain a segment of consecutive items that are blocked across threads
 *     int thread_data[4];
 *     ...
 *
 *     // Collectively compute the block-wide exclusive prefix sum
 *     BlockScan(temp_storage).ExclusiveSum(thread_data, thread_data);
 *
 * \endcode
 * \par
 * Suppose the set of input \p thread_data across the block of threads is
 * <tt>{[1,1,1,1], [1,1,1,1], ..., [1,1,1,1]}</tt>.

xgboost/cub/cub/block/block_scan.cuh  view on Meta::CPAN

     *
     *     // Allocate shared memory for BlockScan
     *     __shared__ typename BlockScan::TempStorage temp_storage;
     *
     *     // Initialize running total
     *     BlockPrefixCallbackOp prefix_op(0);
     *
     *     // Have the block iterate over segments of items
     *     for (int block_offset = 0; block_offset < num_items; block_offset += 128)
     *     {
     *         // Load a segment of consecutive items that are blocked across threads
     *         int thread_data = d_data[block_offset];
     *
     *         // Collectively compute the block-wide exclusive prefix sum
     *         BlockScan(temp_storage).ExclusiveSum(
     *             thread_data, thread_data, prefix_op);
     *         CTA_SYNC();
     *
     *         // Store scanned items to output segment
     *         d_data[block_offset] = thread_data;
     *     }

xgboost/cub/cub/block/block_scan.cuh  view on Meta::CPAN

     * \name Exclusive prefix sum operations (multiple data per thread)
     *********************************************************************/
    //@{


    /**
     * \brief Computes an exclusive block-wide prefix scan using addition (+) as the scan operator.  Each thread contributes an array of consecutive input elements.  The value of 0 is applied as the initial value, and is assigned to \p output[0] in <...
     *
     * \par
     * - \identityzero
     * - \blocked
     * - \granularity
     * - \smemreuse
     *
     * \par Snippet
     * The code snippet below illustrates an exclusive prefix sum of 512 integer items that
     * are partitioned in a [<em>blocked arrangement</em>](index.html#sec5sec3) across 128 threads
     * where each thread owns 4 consecutive items.
     * \par
     * \code
     * #include <cub/cub.cuh>   // or equivalently <cub/block/block_scan.cuh>
     *
     * __global__ void ExampleKernel(...)
     * {
     *     // Specialize BlockScan for a 1D block of 128 threads on type int
     *     typedef cub::BlockScan<int, 128> BlockScan;
     *
     *     // Allocate shared memory for BlockScan
     *     __shared__ typename BlockScan::TempStorage temp_storage;
     *
     *     // Obtain a segment of consecutive items that are blocked across threads
     *     int thread_data[4];
     *     ...
     *
     *     // Collectively compute the block-wide exclusive prefix sum
     *     BlockScan(temp_storage).ExclusiveSum(thread_data, thread_data);
     *
     * \endcode
     * \par
     * Suppose the set of input \p thread_data across the block of threads is <tt>{ [1,1,1,1], [1,1,1,1], ..., [1,1,1,1] }</tt>.  The
     * corresponding output \p thread_data in those threads will be <tt>{ [0,1,2,3], [4,5,6,7], ..., [508,509,510,511] }</tt>.

xgboost/cub/cub/block/block_scan.cuh  view on Meta::CPAN

        T initial_value = 0;
        ExclusiveScan(input, output, initial_value, cub::Sum());
    }


    /**
     * \brief Computes an exclusive block-wide prefix scan using addition (+) as the scan operator.  Each thread contributes an array of consecutive input elements.  The value of 0 is applied as the initial value, and is assigned to \p output[0] in <...
     *
     * \par
     * - \identityzero
     * - \blocked
     * - \granularity
     * - \smemreuse
     *
     * \par Snippet
     * The code snippet below illustrates an exclusive prefix sum of 512 integer items that
     * are partitioned in a [<em>blocked arrangement</em>](index.html#sec5sec3) across 128 threads
     * where each thread owns 4 consecutive items.
     * \par
     * \code
     * #include <cub/cub.cuh>   // or equivalently <cub/block/block_scan.cuh>
     *
     * __global__ void ExampleKernel(...)
     * {
     *     // Specialize BlockScan for a 1D block of 128 threads on type int
     *     typedef cub::BlockScan<int, 128> BlockScan;
     *
     *     // Allocate shared memory for BlockScan
     *     __shared__ typename BlockScan::TempStorage temp_storage;
     *
     *     // Obtain a segment of consecutive items that are blocked across threads
     *     int thread_data[4];
     *     ...
     *
     *     // Collectively compute the block-wide exclusive prefix sum
     *     int block_aggregate;
     *     BlockScan(temp_storage).ExclusiveSum(thread_data, thread_data, block_aggregate);
     *
     * \endcode
     * \par
     * Suppose the set of input \p thread_data across the block of threads is <tt>{ [1,1,1,1], [1,1,1,1], ..., [1,1,1,1] }</tt>.  The

xgboost/cub/cub/block/block_scan.cuh  view on Meta::CPAN


    /**
     * \brief Computes an exclusive block-wide prefix scan using addition (+) as the scan operator.  Each thread contributes an array of consecutive input elements.  Instead of using 0 as the block-wide prefix, the call-back functor \p block_prefix_c...
     *
     * \par
     * - \identityzero
     * - The \p block_prefix_callback_op functor must implement a member function <tt>T operator()(T block_aggregate)</tt>.
     *   The functor's input parameter \p block_aggregate is the same value also returned by the scan operation.
     *   The functor will be invoked by the first warp of threads in the block, however only the return value from
     *   <em>lane</em><sub>0</sub> is applied as the block-wide prefix.  Can be stateful.
     * - \blocked
     * - \granularity
     * - \smemreuse
     *
     * \par Snippet
     * The code snippet below illustrates a single thread block that progressively
     * computes an exclusive prefix sum over multiple "tiles" of input using a
     * prefix functor to maintain a running total between block-wide scans.  Each tile consists
     * of 512 integer items that are partitioned in a [<em>blocked arrangement</em>](index.html#sec5sec3)
     * across 128 threads where each thread owns 4 consecutive items.
     * \par
     * \code
     * #include <cub/cub.cuh>   // or equivalently <cub/block/block_scan.cuh>
     *
     * // A stateful callback functor that maintains a running prefix to be applied
     * // during consecutive scan operations.
     * struct BlockPrefixCallbackOp
     * {
     *     // Running prefix

xgboost/cub/cub/block/block_scan.cuh  view on Meta::CPAN

     *         typename BlockScan::TempStorage     scan;
     *         typename BlockStore::TempStorage    store;
     *     } temp_storage;
     *
     *     // Initialize running total
     *     BlockPrefixCallbackOp prefix_op(0);
     *
     *     // Have the block iterate over segments of items
     *     for (int block_offset = 0; block_offset < num_items; block_offset += 128 * 4)
     *     {
     *         // Load a segment of consecutive items that are blocked across threads
     *         int thread_data[4];
     *         BlockLoad(temp_storage.load).Load(d_data + block_offset, thread_data);
     *         CTA_SYNC();
     *
     *         // Collectively compute the block-wide exclusive prefix sum
     *         int block_aggregate;
     *         BlockScan(temp_storage.scan).ExclusiveSum(
     *             thread_data, thread_data, prefix_op);
     *         CTA_SYNC();
     *

xgboost/cub/cub/block/block_scan.cuh  view on Meta::CPAN

     *
     *     // Allocate shared memory for BlockScan
     *     __shared__ typename BlockScan::TempStorage temp_storage;
     *
     *     // Initialize running total
     *     BlockPrefixCallbackOp prefix_op(INT_MIN);
     *
     *     // Have the block iterate over segments of items
     *     for (int block_offset = 0; block_offset < num_items; block_offset += 128)
     *     {
     *         // Load a segment of consecutive items that are blocked across threads
     *         int thread_data = d_data[block_offset];
     *
     *         // Collectively compute the block-wide exclusive prefix max scan
     *         BlockScan(temp_storage).ExclusiveScan(
     *             thread_data, thread_data, INT_MIN, cub::Max(), prefix_op);
     *         CTA_SYNC();
     *
     *         // Store scanned items to output segment
     *         d_data[block_offset] = thread_data;
     *     }

xgboost/cub/cub/block/block_scan.cuh  view on Meta::CPAN

     * \name Exclusive prefix scan operations (multiple data per thread)
     *********************************************************************/
    //@{


    /**
     * \brief Computes an exclusive block-wide prefix scan using the specified binary \p scan_op functor.  Each thread contributes an array of consecutive input elements.
     *
     * \par
     * - Supports non-commutative scan operators.
     * - \blocked
     * - \granularity
     * - \smemreuse
     *
     * \par Snippet
     * The code snippet below illustrates an exclusive prefix max scan of 512 integer items that
     * are partitioned in a [<em>blocked arrangement</em>](index.html#sec5sec3) across 128 threads
     * where each thread owns 4 consecutive items.
     * \par
     * \code
     * #include <cub/cub.cuh>   // or equivalently <cub/block/block_scan.cuh>
     *
     * __global__ void ExampleKernel(...)
     * {
     *     // Specialize BlockScan for a 1D block of 128 threads on type int
     *     typedef cub::BlockScan<int, 128> BlockScan;
     *
     *     // Allocate shared memory for BlockScan
     *     __shared__ typename BlockScan::TempStorage temp_storage;
     *
     *     // Obtain a segment of consecutive items that are blocked across threads
     *     int thread_data[4];
     *     ...
     *
     *     // Collectively compute the block-wide exclusive prefix max scan
     *     BlockScan(temp_storage).ExclusiveScan(thread_data, thread_data, INT_MIN, cub::Max());
     *
     * \endcode
     * \par
     * Suppose the set of input \p thread_data across the block of threads is
     * <tt>{ [0,-1,2,-3], [4,-5,6,-7], ..., [508,-509,510,-511] }</tt>.

xgboost/cub/cub/block/block_scan.cuh  view on Meta::CPAN

        // Exclusive scan in registers with prefix as seed
        ThreadScanExclusive(input, output, scan_op, thread_prefix);
    }


    /**
     * \brief Computes an exclusive block-wide prefix scan using the specified binary \p scan_op functor.  Each thread contributes an array of consecutive input elements.  Also provides every thread with the block-wide \p block_aggregate of all input...
     *
     * \par
     * - Supports non-commutative scan operators.
     * - \blocked
     * - \granularity
     * - \smemreuse
     *
     * \par Snippet
     * The code snippet below illustrates an exclusive prefix max scan of 512 integer items that
     * are partitioned in a [<em>blocked arrangement</em>](index.html#sec5sec3) across 128 threads
     * where each thread owns 4 consecutive items.
     * \par
     * \code
     * #include <cub/cub.cuh>   // or equivalently <cub/block/block_scan.cuh>
     *
     * __global__ void ExampleKernel(...)
     * {
     *     // Specialize BlockScan for a 1D block of 128 threads on type int
     *     typedef cub::BlockScan<int, 128> BlockScan;
     *
     *     // Allocate shared memory for BlockScan
     *     __shared__ typename BlockScan::TempStorage temp_storage;
     *
     *     // Obtain a segment of consecutive items that are blocked across threads
     *     int thread_data[4];
     *     ...
     *
     *     // Collectively compute the block-wide exclusive prefix max scan
     *     int block_aggregate;
     *     BlockScan(temp_storage).ExclusiveScan(thread_data, thread_data, INT_MIN, cub::Max(), block_aggregate);
     *
     * \endcode
     * \par
     * Suppose the set of input \p thread_data across the block of threads is <tt>{ [0,-1,2,-3], [4,-5,6,-7], ..., [508,-509,510,-511] }</tt>.  The

xgboost/cub/cub/block/block_scan.cuh  view on Meta::CPAN


    /**
     * \brief Computes an exclusive block-wide prefix scan using the specified binary \p scan_op functor.  Each thread contributes an array of consecutive input elements.  the call-back functor \p block_prefix_callback_op is invoked by the first warp...
     *
     * \par
     * - The \p block_prefix_callback_op functor must implement a member function <tt>T operator()(T block_aggregate)</tt>.
     *   The functor's input parameter \p block_aggregate is the same value also returned by the scan operation.
     *   The functor will be invoked by the first warp of threads in the block, however only the return value from
     *   <em>lane</em><sub>0</sub> is applied as the block-wide prefix.  Can be stateful.
     * - Supports non-commutative scan operators.
     * - \blocked
     * - \granularity
     * - \smemreuse
     *
     * \par Snippet
     * The code snippet below illustrates a single thread block that progressively
     * computes an exclusive prefix max scan over multiple "tiles" of input using a
     * prefix functor to maintain a running total between block-wide scans.  Each tile consists
     * of 128 integer items that are partitioned across 128 threads.
     * \par
     * \code

xgboost/cub/cub/block/block_scan.cuh  view on Meta::CPAN

     *         typename BlockScan::TempStorage     scan;
     *         typename BlockStore::TempStorage    store;
     *     } temp_storage;
     *
     *     // Initialize running total
     *     BlockPrefixCallbackOp prefix_op(0);
     *
     *     // Have the block iterate over segments of items
     *     for (int block_offset = 0; block_offset < num_items; block_offset += 128 * 4)
     *     {
     *         // Load a segment of consecutive items that are blocked across threads
     *         int thread_data[4];
     *         BlockLoad(temp_storage.load).Load(d_data + block_offset, thread_data);
     *         CTA_SYNC();
     *
     *         // Collectively compute the block-wide exclusive prefix max scan
     *         BlockScan(temp_storage.scan).ExclusiveScan(
     *             thread_data, thread_data, INT_MIN, cub::Max(), prefix_op);
     *         CTA_SYNC();
     *
     *         // Store scanned items to output segment

xgboost/cub/cub/block/block_scan.cuh  view on Meta::CPAN

     * \name Exclusive prefix scan operations (no initial value, multiple data per thread)
     *********************************************************************/
    //@{


    /**
     * \brief Computes an exclusive block-wide prefix scan using the specified binary \p scan_op functor.  Each thread contributes an array of consecutive input elements.  With no initial value, the output computed for <em>thread</em><sub>0</sub> is ...
     *
     * \par
     * - Supports non-commutative scan operators.
     * - \blocked
     * - \granularity
     * - \smemreuse
     *
     * \tparam ITEMS_PER_THREAD     <b>[inferred]</b> The number of consecutive items partitioned onto each thread.
     * \tparam ScanOp               <b>[inferred]</b> Binary scan functor  type having member <tt>T operator()(const T &a, const T &b)</tt>
     */
    template <
        int             ITEMS_PER_THREAD,
        typename        ScanOp>
    __device__ __forceinline__ void ExclusiveScan(

xgboost/cub/cub/block/block_scan.cuh  view on Meta::CPAN

        // Exclusive scan in registers with prefix
        ThreadScanExclusive(input, output, scan_op, thread_partial, (linear_tid != 0));
    }


    /**
     * \brief Computes an exclusive block-wide prefix scan using the specified binary \p scan_op functor.  Each thread contributes an array of consecutive input elements.  Also provides every thread with the block-wide \p block_aggregate of all input...
     *
     * \par
     * - Supports non-commutative scan operators.
     * - \blocked
     * - \granularity
     * - \smemreuse
     *
     * \tparam ITEMS_PER_THREAD     <b>[inferred]</b> The number of consecutive items partitioned onto each thread.
     * \tparam ScanOp               <b>[inferred]</b> Binary scan functor  type having member <tt>T operator()(const T &a, const T &b)</tt>
     */
    template <
        int             ITEMS_PER_THREAD,
        typename        ScanOp>
    __device__ __forceinline__ void ExclusiveScan(

xgboost/cub/cub/block/block_scan.cuh  view on Meta::CPAN

     *
     *     // Allocate shared memory for BlockScan
     *     __shared__ typename BlockScan::TempStorage temp_storage;
     *
     *     // Initialize running total
     *     BlockPrefixCallbackOp prefix_op(0);
     *
     *     // Have the block iterate over segments of items
     *     for (int block_offset = 0; block_offset < num_items; block_offset += 128)
     *     {
     *         // Load a segment of consecutive items that are blocked across threads
     *         int thread_data = d_data[block_offset];
     *
     *         // Collectively compute the block-wide inclusive prefix sum
     *         BlockScan(temp_storage).InclusiveSum(
     *             thread_data, thread_data, prefix_op);
     *         CTA_SYNC();
     *
     *         // Store scanned items to output segment
     *         d_data[block_offset] = thread_data;
     *     }

xgboost/cub/cub/block/block_scan.cuh  view on Meta::CPAN

    /******************************************************************//**
     * \name Inclusive prefix sum operations (multiple data per thread)
     *********************************************************************/
    //@{


    /**
     * \brief Computes an inclusive block-wide prefix scan using addition (+) as the scan operator.  Each thread contributes an array of consecutive input elements.
     *
     * \par
     * - \blocked
     * - \granularity
     * - \smemreuse
     *
     * \par Snippet
     * The code snippet below illustrates an inclusive prefix sum of 512 integer items that
     * are partitioned in a [<em>blocked arrangement</em>](index.html#sec5sec3) across 128 threads
     * where each thread owns 4 consecutive items.
     * \par
     * \code
     * #include <cub/cub.cuh>   // or equivalently <cub/block/block_scan.cuh>
     *
     * __global__ void ExampleKernel(...)
     * {
     *     // Specialize BlockScan for a 1D block of 128 threads on type int
     *     typedef cub::BlockScan<int, 128> BlockScan;
     *
     *     // Allocate shared memory for BlockScan
     *     __shared__ typename BlockScan::TempStorage temp_storage;
     *
     *     // Obtain a segment of consecutive items that are blocked across threads
     *     int thread_data[4];
     *     ...
     *
     *     // Collectively compute the block-wide inclusive prefix sum
     *     BlockScan(temp_storage).InclusiveSum(thread_data, thread_data);
     *
     * \endcode
     * \par
     * Suppose the set of input \p thread_data across the block of threads is <tt>{ [1,1,1,1], [1,1,1,1], ..., [1,1,1,1] }</tt>.  The
     * corresponding output \p thread_data in those threads will be <tt>{ [1,2,3,4], [5,6,7,8], ..., [509,510,511,512] }</tt>.

xgboost/cub/cub/block/block_scan.cuh  view on Meta::CPAN

            // Inclusive scan in registers with prefix as seed
            ThreadScanInclusive(input, output, scan_op, thread_prefix, (linear_tid != 0));
        }
    }


    /**
     * \brief Computes an inclusive block-wide prefix scan using addition (+) as the scan operator.  Each thread contributes an array of consecutive input elements.  Also provides every thread with the block-wide \p block_aggregate of all inputs.
     *
     * \par
     * - \blocked
     * - \granularity
     * - \smemreuse
     *
     * \par Snippet
     * The code snippet below illustrates an inclusive prefix sum of 512 integer items that
     * are partitioned in a [<em>blocked arrangement</em>](index.html#sec5sec3) across 128 threads
     * where each thread owns 4 consecutive items.
     * \par
     * \code
     * #include <cub/cub.cuh>   // or equivalently <cub/block/block_scan.cuh>
     *
     * __global__ void ExampleKernel(...)
     * {
     *     // Specialize BlockScan for a 1D block of 128 threads on type int
     *     typedef cub::BlockScan<int, 128> BlockScan;
     *
     *     // Allocate shared memory for BlockScan
     *     __shared__ typename BlockScan::TempStorage temp_storage;
     *
     *     // Obtain a segment of consecutive items that are blocked across threads
     *     int thread_data[4];
     *     ...
     *
     *     // Collectively compute the block-wide inclusive prefix sum
     *     int block_aggregate;
     *     BlockScan(temp_storage).InclusiveSum(thread_data, thread_data, block_aggregate);
     *
     * \endcode
     * \par
     * Suppose the set of input \p thread_data across the block of threads is

xgboost/cub/cub/block/block_scan.cuh  view on Meta::CPAN



    /**
     * \brief Computes an inclusive block-wide prefix scan using addition (+) as the scan operator.  Each thread contributes an array of consecutive input elements.  Instead of using 0 as the block-wide prefix, the call-back functor \p block_prefix_c...
     *
     * \par
     * - The \p block_prefix_callback_op functor must implement a member function <tt>T operator()(T block_aggregate)</tt>.
     *   The functor's input parameter \p block_aggregate is the same value also returned by the scan operation.
     *   The functor will be invoked by the first warp of threads in the block, however only the return value from
     *   <em>lane</em><sub>0</sub> is applied as the block-wide prefix.  Can be stateful.
     * - \blocked
     * - \granularity
     * - \smemreuse
     *
     * \par Snippet
     * The code snippet below illustrates a single thread block that progressively
     * computes an inclusive prefix sum over multiple "tiles" of input using a
     * prefix functor to maintain a running total between block-wide scans.  Each tile consists
     * of 512 integer items that are partitioned in a [<em>blocked arrangement</em>](index.html#sec5sec3)
     * across 128 threads where each thread owns 4 consecutive items.
     * \par
     * \code
     * #include <cub/cub.cuh>   // or equivalently <cub/block/block_scan.cuh>
     *
     * // A stateful callback functor that maintains a running prefix to be applied
     * // during consecutive scan operations.
     * struct BlockPrefixCallbackOp
     * {
     *     // Running prefix

xgboost/cub/cub/block/block_scan.cuh  view on Meta::CPAN

     *         typename BlockScan::TempStorage     scan;
     *         typename BlockStore::TempStorage    store;
     *     } temp_storage;
     *
     *     // Initialize running total
     *     BlockPrefixCallbackOp prefix_op(0);
     *
     *     // Have the block iterate over segments of items
     *     for (int block_offset = 0; block_offset < num_items; block_offset += 128 * 4)
     *     {
     *         // Load a segment of consecutive items that are blocked across threads
     *         int thread_data[4];
     *         BlockLoad(temp_storage.load).Load(d_data + block_offset, thread_data);
     *         CTA_SYNC();
     *
     *         // Collectively compute the block-wide inclusive prefix sum
     *         BlockScan(temp_storage.scan).IncluisveSum(
     *             thread_data, thread_data, prefix_op);
     *         CTA_SYNC();
     *
     *         // Store scanned items to output segment

xgboost/cub/cub/block/block_scan.cuh  view on Meta::CPAN

     *
     *     // Allocate shared memory for BlockScan
     *     __shared__ typename BlockScan::TempStorage temp_storage;
     *
     *     // Initialize running total
     *     BlockPrefixCallbackOp prefix_op(INT_MIN);
     *
     *     // Have the block iterate over segments of items
     *     for (int block_offset = 0; block_offset < num_items; block_offset += 128)
     *     {
     *         // Load a segment of consecutive items that are blocked across threads
     *         int thread_data = d_data[block_offset];
     *
     *         // Collectively compute the block-wide inclusive prefix max scan
     *         BlockScan(temp_storage).InclusiveScan(
     *             thread_data, thread_data, cub::Max(), prefix_op);
     *         CTA_SYNC();
     *
     *         // Store scanned items to output segment
     *         d_data[block_offset] = thread_data;
     *     }

xgboost/cub/cub/block/block_scan.cuh  view on Meta::CPAN

     * \name Inclusive prefix scan operations (multiple data per thread)
     *********************************************************************/
    //@{


    /**
     * \brief Computes an inclusive block-wide prefix scan using the specified binary \p scan_op functor.  Each thread contributes an array of consecutive input elements.
     *
     * \par
     * - Supports non-commutative scan operators.
     * - \blocked
     * - \granularity
     * - \smemreuse
     *
     * \par Snippet
     * The code snippet below illustrates an inclusive prefix max scan of 512 integer items that
     * are partitioned in a [<em>blocked arrangement</em>](index.html#sec5sec3) across 128 threads
     * where each thread owns 4 consecutive items.
     * \par
     * \code
     * #include <cub/cub.cuh>   // or equivalently <cub/block/block_scan.cuh>
     *
     * __global__ void ExampleKernel(...)
     * {
     *     // Specialize BlockScan for a 1D block of 128 threads on type int
     *     typedef cub::BlockScan<int, 128> BlockScan;
     *
     *     // Allocate shared memory for BlockScan
     *     __shared__ typename BlockScan::TempStorage temp_storage;
     *
     *     // Obtain a segment of consecutive items that are blocked across threads
     *     int thread_data[4];
     *     ...
     *
     *     // Collectively compute the block-wide inclusive prefix max scan
     *     BlockScan(temp_storage).InclusiveScan(thread_data, thread_data, cub::Max());
     *
     * \endcode
     * \par
     * Suppose the set of input \p thread_data across the block of threads is <tt>{ [0,-1,2,-3], [4,-5,6,-7], ..., [508,-509,510,-511] }</tt>.  The
     * corresponding output \p thread_data in those threads will be <tt>{ [0,0,2,2], [4,4,6,6], ..., [508,508,510,510] }</tt>.

xgboost/cub/cub/block/block_scan.cuh  view on Meta::CPAN

            ThreadScanInclusive(input, output, scan_op, thread_prefix, (linear_tid != 0));
        }
    }


    /**
     * \brief Computes an inclusive block-wide prefix scan using the specified binary \p scan_op functor.  Each thread contributes an array of consecutive input elements.  Also provides every thread with the block-wide \p block_aggregate of all input...
     *
     * \par
     * - Supports non-commutative scan operators.
     * - \blocked
     * - \granularity
     * - \smemreuse
     *
     * \par Snippet
     * The code snippet below illustrates an inclusive prefix max scan of 512 integer items that
     * are partitioned in a [<em>blocked arrangement</em>](index.html#sec5sec3) across 128 threads
     * where each thread owns 4 consecutive items.
     * \par
     * \code
     * #include <cub/cub.cuh>   // or equivalently <cub/block/block_scan.cuh>
     *
     * __global__ void ExampleKernel(...)
     * {
     *     // Specialize BlockScan for a 1D block of 128 threads on type int
     *     typedef cub::BlockScan<int, 128> BlockScan;
     *
     *     // Allocate shared memory for BlockScan
     *     __shared__ typename BlockScan::TempStorage temp_storage;
     *
     *     // Obtain a segment of consecutive items that are blocked across threads
     *     int thread_data[4];
     *     ...
     *
     *     // Collectively compute the block-wide inclusive prefix max scan
     *     int block_aggregate;
     *     BlockScan(temp_storage).InclusiveScan(thread_data, thread_data, cub::Max(), block_aggregate);
     *
     * \endcode
     * \par
     * Suppose the set of input \p thread_data across the block of threads is

xgboost/cub/cub/block/block_scan.cuh  view on Meta::CPAN


    /**
     * \brief Computes an inclusive block-wide prefix scan using the specified binary \p scan_op functor.  Each thread contributes an array of consecutive input elements.  the call-back functor \p block_prefix_callback_op is invoked by the first warp...
     *
     * \par
     * - The \p block_prefix_callback_op functor must implement a member function <tt>T operator()(T block_aggregate)</tt>.
     *   The functor's input parameter \p block_aggregate is the same value also returned by the scan operation.
     *   The functor will be invoked by the first warp of threads in the block, however only the return value from
     *   <em>lane</em><sub>0</sub> is applied as the block-wide prefix.  Can be stateful.
     * - Supports non-commutative scan operators.
     * - \blocked
     * - \granularity
     * - \smemreuse
     *
     * \par Snippet
     * The code snippet below illustrates a single thread block that progressively
     * computes an inclusive prefix max scan over multiple "tiles" of input using a
     * prefix functor to maintain a running total between block-wide scans.  Each tile consists
     * of 128 integer items that are partitioned across 128 threads.
     * \par
     * \code

xgboost/cub/cub/block/block_scan.cuh  view on Meta::CPAN

     *         typename BlockScan::TempStorage     scan;
     *         typename BlockStore::TempStorage    store;
     *     } temp_storage;
     *
     *     // Initialize running total
     *     BlockPrefixCallbackOp prefix_op(0);
     *
     *     // Have the block iterate over segments of items
     *     for (int block_offset = 0; block_offset < num_items; block_offset += 128 * 4)
     *     {
     *         // Load a segment of consecutive items that are blocked across threads
     *         int thread_data[4];
     *         BlockLoad(temp_storage.load).Load(d_data + block_offset, thread_data);
     *         CTA_SYNC();
     *
     *         // Collectively compute the block-wide inclusive prefix max scan
     *         BlockScan(temp_storage.scan).InclusiveScan(
     *             thread_data, thread_data, cub::Max(), prefix_op);
     *         CTA_SYNC();
     *
     *         // Store scanned items to output segment

xgboost/cub/cub/block/block_shuffle.cuh  view on Meta::CPAN


        unsigned int offset = threadIdx.x + distance;
        if (offset >= BLOCK_THREADS)
            offset -= BLOCK_THREADS;

        output = temp_storage[offset].prev;
    }


    /**
     * \brief The thread block rotates its [<em>blocked arrangement</em>](index.html#sec5sec3) of \p input items, shifting it up by one item
     *
     * \par
     * - \blocked
     * - \granularity
     * - \smemreuse
     */
    template <int ITEMS_PER_THREAD>
    __device__ __forceinline__ void Up(
        T (&input)[ITEMS_PER_THREAD],   ///< [in] The calling thread's input items
        T (&prev)[ITEMS_PER_THREAD])    ///< [out] The corresponding predecessor items (may be aliased to \p input).  The item \p prev[0] is not updated for <em>thread</em><sub>0</sub>.
    {
        temp_storage[linear_tid].prev = input[ITEMS_PER_THREAD - 1];

xgboost/cub/cub/block/block_shuffle.cuh  view on Meta::CPAN

        for (int ITEM = ITEMS_PER_THREAD - 1; ITEM > 0; --ITEM)
            prev[ITEM] = input[ITEM - 1];


        if (linear_tid > 0)
            prev[0] = temp_storage[linear_tid - 1].prev;
    }


    /**
     * \brief The thread block rotates its [<em>blocked arrangement</em>](index.html#sec5sec3) of \p input items, shifting it up by one item.  All threads receive the \p input provided by <em>thread</em><sub><tt>BLOCK_THREADS-1</tt></sub>.
     *
     * \par
     * - \blocked
     * - \granularity
     * - \smemreuse
     */
    template <int ITEMS_PER_THREAD>
    __device__ __forceinline__ void Up(
        T (&input)[ITEMS_PER_THREAD],   ///< [in] The calling thread's input items
        T (&prev)[ITEMS_PER_THREAD],    ///< [out] The corresponding predecessor items (may be aliased to \p input).  The item \p prev[0] is not updated for <em>thread</em><sub>0</sub>.
        T &block_suffix)                ///< [out] The item \p input[ITEMS_PER_THREAD-1] from <em>thread</em><sub><tt>BLOCK_THREADS-1</tt></sub>, provided to all threads
    {
        Up(input, prev);
        block_suffix = temp_storage[BLOCK_THREADS - 1].prev;
    }


    /**
     * \brief The thread block rotates its [<em>blocked arrangement</em>](index.html#sec5sec3) of \p input items, shifting it down by one item
     *
     * \par
     * - \blocked
     * - \granularity
     * - \smemreuse
     */
    template <int ITEMS_PER_THREAD>
    __device__ __forceinline__ void Down(
        T (&input)[ITEMS_PER_THREAD],   ///< [in] The calling thread's input items
        T (&prev)[ITEMS_PER_THREAD])    ///< [out] The corresponding predecessor items (may be aliased to \p input).  The value \p prev[0] is not updated for <em>thread</em><sub>BLOCK_THREADS-1</sub>.
    {
        temp_storage[linear_tid].prev = input[ITEMS_PER_THREAD - 1];

xgboost/cub/cub/block/block_shuffle.cuh  view on Meta::CPAN

        #pragma unroll
        for (int ITEM = ITEMS_PER_THREAD - 1; ITEM > 0; --ITEM)
            prev[ITEM] = input[ITEM - 1];

        if (linear_tid > 0)
            prev[0] = temp_storage[linear_tid - 1].prev;
    }


    /**
     * \brief The thread block rotates its [<em>blocked arrangement</em>](index.html#sec5sec3) of input items, shifting it down by one item.  All threads receive \p input[0] provided by <em>thread</em><sub><tt>0</tt></sub>.
     *
     * \par
     * - \blocked
     * - \granularity
     * - \smemreuse
     */
    template <int ITEMS_PER_THREAD>
    __device__ __forceinline__ void Down(
        T (&input)[ITEMS_PER_THREAD],   ///< [in] The calling thread's input items
        T (&prev)[ITEMS_PER_THREAD],    ///< [out] The corresponding predecessor items (may be aliased to \p input).  The value \p prev[0] is not updated for <em>thread</em><sub>BLOCK_THREADS-1</sub>.
        T &block_prefix)                ///< [out] The item \p input[0] from <em>thread</em><sub><tt>0</tt></sub>, provided to all threads
    {
        Up(input, prev);

xgboost/cub/cub/block/block_store.cuh  view on Meta::CPAN

/// CUB namespace
namespace cub {

/**
 * \addtogroup UtilIo
 * @{
 */


/******************************************************************//**
 * \name Blocked arrangement I/O (direct)
 *********************************************************************/
//@{

/**
 * \brief Store a blocked arrangement of items across a thread block into a linear segment of items.
 *
 * \blocked
 *
 * \tparam T                    <b>[inferred]</b> The data type to store.
 * \tparam ITEMS_PER_THREAD     <b>[inferred]</b> The number of consecutive items partitioned onto each thread.
 * \tparam OutputIteratorT      <b>[inferred]</b> The random-access iterator type for output \iterator.
 */
template <
    typename            T,
    int                 ITEMS_PER_THREAD,
    typename            OutputIteratorT>
__device__ __forceinline__ void StoreDirectBlocked(
    int                 linear_tid,                 ///< [in] A suitable 1D thread-identifier for the calling thread (e.g., <tt>(threadIdx.y * blockDim.x) + linear_tid</tt> for 2D thread blocks)
    OutputIteratorT     block_itr,                  ///< [in] The thread block's base output iterator for storing to
    T                   (&items)[ITEMS_PER_THREAD]) ///< [in] Data to store
{
    OutputIteratorT thread_itr = block_itr + (linear_tid * ITEMS_PER_THREAD);

    // Store directly in thread-blocked order
    #pragma unroll
    for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++)
    {
        thread_itr[ITEM] = items[ITEM];
    }
}


/**
 * \brief Store a blocked arrangement of items across a thread block into a linear segment of items, guarded by range
 *
 * \blocked
 *
 * \tparam T                    <b>[inferred]</b> The data type to store.
 * \tparam ITEMS_PER_THREAD     <b>[inferred]</b> The number of consecutive items partitioned onto each thread.
 * \tparam OutputIteratorT      <b>[inferred]</b> The random-access iterator type for output \iterator.
 */
template <
    typename            T,
    int                 ITEMS_PER_THREAD,
    typename            OutputIteratorT>
__device__ __forceinline__ void StoreDirectBlocked(
    int                 linear_tid,                 ///< [in] A suitable 1D thread-identifier for the calling thread (e.g., <tt>(threadIdx.y * blockDim.x) + linear_tid</tt> for 2D thread blocks)
    OutputIteratorT     block_itr,                  ///< [in] The thread block's base output iterator for storing to
    T                   (&items)[ITEMS_PER_THREAD], ///< [in] Data to store
    int                 valid_items)                ///< [in] Number of valid items to write
{
    OutputIteratorT thread_itr = block_itr + (linear_tid * ITEMS_PER_THREAD);

    // Store directly in thread-blocked order
    #pragma unroll
    for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++)
    {
        if (ITEM + (linear_tid * ITEMS_PER_THREAD) < valid_items)
        {
            thread_itr[ITEM] = items[ITEM];
        }
    }
}


/**
 * \brief Store a blocked arrangement of items across a thread block into a linear segment of items.
 *
 * \blocked
 *
 * The output offset (\p block_ptr + \p block_offset) must be quad-item aligned,
 * which is the default starting offset returned by \p cudaMalloc()
 *
 * \par
 * The following conditions will prevent vectorization and storing will fall back to cub::BLOCK_STORE_DIRECT:
 *   - \p ITEMS_PER_THREAD is odd
 *   - The data type \p T is not a built-in primitive or CUDA vector type (e.g., \p short, \p int2, \p double, \p float2, etc.)
 *
 * \tparam T                    <b>[inferred]</b> The data type to store.
 * \tparam ITEMS_PER_THREAD     <b>[inferred]</b> The number of consecutive items partitioned onto each thread.
 *
 */
template <
    typename            T,
    int                 ITEMS_PER_THREAD>
__device__ __forceinline__ void StoreDirectBlockedVectorized(
    int                 linear_tid,                 ///< [in] A suitable 1D thread-identifier for the calling thread (e.g., <tt>(threadIdx.y * blockDim.x) + linear_tid</tt> for 2D thread blocks)
    T                   *block_ptr,                 ///< [in] Input pointer for storing from
    T                   (&items)[ITEMS_PER_THREAD]) ///< [in] Data to store
{
    enum
    {
        // Maximum CUDA vector size is 4 elements
        MAX_VEC_SIZE = CUB_MIN(4, ITEMS_PER_THREAD),

        // Vector size must be a power of two and an even divisor of the items per thread

xgboost/cub/cub/block/block_store.cuh  view on Meta::CPAN

    T *raw_items = reinterpret_cast<T*>(raw_vector);

    // Copy
    #pragma unroll
    for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++)
    {
        raw_items[ITEM] = items[ITEM];
    }

    // Direct-store using vector types
    StoreDirectBlocked(linear_tid, block_ptr_vectors, raw_vector);
}



//@}  end member group
/******************************************************************//**
 * \name Striped arrangement I/O (direct)
 *********************************************************************/
//@{

xgboost/cub/cub/block/block_store.cuh  view on Meta::CPAN



/** @} */       // end group UtilIo


//-----------------------------------------------------------------------------
// Generic BlockStore abstraction
//-----------------------------------------------------------------------------

/**
 * \brief cub::BlockStoreAlgorithm enumerates alternative algorithms for cub::BlockStore to write a blocked arrangement of items across a CUDA thread block to a linear segment of memory.
 */
enum BlockStoreAlgorithm
{
    /**
     * \par Overview
     *
     * A [<em>blocked arrangement</em>](index.html#sec5sec3) of data is written
     * directly to memory.
     *
     * \par Performance Considerations
     * - The utilization of memory transactions (coalescing) decreases as the
     *   access stride between threads increases (i.e., the number items per thread).
     */
    BLOCK_STORE_DIRECT,

    /**
     * \par Overview
     *
     * A [<em>blocked arrangement</em>](index.html#sec5sec3) of data is written directly
     * to memory using CUDA's built-in vectorized stores as a coalescing optimization.
     * For example, <tt>st.global.v4.s32</tt> instructions will be generated
     * when \p T = \p int and \p ITEMS_PER_THREAD % 4 == 0.
     *
     * \par Performance Considerations
     * - The utilization of memory transactions (coalescing) remains high until the the
     *   access stride between threads (i.e., the number items per thread) exceeds the
     *   maximum vector store width (typically 4 items or 64B, whichever is lower).
     * - The following conditions will prevent vectorization and writing will fall back to cub::BLOCK_STORE_DIRECT:
     *   - \p ITEMS_PER_THREAD is odd
     *   - The \p OutputIteratorT is not a simple pointer type
     *   - The block output offset is not quadword-aligned
     *   - The data type \p T is not a built-in primitive or CUDA vector type (e.g., \p short, \p int2, \p double, \p float2, etc.)
     */
    BLOCK_STORE_VECTORIZE,

    /**
     * \par Overview
     * A [<em>blocked arrangement</em>](index.html#sec5sec3) is locally
     * transposed and then efficiently written to memory as a [<em>striped arrangement</em>](index.html#sec5sec3).
     *
     * \par Performance Considerations
     * - The utilization of memory transactions (coalescing) remains high regardless
     *   of items written per thread.
     * - The local reordering incurs slightly longer latencies and throughput than the
     *   direct cub::BLOCK_STORE_DIRECT and cub::BLOCK_STORE_VECTORIZE alternatives.
     */
    BLOCK_STORE_TRANSPOSE,

    /**
     * \par Overview
     * A [<em>blocked arrangement</em>](index.html#sec5sec3) is locally
     * transposed and then efficiently written to memory as a
     * [<em>warp-striped arrangement</em>](index.html#sec5sec3)
     *
     * \par Usage Considerations
     * - BLOCK_THREADS must be a multiple of WARP_THREADS
     *
     * \par Performance Considerations
     * - The utilization of memory transactions (coalescing) remains high regardless
     *   of items written per thread.
     * - The local reordering incurs slightly longer latencies and throughput than the
     *   direct cub::BLOCK_STORE_DIRECT and cub::BLOCK_STORE_VECTORIZE alternatives.
     */
    BLOCK_STORE_WARP_TRANSPOSE,

    /**
     * \par Overview
     * A [<em>blocked arrangement</em>](index.html#sec5sec3) is locally
     * transposed and then efficiently written to memory as a
     * [<em>warp-striped arrangement</em>](index.html#sec5sec3)
     * To reduce the shared memory requirement, only one warp's worth of shared
     * memory is provisioned and is subsequently time-sliced among warps.
     *
     * \par Usage Considerations
     * - BLOCK_THREADS must be a multiple of WARP_THREADS
     *
     * \par Performance Considerations
     * - The utilization of memory transactions (coalescing) remains high regardless
     *   of items written per thread.
     * - Provisions less shared memory temporary storage, but incurs larger
     *   latencies than the BLOCK_STORE_WARP_TRANSPOSE alternative.
     */
    BLOCK_STORE_WARP_TRANSPOSE_TIMESLICED,

};


/**
 * \brief The BlockStore class provides [<em>collective</em>](index.html#sec0) data movement methods for writing a [<em>blocked arrangement</em>](index.html#sec5sec3) of items partitioned across a CUDA thread block to a linear segment of memory.  ![]...
 * \ingroup BlockModule
 * \ingroup UtilIo
 *
 * \tparam T                    The type of data to be written.
 * \tparam BLOCK_DIM_X          The thread block length in threads along the X dimension
 * \tparam ITEMS_PER_THREAD     The number of consecutive items partitioned onto each thread.
 * \tparam ALGORITHM            <b>[optional]</b> cub::BlockStoreAlgorithm tuning policy enumeration.  default: cub::BLOCK_STORE_DIRECT.
 * \tparam WARP_TIME_SLICING    <b>[optional]</b> Whether or not only one warp's worth of shared memory should be allocated and time-sliced among block-warps during any load-related data transpositions (versus each warp having its own storage). (defa...
 * \tparam BLOCK_DIM_Y          <b>[optional]</b> The thread block length in threads along the Y dimension (default: 1)
 * \tparam BLOCK_DIM_Z          <b>[optional]</b> The thread block length in threads along the Z dimension (default: 1)
 * \tparam PTX_ARCH             <b>[optional]</b> \ptxversion
 *
 * \par Overview
 * - The BlockStore class provides a single data movement abstraction that can be specialized
 *   to implement different cub::BlockStoreAlgorithm strategies.  This facilitates different
 *   performance policies for different architectures, data types, granularity sizes, etc.
 * - BlockStore can be optionally specialized by different data movement strategies:
 *   -# <b>cub::BLOCK_STORE_DIRECT</b>.  A [<em>blocked arrangement</em>](index.html#sec5sec3) of data is written
 *      directly to memory. [More...](\ref cub::BlockStoreAlgorithm)
 *   -# <b>cub::BLOCK_STORE_VECTORIZE</b>.  A [<em>blocked arrangement</em>](index.html#sec5sec3)
 *      of data is written directly to memory using CUDA's built-in vectorized stores as a
 *      coalescing optimization.  [More...](\ref cub::BlockStoreAlgorithm)
 *   -# <b>cub::BLOCK_STORE_TRANSPOSE</b>.  A [<em>blocked arrangement</em>](index.html#sec5sec3)
 *      is locally transposed into a [<em>striped arrangement</em>](index.html#sec5sec3) which is
 *      then written to memory.  [More...](\ref cub::BlockStoreAlgorithm)
 *   -# <b>cub::BLOCK_STORE_WARP_TRANSPOSE</b>.  A [<em>blocked arrangement</em>](index.html#sec5sec3)
 *      is locally transposed into a [<em>warp-striped arrangement</em>](index.html#sec5sec3) which is
 *      then written to memory.  [More...](\ref cub::BlockStoreAlgorithm)
 * - \rowmajor
 *
 * \par A Simple Example
 * \blockcollective{BlockStore}
 * \par
 * The code snippet below illustrates the storing of a "blocked" arrangement
 * of 512 integers across 128 threads (where each thread owns 4 consecutive items)
 * into a linear segment of memory.  The store is specialized for \p BLOCK_STORE_WARP_TRANSPOSE,
 * meaning items are locally reordered among threads so that memory references will be
 * efficiently coalesced using a warp-striped access pattern.
 * \par
 * \code
 * #include <cub/cub.cuh>   // or equivalently <cub/block/block_store.cuh>
 *
 * __global__ void ExampleKernel(int *d_data, ...)
 * {
 *     // Specialize BlockStore for a 1D block of 128 threads owning 4 integer items each
 *     typedef cub::BlockStore<int, 128, 4, BLOCK_STORE_WARP_TRANSPOSE> BlockStore;
 *
 *     // Allocate shared memory for BlockStore
 *     __shared__ typename BlockStore::TempStorage temp_storage;
 *
 *     // Obtain a segment of consecutive items that are blocked across threads
 *     int thread_data[4];
 *     ...
 *
 *     // Store items to linear memory
 *     int thread_data[4];
 *     BlockStore(temp_storage).Store(d_data, thread_data);
 *
 * \endcode
 * \par
 * Suppose the set of \p thread_data across the block of threads is

xgboost/cub/cub/block/block_store.cuh  view on Meta::CPAN

        :
            linear_tid(linear_tid)
        {}

        /// Store items into a linear segment of memory
        template <typename OutputIteratorT>
        __device__ __forceinline__ void Store(
            OutputIteratorT     block_itr,                  ///< [in] The thread block's base output iterator for storing to
            T                   (&items)[ITEMS_PER_THREAD]) ///< [in] Data to store
        {
            StoreDirectBlocked(linear_tid, block_itr, items);
        }

        /// Store items into a linear segment of memory, guarded by range
        template <typename OutputIteratorT>
        __device__ __forceinline__ void Store(
            OutputIteratorT     block_itr,                  ///< [in] The thread block's base output iterator for storing to
            T                   (&items)[ITEMS_PER_THREAD], ///< [in] Data to store
            int                 valid_items)                ///< [in] Number of valid items to write
        {
            StoreDirectBlocked(linear_tid, block_itr, items, valid_items);
        }
    };


    /**
     * BLOCK_STORE_VECTORIZE specialization of store helper
     */
    template <int DUMMY>
    struct StoreInternal<BLOCK_STORE_VECTORIZE, DUMMY>
    {

xgboost/cub/cub/block/block_store.cuh  view on Meta::CPAN

            int linear_tid)
        :
            linear_tid(linear_tid)
        {}

        /// Store items into a linear segment of memory, specialized for native pointer types (attempts vectorization)
        __device__ __forceinline__ void Store(
            T                   *block_ptr,                 ///< [in] The thread block's base output iterator for storing to
            T                   (&items)[ITEMS_PER_THREAD]) ///< [in] Data to store
        {
            StoreDirectBlockedVectorized(linear_tid, block_ptr, items);
        }

        /// Store items into a linear segment of memory, specialized for opaque input iterators (skips vectorization)
        template <typename OutputIteratorT>
        __device__ __forceinline__ void Store(
            OutputIteratorT    block_itr,                  ///< [in] The thread block's base output iterator for storing to
            T                   (&items)[ITEMS_PER_THREAD]) ///< [in] Data to store
        {
            StoreDirectBlocked(linear_tid, block_itr, items);
        }

        /// Store items into a linear segment of memory, guarded by range
        template <typename OutputIteratorT>
        __device__ __forceinline__ void Store(
            OutputIteratorT     block_itr,                  ///< [in] The thread block's base output iterator for storing to
            T                   (&items)[ITEMS_PER_THREAD], ///< [in] Data to store
            int                 valid_items)                ///< [in] Number of valid items to write
        {
            StoreDirectBlocked(linear_tid, block_itr, items, valid_items);
        }
    };


    /**
     * BLOCK_STORE_TRANSPOSE specialization of store helper
     */
    template <int DUMMY>
    struct StoreInternal<BLOCK_STORE_TRANSPOSE, DUMMY>
    {

xgboost/cub/cub/block/block_store.cuh  view on Meta::CPAN

            temp_storage(temp_storage.Alias()),
            linear_tid(linear_tid)
        {}

        /// Store items into a linear segment of memory
        template <typename OutputIteratorT>
        __device__ __forceinline__ void Store(
            OutputIteratorT     block_itr,                  ///< [in] The thread block's base output iterator for storing to
            T                   (&items)[ITEMS_PER_THREAD]) ///< [in] Data to store
        {
            BlockExchange(temp_storage).BlockedToStriped(items);
            StoreDirectStriped<BLOCK_THREADS>(linear_tid, block_itr, items);
        }

        /// Store items into a linear segment of memory, guarded by range
        template <typename OutputIteratorT>
        __device__ __forceinline__ void Store(
            OutputIteratorT   block_itr,                  ///< [in] The thread block's base output iterator for storing to
            T                   (&items)[ITEMS_PER_THREAD], ///< [in] Data to store
            int                 valid_items)                ///< [in] Number of valid items to write
        {
            BlockExchange(temp_storage).BlockedToStriped(items);
            if (linear_tid == 0)
                temp_storage.valid_items = valid_items;     // Move through volatile smem as a workaround to prevent RF spilling on subsequent loads
            CTA_SYNC();
            StoreDirectStriped<BLOCK_THREADS>(linear_tid, block_itr, items, temp_storage.valid_items);
        }
    };


    /**
     * BLOCK_STORE_WARP_TRANSPOSE specialization of store helper

xgboost/cub/cub/block/block_store.cuh  view on Meta::CPAN

            temp_storage(temp_storage.Alias()),
            linear_tid(linear_tid)
        {}

        /// Store items into a linear segment of memory
        template <typename OutputIteratorT>
        __device__ __forceinline__ void Store(
            OutputIteratorT   block_itr,                    ///< [in] The thread block's base output iterator for storing to
            T                 (&items)[ITEMS_PER_THREAD])   ///< [in] Data to store
        {
            BlockExchange(temp_storage).BlockedToWarpStriped(items);
            StoreDirectWarpStriped(linear_tid, block_itr, items);
        }

        /// Store items into a linear segment of memory, guarded by range
        template <typename OutputIteratorT>
        __device__ __forceinline__ void Store(
            OutputIteratorT   block_itr,                    ///< [in] The thread block's base output iterator for storing to
            T                 (&items)[ITEMS_PER_THREAD],   ///< [in] Data to store
            int               valid_items)                  ///< [in] Number of valid items to write
        {
            BlockExchange(temp_storage).BlockedToWarpStriped(items);
            if (linear_tid == 0)
                temp_storage.valid_items = valid_items;     // Move through volatile smem as a workaround to prevent RF spilling on subsequent loads
            CTA_SYNC();
            StoreDirectWarpStriped(linear_tid, block_itr, items, temp_storage.valid_items);
        }
    };


    /**
     * BLOCK_STORE_WARP_TRANSPOSE_TIMESLICED specialization of store helper

xgboost/cub/cub/block/block_store.cuh  view on Meta::CPAN

            temp_storage(temp_storage.Alias()),
            linear_tid(linear_tid)
        {}

        /// Store items into a linear segment of memory
        template <typename OutputIteratorT>
        __device__ __forceinline__ void Store(
            OutputIteratorT     block_itr,                  ///< [in] The thread block's base output iterator for storing to
            T                   (&items)[ITEMS_PER_THREAD]) ///< [in] Data to store
        {
            BlockExchange(temp_storage).BlockedToWarpStriped(items);
            StoreDirectWarpStriped(linear_tid, block_itr, items);
        }

        /// Store items into a linear segment of memory, guarded by range
        template <typename OutputIteratorT>
        __device__ __forceinline__ void Store(
            OutputIteratorT   block_itr,                  ///< [in] The thread block's base output iterator for storing to
            T                   (&items)[ITEMS_PER_THREAD], ///< [in] Data to store
            int                 valid_items)                ///< [in] Number of valid items to write
        {
            BlockExchange(temp_storage).BlockedToWarpStriped(items);
            if (linear_tid == 0)
                temp_storage.valid_items = valid_items;     // Move through volatile smem as a workaround to prevent RF spilling on subsequent loads
            CTA_SYNC();
            StoreDirectWarpStriped(linear_tid, block_itr, items, temp_storage.valid_items);
        }
    };

    /******************************************************************************
     * Type definitions
     ******************************************************************************/

xgboost/cub/cub/block/block_store.cuh  view on Meta::CPAN

    /******************************************************************//**
     * \name Data movement
     *********************************************************************/
    //@{


    /**
     * \brief Store items into a linear segment of memory.
     *
     * \par
     * - \blocked
     * - \smemreuse
     *
     * \par Snippet
     * The code snippet below illustrates the storing of a "blocked" arrangement
     * of 512 integers across 128 threads (where each thread owns 4 consecutive items)
     * into a linear segment of memory.  The store is specialized for \p BLOCK_STORE_WARP_TRANSPOSE,
     * meaning items are locally reordered among threads so that memory references will be
     * efficiently coalesced using a warp-striped access pattern.
     * \par
     * \code
     * #include <cub/cub.cuh>   // or equivalently <cub/block/block_store.cuh>
     *
     * __global__ void ExampleKernel(int *d_data, ...)
     * {
     *     // Specialize BlockStore for a 1D block of 128 threads owning 4 integer items each
     *     typedef cub::BlockStore<int, 128, 4, BLOCK_STORE_WARP_TRANSPOSE> BlockStore;
     *
     *     // Allocate shared memory for BlockStore
     *     __shared__ typename BlockStore::TempStorage temp_storage;
     *
     *     // Obtain a segment of consecutive items that are blocked across threads
     *     int thread_data[4];
     *     ...
     *
     *     // Store items to linear memory
     *     int thread_data[4];
     *     BlockStore(temp_storage).Store(d_data, thread_data);
     *
     * \endcode
     * \par
     * Suppose the set of \p thread_data across the block of threads is

xgboost/cub/cub/block/block_store.cuh  view on Meta::CPAN

        OutputIteratorT     block_itr,                  ///< [in] The thread block's base output iterator for storing to
        T                   (&items)[ITEMS_PER_THREAD]) ///< [in] Data to store
    {
        InternalStore(temp_storage, linear_tid).Store(block_itr, items);
    }

    /**
     * \brief Store items into a linear segment of memory, guarded by range.
     *
     * \par
     * - \blocked
     * - \smemreuse
     *
     * \par Snippet
     * The code snippet below illustrates the guarded storing of a "blocked" arrangement
     * of 512 integers across 128 threads (where each thread owns 4 consecutive items)
     * into a linear segment of memory.  The store is specialized for \p BLOCK_STORE_WARP_TRANSPOSE,
     * meaning items are locally reordered among threads so that memory references will be
     * efficiently coalesced using a warp-striped access pattern.
     * \par
     * \code
     * #include <cub/cub.cuh>   // or equivalently <cub/block/block_store.cuh>
     *
     * __global__ void ExampleKernel(int *d_data, int valid_items, ...)
     * {
     *     // Specialize BlockStore for a 1D block of 128 threads owning 4 integer items each
     *     typedef cub::BlockStore<int, 128, 4, BLOCK_STORE_WARP_TRANSPOSE> BlockStore;
     *
     *     // Allocate shared memory for BlockStore
     *     __shared__ typename BlockStore::TempStorage temp_storage;
     *
     *     // Obtain a segment of consecutive items that are blocked across threads
     *     int thread_data[4];
     *     ...
     *
     *     // Store items to linear memory
     *     int thread_data[4];
     *     BlockStore(temp_storage).Store(d_data, thread_data, valid_items);
     *
     * \endcode
     * \par
     * Suppose the set of \p thread_data across the block of threads is

xgboost/cub/cub/block/specializations/block_histogram_sort.cuh  view on Meta::CPAN


    // Composite data onto an existing histogram
    template <
        typename            CounterT     >
    __device__ __forceinline__ void Composite(
        T                   (&items)[ITEMS_PER_THREAD],     ///< [in] Calling thread's input values to histogram
        CounterT            histogram[BINS])                 ///< [out] Reference to shared/device-accessible memory histogram
    {
        enum { TILE_SIZE = BLOCK_THREADS * ITEMS_PER_THREAD };

        // Sort bytes in blocked arrangement
        BlockRadixSortT(temp_storage.sort).Sort(items);

        CTA_SYNC();

        // Initialize the shared memory's run_begin and run_end for each bin
        int histo_offset = 0;

        #pragma unroll
        for(; histo_offset + BLOCK_THREADS <= BINS; histo_offset += BLOCK_THREADS)
        {

xgboost/cub/cub/device/dispatch/dispatch_radix_sort.cuh  view on Meta::CPAN


    // Load values
    if (!KEYS_ONLY)
    {
        BlockLoadValues(temp_storage.load_values).Load(d_values_in, values, num_items);

        CTA_SYNC();
    }

    // Sort tile
    BlockRadixSortT(temp_storage.sort).SortBlockedToStriped(
        keys,
        values,
        current_bit,
        end_bit,
        Int2Type<IS_DESCENDING>(),
        Int2Type<KEYS_ONLY>());

    // Store keys and values
    #pragma unroll
    for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ++ITEM)

xgboost/cub/cub/host/mutex.cuh  view on Meta::CPAN

         * Compiler read/write barrier
         */
        __forceinline__ void _ReadWriteBarrier()
        {
            __sync_synchronize();
        }

        /**
         * Atomic exchange
         */
        __forceinline__ long _InterlockedExchange(volatile int * const Target, const int Value)
        {
            // NOTE: __sync_lock_test_and_set would be an acquire barrier, so we force a full barrier
            _ReadWriteBarrier();
            return __sync_lock_test_and_set(Target, Value);
        }

        /**
         * Pause instruction to prevent excess processor bus usage
         */
        __forceinline__ void YieldProcessor()

xgboost/cub/cub/host/mutex.cuh  view on Meta::CPAN

         */
        Mutex() : lock(0) {}

        /**
         * Return when the specified spinlock has been acquired
         */
        __forceinline__ void Lock()
        {
            while (1)
            {
                if (!_InterlockedExchange(&lock, 1)) return;
                while (lock) YieldProcessor();
            }
        }


        /**
         * Release the specified spinlock
         */
        __forceinline__ void Unlock()
        {

xgboost/cub/examples/block/example_block_radix_sort.cu  view on Meta::CPAN

    int         BLOCK_THREADS,
    int         ITEMS_PER_THREAD>
__launch_bounds__ (BLOCK_THREADS)
__global__ void BlockSortKernel(
    Key         *d_in,          // Tile of input
    Key         *d_out,         // Tile of output
    clock_t     *d_elapsed)     // Elapsed cycle count of block scan
{
    enum { TILE_SIZE = BLOCK_THREADS * ITEMS_PER_THREAD };

    // Specialize BlockLoad type for our thread block (uses warp-striped loads for coalescing, then transposes in shared memory to a blocked arrangement)
    typedef BlockLoad<Key, BLOCK_THREADS, ITEMS_PER_THREAD, BLOCK_LOAD_WARP_TRANSPOSE> BlockLoadT;

    // Specialize BlockRadixSort type for our thread block
    typedef BlockRadixSort<Key, BLOCK_THREADS, ITEMS_PER_THREAD> BlockRadixSortT;

    // Shared memory
    __shared__ union
    {
        typename BlockLoadT::TempStorage        load;
        typename BlockRadixSortT::TempStorage   sort;
    } temp_storage;

    // Per-thread tile items
    Key items[ITEMS_PER_THREAD];

    // Our current block's offset
    int block_offset = blockIdx.x * TILE_SIZE;

    // Load items into a blocked arrangement
    BlockLoadT(temp_storage.load).Load(d_in + block_offset, items);

    // Barrier for smem reuse
    __syncthreads();

    // Start cycle timer
    clock_t start = clock();

    // Sort keys
    BlockRadixSortT(temp_storage.sort).SortBlockedToStriped(items);

    // Stop cycle timer
    clock_t stop = clock();

    // Store output in striped fashion
    StoreDirectStriped<BLOCK_THREADS>(threadIdx.x, d_out + block_offset, items);

    // Store elapsed clocks
    if (threadIdx.x == 0)
    {

xgboost/cub/examples/block/example_block_radix_sort.cu  view on Meta::CPAN


    // Check for kernel errors and STDIO from the kernel, if any
    CubDebugExit(cudaDeviceSynchronize());

    // Display timing results
    float avg_millis            = elapsed_millis / g_timing_iterations;
    float avg_items_per_sec     = float(TILE_SIZE * g_grid_size) / avg_millis / 1000.0f;
    double avg_clocks           = double(elapsed_clocks) / g_timing_iterations / g_grid_size;
    double avg_clocks_per_item  = avg_clocks / TILE_SIZE;

    printf("\tAverage BlockRadixSort::SortBlocked clocks: %.3f\n", avg_clocks);
    printf("\tAverage BlockRadixSort::SortBlocked clocks per item: %.3f\n", avg_clocks_per_item);
    printf("\tAverage kernel millis: %.4f\n", avg_millis);
    printf("\tAverage million items / sec: %.4f\n", avg_items_per_sec);
    fflush(stdout);

    // Cleanup
    if (h_in) delete[] h_in;
    if (h_reference) delete[] h_reference;
    if (h_elapsed) delete[] h_elapsed;
    if (d_in) CubDebugExit(cudaFree(d_in));
    if (d_out) CubDebugExit(cudaFree(d_out));

xgboost/cub/examples/block/example_block_scan.cu  view on Meta::CPAN

 */
template <
    int                     BLOCK_THREADS,
    int                     ITEMS_PER_THREAD,
    BlockScanAlgorithm      ALGORITHM>
__global__ void BlockPrefixSumKernel(
    int         *d_in,          // Tile of input
    int         *d_out,         // Tile of output
    clock_t     *d_elapsed)     // Elapsed cycle count of block scan
{
    // Specialize BlockLoad type for our thread block (uses warp-striped loads for coalescing, then transposes in shared memory to a blocked arrangement)
    typedef BlockLoad<int, BLOCK_THREADS, ITEMS_PER_THREAD, BLOCK_LOAD_WARP_TRANSPOSE> BlockLoadT;

    // Specialize BlockStore type for our thread block (uses warp-striped loads for coalescing, then transposes in shared memory to a blocked arrangement)
    typedef BlockStore<int, BLOCK_THREADS, ITEMS_PER_THREAD, BLOCK_STORE_WARP_TRANSPOSE> BlockStoreT;

    // Specialize BlockScan type for our thread block
    typedef BlockScan<int, BLOCK_THREADS, ALGORITHM> BlockScanT;

    // Shared memory
    __shared__ union
    {
        typename BlockLoadT::TempStorage    load;
        typename BlockStoreT::TempStorage   store;
        typename BlockScanT::TempStorage    scan;
    } temp_storage;

    // Per-thread tile data
    int data[ITEMS_PER_THREAD];

    // Load items into a blocked arrangement
    BlockLoadT(temp_storage.load).Load(d_in, data);

    // Barrier for smem reuse
    __syncthreads();

    // Start cycle timer
    clock_t start = clock();

    // Compute exclusive prefix sum
    int aggregate;
    BlockScanT(temp_storage.scan).ExclusiveSum(data, data, aggregate);

    // Stop cycle timer
    clock_t stop = clock();

    // Barrier for smem reuse
    __syncthreads();

    // Store items from a blocked arrangement
    BlockStoreT(temp_storage.store).Store(d_out, data);

    // Store aggregate and elapsed clocks
    if (threadIdx.x == 0)
    {
        *d_elapsed = (start > stop) ? start - stop : stop - start;
        d_out[BLOCK_THREADS * ITEMS_PER_THREAD] = aggregate;
    }
}

xgboost/cub/experimental/defunct/example_coo_spmv.cu  view on Meta::CPAN


    // Head flag type
    typedef int HeadFlag;

    // Partial dot product type
    typedef PartialProduct<VertexId, Value> PartialProduct;

    // Parameterized BlockScan type for reduce-value-by-row scan
    typedef BlockScan<PartialProduct, BLOCK_THREADS, BLOCK_SCAN_RAKING_MEMOIZE> BlockScan;

    // Parameterized BlockExchange type for exchanging rows between warp-striped -> blocked arrangements
    typedef BlockExchange<VertexId, BLOCK_THREADS, ITEMS_PER_THREAD, true> BlockExchangeRows;

    // Parameterized BlockExchange type for exchanging values between warp-striped -> blocked arrangements
    typedef BlockExchange<Value, BLOCK_THREADS, ITEMS_PER_THREAD, true> BlockExchangeValues;

    // Parameterized BlockDiscontinuity type for setting head-flags for each new row segment
    typedef BlockDiscontinuity<HeadFlag, BLOCK_THREADS> BlockDiscontinuity;

    // Shared memory type for this threadblock
    struct TempStorage
    {
        union
        {

xgboost/cub/experimental/defunct/example_coo_spmv.cu  view on Meta::CPAN

        #pragma unroll
        for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++)
        {
#if CUB_PTX_ARCH >= 350
            values[ITEM] *= ThreadLoad<LOAD_LDG>(d_vector + columns[ITEM]);
#else
            values[ITEM] *= TexVector<Value>::Load(columns[ITEM]);
#endif
        }

        // Transpose from warp-striped to blocked arrangement
        BlockExchangeValues(temp_storage.exchange_values).WarpStripedToBlocked(values);

        __syncthreads();

        // Transpose from warp-striped to blocked arrangement
        BlockExchangeRows(temp_storage.exchange_rows).WarpStripedToBlocked(rows);

        // Barrier for smem reuse and coherence
        __syncthreads();

        // FlagT row heads by looking for discontinuities
        BlockDiscontinuity(temp_storage.discontinuity).FlagHeads(
            head_flags,                     // (Out) Head flags
            rows,                           // Original row ids
            NewRowOp(),                     // Functor for detecting start of new rows
            prefix_op.running_prefix.row);  // Last row ID from previous tile to compare with first row ID in this tile

xgboost/cub/experimental/defunct/example_coo_spmv.cu  view on Meta::CPAN

    {
        VertexId        rows[ITEMS_PER_THREAD];
        PartialProduct  partial_sums[ITEMS_PER_THREAD];
        HeadFlag        head_flags[ITEMS_PER_THREAD];

        // Load a tile of block partials from previous kernel
        if (FULL_TILE)
        {
            // Full tile
#if CUB_PTX_ARCH >= 350
            LoadDirectBlocked<LOAD_LDG>(threadIdx.x, d_block_partials + block_offset, partial_sums);
#else
            LoadDirectBlocked(threadIdx.x, d_block_partials + block_offset, partial_sums);
#endif
        }
        else
        {
            // Partial tile (extend zero-valued coordinates of the last partial-product for out-of-bounds items)
            PartialProduct default_sum;
            default_sum.row = temp_storage.last_block_row;
            default_sum.partial = Value(0);

#if CUB_PTX_ARCH >= 350
            LoadDirectBlocked<LOAD_LDG>(threadIdx.x, d_block_partials + block_offset, partial_sums, guarded_items, default_sum);
#else
            LoadDirectBlocked(threadIdx.x, d_block_partials + block_offset, partial_sums, guarded_items, default_sum);
#endif
        }

        // Copy out row IDs for row-head flagging
        #pragma unroll
        for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++)
        {
            rows[ITEM] = partial_sums[ITEM].row;
        }

xgboost/cub/test/test_block_radix_sort.cu  view on Meta::CPAN


bool                    g_verbose = false;
CachingDeviceAllocator  g_allocator(true);


//---------------------------------------------------------------------
// Test kernels
//---------------------------------------------------------------------


/// Specialized descending, blocked -> blocked
template <int BLOCK_THREADS, typename BlockRadixSort, int ITEMS_PER_THREAD, typename Key, typename Value>
__device__ __forceinline__ void TestBlockSort(
    typename BlockRadixSort::TempStorage &temp_storage,
    Key                         (&keys)[ITEMS_PER_THREAD],
    Value                       (&values)[ITEMS_PER_THREAD],
    Key                         *d_keys,
    Value                       *d_values,
    int                         begin_bit,
    int                         end_bit,
    clock_t                     &stop,
    Int2Type<true>              is_descending,
    Int2Type<true>              is_blocked_output)
{
    BlockRadixSort(temp_storage).SortDescending(keys, values, begin_bit, end_bit);
    stop = clock();
    StoreDirectBlocked(threadIdx.x, d_keys, keys);
    StoreDirectBlocked(threadIdx.x, d_values, values);
}

/// Specialized descending, blocked -> striped
template <int BLOCK_THREADS, typename BlockRadixSort, int ITEMS_PER_THREAD, typename Key, typename Value>
__device__ __forceinline__ void TestBlockSort(
    typename BlockRadixSort::TempStorage &temp_storage,
    Key                         (&keys)[ITEMS_PER_THREAD],
    Value                       (&values)[ITEMS_PER_THREAD],
    Key                         *d_keys,
    Value                       *d_values,
    int                         begin_bit,
    int                         end_bit,
    clock_t                     &stop,
    Int2Type<true>              is_descending,
    Int2Type<false>             is_blocked_output)
{
    BlockRadixSort(temp_storage).SortDescendingBlockedToStriped(keys, values, begin_bit, end_bit);
    stop = clock();
    StoreDirectStriped<BLOCK_THREADS>(threadIdx.x, d_keys, keys);
    StoreDirectStriped<BLOCK_THREADS>(threadIdx.x, d_values, values);
}

/// Specialized ascending, blocked -> blocked
template <int BLOCK_THREADS, typename BlockRadixSort, int ITEMS_PER_THREAD, typename Key, typename Value>
__device__ __forceinline__ void TestBlockSort(
    typename BlockRadixSort::TempStorage &temp_storage,
    Key                         (&keys)[ITEMS_PER_THREAD],
    Value                       (&values)[ITEMS_PER_THREAD],
    Key                         *d_keys,
    Value                       *d_values,
    int                         begin_bit,
    int                         end_bit,
    clock_t                     &stop,
    Int2Type<false>             is_descending,
    Int2Type<true>              is_blocked_output)
{
    BlockRadixSort(temp_storage).Sort(keys, values, begin_bit, end_bit);
    stop = clock();
    StoreDirectBlocked(threadIdx.x, d_keys, keys);
    StoreDirectBlocked(threadIdx.x, d_values, values);
}

/// Specialized ascending, blocked -> striped
template <int BLOCK_THREADS, typename BlockRadixSort, int ITEMS_PER_THREAD, typename Key, typename Value>
__device__ __forceinline__ void TestBlockSort(
    typename BlockRadixSort::TempStorage &temp_storage,
    Key                         (&keys)[ITEMS_PER_THREAD],
    Value                       (&values)[ITEMS_PER_THREAD],
    Key                         *d_keys,
    Value                       *d_values,
    int                         begin_bit,
    int                         end_bit,
    clock_t                     &stop,
    Int2Type<false>             is_descending,
    Int2Type<false>             is_blocked_output)
{
    BlockRadixSort(temp_storage).SortBlockedToStriped(keys, values, begin_bit, end_bit);
    stop = clock();
    StoreDirectStriped<BLOCK_THREADS>(threadIdx.x, d_keys, keys);
    StoreDirectStriped<BLOCK_THREADS>(threadIdx.x, d_values, values);
}



/**
 * BlockRadixSort kernel
 */

xgboost/cub/test/test_block_radix_sort.cu  view on Meta::CPAN

            SMEM_CONFIG>
        BlockRadixSortT;

    // Allocate temp storage in shared memory
    __shared__ typename BlockRadixSortT::TempStorage temp_storage;

    // Items per thread
    Key     keys[ITEMS_PER_THREAD];
    Value   values[ITEMS_PER_THREAD];

    LoadDirectBlocked(threadIdx.x, d_keys, keys);
    LoadDirectBlocked(threadIdx.x, d_values, values);

    // Start cycle timer
    clock_t stop;
    clock_t start = clock();

    TestBlockSort<BLOCK_THREADS, BlockRadixSortT>(
        temp_storage, keys, values, d_keys, d_values, begin_bit, end_bit, stop, Int2Type<DESCENDING>(), Int2Type<BLOCKED_OUTPUT>());

    // Store time
    if (threadIdx.x == 0)

xgboost/cub/test/test_block_radix_sort.cu  view on Meta::CPAN

    cudaSharedMemConfig     SMEM_CONFIG,
    bool                    DESCENDING,
    bool                    BLOCKED_OUTPUT,
    typename                Key,
    typename                Value>
void TestValid(Int2Type<false> fits_smem_capacity)
{}


/**
 * Test ascending/descending and to-blocked/to-striped
 */
template <
    int                     BLOCK_THREADS,
    int                     ITEMS_PER_THREAD,
    int                     RADIX_BITS,
    bool                    MEMOIZE_OUTER_SCAN,
    BlockScanAlgorithm      INNER_SCAN_ALGORITHM,
    cudaSharedMemConfig     SMEM_CONFIG,
    typename                Key,
    typename                Value>

xgboost/cub/test/test_block_radix_sort.cu  view on Meta::CPAN


#if defined(SM100) || defined(SM110) || defined(SM130)
    Int2Type<sizeof(typename BlockRadixSortT::TempStorage) <= 16 * 1024> fits_smem_capacity;
#else
    Int2Type<(sizeof(typename BlockRadixSortT::TempStorage) <= 48 * 1024)> fits_smem_capacity;
#endif

    // Sort-ascending, to-striped
    TestValid<BLOCK_THREADS, ITEMS_PER_THREAD, RADIX_BITS, MEMOIZE_OUTER_SCAN, INNER_SCAN_ALGORITHM, SMEM_CONFIG, true, false, Key, Value>(fits_smem_capacity);

    // Sort-descending, to-blocked
    TestValid<BLOCK_THREADS, ITEMS_PER_THREAD, RADIX_BITS, MEMOIZE_OUTER_SCAN, INNER_SCAN_ALGORITHM, SMEM_CONFIG, false, true, Key, Value>(fits_smem_capacity);

    // Not necessary
//    TestValid<BLOCK_THREADS, ITEMS_PER_THREAD, RADIX_BITS, MEMOIZE_OUTER_SCAN, INNER_SCAN_ALGORITHM, SMEM_CONFIG, false, false, Key, Value>(fits_smem_capacity);
//    TestValid<BLOCK_THREADS, ITEMS_PER_THREAD, RADIX_BITS, MEMOIZE_OUTER_SCAN, INNER_SCAN_ALGORITHM, SMEM_CONFIG, true, true, Key, Value>(fits_smem_capacity);
}


/**
 * Test value type and smem config

xgboost/cub/test/test_block_reduce.cu  view on Meta::CPAN

    int linear_tid = RowMajorTid(BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z);

    // Per-thread tile data
    T data[ITEMS_PER_THREAD];

    // Load first tile of data
    int block_offset = 0;

    if (block_offset < TILE_SIZE * tiles)
    {
        LoadDirectBlocked(linear_tid, d_in + block_offset, data);
        block_offset += TILE_SIZE;

        // Start cycle timer
        clock_t start = clock();

        // Cooperative reduce first tile
        BlockReduceT block_reduce(temp_storage) ;
        T block_aggregate = DeviceTest(block_reduce, data, reduction_op);

        // Stop cycle timer

xgboost/cub/test/test_block_reduce.cu  view on Meta::CPAN

#endif // CUB_PTX_ARCH == 100
        clock_t elapsed = (start > stop) ? start - stop : stop - start;

        // Loop over input tiles
        while (block_offset < TILE_SIZE * tiles)
        {
            // TestBarrier between threadblock reductions
            __syncthreads();
    
            // Load tile of data
            LoadDirectBlocked(linear_tid, d_in + block_offset, data);
            block_offset += TILE_SIZE;

            // Start cycle timer
            clock_t start = clock();

            // Cooperatively reduce the tile's aggregate
            BlockReduceT block_reduce(temp_storage) ;
            T tile_aggregate = DeviceTest(block_reduce, data, reduction_op);

            // Stop cycle timer

xgboost/cub/test/test_block_scan.cu  view on Meta::CPAN

    // Parameterize BlockScan type for our thread block
    typedef BlockScan<T, BLOCK_DIM_X, ALGORITHM, BLOCK_DIM_Y, BLOCK_DIM_Z> BlockScanT;

    // Allocate temp storage in shared memory
    __shared__ typename BlockScanT::TempStorage temp_storage;

    int linear_tid = RowMajorTid(BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z);

    // Per-thread tile data
    T data[ITEMS_PER_THREAD];
    LoadDirectBlocked(linear_tid, d_in, data);

    __threadfence_block();      // workaround to prevent clock hoisting
    clock_t start = clock();
    __threadfence_block();      // workaround to prevent clock hoisting

    // Test scan
    T                                   block_aggregate;
    BlockScanT                          block_scan(temp_storage);
    BlockPrefixCallbackOp<T, ScanOpT>   prefix_op(linear_tid, initial_value, scan_op);

    DeviceTest(block_scan, data, initial_value, scan_op, block_aggregate, prefix_op,
        Int2Type<SCAN_MODE>(), Int2Type<TEST_MODE>(), Int2Type<Traits<T>::PRIMITIVE>());

    // Stop cycle timer
    __threadfence_block();      // workaround to prevent clock hoisting
    clock_t stop = clock();
    __threadfence_block();      // workaround to prevent clock hoisting

    // Store output
    StoreDirectBlocked(linear_tid, d_out, data);

    // Store block_aggregate
    if (TEST_MODE != BASIC)
        d_aggregate[linear_tid] = block_aggregate;

    // Store prefix
    if (TEST_MODE == PREFIX)
    {
        if (linear_tid == 0)
            d_out[TILE_SIZE] = prefix_op.prefix;

xgboost/nccl/src/libwrap.h  view on Meta::CPAN

    NVML_ERROR_INSUFFICIENT_SIZE = 7,   //!< An input argument is not large enough
    NVML_ERROR_INSUFFICIENT_POWER = 8,  //!< A device's external power cables are not properly attached
    NVML_ERROR_DRIVER_NOT_LOADED = 9,   //!< NVIDIA driver is not loaded
    NVML_ERROR_TIMEOUT = 10,            //!< User provided timeout passed
    NVML_ERROR_IRQ_ISSUE = 11,          //!< NVIDIA Kernel detected an interrupt issue with a GPU
    NVML_ERROR_LIBRARY_NOT_FOUND = 12,  //!< NVML Shared Library couldn't be found or loaded
    NVML_ERROR_FUNCTION_NOT_FOUND = 13, //!< Local version of NVML doesn't implement this function
    NVML_ERROR_CORRUPTED_INFOROM = 14,  //!< infoROM is corrupted
    NVML_ERROR_GPU_IS_LOST = 15,        //!< The GPU has fallen off the bus or has otherwise become inaccessible
    NVML_ERROR_RESET_REQUIRED = 16,     //!< The GPU requires a reset before it can be used again
    NVML_ERROR_OPERATING_SYSTEM = 17,   //!< The GPU control device has been blocked by the operating system/cgroups
    NVML_ERROR_LIB_RM_VERSION_MISMATCH = 18,   //!< RM detects a driver/library version mismatch
    NVML_ERROR_IN_USE = 19,             //!< An operation cannot be performed because the GPU is currently in use
    NVML_ERROR_UNKNOWN = 999            //!< An internal driver error occurred
} nvmlReturn_t;
/* End of nvml.h */

ncclResult_t wrapSymbols(void);

ncclResult_t wrapNvmlInit(void);
ncclResult_t wrapNvmlShutdown(void);



( run in 0.905 second using v1.01-cache-2.11-cpan-49f99fa48dc )