Alien-XGBoost
view release on metacpan or search on metacpan
xgboost/cub/cub/block/block_scan.cuh view on Meta::CPAN
* scan, allowing the "downsweep" not to re-read them from shared memory.
*/
BLOCK_SCAN_RAKING_MEMOIZE,
/**
* \par Overview
* A quick "tiled warpscans" prefix scan algorithm. Execution is comprised of four phases:
* -# Upsweep sequential reduction in registers (if threads contribute more than one input each). Each thread then places the partial reduction of its item(s) into shared memory.
* -# Compute a shallow, but inefficient warp-synchronous Kogge-Stone style scan within each warp.
* -# A propagation phase where the warp scan outputs in each warp are updated with the aggregate from each preceding warp.
* -# Downsweep sequential scan in registers (if threads contribute more than one input), seeded with the raking scan output.
*
* \par
* \image html block_scan_warpscans.png
* <div class="centercaption">\p BLOCK_SCAN_WARP_SCANS data flow for a hypothetical 16-thread threadblock and 4-thread raking warp.</div>
*
* \par Performance Considerations
* - Although this variant may suffer lower overall throughput across the
* GPU because due to a heavy reliance on inefficient warpscans, it can
* often provide lower turnaround latencies when the GPU is under-occupied.
*/
BLOCK_SCAN_WARP_SCANS,
};
/******************************************************************************
* Block scan
******************************************************************************/
/**
* \brief The BlockScan class provides [<em>collective</em>](index.html#sec0) methods for computing a parallel prefix sum/scan of items partitioned across a CUDA thread block. 
* \ingroup BlockModule
*
* \tparam T Data type being scanned
* \tparam BLOCK_DIM_X The thread block length in threads along the X dimension
* \tparam ALGORITHM <b>[optional]</b> cub::BlockScanAlgorithm enumerator specifying the underlying algorithm to use (default: cub::BLOCK_SCAN_RAKING)
* \tparam BLOCK_DIM_Y <b>[optional]</b> The thread block length in threads along the Y dimension (default: 1)
* \tparam BLOCK_DIM_Z <b>[optional]</b> The thread block length in threads along the Z dimension (default: 1)
* \tparam PTX_ARCH <b>[optional]</b> \ptxversion
*
* \par Overview
* - Given a list of input elements and a binary reduction operator, a [<em>prefix scan</em>](http://en.wikipedia.org/wiki/Prefix_sum)
* produces an output list where each element is computed to be the reduction
* of the elements occurring earlier in the input list. <em>Prefix sum</em>
* connotes a prefix scan with the addition operator. The term \em inclusive indicates
* that the <em>i</em><sup>th</sup> output reduction incorporates the <em>i</em><sup>th</sup> input.
* The term \em exclusive indicates the <em>i</em><sup>th</sup> input is not incorporated into
* the <em>i</em><sup>th</sup> output reduction.
* - \rowmajor
* - BlockScan can be optionally specialized by algorithm to accommodate different workload profiles:
* -# <b>cub::BLOCK_SCAN_RAKING</b>. An efficient (high throughput) "raking reduce-then-scan" prefix scan algorithm. [More...](\ref cub::BlockScanAlgorithm)
* -# <b>cub::BLOCK_SCAN_RAKING_MEMOIZE</b>. Similar to cub::BLOCK_SCAN_RAKING, but having higher throughput at the expense of additional register pressure for intermediate storage. [More...](\ref cub::BlockScanAlgorithm)
* -# <b>cub::BLOCK_SCAN_WARP_SCANS</b>. A quick (low latency) "tiled warpscans" prefix scan algorithm. [More...](\ref cub::BlockScanAlgorithm)
*
* \par Performance Considerations
* - \granularity
* - Uses special instructions when applicable (e.g., warp \p SHFL)
* - Uses synchronization-free communication between warp lanes when applicable
* - Invokes a minimal number of minimal block-wide synchronization barriers (only
* one or two depending on algorithm selection)
* - Incurs zero bank conflicts for most types
* - Computation is slightly more efficient (i.e., having lower instruction overhead) for:
* - Prefix sum variants (<b><em>vs.</em></b> generic scan)
* - \blocksize
* - See cub::BlockScanAlgorithm for performance details regarding algorithmic alternatives
*
* \par A Simple Example
* \blockcollective{BlockScan}
* \par
* The code snippet below illustrates an exclusive prefix sum of 512 integer items that
* are partitioned in a [<em>blocked arrangement</em>](index.html#sec5sec3) across 128 threads
* where each thread owns 4 consecutive items.
* \par
* \code
* #include <cub/cub.cuh> // or equivalently <cub/block/block_scan.cuh>
*
* __global__ void ExampleKernel(...)
* {
* // Specialize BlockScan for a 1D block of 128 threads on type int
* typedef cub::BlockScan<int, 128> BlockScan;
*
* // Allocate shared memory for BlockScan
* __shared__ typename BlockScan::TempStorage temp_storage;
*
* // Obtain a segment of consecutive items that are blocked across threads
* int thread_data[4];
* ...
*
* // Collectively compute the block-wide exclusive prefix sum
* BlockScan(temp_storage).ExclusiveSum(thread_data, thread_data);
*
* \endcode
* \par
* Suppose the set of input \p thread_data across the block of threads is
* <tt>{[1,1,1,1], [1,1,1,1], ..., [1,1,1,1]}</tt>.
* The corresponding output \p thread_data in those threads will be
* <tt>{[0,1,2,3], [4,5,6,7], ..., [508,509,510,511]}</tt>.
*
*/
template <
typename T,
int BLOCK_DIM_X,
BlockScanAlgorithm ALGORITHM = BLOCK_SCAN_RAKING,
int BLOCK_DIM_Y = 1,
int BLOCK_DIM_Z = 1,
int PTX_ARCH = CUB_PTX_ARCH>
class BlockScan
{
private:
/******************************************************************************
* Constants and type definitions
******************************************************************************/
/// Constants
enum
{
/// The thread block size in threads
BLOCK_THREADS = BLOCK_DIM_X * BLOCK_DIM_Y * BLOCK_DIM_Z,
};
( run in 0.574 second using v1.01-cache-2.11-cpan-e1769b4cff6 )