png results from the CPAN

Alien-XGBoost
     *   - \p ITEMS_PER_THREAD is odd
     *   - The \p OutputIteratorT is not a simple pointer type
     *   - The block output offset is not quadword-aligned
     *   - The data type \p T is not a built-in primitive or CUDA vector type (e.g., \p short, \p int2, \p double, \p float2, etc.)
     */
    BLOCK_STORE_VECTORIZE,

    /**
     * \par Overview
     * A [<em>blocked arrangement</em>](index.html#sec5sec3) is locally
     * transposed and then efficiently written to memory as a [<em>striped arrangement</em>](index.html#sec5sec3).
     *
     * \par Performance Considerations
     * - The utilization of memory transactions (coalescing) remains high regardless
     *   of items written per thread.
     * - The local reordering incurs slightly longer latencies and throughput than the
     *   direct cub::BLOCK_STORE_DIRECT and cub::BLOCK_STORE_VECTORIZE alternatives.
     */
    BLOCK_STORE_TRANSPOSE,

    /**
     * \par Overview
     * A [<em>blocked arrangement</em>](index.html#sec5sec3) is locally
     * transposed and then efficiently written to memory as a
     * [<em>warp-striped arrangement</em>](index.html#sec5sec3)
     *
     * \par Usage Considerations
     * - BLOCK_THREADS must be a multiple of WARP_THREADS
     *
     * \par Performance Considerations
     * - The utilization of memory transactions (coalescing) remains high regardless
     *   of items written per thread.
     * - The local reordering incurs slightly longer latencies and throughput than the
     *   direct cub::BLOCK_STORE_DIRECT and cub::BLOCK_STORE_VECTORIZE alternatives.
     */
    BLOCK_STORE_WARP_TRANSPOSE,

    /**
     * \par Overview
     * A [<em>blocked arrangement</em>](index.html#sec5sec3) is locally
     * transposed and then efficiently written to memory as a
     * [<em>warp-striped arrangement</em>](index.html#sec5sec3)
     * To reduce the shared memory requirement, only one warp's worth of shared
     * memory is provisioned and is subsequently time-sliced among warps.
     *
     * \par Usage Considerations
     * - BLOCK_THREADS must be a multiple of WARP_THREADS
     *
     * \par Performance Considerations
     * - The utilization of memory transactions (coalescing) remains high regardless
     *   of items written per thread.
     * - Provisions less shared memory temporary storage, but incurs larger
     *   latencies than the BLOCK_STORE_WARP_TRANSPOSE alternative.
     */
    BLOCK_STORE_WARP_TRANSPOSE_TIMESLICED,

};


/**
 * \brief The BlockStore class provides [<em>collective</em>](index.html#sec0) data movement methods for writing a [<em>blocked arrangement</em>](index.html#sec5sec3) of items partitioned across a CUDA thread block to a linear segment of memory.  ![]...
 * \ingroup BlockModule
 * \ingroup UtilIo
 *
 * \tparam T                    The type of data to be written.
 * \tparam BLOCK_DIM_X          The thread block length in threads along the X dimension
 * \tparam ITEMS_PER_THREAD     The number of consecutive items partitioned onto each thread.
 * \tparam ALGORITHM            <b>[optional]</b> cub::BlockStoreAlgorithm tuning policy enumeration.  default: cub::BLOCK_STORE_DIRECT.
 * \tparam WARP_TIME_SLICING    <b>[optional]</b> Whether or not only one warp's worth of shared memory should be allocated and time-sliced among block-warps during any load-related data transpositions (versus each warp having its own storage). (defa...
 * \tparam BLOCK_DIM_Y          <b>[optional]</b> The thread block length in threads along the Y dimension (default: 1)
 * \tparam BLOCK_DIM_Z          <b>[optional]</b> The thread block length in threads along the Z dimension (default: 1)
 * \tparam PTX_ARCH             <b>[optional]</b> \ptxversion
 *
 * \par Overview
 * - The BlockStore class provides a single data movement abstraction that can be specialized
 *   to implement different cub::BlockStoreAlgorithm strategies.  This facilitates different
 *   performance policies for different architectures, data types, granularity sizes, etc.
 * - BlockStore can be optionally specialized by different data movement strategies:
 *   -# <b>cub::BLOCK_STORE_DIRECT</b>.  A [<em>blocked arrangement</em>](index.html#sec5sec3) of data is written
 *      directly to memory. [More...](\ref cub::BlockStoreAlgorithm)
 *   -# <b>cub::BLOCK_STORE_VECTORIZE</b>.  A [<em>blocked arrangement</em>](index.html#sec5sec3)
 *      of data is written directly to memory using CUDA's built-in vectorized stores as a
 *      coalescing optimization.  [More...](\ref cub::BlockStoreAlgorithm)
 *   -# <b>cub::BLOCK_STORE_TRANSPOSE</b>.  A [<em>blocked arrangement</em>](index.html#sec5sec3)
 *      is locally transposed into a [<em>striped arrangement</em>](index.html#sec5sec3) which is
 *      then written to memory.  [More...](\ref cub::BlockStoreAlgorithm)
 *   -# <b>cub::BLOCK_STORE_WARP_TRANSPOSE</b>.  A [<em>blocked arrangement</em>](index.html#sec5sec3)
 *      is locally transposed into a [<em>warp-striped arrangement</em>](index.html#sec5sec3) which is
 *      then written to memory.  [More...](\ref cub::BlockStoreAlgorithm)
 * - \rowmajor
 *
 * \par A Simple Example
 * \blockcollective{BlockStore}
 * \par
 * The code snippet below illustrates the storing of a "blocked" arrangement
 * of 512 integers across 128 threads (where each thread owns 4 consecutive items)
 * into a linear segment of memory.  The store is specialized for \p BLOCK_STORE_WARP_TRANSPOSE,
 * meaning items are locally reordered among threads so that memory references will be
 * efficiently coalesced using a warp-striped access pattern.
 * \par
 * \code
 * #include <cub/cub.cuh>   // or equivalently <cub/block/block_store.cuh>
 *
 * __global__ void ExampleKernel(int *d_data, ...)
 * {
 *     // Specialize BlockStore for a 1D block of 128 threads owning 4 integer items each
 *     typedef cub::BlockStore<int, 128, 4, BLOCK_STORE_WARP_TRANSPOSE> BlockStore;
 *
 *     // Allocate shared memory for BlockStore
 *     __shared__ typename BlockStore::TempStorage temp_storage;
 *
 *     // Obtain a segment of consecutive items that are blocked across threads
 *     int thread_data[4];
 *     ...
 *
 *     // Store items to linear memory
 *     int thread_data[4];
 *     BlockStore(temp_storage).Store(d_data, thread_data);
 *
 * \endcode
 * \par
( run in 0.501 second using v1.01-cache-2.11-cpan-5623c5533a1 )