Alien-XGBoost
view release on metacpan or search on metacpan
xgboost/cub/CHANGE_LOG.TXT view on Meta::CPAN
cub::DoubleBuffer in which it could end up in either buffer)
- New cub::DeviceRunLengthEncode::NonTrivialRuns for finding the starting
offsets and lengths of all non-trivial runs (i.e., length > 1) of keys in
a given sequence. (Useful for top-down partitioning algorithms like
MSD sorting of very-large keys.)
//-----------------------------------------------------------------------------
1.3.2 07/28/2014
- Bug fixes:
- Fix for cub::DeviceReduce where reductions of small problems
(small enough to only dispatch a single threadblock) would run in
the default stream (stream zero) regardless of whether an alternate
stream was specified.
//-----------------------------------------------------------------------------
1.3.1 05/23/2014
- Bug fixes:
- Workaround for a benign WAW race warning reported by cuda-memcheck
in BlockScan specialized for BLOCK_SCAN_WARP_SCANS algorithm.
- Fix for bug in DeviceRadixSort where the algorithm may sort more
key bits than the caller specified (up to the nearest radix digit).
- Fix for ~3% DeviceRadixSort performance regression on Kepler and
Fermi that was introduced in v1.3.0.
//-----------------------------------------------------------------------------
1.3.0 05/12/2014
- New features:
- CUB's collective (block-wide, warp-wide) primitives underwent a minor
interface refactoring:
- To provide the appropriate support for multidimensional thread blocks,
The interfaces for collective classes are now template-parameterized
by X, Y, and Z block dimensions (with BLOCK_DIM_Y and BLOCK_DIM_Z being
optional, and BLOCK_DIM_X replacing BLOCK_THREADS). Furthermore, the
constructors that accept remapped linear thread-identifiers have been
removed: all primitives now assume a row-major thread-ranking for
multidimensional thread blocks.
- To allow the host program (compiled by the host-pass) to
accurately determine the device-specific storage requirements for
a given collective (compiled for each device-pass), the interfaces
for collective classes are now (optionally) template-parameterized
by the desired PTX compute capability. This is useful when
aliasing collective storage to shared memory that has been
allocated dynamically by the host at the kernel call site.
- Most CUB programs having typical 1D usage should not require any
changes to accomodate these updates.
- Added new "combination" WarpScan methods for efficiently computing
both inclusive and exclusive prefix scans (and sums).
- Bug fixes:
- Fixed bug in cub::WarpScan (which affected cub::BlockScan and
cub::DeviceScan) where incorrect results (e.g., NAN) would often be
returned when parameterized for floating-point types (fp32, fp64).
- Workaround-fix for ptxas error when compiling with with -G flag on Linux
(for debug instrumentation)
- Misc. workaround-fixes for certain scan scenarios (using custom
scan operators) where code compiled for SM1x is run on newer
GPUs of higher compute-capability: the compiler could not tell
which memory space was being used collective operations and was
mistakenly using global ops instead of shared ops.
//-----------------------------------------------------------------------------
1.2.3 04/01/2014
- Bug fixes:
- Fixed access violation bug in DeviceReduce::ReduceByKey for non-primitive value types
- Fixed code-snippet bug in ArgIndexInputIteratorT documentation
//-----------------------------------------------------------------------------
1.2.2 03/03/2014
- New features:
- Added MS VC++ project solutions for device-wide and block-wide examples
- Performance:
- Added a third algorithmic variant of cub::BlockReduce for improved performance
when using commutative operators (e.g., numeric addition)
- Bug fixes:
- Fixed bug where inclusion of Thrust headers in a certain order prevented CUB device-wide primitives from working properly
//-----------------------------------------------------------------------------
1.2.0 02/25/2014
- New features:
- Added device-wide reduce-by-key (DeviceReduce::ReduceByKey, DeviceReduce::RunLengthEncode)
- Performance
- Improved DeviceScan, DeviceSelect, DevicePartition performance
- Documentation and testing:
- Compatible with CUDA 6.0
- Added performance-portability plots for many device-wide primitives to doc
- Update doc and tests to reflect iterator (in)compatibilities with CUDA 5.0 (and older) and Thrust 1.6 (and older).
- Bug fixes
- Revised the operation of temporary tile status bookkeeping for DeviceScan (and similar) to be safe for current code run on future platforms (now uses proper fences)
- Fixed DeviceScan bug where Win32 alignment disagreements between host and device regarding user-defined data types would corrupt tile status
- Fixed BlockScan bug where certain exclusive scans on custom data types for the BLOCK_SCAN_WARP_SCANS variant would return incorrect results for the first thread in the block
- Added workaround for TexRefInputIteratorTto work with CUDA 6.0
//-----------------------------------------------------------------------------
1.1.1 12/11/2013
- New features:
- Added TexObjInputIteratorT, TexRefInputIteratorT, CacheModifiedInputIteratorT, and CacheModifiedOutputIterator types for loading & storing arbitrary types through the cache hierarchy. Compatible with Thrust API.
- Added descending sorting to DeviceRadixSort and BlockRadixSort
- Added min, max, arg-min, and arg-max to DeviceReduce
- Added DeviceSelect (select-unique, select-if, and select-flagged)
- Added DevicePartition (partition-if, partition-flagged)
- Added generic cub::ShuffleUp(), cub::ShuffleDown(), and cub::ShuffleIndex() for warp-wide communication of arbitrary data types (SM3x+)
- Added cub::MaxSmOccupancy() for accurately determining SM occupancy for any given kernel function pointer
- Performance
- Improved DeviceScan and DeviceRadixSort performance for older architectures (SM10-SM30)
- Interface changes:
- Refactored block-wide I/O (BlockLoad and BlockStore), removing cache-modifiers from their interfaces. The CacheModifiedInputIteratorTand CacheModifiedOutputIterator should now be used with BlockLoad and BlockStore to effect that behavior.
- Rename device-wide "stream_synchronous" param to "debug_synchronous" to avoid confusion about usage
- Documentation and testing:
- Added simple examples of device-wide methods
- Improved doxygen documentation and example snippets
- Improved test coverege to include up to 21,000 kernel variants and 851,000 unit tests (per architecture, per platform)
- Bug fixes
- Fixed misc DeviceScan, BlockScan, DeviceReduce, and BlockReduce bugs when operating on non-primitive types for older architectures SM10-SM13
- Fixed DeviceScan / WarpReduction bug: SHFL-based segmented reduction producting incorrect results for multi-word types (size > 4B) on Linux
- Fixed BlockScan bug: For warpscan-based scans, not all threads in the first warp were entering the prefix callback functor
( run in 1.379 second using v1.01-cache-2.11-cpan-39bf76dae61 )