Alien-XGBoost
view release on metacpan or search on metacpan
xgboost/cub/CHANGE_LOG.TXT view on Meta::CPAN
cub::DeviceScan) where incorrect results (e.g., NAN) would often be
returned when parameterized for floating-point types (fp32, fp64).
- Workaround-fix for ptxas error when compiling with with -G flag on Linux
(for debug instrumentation)
- Misc. workaround-fixes for certain scan scenarios (using custom
scan operators) where code compiled for SM1x is run on newer
GPUs of higher compute-capability: the compiler could not tell
which memory space was being used collective operations and was
mistakenly using global ops instead of shared ops.
//-----------------------------------------------------------------------------
1.2.3 04/01/2014
- Bug fixes:
- Fixed access violation bug in DeviceReduce::ReduceByKey for non-primitive value types
- Fixed code-snippet bug in ArgIndexInputIteratorT documentation
//-----------------------------------------------------------------------------
1.2.2 03/03/2014
- New features:
- Added MS VC++ project solutions for device-wide and block-wide examples
- Performance:
- Added a third algorithmic variant of cub::BlockReduce for improved performance
when using commutative operators (e.g., numeric addition)
- Bug fixes:
- Fixed bug where inclusion of Thrust headers in a certain order prevented CUB device-wide primitives from working properly
//-----------------------------------------------------------------------------
1.2.0 02/25/2014
- New features:
- Added device-wide reduce-by-key (DeviceReduce::ReduceByKey, DeviceReduce::RunLengthEncode)
- Performance
- Improved DeviceScan, DeviceSelect, DevicePartition performance
- Documentation and testing:
- Compatible with CUDA 6.0
- Added performance-portability plots for many device-wide primitives to doc
- Update doc and tests to reflect iterator (in)compatibilities with CUDA 5.0 (and older) and Thrust 1.6 (and older).
- Bug fixes
- Revised the operation of temporary tile status bookkeeping for DeviceScan (and similar) to be safe for current code run on future platforms (now uses proper fences)
- Fixed DeviceScan bug where Win32 alignment disagreements between host and device regarding user-defined data types would corrupt tile status
- Fixed BlockScan bug where certain exclusive scans on custom data types for the BLOCK_SCAN_WARP_SCANS variant would return incorrect results for the first thread in the block
- Added workaround for TexRefInputIteratorTto work with CUDA 6.0
//-----------------------------------------------------------------------------
1.1.1 12/11/2013
- New features:
- Added TexObjInputIteratorT, TexRefInputIteratorT, CacheModifiedInputIteratorT, and CacheModifiedOutputIterator types for loading & storing arbitrary types through the cache hierarchy. Compatible with Thrust API.
- Added descending sorting to DeviceRadixSort and BlockRadixSort
- Added min, max, arg-min, and arg-max to DeviceReduce
- Added DeviceSelect (select-unique, select-if, and select-flagged)
- Added DevicePartition (partition-if, partition-flagged)
- Added generic cub::ShuffleUp(), cub::ShuffleDown(), and cub::ShuffleIndex() for warp-wide communication of arbitrary data types (SM3x+)
- Added cub::MaxSmOccupancy() for accurately determining SM occupancy for any given kernel function pointer
- Performance
- Improved DeviceScan and DeviceRadixSort performance for older architectures (SM10-SM30)
- Interface changes:
- Refactored block-wide I/O (BlockLoad and BlockStore), removing cache-modifiers from their interfaces. The CacheModifiedInputIteratorTand CacheModifiedOutputIterator should now be used with BlockLoad and BlockStore to effect that behavior.
- Rename device-wide "stream_synchronous" param to "debug_synchronous" to avoid confusion about usage
- Documentation and testing:
- Added simple examples of device-wide methods
- Improved doxygen documentation and example snippets
- Improved test coverege to include up to 21,000 kernel variants and 851,000 unit tests (per architecture, per platform)
- Bug fixes
- Fixed misc DeviceScan, BlockScan, DeviceReduce, and BlockReduce bugs when operating on non-primitive types for older architectures SM10-SM13
- Fixed DeviceScan / WarpReduction bug: SHFL-based segmented reduction producting incorrect results for multi-word types (size > 4B) on Linux
- Fixed BlockScan bug: For warpscan-based scans, not all threads in the first warp were entering the prefix callback functor
- Fixed DeviceRadixSort bug: race condition with key-value pairs for pre-SM35 architectures
- Fixed DeviceRadixSort bug: incorrect bitfield-extract behavior with long keys on 64bit Linux
- Fixed BlockDiscontinuity bug: complation error in for types other than int32/uint32
- CDP (device-callable) versions of device-wide methods now report the same temporary storage allocation size requirement as their host-callable counterparts
//-----------------------------------------------------------------------------
1.0.2 08/23/2013
- Corrections to code snippet examples for BlockLoad, BlockStore, and BlockDiscontinuity
- Cleaned up unnecessary/missing header includes. You can now safely #inlude a specific .cuh (instead of cub.cuh)
- Bug/compilation fixes for BlockHistogram
//-----------------------------------------------------------------------------
1.0.1 08/08/2013
- New collective interface idiom (specialize::construct::invoke).
- Added best-in-class DeviceRadixSort. Implements short-circuiting for homogenous digit passes.
- Added best-in-class DeviceScan. Implements single-pass "adaptive-lookback" strategy.
- Significantly improved documentation (with example code snippets)
- More extensive regression test suit for aggressively testing collective variants
- Allow non-trially-constructed types (previously unions had prevented aliasing temporary storage of those types)
- Improved support for Kepler SHFL (collective ops now use SHFL for types larger than 32b)
- Better code generation for 64-bit addressing within BlockLoad/BlockStore
- DeviceHistogram now supports histograms of arbitrary bins
- Misc. fixes
- Workarounds for SM10 codegen issues in uncommonly-used WarpScan/Reduce specializations
- Updates to accommodate CUDA 5.5 dynamic parallelism
//-----------------------------------------------------------------------------
0.9.4 05/07/2013
- Fixed compilation errors for SM10-SM13
- Fixed compilation errors for some WarpScan entrypoints on SM30+
- Added block-wide histogram (BlockHistogram256)
- Added device-wide histogram (DeviceHistogram256)
- Added new BlockScan algorithm variant BLOCK_SCAN_RAKING_MEMOIZE, which
trades more register consumption for less shared memory I/O)
- Updates to BlockRadixRank to use BlockScan (which improves performance
on Kepler due to SHFL instruction)
- Allow types other than C++ primitives to be used in WarpScan::*Sum methods
if they only have operator + overloaded. (Previously they also required
to support assignment from int(0).)
- Update BlockReduce's BLOCK_REDUCE_WARP_REDUCTIONS algorithm to work even
when block size is not an even multiple of warp size
- Added work management utility descriptors (GridQueue, GridEvenShare)
- Refactoring of DeviceAllocator interface and CachingDeviceAllocator
implementation
- Misc. documentation updates and corrections.
( run in 1.780 second using v1.01-cache-2.11-cpan-39bf76dae61 )