view release on metacpan or search on metacpan
and distribution as defined by Sections 1 through 9 of this document.
"Licensor" shall mean the copyright owner or entity authorized by
the copyright owner that is granting the License.
"Legal Entity" shall mean the union of the acting entity and all
other entities that control, are controlled by, or are under common
control with that entity. For the purposes of this definition,
"control" means (i) the power, direct or indirect, to cause the
direction or management of such entity, whether by contract or
otherwise, or (ii) ownership of fifty percent (50%) or more of the
outstanding shares, or (iii) beneficial ownership of such entity.
"You" (or "Your") shall mean an individual or Legal Entity
exercising permissions granted by this License.
"Source" form shall mean the preferred form for making modifications,
including but not limited to software source code, documentation
source, and configuration files.
"Object" form shall mean any form resulting from mechanical
xgboost/R-package/R/xgb.importance.R view on Meta::CPAN
#' For that reason, in order to obtain a meaningful ranking by importance for a linear model,
#' the features need to be on the same scale (which you also would want to do when using either
#' L1 or L2 regularization).
#'
#' @return
#'
#' For a tree model, a \code{data.table} with the following columns:
#' \itemize{
#' \item \code{Features} names of the features used in the model;
#' \item \code{Gain} represents fractional contribution of each feature to the model based on
#' the total gain of this feature's splits. Higher percentage means a more important
#' predictive feature.
#' \item \code{Cover} metric of the number of observation related to this feature;
#' \item \code{Frequency} percentage representing the relative number of times
#' a feature have been used in trees.
#' }
#'
#' A linear model's importance \code{data.table} has the following columns:
#' \itemize{
#' \item \code{Features} names of the features used in the model;
#' \item \code{Weight} the linear coefficient of this feature;
#' \item \code{Class} (only for multiclass models) class label.
#' }
#'
xgboost/R-package/man/xgb.importance.Rd view on Meta::CPAN
\item{label}{deprecated.}
\item{target}{deprecated.}
}
\value{
For a tree model, a \code{data.table} with the following columns:
\itemize{
\item \code{Features} names of the features used in the model;
\item \code{Gain} represents fractional contribution of each feature to the model based on
the total gain of this feature's splits. Higher percentage means a more important
predictive feature.
\item \code{Cover} metric of the number of observation related to this feature;
\item \code{Frequency} percentage representing the relative number of times
a feature have been used in trees.
}
A linear model's importance \code{data.table} has the following columns:
\itemize{
\item \code{Features} names of the features used in the model;
\item \code{Weight} the linear coefficient of this feature;
\item \code{Class} (only for multiclass models) class label.
}
xgboost/R-package/vignettes/discoverYourData.Rmd view on Meta::CPAN
head(importanceClean)
```
> In the table above we have removed two not needed columns and select only the first lines.
First thing you notice is the new column `Split`. It is the split applied to the feature on a branch of one of the tree. Each split is present, therefore a feature can appear several times in this table. Here we can see the feature `Age` is used seve...
How the split is applied to count the co-occurrences? It is always `<`. For instance, in the second line, we measure the number of persons under 61.5 years with the illness gone after the treatment.
The two other new columns are `RealCover` and `RealCover %`. In the first column it measures the number of observations in the dataset where the split is respected and the label marked as `1`. The second column is the percentage of the whole populati...
Therefore, according to our findings, getting a placebo doesn't seem to help but being younger than 61 years may help (seems logic).
> You may wonder how to interpret the `< 1.00001` on the first line. Basically, in a sparse `Matrix`, there is no `0`, therefore, looking for one hot-encoded categorical observations validating the rule `< 1.00001` is like just looking for `1` for th...
### Plotting the feature importance
All these things are nice, but it would be even better to plot the results.
xgboost/cub/cub/device/device_partition.cuh view on Meta::CPAN
*
* \par Performance
* \linear_performance{partition}
*
* \par
* The following chart illustrates DevicePartition::If
* performance across different CUDA architectures for \p int32 items,
* where 50% of the items are randomly selected for the first partition.
* \plots_below
*
* \image html partition_if_int32_50_percent.png
*
*/
struct DevicePartition
{
/**
* \brief Uses the \p d_flags sequence to split the corresponding items from \p d_in into a partitioned sequence \p d_out. The total number of items copied into the first partition is written to \p d_num_selected_out. .
* - Copies of the selected items are compacted into \p d_out and maintain their original
xgboost/cub/cub/device/device_partition.cuh view on Meta::CPAN
* - Copies of the selected items are compacted into \p d_out and maintain their original
* relative ordering, however copies of the unselected items are compacted into the
* rear of \p d_out in reverse order.
* - \devicestorage
*
* \par Performance
* The following charts illustrate saturated partition-if performance across different
* CUDA architectures for \p int32 and \p int64 items, respectively. Items are
* selected for the first partition with 50% probability.
*
* \image html partition_if_int32_50_percent.png
* \image html partition_if_int64_50_percent.png
*
* \par
* The following charts are similar, but 5% selection probability for the first partition:
*
* \image html partition_if_int32_5_percent.png
* \image html partition_if_int64_5_percent.png
*
* \par Snippet
* The code snippet below illustrates the compaction of items selected from an \p int device vector.
* \par
* \code
* #include <cub/cub.cuh> // or equivalently <cub/device/device_partition.cuh>
*
* // Functor type for selecting values less than some criteria
* struct LessThan
* {
xgboost/cub/cub/device/device_select.cuh view on Meta::CPAN
* \cdp_class{DeviceSelect}
*
* \par Performance
* \linear_performance{select-flagged, select-if, and select-unique}
*
* \par
* The following chart illustrates DeviceSelect::If
* performance across different CUDA architectures for \p int32 items,
* where 50% of the items are randomly selected.
*
* \image html select_if_int32_50_percent.png
*
* \par
* The following chart illustrates DeviceSelect::Unique
* performance across different CUDA architectures for \p int32 items
* where segments have lengths uniformly sampled from [1,1000].
*
* \image html select_unique_int32_len_500.png
*
* \par
* \plots_below
xgboost/cub/cub/device/device_select.cuh view on Meta::CPAN
*
* \par
* - Copies of the selected items are compacted into \p d_out and maintain their original relative ordering.
* - \devicestorage
*
* \par Performance
* The following charts illustrate saturated select-if performance across different
* CUDA architectures for \p int32 and \p int64 items, respectively. Items are
* selected with 50% probability.
*
* \image html select_if_int32_50_percent.png
* \image html select_if_int64_50_percent.png
*
* \par
* The following charts are similar, but 5% selection probability:
*
* \image html select_if_int32_5_percent.png
* \image html select_if_int64_5_percent.png
*
* \par Snippet
* The code snippet below illustrates the compaction of items selected from an \p int device vector.
* \par
* \code
* #include <cub/cub.cuh> // or equivalently <cub/device/device_select.cuh>
*
* // Functor type for selecting values less than some criteria
* struct LessThan
* {
xgboost/doc/R-package/discoverYourData.md view on Meta::CPAN
## 5: SexMale -1.00136e-05 0.04874405 4 0.1428571
## 6: Age 53.5 0.04620627 10 0.3571429
```
> In the table above we have removed two not needed columns and select only the first lines.
First thing you notice is the new column `Split`. It is the split applied to the feature on a branch of one of the tree. Each split is present, therefore a feature can appear several times in this table. Here we can see the feature `Age` is used seve...
How the split is applied to count the co-occurrences? It is always `<`. For instance, in the second line, we measure the number of persons under 61.5 years with the illness gone after the treatment.
The two other new columns are `RealCover` and `RealCover %`. In the first column it measures the number of observations in the dataset where the split is respected and the label marked as `1`. The second column is the percentage of the whole populati...
Therefore, according to our findings, getting a placebo doesn't seem to help but being younger than 61 years may help (seems logic).
> You may wonder how to interpret the `< 1.00001` on the first line. Basically, in a sparse `Matrix`, there is no `0`, therefore, looking for one hot-encoded categorical observations validating the rule `< 1.00001` is like just looking for `1` for th...
### Plotting the feature importance
All these things are nice, but it would be even better to plot the results.
xgboost/src/tree/fast_hist_param.h view on Meta::CPAN
#define XGBOOST_TREE_FAST_HIST_PARAM_H_
namespace xgboost {
namespace tree {
/*! \brief training parameters for histogram-based training */
struct FastHistParam : public dmlc::Parameter<FastHistParam> {
// integral data type to be used with columnar data storage
enum class DataType { uint8 = 1, uint16 = 2, uint32 = 4 };
int colmat_dtype;
// percentage threshold for treating a feature as sparse
// e.g. 0.2 indicates a feature with fewer than 20% nonzeros is considered sparse
double sparse_threshold;
// use feature grouping? (default yes)
int enable_feature_grouping;
// when grouping features, how many "conflicts" to allow.
// conflict is when an instance has nonzero values for two or more features
// default is 0, meaning features should be strictly complementary
double max_conflict_rate;
// when grouping features, how much effort to expend to prevent singleton groups
// we'll try to insert each feature into existing groups before creating a new group
xgboost/src/tree/fast_hist_param.h view on Meta::CPAN
DMLC_DECLARE_PARAMETER(FastHistParam) {
DMLC_DECLARE_FIELD(colmat_dtype)
.set_default(static_cast<int>(DataType::uint32))
.add_enum("uint8", static_cast<int>(DataType::uint8))
.add_enum("uint16", static_cast<int>(DataType::uint16))
.add_enum("uint32", static_cast<int>(DataType::uint32))
.describe("Integral data type to be used with columnar data storage."
"May carry marginal performance implications. Reserved for "
"advanced use");
DMLC_DECLARE_FIELD(sparse_threshold).set_range(0, 1.0).set_default(0.2)
.describe("percentage threshold for treating a feature as sparse");
DMLC_DECLARE_FIELD(enable_feature_grouping).set_lower_bound(0).set_default(0)
.describe("if >0, enable feature grouping to ameliorate work imbalance "
"among worker threads");
DMLC_DECLARE_FIELD(max_conflict_rate).set_range(0, 1.0).set_default(0)
.describe("when grouping features, how many \"conflicts\" to allow."
"conflict is when an instance has nonzero values for two or more features."
"default is 0, meaning features should be strictly complementary.");
DMLC_DECLARE_FIELD(max_search_group).set_lower_bound(0).set_default(100)
.describe("when grouping features, how much effort to expend to prevent "
"singleton groups. We'll try to insert each feature into existing "