iota2 and scikit-learn machine learning algorithms
Iota2 is able to use some of machine learning algorithms coming from scikit-learn (more specifically ensemble methods and SVC).
This documentation exposes how to configure iota2 in order to use scikit-learn library.
All scikit-learn parameters are available in the scikit_models_parameters section. Some of them refer directly to scikit-learn models classifier parameters (keywords arguments).
Scikit-learn parameters table
Parameter Key |
Parameter Type |
Default value |
Parameter purpose |
|---|---|---|---|
Boolean |
False |
Apply features standardization before learning and classification process |
|
Dictionary |
{} |
Range of estimator’s parameters to be tested during cross-validation. |
|
cross_validation_grouped |
Boolean |
False |
If false, cross validation folds can contains mixed samples from different polygons |
cross_validation_folds |
Integer |
5 |
Number of cross validation folds |
model_type |
String |
None |
scikit-learn classifier’s name |
Dictionary |
{} |
Additional arguments to be passed to the model |
About standardization
Standardize features by removing the mean and scaling to unit variance.
Note
The standardization implemented in iota2 comes from scikit-learn StandardScaler method and used with default values : StandardScaler(copy=True, with_mean=True, with_std=True)
Cross validation parameters
Cross validation is a method used
to find the best optimized estimator’s parameters according to a scorer function (overall-accuracy).
The user has to provide a list of estimator’s parameters to optimize. This list
of parameters must be provided through a python dictionary. For instance , considering
a RandomForestClassifier machine learning classifier, the configuration file
could contain :
scikit_models_parameters:
{
model_type: "RandomForestClassifier"
cross_validation_parameters: {'n_estimators': [50, 100, 150],
'max_depth': [5, 10, 20]}
}
Indeed n_estimators and min_samples_split are two parameters of RandomForestClassifier.
In this case, every couple in [50, 100, 150] and [5, 10, 20] will be tested and the best one,
with respect to the estimated scorer value, will be used to build the RandomForestClassifier model.
Note
The cross validation workflow implemented in iota2 comes from the scikit-learn GridSearchCV method.
Note
Once the cross validation is achieved, a text file called *_cross_val_param.cv is created next to models.
This file contains every cross validation score for each parameter to optimize and the chosen parameters.
Model’s keywords arguments
Every classifier from ensemble methods
as well as SVC are all
accessible in iota2, each one with its own set of input parameters. For instance with the
RandomForestClassifier,
the user can configure n_estimators, criterion, max_leaf_nodes etc. To configure these
parameters, use the keyword_arguments dictionary. For example:
scikit_models_parameters:
{
model_type: "RandomForestClassifier"
keyword_arguments: {
criterion: "entropy"
min_samples_split: 4
}
cross_validation_parameters: {'n_estimators': [50, 100, 150],
'max_depth': [5, 10, 20]}
}
Configuration file example
Here is an example of a configuration file
fully operational with the downloadable dataset
implementing scikit-learn machine learning algorithms.