iota2 and scikit-learn machine learning algorithms

Iota2 is able to use some of machine learning algorithms coming from scikit-learn (more specifically ensemble methods and SVC).

This documentation exposes how to configure iota2 in order to use scikit-learn library.

All scikit-learn parameters are available in the scikit_models_parameters section. Some of them refer directly to scikit-learn models classifier parameters (keywords arguments).

Scikit-learn parameters table

Parameter Key	Parameter Type	Default value	Parameter purpose
standardization	Boolean	False	Apply features standardization before learning and classification process
cross_validation_parameters	Dictionary	{}	Range of estimator’s parameters to be tested during cross-validation.
cross_validation_grouped	Boolean	False	If false, cross validation folds can contains mixed samples from different polygons
cross_validation_folds	Integer	5	Number of cross validation folds
model_type	String	None	scikit-learn classifier’s name
keywords arguments	Dictionary	{}	Additional arguments to be passed to the model

About standardization

Standardize features by removing the mean and scaling to unit variance.

Note

The standardization implemented in iota2 comes from scikit-learn StandardScaler method and used with default values : StandardScaler(copy=True, with_mean=True, with_std=True)

Cross validation parameters

Cross validation is a method used to find the best optimized estimator’s parameters according to a scorer function (overall-accuracy). The user has to provide a list of estimator’s parameters to optimize. This list of parameters must be provided through a python dictionary. For instance , considering a RandomForestClassifier machine learning classifier, the configuration file could contain :

scikit_models_parameters:
{
    model_type: "RandomForestClassifier"
    cross_validation_parameters: {'n_estimators': [50, 100, 150],
                                  'max_depth': [5, 10, 20]}
}

Indeed n_estimators and min_samples_split are two parameters of RandomForestClassifier. In this case, every couple in [50, 100, 150] and [5, 10, 20] will be tested and the best one, with respect to the estimated scorer value, will be used to build the RandomForestClassifier model.

Note

The cross validation workflow implemented in iota2 comes from the scikit-learn GridSearchCV method.

Note

Once the cross validation is achieved, a text file called *_cross_val_param.cv is created next to models. This file contains every cross validation score for each parameter to optimize and the chosen parameters.

Model’s keywords arguments

Every classifier from ensemble methods as well as SVC are all accessible in iota2, each one with its own set of input parameters. For instance with the RandomForestClassifier, the user can configure n_estimators, criterion, max_leaf_nodes etc. To configure these parameters, use the keyword_arguments dictionary. For example:

scikit_models_parameters:
{
    model_type: "RandomForestClassifier"
    keyword_arguments: {
        criterion: "entropy"
        min_samples_split: 4
    }
    cross_validation_parameters: {'n_estimators': [50, 100, 150],
                                  'max_depth': [5, 10, 20]}
}

Configuration file example

Here is an example of a configuration file fully operational with the downloadable dataset implementing scikit-learn machine learning algorithms.