Sklearn classification example¶
In this page, we will show you how to train an Sklearn classifier using a Pytorch feature extractor.
This example will demonstrate how to create custom experiments starting from default settings.
Training¶
Dataset¶
Let's start with the dataset that we are going to use. Since we are using the base datamodule, we need to organize the data in a specific way. The base datamodule expects the following structure:
dataset/
├── class_0
│ ├── abc.xyz
│ └── ...
├── class_1
│ ├── abc.xyz
│ └── ...
├── class_N
│ ├── abc.xyz
│ └── ...
├── None
│ ├── abc.xyz
│ └── ...
├── train.txt # optional
└── test.txt # optional
Note
Only for inference a None class representing unknown data can be used. If a folder with the same name during training it will be interpreted as a class.
The train.txt
and test.txt
files are optional and can be used to specify the list of images to use for training (validation) and testing. If not specified, the datamodule will base the split using a different parameter. The files should contain the relative path to the image from the dataset root folder. For example, if the dataset is organized as above, the train.txt
file could be:
class_0/abc.xyz
...
class_1/abc.xyz
...
class_N/abc.xyz
...
The standard datamodule configuration for this kind of task is found under datamodule/base/sklearn_classification.yaml
.
_target_: quadra.datamodules.SklearnClassificationDataModule
data_path: ???
exclude_filter:
include_filter:
val_size: 0.3
class_to_idx:
label_map:
seed: ${core.seed}
batch_size: 32
num_workers: 8
train_transform: ${transforms.train_transform}
val_transform: ${transforms.val_transform}
test_transform: ${transforms.test_transform}
roi:
n_splits: 1
phase:
cache: false
limit_training_data:
train_split_file:
test_split_file:
The only required parameter is data_path
which should point to the dataset root folder. The other parameters can be used to customize the datamodule behavior, the most important parameters are:
val_size
: The percentage of the dataset to use for validation (if test_split_file is not specified)class_to_idx
: A dictionary mapping class names to class indexeslabel_map
: A dictionary mapping groups of classes to a single class (E.g. "good": ["class_1", "class_2"]), it may be useful for testing different scenarios or simplify the classification taskroi
: Optional region of intereset in the following format (x_upper_left, y_upper_left, x_bottom_right, y_bottom_right)n_splits
: The number of splits to use to partition the dataset, if 1 the dataset will be split in train and test, if > 1 cross-validation will be usedcache
: If cross validation is used is it possible to cache the features extracted from the backbone to speed up the processlimit_training_data
: If specified, the datamodule will use only a subset of the training data (useful for debugging)train_split_file
: If specified, the datamodule will use the given file to create the train dataset (which will be the base for validation)test_split_file
: If specified, the datamodule will use the given file to create the test dataset
No matter if cross validation or standard train/test split is used, the final model will be trained on the whole training dataset, the splits are only used to validate the model.
Experiment¶
Suppose that we want to run the experiment on the given dataset, we can define a config starting from the base config:
# @package _global_
defaults:
- override /model: logistic_regression
- override /transforms: default_resize
- override /task: sklearn_classification
- override /backbone: dino_vitb8
- override /trainer: sklearn_classification
- override /datamodule: base/sklearn_classification
export:
types: [pytorch, torchscript]
backbone:
model:
pretrained: true
freeze: true
core:
tag: "run"
name: "sklearn-classification"
datamodule:
num_workers: 8
batch_size: 32
phase: train
n_splits: 1
By default the experiment will use dino_vitb8 as backbone, resizing the images to 224x224 and training a logistic regression classifier. Setting the n_splits
parameter to 1 will use a standard 70/30 train/validation split (given the parameters specified in the base datamodule) instead of cross validation.
It will also export the model in two formats, "torchscript" and "pytorch".
An actual configuration file based on the above could be this one (suppose it's saved under configs/experiment/custom_experiment/sklearn_classification.yaml
):
# @package _global_
defaults:
- base/classification/sklearn_classification
- override /backbone: resnet18
- _self_
core:
name: experiment-name
export:
types: [pytorch, torchscript]
datamodule:
data_path: path_to_dataset
batch_size: 64
class_to_idx:
class_0: 0
class_1: 1
class_2: 2
n_splits: 5
train_split_file: ${datamodule.data_path}/train.txt
test_split_file: ${datamodule.data_path}/test.txt
task:
device: cuda:0
half_precision: false
gradcam: false
automatic_batch_size:
starting_batch_size: 1024
disable: true
save_model_summary: false
output:
folder: classification_experiment
report: true
example: true
test_full_data: true
This will train a logistic regression classifier using a resnet18 backbone, resizing the images to 224x224 and using a 5-fold cross validation. The class_to_idx
parameter is used to map the class names to indexes, the indexes will be used to train the classifier. The output
parameter is used to specify the output folder and the type of output to save. The export.types
parameter can be used to export the model in different formats, at the moment torchscript
, onnx
and pytorch
are supported.
The backbone (in torchscript and pytorch format) will be saved along with the classifier. test_full_data
is used to specify if a final test should be performed on all the data (after training on the training and validation datasets).
It's possible to enable half precision training by setting half_precision
to true
and export gradcam results by setting gradcam
to true
.
Optionally it's possible to enable the automatic batch size finder by setting automatic_batch_size.disable
to false
. This will try to find the maximum batch size that can be used on the given device without running out of memory. The starting_batch_size
parameter is used to specify the starting batch size to use for the search, the algorithm will start from this value and will try to divide it by two until it doesn't run out of memory.
Finally, the save_model_summary
parameter can be used to save the backbone information in a text file called model_summary.txt
located in the root of the output folder.
Run¶
Assuming that you have created a virtual environment and installed the quadra
library, you can run the experiment by running the following command:
quadra experiment=custom_experiment/sklearn_classification
This will run the experiment training a classifier and saving metrics and reports under the classification_experiment
folder.
The output folder should contain the following entries:
classification_experiment classification_experiment_3 data
classification_experiment_0 classification_experiment_4 deployment_model
classification_experiment_1 config_resolved.yaml main.log
classification_experiment_2 config_tree.txt test
Each classification_experiment_X
folder contains the metrics for the corresponding fold while the classification_experiment
folder contains the metrics computed aggregating the results of all the folds.
The data
folder contains a joblib version of the datamodule containing parameters and splits for reproducibility. The deployment_model
folder contains the backbone exported in torchscript and pytorch format alongside the joblib version of trained classifier. The test
folder contains the metrics for the final test on all the data after the model has been trained on both train and validation.
Evaluation¶
The same datamodule specified before can be used for inference by setting the phase
parameter to test
.
Experiment¶
The default experiment config is found under configs/experiment/base/classification/sklearn_classification_test.yaml
.
# @package _global_
defaults:
- override /transforms: default_resize
- override /task: sklearn_classification_test
- override /trainer: sklearn_classification
- override /datamodule: base/sklearn_classification
core:
tag: run
name: sklearn-classification-test
datamodule:
num_workers: 8
batch_size: 32
phase: test
An actual configuration file based on the above could be this one (suppose it's saved under configs/experiment/custom_experiment/sklearn_classification_test.yaml
):
# @package _global_
defaults:
- base/classification/sklearn_classification_test
- _self_
core:
name: experiment-test-name
datamodule:
data_path: path_to_test_dataset
batch_size: 64
task:
device: cuda:0
gradcam: true
output:
folder: classification_test_experiment
report: true
example: true
model_path: ???
This will test the model trained in the given experiment on the given dataset. The experiment results will be saved under the classification_test_experiment
folder. If gradcam is set to True, original and gradcam results will be saved during the generate_report().
Model_path must point to a model file. It could either be a '.pt'/'.pth' or a backbone_config '.yaml' file.
Run¶
Same as above, assuming that you have created a virtual environment and installed the quadra
library, you can run the experiment by running the following command:
quadra experiment=custom_experiment/sklearn_classification_test