Data configuration#

In this page, we detail the configuration values related to the data pipeline. The configuration groups or values in data are:

  • seed: the seed for the random number generator in the data pipeline, which is essential for the experiment reproducibility;

  • dataset: for loading the dataset and transforming it to the appropriate format;

  • splitter: for splitting the data among the clients and the server (including the data distribution when partitionning);

  • processing: for performing data processing;

  • cleaning: optional, for performing data cleaning;

  • loading: optional, for loading static data (splitted, cleaned, tensors) that were save during previous experiments instead of loading the raw dataset;

  • saving: optional, for saving data at one or multiple specific steps of the data pipeline;

  • others: for the others config values that do not fall into the other categories.

See also

See the tutorial Understanding the data pipeline in armlet to learn more about how this framework uses these configurations to perform the data pipeline.

Dataset configuration#

The dataset field requires two mandatory config values:

  • dataset_name: the name of the dataset, used by the logger;

  • _target_: the function used to load the dataset;

  • and additional config values depending on the chosen loading function (that will be dynamically pass to the function as arguments).

Tip

By default, ARMLET provides multiple dataset loading functions in armlet.data.datasets and pre-defined dataset YAML config files in the ARMLET_DIR/configs/data/dataset folder.

Data splitter configuration#

The splitter field includes all the config values required to split the data. The majority of them are the same as the parameters of Fluke’s data other fields and Fluke’s data distribution field:

  • distribution: config group for specifying the data distribution when partitionning (similar to the one provided by Fluke, but adapted to the data format required by ARMLET). The only mandatory config value in this group is _target_, which is the distribution function used during the data splitting step;

  • client_split: percentage of the client’s data that will be used as its test set;

  • client_val_split: percentage of the client’s test set that will be used as a validation set;

  • keep_test: specifies whether you want to keep the test set as provided by the dataset loading function;

  • server_test: specifies whether the server has a test set;

  • server_test_union: specifies whether the server test set is the union of all client test sets (also apply for the server val set). It requires to set server_test and keep_test to false;

  • server_split: size of the server split with respect to the entire dataset (only used when keep_test is set to false and server_test is set to true);

  • server_val_split: percentage of the server’s data that will be used as a server validation set;

  • uniform_test: specifies whether to use a client-side IID test set distribution regardless of the training data distribution.

Tip

By default, ARMLET provides some distribution functions in armlet.data.splitter and pre-defined distribution YAML config files in the ARMLET_DIR/configs/data/splitter/distribution folder.

See also

For more information about keep_test, server_test, client_split, server_split, and uniform_test, see Data configuration from Fluke documentation.

Data processing configuration#

The processing field consists of several config groups, each of which defines a processing step to be applied to the data. Each of these config groups requires the following config values:

  • _target_: the function corresponding to the data processing to perform;

  • _apply_directly_to_subdata_: boolean for determining whether the processing step is applied directly to the subdata (local training data for each client, server test data, etc.). This can be usefull when the processing must access all data before processing the subdata, such as for feature encoding or normalization;

  • and additional config values depending on the chosen data processing function (that will be dynamically pass to the function as arguments).

Note that the name of the config groups only serves to organize the configs and have no impact on the processing steps.

Important

The order of the config groups in the processing field is important as ARMLET will perform the different steps in this order.

Tip

ARMLET provides multiple data processing functions in armlet.data.processing and pre-defined YAML config files in the ARMLET_DIR/configs/data/processing folder.

Data cleaning configuration#

The cleaning field is optional and consists of several config groups, each of which defines a cleaning step to be applied to the data. Each of these config groups (the name being the type of corrupted data/error to be handled) requires the following config values:

  • _target_: the class corresponding to the data cleaning step to perform;

  • and additional config values depending on the chosen data cleaning function (that will be dynamically pass to the function as arguments).

For the moment, only armlet.data.cleaning.missing_values.RemoveMV is implemented in ARMLET. It removes any sample containing missing values (i.e., NaN). Feel free to implement your own methods for handling your type of corrupted data/error (e.g., missing values, label errors, outliers).

Note that the cleaning field has also a config value name that is only used to give a name to the combination of cleaning steps (for logging).

Data loading configuration#

The loading field can be used to load static data (splitted, cleaned, tensors) that were save during previous experiments (using the saving module). The config values are:

  • static: boolean for activating the static data loading;

  • load_dir: directory from which the data will be loaded.

Data saving configuration#

The saving field can be used to save the data at one or multiple specific steps of the data pipeline. The config values are:

  • save_dir: directory where the data will be saved. A subfolder will be created, with a name that depends on the type of saving (e.g., “after_processing”, “no_cleaning”, etc.).

  • save_data_before_cleaning: boolean for saving data before cleaning (set to False by default);

  • save_data_after_cleaning: boolean for saving data after cleaning (set to False by default);

  • save_data_after_processing: boolean for saving data after processing (set to False by default). This can be particularly usefull when dealing with images as it will save tensors instead of raw images.

Others fields#

The others field includes all the others parameters that do not fall into the other categories.

  • sampling_perc: sampling percentage when loading the dataset during a training iteration.

See also

For more information about sampling_perc, see Data configuration from Fluke documentation.