Data configuration#
In this page, we detail the configuration values related to the data pipeline.
The configuration groups or values in data are:
dataset: for loading the dataset and transforming it to the appropriate format;distribution: for specifying the data distribution when partitionning;others: for the others parameters used to split the data among the clients and the server;seed: the seed for the random number generator in the data pipeline, which is essential for the experiment reproducibility;processing: optional, for performing data processing;cleaning: optional, for performing data cleaning;loading: optional, for dynamically loading the splitted dataset that were saved during previous experiments;saving: optional, for saving data at one or multiple specific steps of the data pipeline.
Dataset configuration#
The dataset field requires two mandatory config values:
dataset_name: the name of the dataset, used by the logger;_target_: the function used to load the dataset.
Other config values can be specified according to the chosen loading function.
Important
By default, ARMLET provides multiple dataset loading functions in armlet.data.datasets and pre-defined dataset YAML config files in the ARMLET_DIR/configs/data/dataset folder.
Data distribution#
The distribution field is similar to the one provided by Fluke, but is adapted to the data format required by ARMLET.
The only mandatory config value is _target_, which is the distribution function used during the data splitting step.
Important
By default, ARMLET provides some distribution functions in armlet.data.splitter and pre-defined distribution YAML config files in the ARMLET_DIR/configs/data/distribution folder.
Others fields#
The others field includes all the other config values required to split the data.
The majority of them are the same as the parameters of Fluke’s data other fields:
sampling_perc: sampling percentage when loading the dataset during a training iteration;client_split: percentage of the client’s data that will be used as its test set;client_val_split: percentage of the client’s test set that will be used as a validation set;keep_test: specifies whether you want to keep the test set as provided by the dataset loading function;server_test: specifies whether the server has a test set;server_test_union: specifies whether the server test set is the union of all client test sets (also apply for the server val set). It requires to setserver_testandkeep_testtofalse;server_split: size of the server split with respect to the entire dataset (only used whenkeep_testis set tofalseandserver_testis set totrue);server_val_split: percentage of the server’s data that will be used as a server validation set;uniform_test: specifies whether to use a client-side IID test set distribution regardless of the training data distribution.
See also
For more information about sampling_perc, keep_test, server_test, client_split, server_split, and uniform_test, see Data configuration from Fluke documentation.
Data processing configuration#
[TODO]
Data cleaning configuration#
[TODO]
Data loading configuration#
[TODO]
Data saving configuration#
[TODO]