Datasets¶

TCBench supports the following public traffic classification datasets

Table : Datasets properties¶

Name	Applications	License	Our curation
`ucdavis-icdm19`	5
`mirage19`	20	NC-ND	-
`mirage22`	9	NC-ND	-
`utmobilenet21`	17	GPLv3

At a glance, these datasets

Are collections of either CSV or JSON files.
Are reporting individual packet level information or per-flow time series and metrics.
May have been organized in subfolders, namely partitions, to reflect the related measurement campaign (see ucdavis-icdm19, utmobilenet21).
May have file names carrying semantic.
May require preprocessing to remove "background" noise, i.e., traffic unrelated to a target application (see mirage19 and mirage22).
Do not have reference train/validation/test splits.

In other words, these datasets need to be curated to be used.

Important

The integration of these datasets in tcbench does not break the original licensing of the data nor it breaks their ownership. Rather, the integration aims at easing the access to these dataset. We thus encourage researchers and practitioners interesting in using these datasets to cite the original publications (see links in the table above).

Terminology¶

When describing datasets and related processing we use the following conventions:

A partition is a set of samples pre-defined by the authors of the dataset. For instance, a partition can relate to a specific set of samples to use for training/test (see ucdavis-icdm19).
A split is a set of indexes of samples that need to be used for train/validation/test.
An unfiltered dataset corresponds a monolithic parquet files containing the original raw data of a dataset (no filtering is applied).
A curated dataset is generated processing the unfiltered parquet to clean noise, remove small flows, etc., and each dataset have slightly different curation rules.