Datasets¶
TCBench supports the following public traffic classification datasets
Table : Datasets properties¶
Name | Applications | Links | License | Our curation |
---|---|---|---|---|
ucdavis-icdm19 |
5 | |||
mirage19 |
20 | NC-ND | - | |
mirage22 |
9 | NC-ND | - | |
utmobilenet21 |
17 | GPLv3 |
At a glance, these datasets
-
Are collections of either CSV or JSON files.
-
Are reporting individual packet level information or per-flow time series and metrics.
-
May have been organized in subfolders, namely partitions, to reflect the related measurement campaign (see
ucdavis-icdm19
,utmobilenet21
). -
May have file names carrying semantic.
-
May require preprocessing to remove "background" noise, i.e., traffic unrelated to a target application (see
mirage19
andmirage22
). -
Do not have reference train/validation/test splits.
In other words, these datasets need to be curated to be used.
Important
The integration of these datasets in tcbench does not break the original licensing of the data nor it breaks their ownership. Rather, the integration aims at easing the access to these dataset. We thus encourage researchers and practitioners interesting in using these datasets to cite the original publications (see links in the table above).
Terminology¶
When describing datasets and related processing we use the following conventions:
-
A partition is a set of samples pre-defined by the authors of the dataset. For instance, a partition can relate to a specific set of samples to use for training/test (see
ucdavis-icdm19
). -
A split is a set of indexes of samples that need to be used for train/validation/test.
-
An unfiltered dataset corresponds a monolithic parquet files containing the original raw data of a dataset (no filtering is applied).
-
A curated dataset is generated processing the unfiltered parquet to clean noise, remove small flows, etc., and each dataset have slightly different curation rules.