Curation & Meta-data

Datasets curation¶

The curation process operated by tcbench aims at the following objectives.

Curation objectives

Transform all datasets into a common format, apply some clearning if/where needed and create a reference set of data split for modeling.
All data should reflect per-flow records where each flow is associated to packet time series input.

To do so, tcbench applies an opinionated preprocessing starting from the original raw version of the data (i.e., the data offered by the authors of each dataset).

The raw data go through the following steps:

Install: tcbench can fetch the raw data from the Internet or can take as input a folder where the raw data were already downloaded. This data is then "installed" by unpacking it into the python environment were tcbench is installed.
Preprocess: Once unpacked, the raw data is converted into monolithic packet files. Such files are left untouched, i.e., they simply serve as a re-organization of the original data (with a per-flow view where needed) with an homogeneous format across datasets.
Filter and split: The monolithic parquet files are first filtered (e.g., removing very short flows or flow related to invalid IP addresses) and then used to train/validation/test splits.

We reiterate that all steps are a necessity to enable ML/DL modeling. Yet, they are opinionated processes so tcbench embodies just on option to perform them.

Datasets Meta-data¶

As part of the curation process, tcbench can easily show the meta-data related to the datasets.

For instance, you can see the datasets summary table by running

tcbench datasets info

Output

Datasets
├── ucdavis-icdm19
│   └──  🚩 classes:           5
│        🔗 paper_url:         https://arxiv.org/pdf/1812.09761.pdf
│        🔗 website:           https://github.com/shrezaei/Semi-supervised-Learning-QUIC-
│        🔗 data:              https://drive.google.com/drive/folders/1Pvev0hJ82usPh6dWDlz7Lv8L6h3JpWhE
│        🔗 curated data:      https://figshare.com/ndownloader/files/42437043
│        ➕ curated data MD5:  9828cce0c3a092ff19ed77f9e07f317c
│        📁 installed:         None
│        📁 preprocessed:      None
│        📁 data splits:       None
├── mirage19
│   └──  🚩 classes:       20
│        🔗 paper_url:     http://wpage.unina.it/antonio.montieri/pubs/MIRAGE_ICCCS_2019.pdf
│        🔗 website:       https://traffic.comics.unina.it/mirage/mirage-2019.html
│        🔗 data:          https://traffic.comics.unina.it/mirage/MIRAGE/MIRAGE-2019_traffic_dataset_downloadable_v2.tar.gz
│        📁 installed:     None
│        📁 preprocessed:  None
│        📁 data splits:   None
├── mirage22
│   └──  🚩 classes:       9
│        🔗 paper_url:     http://wpage.unina.it/antonio.montieri/pubs/_C__IEEE_CAMAD_2021___Traffic_Classification_Covid_app.pdf
│        🔗 website:       https://traffic.comics.unina.it/mirage/mirage-covid-ccma-2022.html
│        🔗 data:          https://traffic.comics.unina.it/mirage/MIRAGE/MIRAGE-COVID-CCMA-2022.zip
│        📁 installed:     None
│        📁 preprocessed:  None
│        📁 data splits:   None
└── utmobilenet21
    └──  🚩 classes:           17
         🔗 paper_url:         https://ieeexplore.ieee.org/abstract/document/9490678/
         🔗 website:           https://github.com/YuqiangHeng/UTMobileNetTraffic2021
         🔗 data:              https://utexas.app.box.com/s/okrimcsz1mn9ec4j667kbb00d9gt16ii
         🔗 curated data:      https://figshare.com/ndownloader/files/42436353
         ➕ curated data MD5:  e1fdcffa41a0f01d63eaf57c198485ce
         📁 installed:         None
         📁 preprocessed:      None
         📁 data splits:       None

Notice the three properties installed, preprocessed and data_splits. These show the absolute path where the data is stored. As mentioned before, notice that the data is installed inside the python environment where tcbench is installed.

The example above refers to an environment with no dataset installed yet.

The following instead show a case where all datasets are installed.

Output

Datasets
├── ucdavis-icdm19
│   └──  🚩 classes:           5
│        🔗 paper_url:         https://arxiv.org/pdf/1812.09761.pdf
│        🔗 website:           https://github.com/shrezaei/Semi-supervised-Learning-QUIC-
│        🔗 data:              https://drive.google.com/drive/folders/1Pvev0hJ82usPh6dWDlz7Lv8L6h3JpWhE
│        🔗 curated data:      https://figshare.com/ndownloader/files/42437043
│        ➕ curated data MD5:  9828cce0c3a092ff19ed77f9e07f317c
│        📁 installed:         ./envs/tcbench/lib/python3.10/site-packages/tcbench/libtcdatasets/datasets/ucdavis-icdm19/raw
│        📁 preprocessed:      ./envs/tcbench/lib/python3.10/site-packages/tcbench/libtcdatasets/datasets/ucdavis-icdm19/preprocessed
│        📁 data splits:       ./envs/tcbenchlib/python3.10/site-packages/tcbench/libtcdatasets/datasets/ucdavis-icdm19/preprocessed/imc23
├── mirage19
│   └──  🚩 classes:       20
│        🔗 paper_url:     http://wpage.unina.it/antonio.montieri/pubs/MIRAGE_ICCCS_2019.pdf
│        🔗 website:       https://traffic.comics.unina.it/mirage/mirage-2019.html
│        🔗 data:          https://traffic.comics.unina.it/mirage/MIRAGE/MIRAGE-2019_traffic_dataset_downloadable_v2.tar.gz
│        📁 installed:     None
│        📁 preprocessed:  ./envs/tcbench/lib/python3.10/site-packages/tcbench/libtcdatasets/datasets/mirage19/preprocessed
│        📁 data splits:   ./envs/tcbench/lib/python3.10/site-packages/tcbench/libtcdatasets/datasets/mirage19/preprocessed/imc23
├── mirage22
│   └──  🚩 classes:       9
│        🔗 paper_url:     http://wpage.unina.it/antonio.montieri/pubs/_C__IEEE_CAMAD_2021___Traffic_Classification_Covid_app.pdf
│        🔗 website:       https://traffic.comics.unina.it/mirage/mirage-covid-ccma-2022.html
│        🔗 data:          https://traffic.comics.unina.it/mirage/MIRAGE/MIRAGE-COVID-CCMA-2022.zip
│        📁 installed:     None
│        📁 preprocessed:  ./envs/tcbench/lib/python3.10/site-packages/tcbench/libtcdatasets/datasets/mirage22/preprocessed
│        📁 data splits:   ./envs/tcbench/lib/python3.10/site-packages/tcbench/libtcdatasets/datasets/mirage22/preprocessed/imc23
└── utmobilenet21
    └──  🚩 classes:           17
         🔗 paper_url:         https://ieeexplore.ieee.org/abstract/document/9490678/
         🔗 website:           https://github.com/YuqiangHeng/UTMobileNetTraffic2021
         🔗 data:              https://utexas.app.box.com/s/okrimcsz1mn9ec4j667kbb00d9gt16ii
         🔗 curated data:      https://figshare.com/ndownloader/files/42436353
         ➕ curated data MD5:  e1fdcffa41a0f01d63eaf57c198485ce
         📁 installed:         None
         📁 preprocessed:      ./envs/tcbench/lib/python3.10/site-packages/tcbench/libtcdatasets/datasets/utmobilenet21/preprocessed
         📁 data splits:       ./envs/tcbench/lib/python3.10/site-packages/tcbench/libtcdatasets/datasets/utmobilenet21/preprocessed/imc23

Notice that

installed may or may not be None depending on if the dataset was installed from the raw data or via tcbench artifacts (see XXX).
preprocessed and data splits are always reported unless the dataset was not installed.

Dataset files¶

As mentioned, the datasets files are installed within the python environment where tcbench is installed.

Generally speaking, tcbench API are designed to ease the process of loading the data without the need to master the internal organization of a dataset. Check the XXX tutorial.

In case of need, you can inspect the internal organization of a dataset using the datasets lsfiles subcommand.

For instance, for the ucdavis-icdm19 datasets

tcbench datasets lsfiles --name ucdavis-icdm19

Output

tcbench datasets lsfiles --name ucdavis-icdm19
Datasets
└── ucdavis-icdm19
    └── 📁 preprocessed/
        ├── ucdavis-icdm19.parquet
        └── 📁 imc23/
            ├── test_split_human.parquet
            ├── test_split_script.parquet
            ├── train_split_0.parquet
            ├── train_split_1.parquet
            ├── train_split_2.parquet
            ├── train_split_3.parquet
            └── train_split_4.parquet

While the info subcommand show the absolute root path where datasets are installed, the lsparquet shows the relative paths and internal hierarchical structure of how the data is locally installed.

In the sample above

the raw/ folder where the original raw data of the dataset is never displayed.
the preprocessed/ucdavis-icdm19.parquet corresponds to the monolithic parquet obtained reformatting (but not curating) the raw data.
the files under the preprocessed/imc23 corresponds to the curation operated to the dataset based on our IMC23 paper.