Skip to content

Tutorial: loading datasets APIs

Let's import tcbench and map its alias tcb

The module automatically import a few functions and constants.

import tcbench as tcb

The .get_datasets_root_folder() method

You can first discover the path where the datasets are installed using .get_datasets_root_folder()

root_folder = tcb.get_datasets_root_folder()
root_folder
PosixPath('./envs/tcbench/lib/python3.10/site-packages/tcbench/libtcdatasets/datasets')

The function returns a pathlib path so you can take advantage of it to navigate the subfolders structure.

For instance:

list(root_folder.iterdir())
[PosixPath('./envs/tcbench/lib/python3.10/site-packages/tcbench/libtcdatasets/datasets/utmobilenet21.BACKUP'),
PosixPath('./envs/tcbench/lib/python3.10/site-packages/tcbench/libtcdatasets/datasets/utmobilenet21'),
PosixPath('./envs/tcbench/lib/python3.10/site-packages/tcbench/libtcdatasets/datasets/mirage22'),
PosixPath('./envs/tcbench/lib/python3.10/site-packages/tcbench/libtcdatasets/datasets/ucdavis-icdm19'),
PosixPath('./envs/tcbench/lib/python3.10/site-packages/tcbench/libtcdatasets/datasets/mirage19')]

As from the output, each dataset is mapped to a different folder named after the dataset itself. Meaning, again taking advantage of pathlib, you can compose path based on strings.

For instance:

list((root_folder / 'ucdavis-icdm19').iterdir())
[PosixPath('./envs/tcbench/lib/python3.10/site-packages/tcbench/libtcdatasets/datasets/ucdavis-icdm19/preprocessed'),
PosixPath('./envs/tcbench/lib/python3.10/site-packages/tcbench/libtcdatasets/datasets/ucdavis-icdm19/raw')]

The .DATASETS enum

A more polished way to reference datasets is via the tcbench.DATASETS attribute which corresponds to a python enumeration object

type(tcb.DATASETS), list(tcb.DATASETS)
(enum.EnumMeta,
[< DATASETS.UCDAVISICDM19: 'ucdavis-icdm19' >,
< DATASETS.UTMOBILENET21: 'utmobilenet21' >,
< DATASETS.MIRAGE19: 'mirage19' >,
< DATASETS.MIRAGE22: 'mirage22' >])

The .get_dataset_folder() method

For instance, you can bypass the composition of a dataset folder path and call directly .get_dataset_folder() to find the specific dataset folder you look for.

dataset_folder = tcb.get_dataset_folder(tcb.DATASETS.UCDAVISICDM19)
dataset_folder
PosixPath('./envs/tcbench/lib/python3.10/site-packages/tcbench/libtcdatasets/datasets/ucdavis-icdm19')

Listing files

Via pathlib you can easily discover all parquet files composing a dataset

list(dataset_folder.rglob('*.parquet'))
[PosixPath('./envs/tcbench/lib/python3.10/site-packages/tcbench/libtcdatasets/datasets/ucdavis-icdm19/preprocessed/ucdavis-icdm19.parquet'),
PosixPath('./envs/tcbench/lib/python3.10/site-packages/tcbench/libtcdatasets/datasets/ucdavis-icdm19/preprocessed/imc23/train_split_0.parquet'),
PosixPath('./envs/tcbench/lib/python3.10/site-packages/tcbench/libtcdatasets/datasets/ucdavis-icdm19/preprocessed/imc23/train_split_1.parquet'),
PosixPath('./envs/tcbench/lib/python3.10/site-packages/tcbench/libtcdatasets/datasets/ucdavis-icdm19/preprocessed/imc23/test_split_human.parquet'),
PosixPath('./envs/tcbench/lib/python3.10/site-packages/tcbench/libtcdatasets/datasets/ucdavis-icdm19/preprocessed/imc23/train_split_4.parquet'),
PosixPath('./envs/tcbench/lib/python3.10/site-packages/tcbench/libtcdatasets/datasets/ucdavis-icdm19/preprocessed/imc23/train_split_3.parquet'),
PosixPath('./envs/tcbench/lib/python3.10/site-packages/tcbench/libtcdatasets/datasets/ucdavis-icdm19/preprocessed/imc23/test_split_script.parquet'),
PosixPath('./envs/tcbench/lib/python3.10/site-packages/tcbench/libtcdatasets/datasets/ucdavis-icdm19/preprocessed/imc23/train_split_2.parquet')]

But you can also programmatically call the the datasets lsparquet subcommand of the CLI using get_rich_tree_parquet_files()

from tcbench.libtcdatasets.datasets_utils import get_rich_tree_parquet_files
get_rich_tree_parquet_files(tcb.DATASETS.UCDAVISICDM19)
Datasets
└── ucdavis-icdm19
    └── 📁 preprocessed/
        ├── ucdavis-icdm19.parquet
        ├── LICENSE
        └── 📁 imc23/
            ├── test_split_human.parquet
            ├── test_split_script.parquet
            ├── train_split_0.parquet
            ├── train_split_1.parquet
            ├── train_split_2.parquet
            ├── train_split_3.parquet
            └── train_split_4.parquet

The .load_parquet() method

Finally, the generic .load_parquet() can be used to load one of the parquet files.

For instance, the following load the unfiltered monolithic file of the ucdavis-icdm19 dataset

df = tcb.load_parquet(tcb.DATASETS.UCDAVISICDM19)
df.head(2)
row_id app flow_id partition num_pkts duration bytes unixtime timetofirst pkts_size pkts_dir pkts_iat
0 0 google-doc GoogleDoc-100 pretraining 2925 116.348 816029 [1527993495.652867, 1527993495.685678, 1527993... [0.0, 0.0328109, 0.261392, 0.262656, 0.263943,... [354, 87, 323, 1412, 1412, 107, 1412, 180, 141... [1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, ... [0.0, 0.0328109, 0.2285811, 0.0012639999999999...
1 1 google-doc GoogleDoc-1000 pretraining 2813 116.592 794628 [1527987720.40456, 1527987720.422811, 15279877... [0.0, 0.0182509, 0.645106, 0.646344, 0.647689,... [295, 87, 301, 1412, 1412, 1412, 180, 113, 141... [1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, ... [0.0, 0.0182509, 0.6268551, 0.0012380000000000...
df.groupby(['partition', 'app'])['app'].value_counts()
partition                    app
pretraining                  google-doc       1221
google-drive     1634
google-music      592
google-search    1915
youtube          1077
retraining-human-triggered   google-doc         15
google-drive       18
google-music       15
google-search      15
youtube            20
retraining-script-triggered  google-doc         30
google-drive       30
google-music       30
google-search      30
youtube            30
Name: count, dtype: int64

Beside the dataset name, the function only has 2 other parameters, but their semantic and values are "mingled" with the curation process adopted.

tcb.load_parquet?
Signature:
tcb.load_parquet(
    dataset_name: 'str | DATASETS',
    min_pkts: 'int' = -1,
    split: 'str' = None,
    columns: 'List[str]' = None,
    animation: 'bool' = False,
) -> 'pd.DataFrame'
Docstring:
Load and returns a dataset parquet file

Arguments:
    dataset_name: The name of the dataset
    min_pkts: the filtering rule applied when curating the datasets.
        If -1, load the unfiltered dataset
    split: if min_pkts!=-1, is used to request the loading of
        the split file. For DATASETS.UCDAVISICDM19
        values can be "human", "script" or a number
        between 0 and 4.
        For all other dataset split can be anything
        which is not None (e.g., True)
    columns: A list of columns to load (if None, load all columns)
    animation: if True, create a loading animation on the console

Returns:
    A pandas dataframe and the related parquet file used to load the dataframe
File:      ~/.conda/envs/super-tcbench/lib/python3.10/site-packages/tcbench/libtcdatasets/datasets_utils.py
Type:      function

How .load_parquet() maps to parquet files

The logic to follow to load specific files can be confusing. The table below report a global view across datasets:

Dataset min_pkts=-1 min_pkts=10 min_pkts=1000 split=True split=0..4 split=human split=script
ucdavis-icdm19 yes - - - yes (train+val) yes (test) yes (test)
mirage19 yes yes - yes (train/val/test) - - -
mirage22 yes yes yes yes (train/val/test) - - -
utmobilenet21 yes yes - yes (train/val/test) - - -
  • min_pkts=-1 is set by default and corresponds to loading the unfiltered parquet files, i.e., the files stored immediately under /preprocessed. All other files are stored under the imc23 subfolders

  • For ucdavis-icdm19, the parameter min_pkts is not used. The loading of training(+validation) and test data is controlled by split

  • For all other datasets, min_pkts specifies which filtered version of the data to use, while split=True load the split indexes

Loading ucdavis-icdm19

For instance, to load the human test split of ucdavid-icdm19 you can run

df = tcb.load_parquet(tcb.DATASETS.UCDAVISICDM19, split='human')
df['app'].value_counts()
app
youtube          20
google-drive     18
google-doc       15
google-music     15
google-search    15
Name: count, dtype: int64

And the logic is very similar for the script partition

df = tcb.load_parquet(tcb.DATASETS.UCDAVISICDM19, split='script')
df['app'].value_counts()
app
google-doc       30
google-drive     30
google-music     30
google-search    30
youtube          30
Name: count, dtype: int64

However to load a specific train split

df = tcb.load_parquet(tcb.DATASETS.UCDAVISICDM19, split='0')
df['app'].value_counts()
app
google-doc       100
google-drive     100
google-music     100
google-search    100
youtube          100
Name: count, dtype: int64

Loading other datasets

By default, without any parameter beside the dataset name, the function loads the unfiltered version of a dataset

df = tcb.load_parquet(tcb.DATASETS.MIRAGE19)
df.shape
(122007, 135)

Recall the structure of the mirage19 dataset

get_rich_tree_parquet_files(tcb.DATASETS.MIRAGE19)
Datasets
└── mirage19
    └── 📁 preprocessed/
        ├── mirage19.parquet
        └── 📁 imc23/
            ├── mirage19_filtered_minpkts10.parquet
            └── mirage19_filtered_minpkts10_splits.parquet

So there is only one filtering with min_pkts=10

df = tcb.load_parquet(tcb.DATASETS.MIRAGE19, min_pkts=10)
df.shape
(64172, 20)

Based on the dataframe shape, we can see that (indeed) we loaded a reduced version of the unfiltered dataset.

While for ucdavis-icdm19 the "split" files represent 100 samples selected for training (because there are two ad-hoc test split), for all other dataset the "split" files contains indexes indicating the rows to use for train/val/test.

Thus, issuing split=True is enough to indicate the need to load the split table.

df_split = tcb.load_parquet(tcb.DATASETS.MIRAGE19, min_pkts=10, split=True)