Tutorial: loading datasets APIs¶
Let's import tcbench
and map its alias tcb
The module automatically import a few functions and constants.
The .get_datasets_root_folder()
method¶
You can first discover the .get_datasets_root_folder()
PosixPath('./envs/tcbench/lib/python3.10/site-packages/tcbench/libtcdatasets/datasets')
The function returns a pathlib
path
so you can take advantage of it to navigate the subfolders structure.
For instance:
[PosixPath('./envs/tcbench/lib/python3.10/site-packages/tcbench/libtcdatasets/datasets/utmobilenet21.BACKUP'),
PosixPath('./envs/tcbench/lib/python3.10/site-packages/tcbench/libtcdatasets/datasets/utmobilenet21'),
PosixPath('./envs/tcbench/lib/python3.10/site-packages/tcbench/libtcdatasets/datasets/mirage22'),
PosixPath('./envs/tcbench/lib/python3.10/site-packages/tcbench/libtcdatasets/datasets/ucdavis-icdm19'),
PosixPath('./envs/tcbench/lib/python3.10/site-packages/tcbench/libtcdatasets/datasets/mirage19')]
As from the output, each dataset is mapped to a different folder
named after the dataset itself. Meaning, again taking advantage of pathlib
,
you can compose path based on strings.
For instance:
[PosixPath('./envs/tcbench/lib/python3.10/site-packages/tcbench/libtcdatasets/datasets/ucdavis-icdm19/preprocessed'),
PosixPath('./envs/tcbench/lib/python3.10/site-packages/tcbench/libtcdatasets/datasets/ucdavis-icdm19/raw')]
The .DATASETS
enum¶
A more polished way to reference datasets is via the tcbench.DATASETS
attribute which corresponds to a python enumeration object
(enum.EnumMeta,
[< DATASETS.UCDAVISICDM19: 'ucdavis-icdm19' >,
< DATASETS.UTMOBILENET21: 'utmobilenet21' >,
< DATASETS.MIRAGE19: 'mirage19' >,
< DATASETS.MIRAGE22: 'mirage22' >])
The .get_dataset_folder()
method¶
For instance, you can bypass the composition of a dataset folder path
and call directly .get_dataset_folder()
to find the specific
dataset folder you look for.
PosixPath('./envs/tcbench/lib/python3.10/site-packages/tcbench/libtcdatasets/datasets/ucdavis-icdm19')
Listing files¶
Via pathlib
you can easily discover all parquet files composing a dataset
[PosixPath('./envs/tcbench/lib/python3.10/site-packages/tcbench/libtcdatasets/datasets/ucdavis-icdm19/preprocessed/ucdavis-icdm19.parquet'),
PosixPath('./envs/tcbench/lib/python3.10/site-packages/tcbench/libtcdatasets/datasets/ucdavis-icdm19/preprocessed/imc23/train_split_0.parquet'),
PosixPath('./envs/tcbench/lib/python3.10/site-packages/tcbench/libtcdatasets/datasets/ucdavis-icdm19/preprocessed/imc23/train_split_1.parquet'),
PosixPath('./envs/tcbench/lib/python3.10/site-packages/tcbench/libtcdatasets/datasets/ucdavis-icdm19/preprocessed/imc23/test_split_human.parquet'),
PosixPath('./envs/tcbench/lib/python3.10/site-packages/tcbench/libtcdatasets/datasets/ucdavis-icdm19/preprocessed/imc23/train_split_4.parquet'),
PosixPath('./envs/tcbench/lib/python3.10/site-packages/tcbench/libtcdatasets/datasets/ucdavis-icdm19/preprocessed/imc23/train_split_3.parquet'),
PosixPath('./envs/tcbench/lib/python3.10/site-packages/tcbench/libtcdatasets/datasets/ucdavis-icdm19/preprocessed/imc23/test_split_script.parquet'),
PosixPath('./envs/tcbench/lib/python3.10/site-packages/tcbench/libtcdatasets/datasets/ucdavis-icdm19/preprocessed/imc23/train_split_2.parquet')]
But you can also programmatically call the the datasets lsparquet
subcommand of the CLI using get_rich_tree_parquet_files()
from tcbench.libtcdatasets.datasets_utils import get_rich_tree_parquet_files
get_rich_tree_parquet_files(tcb.DATASETS.UCDAVISICDM19)
Datasets └── ucdavis-icdm19 └── 📁 preprocessed/ ├── ucdavis-icdm19.parquet ├── LICENSE └── 📁 imc23/ ├── test_split_human.parquet ├── test_split_script.parquet ├── train_split_0.parquet ├── train_split_1.parquet ├── train_split_2.parquet ├── train_split_3.parquet └── train_split_4.parquet
The .load_parquet()
method¶
Finally, the generic .load_parquet()
can be used to load one of the parquet files.
For instance, the following load the unfiltered monolithic file of the ucdavis-icdm19
dataset
row_id | app | flow_id | partition | num_pkts | duration | bytes | unixtime | timetofirst | pkts_size | pkts_dir | pkts_iat | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | google-doc | GoogleDoc-100 | pretraining | 2925 | 116.348 | 816029 | [1527993495.652867, 1527993495.685678, 1527993... | [0.0, 0.0328109, 0.261392, 0.262656, 0.263943,... | [354, 87, 323, 1412, 1412, 107, 1412, 180, 141... | [1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, ... | [0.0, 0.0328109, 0.2285811, 0.0012639999999999... |
1 | 1 | google-doc | GoogleDoc-1000 | pretraining | 2813 | 116.592 | 794628 | [1527987720.40456, 1527987720.422811, 15279877... | [0.0, 0.0182509, 0.645106, 0.646344, 0.647689,... | [295, 87, 301, 1412, 1412, 1412, 180, 113, 141... | [1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, ... | [0.0, 0.0182509, 0.6268551, 0.0012380000000000... |
partition app
pretraining google-doc 1221
google-drive 1634
google-music 592
google-search 1915
youtube 1077
retraining-human-triggered google-doc 15
google-drive 18
google-music 15
google-search 15
youtube 20
retraining-script-triggered google-doc 30
google-drive 30
google-music 30
google-search 30
youtube 30
Name: count, dtype: int64
Beside the dataset name, the function only has 2 other parameters, but their semantic and values are "mingled" with the curation process adopted.
Signature:
tcb.load_parquet(
dataset_name: 'str | DATASETS',
min_pkts: 'int' = -1,
split: 'str' = None,
columns: 'List[str]' = None,
animation: 'bool' = False,
) -> 'pd.DataFrame'
Docstring:
Load and returns a dataset parquet file
Arguments:
dataset_name: The name of the dataset
min_pkts: the filtering rule applied when curating the datasets.
If -1, load the unfiltered dataset
split: if min_pkts!=-1, is used to request the loading of
the split file. For DATASETS.UCDAVISICDM19
values can be "human", "script" or a number
between 0 and 4.
For all other dataset split can be anything
which is not None (e.g., True)
columns: A list of columns to load (if None, load all columns)
animation: if True, create a loading animation on the console
Returns:
A pandas dataframe and the related parquet file used to load the dataframe
File: ~/.conda/envs/super-tcbench/lib/python3.10/site-packages/tcbench/libtcdatasets/datasets_utils.py
Type: function
How .load_parquet()
maps to parquet files¶
The logic to follow to load specific files can be confusing. The table below report a global view across datasets:
Dataset | min_pkts=-1 | min_pkts=10 | min_pkts=1000 | split=True | split=0..4 | split=human | split=script |
---|---|---|---|---|---|---|---|
ucdavis-icdm19 |
yes | - | - | - | yes (train+val) | yes (test) | yes (test) |
mirage19 |
yes | yes | - | yes (train/val/test) | - | - | - |
mirage22 |
yes | yes | yes | yes (train/val/test) | - | - | - |
utmobilenet21 |
yes | yes | - | yes (train/val/test) | - | - | - |
-
min_pkts=-1
is set by default and corresponds to loading the unfiltered parquet files, i.e., the files stored immediately under/preprocessed
. All other files are stored under theimc23
subfolders -
For
ucdavis-icdm19
, the parametermin_pkts
is not used. The loading of training(+validation) and test data is controlled bysplit
-
For all other datasets,
min_pkts
specifies which filtered version of the data to use, whilesplit=True
load the split indexes
Loading ucdavis-icdm19
¶
For instance, to load the human
test split of ucdavid-icdm19
you can run
app
youtube 20
google-drive 18
google-doc 15
google-music 15
google-search 15
Name: count, dtype: int64
And the logic is very similar for the script
partition
app
google-doc 30
google-drive 30
google-music 30
google-search 30
youtube 30
Name: count, dtype: int64
However to load a specific train split
app
google-doc 100
google-drive 100
google-music 100
google-search 100
youtube 100
Name: count, dtype: int64
Loading other datasets¶
By default, without any parameter beside the dataset name, the function loads the unfiltered version of a dataset
(122007, 135)
Recall the structure of the mirage19
dataset
Datasets └── mirage19 └── 📁 preprocessed/ ├── mirage19.parquet └── 📁 imc23/ ├── mirage19_filtered_minpkts10.parquet └── mirage19_filtered_minpkts10_splits.parquet
So there is only one filtering with min_pkts=10
(64172, 20)
Based on the dataframe shape, we can see that (indeed) we loaded a reduced version of the unfiltered dataset.
While for ucdavis-icdm19
the "split" files represent 100 samples selected for training (because there are two ad-hoc test split), for all other dataset the "split" files contains indexes indicating the rows to use for train/val/test.
Thus, issuing split=True
is enough to indicate the need to load the split table.