ucdavis-icdm19
¶
In the literature this dataset is also known as UCDAVIS19 or QUIC-DATASET and refers to 5 quic-based Google services (Google Drive, Google Docs, Google Search, Google Music, YouTube).
The authors of the dataset (Rezaei et. al) describe it as follows
This is a dataset captured in our lab at UC Davis and contains 5 Google services: Google Drive, Youtube, Google Docs, Google Search, and Google Music [5]. We used several systems with various configurations, including Windows 7, 8, 10, Ubuntu 16.4, and 17 operating systems. We wrote several scripts using Selenium WebDriver [17] and AutoIt [1] tools to mimic human behavior when capturing data. This approach allowed us to capture a large dataset without significant human effort. Such approach has been used in many other studies [14, 8, 3]. Furthermore, we also captured a few samples of real human interactions to show how much the accuracy of a model trained on scripted samples will degrade when it is tested on real human samples. During preprocessing, we removed all non-QUIC traffic. Note that all flows in our dataset are labeled, but we did not use labels during the pre-training step. We used class labels of all flows to show the accuracy gap between a fully-supervised and semi-supervised approach.
@article{DBLP:journals/corr/abs-1812-09761,
author = {Shahbaz Rezaei and
Xin Liu},
title = {How to Achieve High Classification Accuracy with Just a Few Labels:
{A} Semi-supervised Approach Using Sampled Packets},
journal = {CoRR},
volume = {abs/1812.09761},
year = {2018},
url = {http://arxiv.org/abs/1812.09761},
eprinttype = {arXiv},
eprint = {1812.09761},
timestamp = {Thu, 07 Nov 2019 09:05:08 +0100},
biburl = {https://dblp.org/rec/journals/corr/abs-1812-09761.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
Raw data¶
The original dataset is a collection of three different zip archives.
<root>/
├── pretraining.zip
├── Retraining(human-triggered).zip
└── Retraining(script-triggered).zip
Each archive is a different partition
that Rezaei et al. named to reflect different scopes
for modeling:
* pretraining
contains thousands of samples and is meant for pre-training models.
* Retraining(human-triggered)
and Retraining(script-triggered)
contain
tens of samples and they are meant for testing or fine-tuning models.
When all zips are unpacked, the folder structure becomes
downloads/
├── pretraining
│ ├── Google Doc
│ ├── Google Drive
│ ├── Google Music
│ ├── Google Search
│ └── Youtube
├── Retraining(human-triggered)
│ ├── Google Doc
│ ├── Google Drive
│ ├── Google Music
│ ├── Google Search
│ └── Youtube
└── Retraining(script-triggered)
├── Google Doc
├── Google Drive
├── Google Music
├── Google Search
└── Youtube
Inside each nested folder there is a collection of CSV files.
Here an extract from one of those CSVs
Output
1527987720.404560000 0 295 1
1527987720.422811000 0.0182509 87 0
1527987721.049666000 0.645106 301 0
1527987721.050904000 0.646344 1412 0
1527987721.052249000 0.647689 1412 0
1527987721.053456000 0.648896 1412 0
1527987721.054619000 0.650059 180 0
1527987721.055299000 0.650739 113 1
1527987721.055848000 0.651288 1412 0
1527987721.057053000 0.652493 1412 0
Each file represent an individual flow and each row in a file has information for an individual packet of that flow. Specifically, the columns correspond to
-
The packet unixtime (in seconds).
-
The packet relative time with respect to the first packet of the flow.
-
The packet size.
-
The packet direction (either 0 or 1).
Curation¶
The raw dataset provided by Rezaei et al. is already cleaned dataset, i.e., the authors already filtered the data they collected and provide logs referring only to traffic generated by the 5 targetted Google services.
As such, tcbench does NOT perform any additional filtering.
However, the organization of the raw data can be improved as it is a collection of many individual CSVs files and class labels are encoded in folder and file names.
So, the curation process performed by tcbench aim to
-
Create a monolithic parquet files where each row represent one flow, and packet series are collected into
numpy
arrays. -
Retain the original folders structure has semantic, this is preserved during curation by adding extra columns (
partition
andflow_id
).
The following table describes the schema of the curated datasets.
Field | Description |
---|---|
row_id |
A unique row id |
app |
The label of the flow, encoded as pandas category |
flow_id |
The original filename |
partition |
The partition related to the flow |
num_pkts |
Number of packets in the flow |
duration |
The duration of the flow |
bytes |
The number of bytes of the flow |
unixtime |
Numpy array with the absolute time of each packet |
timetofirst |
Numpy array with the delta between a packet the first packet of the flow |
pkts_size |
Numpy array for the packet size time series |
pkts_dir |
Numpy array for the packet direction time series |
pkts_iat |
Numpy array for the packet inter-arrival time series |
Splits¶
The 3 partition created by Rezaei et al. needs to be complemented with actual folds before being able to train models.
The splits generated for this datasets relate to our IMC23 paper.
Specifically:
-
From
pretraining
we generate 5 random splits, each with 100 samples per class. -
The other two partitions are left as is and are used for testing.
Both training and testing splits are "materialized", i.e., the splits are NOT collection or row indexes but rather already filtered views of the monolithic parquet files.
Hence, all splits have the same columns as from the previous table.
Install¶
The dataset zip archives are stored in a google Google Drive folder.
To install them in tcbench you need to download the 3 zip files manually and place
them into a local folder, e.g., /downloads
.
To trigger the installation run the following
Output
╭──────╮
│unpack│
╰──────╯
opening: downloads/pretraining.zip
opening: downloads/Retraining(human-triggered).zip
opening: downloads/Retraining(script-triggered).zip
╭──────────╮
│preprocess│
╰──────────╯
found 6672 CSV files to load
Converting CSVs... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00
concatenating files
saving: ./envs/tcbench/lib/python3.10/site-packages/tcbench/libtcdatasets/datasets/ucdavis-icdm19/preprocessed/ucdavis-icdm19.parquet
samples count : unfiltered
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━┓
┃ partition ┃ app ┃ samples ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━┩
│ pretraining │ google-doc │ 1221 │
│ │ google-drive │ 1634 │
│ │ google-music │ 592 │
│ │ google-search │ 1915 │
│ │ youtube │ 1077 │
│ │ __total__ │ 6439 │
├─────────────────────────────┼───────────────┼─────────┤
│ retraining-human-triggered │ google-doc │ 15 │
│ │ google-drive │ 18 │
│ │ google-music │ 15 │
│ │ google-search │ 15 │
│ │ youtube │ 20 │
│ │ __total__ │ 83 │
├─────────────────────────────┼───────────────┼─────────┤
│ retraining-script-triggered │ google-doc │ 30 │
│ │ google-drive │ 30 │
│ │ google-music │ 30 │
│ │ google-search │ 30 │
│ │ youtube │ 30 │
│ │ __total__ │ 150 │
└─────────────────────────────┴───────────────┴─────────┘
╭───────────────╮
│generate splits│
╰───────────────╯
saving: ./envs/tcbench/lib/python3.10/site-packages/tcbench/libtcdatasets/datasets/ucdavis-icdm19/preprocessed/imc23/train_split_0.parquet
saving: ./envs/tcbench/lib/python3.10/site-packages/tcbench/libtcdatasets/datasets/ucdavis-icdm19/preprocessed/imc23/train_split_1.parquet
saving: ./envs/tcbench/lib/python3.10/site-packages/tcbench/libtcdatasets/datasets/ucdavis-icdm19/preprocessed/imc23/train_split_2.parquet
saving: ./envs/tcbench/lib/python3.10/site-packages/tcbench/libtcdatasets/datasets/ucdavis-icdm19/preprocessed/imc23/train_split_3.parquet
saving: ./envs/tcbench/lib/python3.10/site-packages/tcbench/libtcdatasets/datasets/ucdavis-icdm19/preprocessed/imc23/train_split_4.parquet
samples count : train_split = 0 to 4
┏━━━━━━━━━━━━━━━┳━━━━━━━━━┓
┃ app ┃ samples ┃
┡━━━━━━━━━━━━━━━╇━━━━━━━━━┩
│ google-doc │ 100 │
│ google-drive │ 100 │
│ google-music │ 100 │
│ google-search │ 100 │
│ youtube │ 100 │
├───────────────┼─────────┤
│ __total__ │ 500 │
└───────────────┴─────────┘
saving: ./envs/tcbench/lib/python3.10/site-packages/tcbench/libtcdatasets/datasets/ucdavis-icdm19/preprocessed/imc23/test_split_human.parquet
samples count : test_split_human
┏━━━━━━━━━━━━━━━┳━━━━━━━━━┓
┃ app ┃ samples ┃
┡━━━━━━━━━━━━━━━╇━━━━━━━━━┩
│ youtube │ 20 │
│ google-drive │ 18 │
│ google-doc │ 15 │
│ google-music │ 15 │
│ google-search │ 15 │
├───────────────┼─────────┤
│ __total__ │ 83 │
└───────────────┴─────────┘
saving: ./envs/tcbench/lib/python3.10/site-packages/tcbench/libtcdatasets/datasets/ucdavis-icdm19/preprocessed/imc23/test_split_script.parquet
samples count : test_split_script
┏━━━━━━━━━━━━━━━┳━━━━━━━━━┓
┃ app ┃ samples ┃
┡━━━━━━━━━━━━━━━╇━━━━━━━━━┩
│ google-doc │ 30 │
│ google-drive │ 30 │
│ google-music │ 30 │
│ google-search │ 30 │
│ youtube │ 30 │
├───────────────┼─────────┤
│ __total__ │ 150 │
└───────────────┴─────────┘
The console output is showing a few samples count reports related to the processing performed on the datasets
-
The first report relates to the overall monolithic parquet files obtained consolidating all CSVs. This is labeled as unfiltered in the console output.
-
The next report relates to the generation of the 5 splits (obtained processing the pretraining partition only).
-
The last two reports relate to the two predefined testing partitions.