`ucdavis-icdm19`¶

In the literature this dataset is also known as UCDAVIS19 or QUIC-DATASET and refers to 5 quic-based Google services (Google Drive, Google Docs, Google Search, Google Music, YouTube).

The authors of the dataset (Rezaei et. al) describe it as follows

QuoteBibtex

This is a dataset captured in our lab at UC Davis and contains 5 Google services: Google Drive, Youtube, Google Docs, Google Search, and Google Music [5]. We used several systems with various configurations, including Windows 7, 8, 10, Ubuntu 16.4, and 17 operating systems. We wrote several scripts using Selenium WebDriver [17] and AutoIt [1] tools to mimic human behavior when capturing data. This approach allowed us to capture a large dataset without significant human effort. Such approach has been used in many other studies [14, 8, 3]. Furthermore, we also captured a few samples of real human interactions to show how much the accuracy of a model trained on scripted samples will degrade when it is tested on real human samples. During preprocessing, we removed all non-QUIC traffic. Note that all flows in our dataset are labeled, but we did not use labels during the pre-training step. We used class labels of all flows to show the accuracy gap between a fully-supervised and semi-supervised approach.

@article{DBLP:journals/corr/abs-1812-09761,
  author       = {Shahbaz Rezaei and
                  Xin Liu},
  title        = {How to Achieve High Classification Accuracy with Just a Few Labels:
                  {A} Semi-supervised Approach Using Sampled Packets},
  journal      = {CoRR},
  volume       = {abs/1812.09761},
  year         = {2018},
  url          = {http://arxiv.org/abs/1812.09761},
  eprinttype    = {arXiv},
  eprint       = {1812.09761},
  timestamp    = {Thu, 07 Nov 2019 09:05:08 +0100},
  biburl       = {https://dblp.org/rec/journals/corr/abs-1812-09761.bib},
  bibsource    = {dblp computer science bibliography, https://dblp.org}
}

Raw data¶

The original dataset is a collection of three different zip archives.

<root>/
├── pretraining.zip
├── Retraining(human-triggered).zip
└── Retraining(script-triggered).zip

Each archive is a different partition that Rezaei et al. named to reflect different scopes for modeling: * pretraining contains thousands of samples and is meant for pre-training models. * Retraining(human-triggered) and Retraining(script-triggered) contain tens of samples and they are meant for testing or fine-tuning models.

When all zips are unpacked, the folder structure becomes

downloads/
├── pretraining
│   ├── Google Doc
│   ├── Google Drive
│   ├── Google Music
│   ├── Google Search
│   └── Youtube
├── Retraining(human-triggered)
│   ├── Google Doc
│   ├── Google Drive
│   ├── Google Music
│   ├── Google Search
│   └── Youtube
└── Retraining(script-triggered)
    ├── Google Doc
    ├── Google Drive
    ├── Google Music
    ├── Google Search
    └── Youtube

Inside each nested folder there is a collection of CSV files.

Here an extract from one of those CSVs

head <root>/pretraining/Google Doc/GoogleDoc-1000.txt

Output

1527987720.404560000    0       295     1
1527987720.422811000    0.0182509       87      0
1527987721.049666000    0.645106        301     0
1527987721.050904000    0.646344        1412    0
1527987721.052249000    0.647689        1412    0
1527987721.053456000    0.648896        1412    0
1527987721.054619000    0.650059        180     0
1527987721.055299000    0.650739        113     1
1527987721.055848000    0.651288        1412    0
1527987721.057053000    0.652493        1412    0

Each file represent an individual flow and each row in a file has information for an individual packet of that flow. Specifically, the columns correspond to

The packet unixtime (in seconds).
The packet relative time with respect to the first packet of the flow.
The packet size.
The packet direction (either 0 or 1).

Curation¶

The raw dataset provided by Rezaei et al. is already cleaned dataset, i.e., the authors already filtered the data they collected and provide logs referring only to traffic generated by the 5 targetted Google services.

As such, tcbench does NOT perform any additional filtering.

However, the organization of the raw data can be improved as it is a collection of many individual CSVs files and class labels are encoded in folder and file names.

So, the curation process performed by tcbench aim to

Create a monolithic parquet files where each row represent one flow, and packet series are collected into numpy arrays.
Retain the original folders structure has semantic, this is preserved during curation by adding extra columns (partition and flow_id).

The following table describes the schema of the curated datasets.

Field	Description
`row_id`	A unique row id
`app`	The label of the flow, encoded as pandas `category`
`flow_id`	The original filename
`partition`	The partition related to the flow
`num_pkts`	Number of packets in the flow
`duration`	The duration of the flow
`bytes`	The number of bytes of the flow
`unixtime`	Numpy array with the absolute time of each packet
`timetofirst`	Numpy array with the delta between a packet the first packet of the flow
`pkts_size`	Numpy array for the packet size time series
`pkts_dir`	Numpy array for the packet direction time series
`pkts_iat`	Numpy array for the packet inter-arrival time series

Splits¶

The 3 partition created by Rezaei et al. needs to be complemented with actual folds before being able to train models.

The splits generated for this datasets relate to our IMC23 paper.

Specifically:

From pretraining we generate 5 random splits, each with 100 samples per class.
The other two partitions are left as is and are used for testing.

Both training and testing splits are "materialized", i.e., the splits are NOT collection or row indexes but rather already filtered views of the monolithic parquet files.

Hence, all splits have the same columns as from the previous table.

Install¶

The dataset zip archives are stored in a google Google Drive folder.

To install them in tcbench you need to download the 3 zip files manually and place them into a local folder, e.g., /downloads.

To trigger the installation run the following

tcbench datasets install \
    --name ucdavis-icdm19 \
    --input-folder ./downloads/