Skip to content

ucdavis-icdm19

In the literature this dataset is also known as UCDAVIS19 or QUIC-DATASET and refers to 5 quic-based Google services (Google Drive, Google Docs, Google Search, Google Music, YouTube).

The authors of the dataset (Rezaei et. al) describe it as follows

This is a dataset captured in our lab at UC Davis and contains 5 Google services: Google Drive, Youtube, Google Docs, Google Search, and Google Music [5]. We used several systems with various configurations, including Windows 7, 8, 10, Ubuntu 16.4, and 17 operating systems. We wrote several scripts using Selenium WebDriver [17] and AutoIt [1] tools to mimic human behavior when capturing data. This approach allowed us to capture a large dataset without significant human effort. Such approach has been used in many other studies [14, 8, 3]. Furthermore, we also captured a few samples of real human interactions to show how much the accuracy of a model trained on scripted samples will degrade when it is tested on real human samples. During preprocessing, we removed all non-QUIC traffic. Note that all flows in our dataset are labeled, but we did not use labels during the pre-training step. We used class labels of all flows to show the accuracy gap between a fully-supervised and semi-supervised approach.

@article{DBLP:journals/corr/abs-1812-09761,
  author       = {Shahbaz Rezaei and
                  Xin Liu},
  title        = {How to Achieve High Classification Accuracy with Just a Few Labels:
                  {A} Semi-supervised Approach Using Sampled Packets},
  journal      = {CoRR},
  volume       = {abs/1812.09761},
  year         = {2018},
  url          = {http://arxiv.org/abs/1812.09761},
  eprinttype    = {arXiv},
  eprint       = {1812.09761},
  timestamp    = {Thu, 07 Nov 2019 09:05:08 +0100},
  biburl       = {https://dblp.org/rec/journals/corr/abs-1812-09761.bib},
  bibsource    = {dblp computer science bibliography, https://dblp.org}
}

Raw data

The original dataset is a collection of three different zip archives.

<root>/
├── pretraining.zip
├── Retraining(human-triggered).zip
└── Retraining(script-triggered).zip

Each archive is a different partition that Rezaei et al. named to reflect different scopes for modeling: * pretraining contains thousands of samples and is meant for pre-training models. * Retraining(human-triggered) and Retraining(script-triggered) contain tens of samples and they are meant for testing or fine-tuning models.

When all zips are unpacked, the folder structure becomes

downloads/
├── pretraining
│   ├── Google Doc
│   ├── Google Drive
│   ├── Google Music
│   ├── Google Search
│   └── Youtube
├── Retraining(human-triggered)
│   ├── Google Doc
│   ├── Google Drive
│   ├── Google Music
│   ├── Google Search
│   └── Youtube
└── Retraining(script-triggered)
    ├── Google Doc
    ├── Google Drive
    ├── Google Music
    ├── Google Search
    └── Youtube

Inside each nested folder there is a collection of CSV files.

Here an extract from one of those CSVs

head <root>/pretraining/Google Doc/GoogleDoc-1000.txt

Output

1527987720.404560000    0       295     1
1527987720.422811000    0.0182509       87      0
1527987721.049666000    0.645106        301     0
1527987721.050904000    0.646344        1412    0
1527987721.052249000    0.647689        1412    0
1527987721.053456000    0.648896        1412    0
1527987721.054619000    0.650059        180     0
1527987721.055299000    0.650739        113     1
1527987721.055848000    0.651288        1412    0
1527987721.057053000    0.652493        1412    0

Each file represent an individual flow and each row in a file has information for an individual packet of that flow. Specifically, the columns correspond to

  • The packet unixtime (in seconds).

  • The packet relative time with respect to the first packet of the flow.

  • The packet size.

  • The packet direction (either 0 or 1).

Curation

The raw dataset provided by Rezaei et al. is already cleaned dataset, i.e., the authors already filtered the data they collected and provide logs referring only to traffic generated by the 5 targetted Google services.

As such, tcbench does NOT perform any additional filtering.

However, the organization of the raw data can be improved as it is a collection of many individual CSVs files and class labels are encoded in folder and file names.

So, the curation process performed by tcbench aim to

  • Create a monolithic parquet files where each row represent one flow, and packet series are collected into numpy arrays.

  • Retain the original folders structure has semantic, this is preserved during curation by adding extra columns (partition and flow_id).

The following table describes the schema of the curated datasets.

Field Description
row_id A unique row id
app The label of the flow, encoded as pandas category
flow_id The original filename
partition The partition related to the flow
num_pkts Number of packets in the flow
duration The duration of the flow
bytes The number of bytes of the flow
unixtime Numpy array with the absolute time of each packet
timetofirst Numpy array with the delta between a packet the first packet of the flow
pkts_size Numpy array for the packet size time series
pkts_dir Numpy array for the packet direction time series
pkts_iat Numpy array for the packet inter-arrival time series

Splits

The 3 partition created by Rezaei et al. needs to be complemented with actual folds before being able to train models.

The splits generated for this datasets relate to our IMC23 paper.

Specifically:

  • From pretraining we generate 5 random splits, each with 100 samples per class.

  • The other two partitions are left as is and are used for testing.

Both training and testing splits are "materialized", i.e., the splits are NOT collection or row indexes but rather already filtered views of the monolithic parquet files.

Hence, all splits have the same columns as from the previous table.

Install

The dataset zip archives are stored in a google Google Drive folder.

To install them in tcbench you need to download the 3 zip files manually and place them into a local folder, e.g., /downloads.

To trigger the installation run the following

tcbench datasets install \
    --name ucdavis-icdm19 \
    --input-folder ./downloads/

Output

╭──────╮
│unpack│
╰──────╯
opening: downloads/pretraining.zip
opening: downloads/Retraining(human-triggered).zip
opening: downloads/Retraining(script-triggered).zip

╭──────────╮
│preprocess│
╰──────────╯
found 6672 CSV files to load
Converting CSVs... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00
concatenating files
saving: ./envs/tcbench/lib/python3.10/site-packages/tcbench/libtcdatasets/datasets/ucdavis-icdm19/preprocessed/ucdavis-icdm19.parquet
samples count : unfiltered
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━┓
┃ partition                   ┃ app           ┃ samples ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━┩
│ pretraining                 │ google-doc    │    1221 │
│                             │ google-drive  │    1634 │
│                             │ google-music  │     592 │
│                             │ google-search │    1915 │
│                             │ youtube       │    1077 │
│                             │ __total__     │    6439 │
├─────────────────────────────┼───────────────┼─────────┤
│ retraining-human-triggered  │ google-doc    │      15 │
│                             │ google-drive  │      18 │
│                             │ google-music  │      15 │
│                             │ google-search │      15 │
│                             │ youtube       │      20 │
│                             │ __total__     │      83 │
├─────────────────────────────┼───────────────┼─────────┤
│ retraining-script-triggered │ google-doc    │      30 │
│                             │ google-drive  │      30 │
│                             │ google-music  │      30 │
│                             │ google-search │      30 │
│                             │ youtube       │      30 │
│                             │ __total__     │     150 │
└─────────────────────────────┴───────────────┴─────────┘

╭───────────────╮
│generate splits│
╰───────────────╯
saving: ./envs/tcbench/lib/python3.10/site-packages/tcbench/libtcdatasets/datasets/ucdavis-icdm19/preprocessed/imc23/train_split_0.parquet
saving: ./envs/tcbench/lib/python3.10/site-packages/tcbench/libtcdatasets/datasets/ucdavis-icdm19/preprocessed/imc23/train_split_1.parquet
saving: ./envs/tcbench/lib/python3.10/site-packages/tcbench/libtcdatasets/datasets/ucdavis-icdm19/preprocessed/imc23/train_split_2.parquet
saving: ./envs/tcbench/lib/python3.10/site-packages/tcbench/libtcdatasets/datasets/ucdavis-icdm19/preprocessed/imc23/train_split_3.parquet
saving: ./envs/tcbench/lib/python3.10/site-packages/tcbench/libtcdatasets/datasets/ucdavis-icdm19/preprocessed/imc23/train_split_4.parquet
samples count : train_split = 0 to 4
┏━━━━━━━━━━━━━━━┳━━━━━━━━━┓
┃ app           ┃ samples ┃
┡━━━━━━━━━━━━━━━╇━━━━━━━━━┩
│ google-doc    │     100 │
│ google-drive  │     100 │
│ google-music  │     100 │
│ google-search │     100 │
│ youtube       │     100 │
├───────────────┼─────────┤
│ __total__     │     500 │
└───────────────┴─────────┘

saving: ./envs/tcbench/lib/python3.10/site-packages/tcbench/libtcdatasets/datasets/ucdavis-icdm19/preprocessed/imc23/test_split_human.parquet
samples count : test_split_human
┏━━━━━━━━━━━━━━━┳━━━━━━━━━┓
┃ app           ┃ samples ┃
┡━━━━━━━━━━━━━━━╇━━━━━━━━━┩
│ youtube       │      20 │
│ google-drive  │      18 │
│ google-doc    │      15 │
│ google-music  │      15 │
│ google-search │      15 │
├───────────────┼─────────┤
│ __total__     │      83 │
└───────────────┴─────────┘

saving: ./envs/tcbench/lib/python3.10/site-packages/tcbench/libtcdatasets/datasets/ucdavis-icdm19/preprocessed/imc23/test_split_script.parquet
samples count : test_split_script
┏━━━━━━━━━━━━━━━┳━━━━━━━━━┓
┃ app           ┃ samples ┃
┡━━━━━━━━━━━━━━━╇━━━━━━━━━┩
│ google-doc    │      30 │
│ google-drive  │      30 │
│ google-music  │      30 │
│ google-search │      30 │
│ youtube       │      30 │
├───────────────┼─────────┤
│ __total__     │     150 │
└───────────────┴─────────┘

The console output is showing a few samples count reports related to the processing performed on the datasets

  1. The first report relates to the overall monolithic parquet files obtained consolidating all CSVs. This is labeled as unfiltered in the console output.

  2. The next report relates to the generation of the 5 splits (obtained processing the pretraining partition only).

  3. The last two reports relate to the two predefined testing partitions.