mirage19
¶
The dataset collect traffic from 20 mobile Android apps ( accuweather, comixology, dropbox, duolingo, facebook, foursquared, groupon, iliga, messenger, pinterest, slither, spotify, subito.it, trello, tripadvisor, twitter, viber, waze, wish.com, youtube).
The authors of the dataset (Aceto et. al) describe it as follows
We have collected the MIRAGE-2019 dataset in the ARCLAB laboratory at the University of Napoli “Federico II”. The capture sessions (cf. Sec. III-A) span from May 2017 to May 2019. We employed three devices to generate the mobile traffic, namely: (i) Xiaomi Mi5, (ii) Google Nexus 7, and (iii) Samsung Galaxy A5. In detail, we installed the custom firmware CyanogenMod v13.0 (corresponding to the Android version 6.0.1) on all the devices and enabled the root mode. More than 280 experimenters took part to the dataset construction on a voluntary basis, by performing one or two experimental sessions each. The experimenters involved in this activity were students of three different courses9 held at the University of Napoli “Federico II”, aged 19÷25 years, with a 85/15% share between males and females. Each experimental session lasted two hours, at most. Altogether, during each experimental session, each experimenter performed 12 capture sessions of 5÷10 minutes (each resulting in one PCAP traffic trace and one strace log-file, cf. Sec. III). In each capture session the experimenter was asked to perform activities mimicking common uses of a single app with the intent to explore its functionalities in addition to first-time install, login, registration. We report the ethical considerations underlying the aforementioned traffic-capture procedure in Sec. VI. Overall, the MIRAGE-2019 dataset gathers the traffic generated by 40 Android apps belonging to 16 different categories according to Google Play apps distribution portal [17]
@INPROCEEDINGS{aceto2019mirage, author={G. {Aceto} and D. {Ciuonzo} and A. {Montieri} and V. {Persico} and A. {Pescap{`e}}}, booktitle={IEEE 4th International Conference on Computing, Communication and Security (ICCCS 2019)}, title={MIRAGE: Mobile-app Traffic Capture and Ground-truth Creation}, year={2019}, volume={}, number={}, pages={}, abstract={Network traffic analysis, i.e. the umbrella of procedures for distilling information from network traffic, represents the enabler for highly-valuable profiling information, other than being the workhorse for several key network management tasks. While it is currently being revolutionized in its nature by the rising share of traffic generated by mobile and hand-held devices, existing design solutions are mainly evaluated on private traffic traces, and only a few public datasets are available, thus clearly limiting repeatability and further advances on the topic. To this end, this paper introduces and describes MIRAGE, a reproducible architecture for mobile-app traffic capture and ground-truth creation. The outcome of this system is MIRAGE-2019, a human-generated dataset for mobile traffic analysis (with associated ground-truth) having the goal of advancing the state-of-the-art in mobile app traffic analysis. A first statistical characterization of the mobile-app traffic in the dataset is provided in this paper. Still, MIRAGE is expected to be capitalized by the networking community for different tasks related to mobile traffic analysis.}, keywords={Android apps; encrypted traffic; mobile apps; mobile traffic; reproducible research; open dataset; traffic classification}, doi={}, ISSN={}, month={Oct},}
20 or 40 apps?
The quote reports that the datasets has traffic from 40 apps but the website describing the dataset is actually reporting only 20.
Through communication with Aceto et al. we gathered that the public version of the dataset only contains 20 apps. Although not reported on the website, the data for the remaining 20 is available only upon request.
That said, tcbench considers only portion of the dataset which public.
Raw data¶
The dataset is a single tarball that once unpacked has the following structure
The subfolders contain collections of JSON files, each representing a different experiment.
The JSON schema of each file is not officially documented. The semantic of the JSON schema is not very difficult to reverse engineer (especially if you have domain knowledge in traffic processing). That said, the JSON schema has a nested structure that makes it not easy to process.
The target Android app for an experiment is encoded both in the filename
as well as metadata in the JSON schema.
The actual structure in the schema is flow
metadata
bf
label
.
Each JSON file collects per-flow data, but metrics are scattered across different nested layers. For example, aggregate flow metrics are hierarchially separated from packet time series, which are further separated from other metadata.
Last, each JSON reports time series of packet properties (e.g., packets size and direction) but also packet payload (each payload is encoded as a list of integers).
Curation¶
The curation process has the following objectives:
-
Combine all JSON files into a monolithic parquet file.
-
Flatten the JSON nested structure. For instance, the nested input dictionary
{"layer1":{"col1":1, "col2":2}}
would be flattened into a table with columns "layer1_col1" and "layer1_col2" with the respective values "1" and "2". -
Remove "background" traffic. More specifically, each JSON file details the Android app name in the file name. But the traffic in an experiment can be related to a different app/service running in parallel. However, the dataset offers the column
flow_metadata_bf_label
which contains the Android app name thatnetstat
linked to each network socket during an experiment. This implies that, by knowing the expected app of an experiment, one can define as "background"flow_metadata_bf_label
!= expected Android app name. -
Remove "small data". This include filtering out data based on the following rules:
-
Remove ACK packets from time series.
-
Remove flows with < 10 samples.
-
Remove apps generating < 100 samples.
-
-
As mentioned, the dataset contains raw packet bytes across multiple packets of a flow. We process these series to search for ASCII strings. This represents a layman approach to search for TLS handshake information (i.e., rather than actually decoding packet headers, we simply search for sequence of bytes that looks like ASCII strings).
Given the curation requires filtering, we provides two version of the dataset:
-
unfiltered contains all data (including background, ACK packets, etc.) and the related parquet has 135 columns (most generated by unfolding the JSON nested structure).
-
preprocessed contains the curated data and an opinionated selection of 20 columns to make the parquet files more maneageable.
For both formats, the most important columns of the datasets are the following.
Field | Description |
---|---|
packet_data_packet_dir |
The time series of the packet direction |
packet_data_l4_payload_bytes |
The time series of the packet size |
packet_data_iat |
The time series of the packet inter-arrival time |
flow_metadata_bf_label |
The label gathered via netstat |
strings |
The ASCII string recovered from the payload analysis |
android_name |
The app used for an experiment |
app |
The final label encoded as a pandas category |
row_id |
A unique row identifier |
Please refer to the datasets schema page for more details.
Splits¶
The preprocessed parquet file is associated with an 80/10/10 train/validation/test splits created with the following logic.
-
Shuffle the rows.
-
Perform a 90/10 split where the 10-part is used for testing.
-
From the 90-part, do a second 90/10 to define train and validation.
The splits are a collection of row indexes that needs to be applied on the filtered monolithic parquet in order to obtain the data for modeling.
The structure of the splits table is as follows
Field | Description |
---|---|
train_indexes |
A numpy array with the row_id related to the train split |
val_indexes |
... validation split |
test_indexes |
... test split |
split_index |
The index of the split (0..4) |
Install¶
The installation does not requires you to pre-download the dataset tarball and can be triggered with the following command
Output
╭─────────────────╮
│download & unpack│
╰─────────────────╯
Downloading... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.5 GB / 1.5 GB eta 0:00:00
opening: /tmp/tmpxcdzy8tw/MIRAGE-2019_traffic_dataset_downloadable_v2.tar.gz
╭──────────╮
│preprocess│
╰──────────╯
found 1642 JSON files to load
Converting JSONs... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1642/1642 0:00:11
merging files...
saving: ./envs/tcbench/lib/python3.10/site-packages/tcbench/libtcdatasets/datasets/mirage19/preprocessed/mirage19.parquet
╭────────────────────────╮
│filter & generate splits│
╰────────────────────────╯
loading: ./envs/tcbench/lib/python3.10/site-packages/tcbench/libtcdatasets/datasets/mirage19/preprocessed/mirage19.parquet
samples count : unfiltered
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┓
┃ app ┃ samples ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━┩
│ com.waze │ 11737 │
│ de.motain.iliga │ 10810 │
│ com.accuweather.android │ 10631 │
│ com.duolingo │ 8319 │
│ it.subito │ 8167 │
│ com.contextlogic.wish │ 6507 │
│ com.spotify.music │ 6431 │
│ com.joelapenna.foursquared │ 6399 │
│ com.google.android.youtube │ 6346 │
│ com.iconology.comics │ 5516 │
│ com.facebook.katana │ 5368 │
│ com.dropbox.android │ 4815 │
│ com.twitter.android │ 4734 │
│ background │ 4439 │
│ com.pinterest │ 4078 │
│ com.facebook.orca │ 4018 │
│ com.tripadvisor.tripadvisor │ 3572 │
│ air.com.hypah.io.slither │ 3088 │
│ com.viber.voip │ 2740 │
│ com.trello │ 2306 │
│ com.groupon │ 1986 │
├─────────────────────────────┼─────────┤
│ __total__ │ 122007 │
└─────────────────────────────┴─────────┘
stats : number packets per-flow (unfiltered)
┏━━━━━━━┳━━━━━━━━━━┓
┃ stat ┃ value ┃
┡━━━━━━━╇━━━━━━━━━━┩
│ count │ 122007.0 │
│ mean │ 23.11 │
│ std │ 9.73 │
│ min │ 1.0 │
│ 25% │ 17.0 │
│ 50% │ 26.0 │
│ 75% │ 32.0 │
│ max │ 32.0 │
└───────┴──────────┘
filtering min_pkts=10...
saving: ./envs/tcbench/lib/python3.10/site-packages/tcbench/libtcdatasets/datasets/mirage19/preprocessed/imc23/mirage19_filtered_minpkts10.parquet
samples count : filtered (min_pkts=10)
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┓
┃ app ┃ samples ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━┩
│ de.motain.iliga │ 7505 │
│ com.waze │ 7214 │
│ com.duolingo │ 4583 │
│ it.subito │ 4299 │
│ com.contextlogic.wish │ 3927 │
│ com.accuweather.android │ 3737 │
│ com.joelapenna.foursquared │ 3627 │
│ com.spotify.music │ 3300 │
│ com.dropbox.android │ 3189 │
│ com.facebook.katana │ 2878 │
│ com.iconology.comics │ 2812 │
│ com.twitter.android │ 2805 │
│ com.google.android.youtube │ 2728 │
│ com.pinterest │ 2450 │
│ com.tripadvisor.tripadvisor │ 2052 │
│ com.facebook.orca │ 1783 │
│ com.viber.voip │ 1618 │
│ com.trello │ 1478 │
│ com.groupon │ 1174 │
│ air.com.hypah.io.slither │ 1013 │
├─────────────────────────────┼─────────┤
│ __total__ │ 64172 │
└─────────────────────────────┴─────────┘
stats : number packets per-flow (min_pkts=10)
┏━━━━━━━┳━━━━━━━━━┓
┃ stat ┃ value ┃
┡━━━━━━━╇━━━━━━━━━┩
│ count │ 64172.0 │
│ mean │ 17.01 │
│ std │ 4.43 │
│ min │ 11.0 │
│ 25% │ 14.0 │
│ 50% │ 17.0 │
│ 75% │ 19.0 │
│ max │ 32.0 │
└───────┴─────────┘
saving: ./envs/tcbench/lib/python3.10/site-packages/tcbench/libtcdatasets/datasets/mirage19/preprocessed/imc23/mirage19_filtered_minpkts10_splits.parquet
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓
┃ app ┃ train_samples ┃ val_samples ┃ test_samples ┃ all_samples ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩
│ de.motain.iliga │ 6079 │ 675 │ 751 │ 7505 │
│ com.waze │ 5844 │ 649 │ 721 │ 7214 │
│ com.duolingo │ 3712 │ 413 │ 458 │ 4583 │
│ it.subito │ 3482 │ 387 │ 430 │ 4299 │
│ com.contextlogic.wish │ 3181 │ 353 │ 393 │ 3927 │
│ com.accuweather.android │ 3027 │ 336 │ 374 │ 3737 │
│ com.joelapenna.foursquared │ 2938 │ 326 │ 363 │ 3627 │
│ com.spotify.music │ 2673 │ 297 │ 330 │ 3300 │
│ com.dropbox.android │ 2583 │ 287 │ 319 │ 3189 │
│ com.facebook.katana │ 2331 │ 259 │ 288 │ 2878 │
│ com.iconology.comics │ 2278 │ 253 │ 281 │ 2812 │
│ com.twitter.android │ 2272 │ 252 │ 281 │ 2805 │
│ com.google.android.youtube │ 2209 │ 246 │ 273 │ 2728 │
│ com.pinterest │ 1984 │ 221 │ 245 │ 2450 │
│ com.tripadvisor.tripadvisor │ 1662 │ 185 │ 205 │ 2052 │
│ com.facebook.orca │ 1444 │ 161 │ 178 │ 1783 │
│ com.viber.voip │ 1310 │ 146 │ 162 │ 1618 │
│ com.trello │ 1197 │ 133 │ 148 │ 1478 │
│ com.groupon │ 951 │ 106 │ 117 │ 1174 │
│ air.com.hypah.io.slither │ 821 │ 91 │ 101 │ 1013 │
├─────────────────────────────┼───────────────┼─────────────┼──────────────┼─────────────┤
│ __total__ │ 51978 │ 5776 │ 6418 │ 64172 │
└─────────────────────────────┴───────────────┴─────────────┴──────────────┴─────────────┘
The console output is showing a few samples count reports related to the processing performed on the datasets
-
The first report relates to the unfiltered dataset, i.e., the monolithic parquet files obtained consolidating all JSON files but before applying any curation. At first glance, it looks like this dataset has a lot of flows. However, the following report shows the number of packets per flow and suggests that most of the flows in the raw dataset are very very short (thus meaningless for a classification task).
-
The second group of reports show similar information when removing flows with less than 10 packets.
-
The last report shows the number of train/validation/test samples by each application for the first split (the same counters are true for all splits).