Skip to content

utmobilenet21

The dataset collect traffic from 17 mobile Android apps (youtube, reddit, google-maps, spotify, netflix, pinterest, hulu, instagram, dropbox, facebook, twitter, gmail, pandora, messenger, google-drive, hangout, skype)

The authors of the dataset (Heng et. al) describe it as follows

We select 16 of the most popular mobile applications listed on [18], so that the collected data is representative of the modern data consumption patterns. Typical user activities are selected for each application, such as browsing and posting for Reddit and sending and opening emails for Gmail. For each activity, a series of actions are implemented through the Android API to emulate a sequence of user interactions with the smartphone. Examples of actions include scrolling the smartphone screen, clicking on part of the displayed content to go to a different screen, and waiting for the displayed media to play for a certain time.

[...]

After packet recording is complete, the raw pcap data files are transferred to the laptop, where they are processed using TShark and converted to the csv format.

[...]

Three different sets of data were collected. In the deterministic automated dataset, the action parameters for each activity are fixed. For instance, when performing the activity of scrolling news feed on Facebook, the BASH script will always scroll the feed 3 times, wait for 5 seconds, and repeat for 5 times. Although the actions are fixed, the context and content displayed are up to the application. In the randomized automated dataset, the action parameters for each activity are randomized, such as the number of scrolls, the wait time and the number of repetitions for Facebook news feed scrolling. This makes the collected data more diverse and realistic. The third dataset is generated by human users and is the most realistic in terms of representing user activity. It includes two subsets: an application-specific dataset and an activity-specific dataset. In the former, human users perform each activity using applications in Table III. In the latter, human users use each application normally without constraints on the activities to perform.

@ARTICLE{9490678,
  author={Heng, Yuqiang and Chandrasekhar, Vikram and Andrews, Jeffrey G.},
  journal={IEEE Networking Letters}, 
  title={UTMobileNetTraffic2021: A Labeled Public Network Traffic Dataset}, 
  year={2021},
  volume={3},
  number={3},
  pages={156-160},
  doi={10.1109/LNET.2021.3098455}}

Raw data

The dataset is a single zip file. Once unpacked it contains the following structure.

csvs
├── Action-Specific Wild Test Data
├── Deterministic Automated Data
├── Randomized Automated Data
└── Wild Test Data

The structure reflects the four actions used to interact with the apps. Within each folder there is a collection of CSV files with ground-truth labels encoded in the filename.

For instance

> ls -1 csvs/Action-Specific\ Wild\ Test\ Data/ | head
dropbox_man-download_2019-04-30_19-07-09_4fd1c357.csv
dropbox_man-upload_2019-04-30_19-16-06_4fd1c357.csv
facebook_man-scroll-newsfeed_2019-04-19_14-36-52_d56097ed.csv
gmail_man-open-email_2019-04-19_15-08-28_d56097ed.csv
gmail_man-send-email_2019-04-19_15-26-04_d56097ed.csv
google-drive_man-download_2019-04-30_19-22-16_4fd1c357.csv
google-drive_man-upload_2019-04-19_15-40-09_d56097ed.csv
google-drive_man-upload_2019-04-30_19-27-21_4fd1c357.csv
google-maps_man-explore_2019-04-24_15-55-57_4fd1c357.csv
google-maps_man-explore_2019-04-24_16-26-39_4fd1c357.csv

Each CSV is generated with tshark so it gathers per-packet information across all flows of a pcap.

For instance

> head csvs/Action-Specific\ Wild\ Test\ Data/dropbox_man-download_2019-04-30_19-07-09_4fd1c357.csv
,frame.number,frame.time,frame.len,frame.cap_len,sll.pkttype,sll.hatype,sll.halen,sll.src.eth,sll.unused,sll.etype,ip.hdr_len,ip.dsfield.ecn,ip.len,ip.id,ip.frag_offset,ip.ttl,ip.proto,ip.checksum,ip.src,ip.dst,tcp.hdr_len,tcp.len,tcp.srcport,tcp.dstport,tcp.seq,tcp.ack,tcp.flags.ns,tcp.flags.fin,tcp.window_size_value,tcp.checksum,tcp.urgent_pointer,tcp.option_kind,tcp.option_len,tcp.options.timestamp.tsval,tcp.options.timestamp.tsecr,udp.srcport,udp.dstport,udp.length,udp.checksum,gquic.puflags.rsv,gquic.packet_number,location
0,1,"Apr 30, 2019 19:07:17.184823000 CDT",886,68,4,1,6,98:f1:70:7c:4b:27,0000,0x00000800,20.0,0.0,870.0,0x00004405,0.0,64.0,6.0,0x00001eb4,10.145.31.196,162.125.8.7,32.0,818.0,59576.0,443.0,1.0,1.0,0.0,0.0,406.0,0x000016e0,0.0,"1,1,8",10,19415791.0,4190900276.0,,,,,,,EER
1,2,"Apr 30, 2019 19:07:17.207030000 CDT",68,68,0,1,6,00:6c:bc:1c:5f:b9,0000,0x00000800,20.0,0.0,52.0,0x00001556,0.0,55.0,6.0,0x0000594d,162.125.8.7,10.145.31.196,32.0,0.0,443.0,59576.0,1.0,819.0,0.0,0.0,66.0,0x0000ce00,0.0,"1,1,8",10,4190904398.0,19415791.0,,,,,,,EER
2,3,"Apr 30, 2019 19:07:17.338454000 CDT",550,68,0,1,6,00:6c:bc:1c:5f:b9,0000,0x00000800,20.0,0.0,534.0,0x00001557,0.0,55.0,6.0,0x0000576a,162.125.8.7,10.145.31.196,32.0,482.0,443.0,59576.0,1.0,819.0,0.0,0.0,66.0,0x00001382,0.0,"1,1,8",10,4190904525.0,19415791.0,,,,,,,EER
3,4,"Apr 30, 2019 19:07:17.338645000 CDT",102,68,0,1,6,00:6c:bc:1c:5f:b9,0000,0x00000800,20.0,0.0,86.0,0x00001558,0.0,55.0,6.0,0x00005929,162.125.8.7,10.145.31.196,32.0,34.0,443.0,59576.0,483.0,819.0,0.0,0.0,66.0,0x00005a59,0.0,"1,1,8",10,4190904525.0,19415791.0,,,,,,,EER
4,5,"Apr 30, 2019 19:07:17.340360000 CDT",68,68,4,1,6,98:f1:70:7c:4b:27,0000,0x00000800,20.0,0.0,52.0,0x00004406,0.0,64.0,6.0,0x000021e5,10.145.31.196,162.125.8.7,32.0,0.0,59576.0,443.0,819.0,517.0,0.0,0.0,415.0,0x0000d4ff,0.0,"1,1,8",10,19415838.0,4190904525.0,,,,,,,EER
5,6,"Apr 30, 2019 19:07:17.398784000 CDT",80,44,4,1,6,98:f1:70:7c:4b:27,0000,0x00000800,20.0,0.0,64.0,0x0000589f,0.0,64.0,17.0,0x00007e3c,10.145.31.196,128.83.185.41,,,,,,,,,,,,,,,,56035.0,53.0,44.0,0x00009d5d,,,EER
6,7,"Apr 30, 2019 19:07:17.399295000 CDT",80,44,4,1,6,98:f1:70:7c:4b:27,0000,0x00000800,20.0,0.0,64.0,0x000058a0,0.0,64.0,17.0,0x00007e3b,10.145.31.196,128.83.185.41,,,,,,,,,,,,,,,,48733.0,53.0,44.0,0x00009158,,,EER
7,8,"Apr 30, 2019 19:07:17.400595000 CDT",126,44,0,1,6,00:6c:bc:1c:5f:b9,0000,0x00000800,20.0,0.0,110.0,0x0000f478,0.0,62.0,17.0,0x000023ed,128.83.185.41,10.145.31.196,,,,,,,,,,,,,,,,53.0,56035.0,90.0,0x0000c3e8,,,EER
8,9,"Apr 30, 2019 19:07:17.400908000 CDT",126,44,0,1,6,00:6c:bc:1c:5f:b9,0000,0x00000800,20.0,0.0,110.0,0x0000f479,0.0,62.0,17.0,0x000023ec,128.83.185.41,10.145.31.196,,,,,,,,,,,,,,,,53.0,48733.0,90.0,0x0000b7e3,,,EER

Curation

The curation process has the following objectives:

  1. The CSVs can generate problem when loaded. For instance, some rows have broken formats (e.g., utilities such as pandas.read_csv() fail parsing), columns have missing values or mixed types (e.g., ports can be either int of floats). So extra care is required to properly ingest the CSVs.

  2. The CSVs contains packets with protocols other than TCP/UDP. These need to be filtered out.

  3. As CSVs are per-packet, they neet to be reassembled into flows to obtain packet time series.

  4. Remove "small data". This include filtering out data based on the following rules:

    • Remove flows with < 10 samples.

    • Remove apps generating < 100 samples.

The final monolithic parquet files has the following columns

Field Description
row_id A unique flow id
src_ip The source ip of the flow
src_port The source port of the flow
dst_ip The destination ip of the flow
dst_port The destination port of the flow
ip_proto The protocol of the flow (TCP or UDP)
first Timestamp of the first packet
last Timestamp of the last packet
duration Duration of the flow
packets Number of packets in the flow
bytes Number of bytes in the flow
partition From which folder the flow was originally stored
location A label originally provided by the dataset (see the related paper for details)
fname The original filename where the packets of the flow come from
app The final label of the flow, encoded as pandas category
pkts_size The numpy array for the packet size time series
pkts_dir The numpy array for the packet diretion time series
timetofirst The numpy array for the delta between the each packet timestamp the first packet of the flow

Splits

Once preprocessed, the monolithic dataset is further processed to: define five 80/10/10 train/val/test splits with the following logic

  1. Shuffle the rows.
  2. Perform a 90/10 split where the 10-part is used for testing.
  3. From the 90-part, do a second 90/10 to define train and validation.

The splits are NOT materialized, i.e., splits are a collection of row indexes that needs to be applied on the filtered monolithic parquet in order to obtain the data for modeling

The structure of the split table is

Field Description
train_indexes A numpy array with the row_id related to the train split
val_indexes A numpy array with the row_id related to the validation split
test_indexes A numpy array with the row_id related to the test split
split_index The index of the split (0..4)

Install

The dataset zip archive is stored in a box.com folder. You need to manually download the archive and save it into a local folder, e.g., /downloads

To trigger the installation run the following

tcbench datasets install \
    --name utmobilenet21 \
    --input-folder downloads/

Output

╭──────╮
│unpack│
╰──────╯
opening: downloads/UTMobileNet2021.zip

╭──────────╮
│preprocess│
╰──────────╯
processing: ./envs/tcbench/lib/python3.10/site-packages/libtcdatasets/datasets/utmobilenet21/raw/Action-Specific Wild Test Data
found 43 files
Converting CSVs... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 43/43 0:01:15
stage1 completed
stage2 completed
stage3 completed
stage4 completed
saving: /tmp/processing-utmobilenet21/action-specific_wild_test_data.parquet

processing: ./envs/tcbench/lib/python3.10/site-packages/libtcdatasets/datasets/utmobilenet21/raw/Wild Test Data
found 14 files
Converting CSVs... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 14/14 0:03:12
stage1 completed
stage2 completed
stage3 completed
stage4 completed
saving: /tmp/processing-utmobilenet21/wild_test_data.parquet

processing: ./envs/tcbench/lib/python3.10/site-packages/libtcdatasets/datasets/utmobilenet21/raw/Randomized Automated Data
found 288 files
Converting CSVs... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 288/288 0:01:35
stage1 completed
stage2 completed
stage3 completed
stage4 completed
saving: /tmp/processing-utmobilenet21/randomized_automated_data.parquet

processing: ./envs/tcbench/lib/python3.10/site-packages/libtcdatasets/datasets/utmobilenet21/raw/Deterministic Automated Data
found 3438 files
Converting CSVs... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3438/3438 0:08:26
stage1 completed
stage2 completed
stage3 completed
stage4 completed
saving: /tmp/processing-utmobilenet21/deterministic_automated_data.parquet
merging all partitions
saving: ./envs/tcbench/lib/python3.10/site-packages/libtcdatasets/datasets/utmobilenet21/preprocessed/utmobilenet21.parquet

╭────────────────────────╮
│filter & generate splits│
╰────────────────────────╯
loading: ./envs/super-tcbench/lib/python3.10/site-packages/tcbench/libtcdatasets/datasets/utmobilenet21/preprocessed/utmobilenet21.parquet
samples count : unfiltered
┏━━━━━━━━━━━━━━┳━━━━━━━━━┓
┃ app          ┃ samples ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━┩
│ youtube      │    5591 │
│ reddit       │    4370 │
│ google-maps  │    4347 │
│ spotify      │    2550 │
│ netflix      │    2237 │
│ pinterest    │    2165 │
│ hulu         │    1839 │
│ instagram    │    1778 │
│ dropbox      │    1752 │
│ facebook     │    1654 │
│ twitter      │    1494 │
│ gmail        │    1133 │
│ pandora      │     949 │
│ messenger    │     837 │
│ google-drive │     803 │
│ hangout      │     720 │
│ skype        │     159 │
├──────────────┼─────────┤
│ __total__    │   34378 │
└──────────────┴─────────┘
stats : number packets per-flow (unfiltered)
┏━━━━━━━┳━━━━━━━━━━━┓
┃ stat  ┃     value ┃
┡━━━━━━━╇━━━━━━━━━━━┩
│ count │   34378.0 │
│ mean  │    663.96 │
│ std   │  18455.95 │
│ min   │       1.0 │
│ 25%   │       2.0 │
│ 50%   │       2.0 │
│ 75%   │      18.0 │
│ max   │ 1973657.0 │
└───────┴───────────┘

saving: ./envs/tcbench-johndoe/lib/python3.10/site-packages/tcbench/libtcdatasets/datasets/utmobilenet21/preprocessed/imc23/utmobilenet21_filtered_minpkts10.parquet
samples count : filtered (min_pkts=10)
┏━━━━━━━━━━━━━━┳━━━━━━━━━┓
┃ app          ┃ samples ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━┩
│ youtube      │    2496 │
│ google-maps  │    1798 │
│ hulu         │    1169 │
│ reddit       │     816 │
│ spotify      │     664 │
│ netflix      │     483 │
│ pinterest    │     436 │
│ twitter      │     365 │
│ instagram    │     274 │
│ hangout      │     254 │
│ dropbox      │     238 │
│ pandora      │     200 │
│ facebook     │     137 │
│ google-drive │     130 │
├──────────────┼─────────┤
│ __total__    │    9460 │
└──────────────┴─────────┘
stats : number packets per-flow (min_pkts=10)
┏━━━━━━━┳━━━━━━━━━━━┓
┃ stat  ┃     value ┃
┡━━━━━━━╇━━━━━━━━━━━┩
│ count │    9460.0 │
│ mean  │   2366.32 │
│ std   │  35109.17 │
│ min   │      11.0 │
│ 25%   │      25.0 │
│ 50%   │      51.0 │
│ 75%   │     182.0 │
│ max   │ 1973657.0 │
└───────┴───────────┘

saving: ./envs/tcbench-johndoe/lib/python3.10/site-packages/tcbench/libtcdatasets/datasets/utmobilenet21/preprocessed/imc23/utmobilenet21_filtered_minpkts10_splits.parquet
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓
┃ app          ┃ train_samples ┃ val_samples ┃ test_samples ┃ all_samples ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩
│ youtube      │          2021 │         225 │          250 │        2496 │
│ google-maps  │          1456 │         162 │          180 │        1798 │
│ hulu         │           947 │         105 │          117 │        1169 │
│ reddit       │           661 │          73 │           82 │         816 │
│ spotify      │           538 │          60 │           66 │         664 │
│ netflix      │           391 │          44 │           48 │         483 │
│ pinterest    │           353 │          39 │           44 │         436 │
│ twitter      │           296 │          33 │           36 │         365 │
│ instagram    │           222 │          25 │           27 │         274 │
│ hangout      │           206 │          23 │           25 │         254 │
│ dropbox      │           193 │          21 │           24 │         238 │
│ pandora      │           162 │          18 │           20 │         200 │
│ facebook     │           111 │          12 │           14 │         137 │
│ google-drive │           105 │          12 │           13 │         130 │
├──────────────┼───────────────┼─────────────┼──────────────┼─────────────┤
│ __total__    │          7662 │         852 │          946 │        9460 │
└──────────────┴───────────────┴─────────────┴──────────────┴─────────────┘

The console output is showing a few samples count reports related to the processing performed on the datasets

  1. The first report relates to the unfiltered dataset, i.e., the monolithic parquet files obtained consolidating all JSON files but before applying any curation. At first glance, it looks like this dataset has a lot of flows. However, the following report shows the number of packets per flow and suggests that there are many flows which are very short.

  2. The second and third group of reports show similar information to the first group but relates to the filtering out of flows with less than 10 packets.

  3. The last report shows the number of train/validation/test samples by each application for the first split (the same counters are true for all splits).