ucdavis-icdm19¶
Below we report the samples count for each version of the dataset.
Semantic of the splits
The split available for this datasets relate to our IMC23 paper.
unfiltered¶
The unfitered version contains all data before curation.
Output
unfiltered
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━┓
┃ partition ┃ app ┃ samples ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━┩
│ pretraining │ google-doc │ 1221 │
│ │ google-drive │ 1634 │
│ │ google-music │ 592 │
│ │ google-search │ 1915 │
│ │ youtube │ 1077 │
│ │ __total__ │ 6439 │
├─────────────────────────────┼───────────────┼─────────┤
│ retraining-human-triggered │ google-doc │ 15 │
│ │ google-drive │ 18 │
│ │ google-music │ 15 │
│ │ google-search │ 15 │
│ │ youtube │ 20 │
│ │ __total__ │ 83 │
├─────────────────────────────┼───────────────┼─────────┤
│ retraining-script-triggered │ google-doc │ 30 │
│ │ google-drive │ 30 │
│ │ google-music │ 30 │
│ │ google-search │ 30 │
│ │ youtube │ 30 │
│ │ __total__ │ 150 │
└─────────────────────────────┴───────────────┴─────────┘
First training split¶
Output
human test split¶
This is equivalent to the human partition of the unfiltered dataset.
Output
script test split¶
This is equivalent to the script partition of the unfiltered dataset.