Gaoussou Youssouf Kebe1, Padraig Higgins1, Patrick Jenkins1, Kasra Darvish1, Rishabh Sachdeva1, Ryan Barron1, John Winder1, 3, Don Engel1, Edward Raff1, 2, Francis Ferraro1, Cynthia Matuszek1
1 University of Maryland, Baltimore County (UMBC)
2 Booz Allen Hamilton
3 Johns Hopkins Applied Physics Laboratory
The Grounded Language Dataset, or GoLD, is a grounded language learning dataset in four modalities: RGB, depth, text, speech. The data contains 207 instances of 47 object classes. The objects are from five high level categories of food, home, medical, office, and tool. Each instance is captured from different angles for a total of 825 images. Text and speech descriptions are collected using Amazon Mechanical Turk (AMT) for a total of 16500 text descriptions and 16500 speech descriptions.
The data is intended for use in multimodal grounded language acquisition tasks for domestic robots and for testing algorithmic differences between the domains.
The dataset consists of a directory of images, a directory of wav files, and two tsv files with descriptions. Each image label is formated as <object name>_<instance number>_<frame number>. wav files are labeled as <object name>_<instance number>_<frame number>_<description number>.
The structure of the image files looks like
images └── RGB └── allen_wrench └── allen_wrench_1 ├── allen_wrench_1_1.png ├── allen_wrench_1_2.png └── ... └── allen_wrench_2 ├── allen_wrench_2_1.png ├── allen_wrench_2_2.png └── ... └── ... └── apple └── apple_1 ├── apple_1_1.png ├── apple_1_2.png └── ... └── ... └── ... └── RGB_cropped └── allen_wrench └── ... └── apple └── ... └── ... └── RGB_raw └── allen_wrench └── ... └── apple └── ... └── ... └── depth └── allen_wrench └── ... └── apple └── ... └── ... └── depth_cropped └── allen_wrench └── ... └── apple └── ... └── ... └── depth_raw └── allen_wrench └── ... └── apple └── ... └── ... └── pcd └── allen_wrench └── ... └── apple └── ... └── ... └── pcd_cropped └── allen_wrench └── ... └── apple └── ... └── ... └── pcd_visualization └── allen_wrench └── ... └── apple └── ... └── ...
images contains 8 folders
- RGB: RGB images with background masked out
- RGB_cropped: RGB images with background cropped out
- RGB_raw: Full RGB images
- depth: Depth images with background masked out
- depth_cropped: Depth images with background cropped out
- depth_raw: Full depth images
- pcd: Full point clouds
- pcd_cropped: Point clouds with background cropped out
- pcd_visualization: Visualizations of the point clouds
speech.tsv contains 6 fields
- hit_id: AMT hit id
- worker_id: anonymized worker id
- worktime_s: time in seconds to complete the AMT task
- item_id: label for the object, instance, and frame number
- wav: name of the related wav file in the speech directory
- transcription: the Google speech-to-text transcription
text.tsv contains 5 fields:
- hit_id: AMT hit id
- worker_id: anonymized worker id
- worktime_s: time in seconds to complete the AMT task
- item_id: label for the object, instance, and frame number
- text: a single text description for this instance
Video files are available upon request.
@inproceedings{
kebe2021a,
title={A Spoken Language Dataset of Descriptions for Speech-Based Grounded Language Learning},
author={Gaoussou Youssouf Kebe and Padraig Higgins and Patrick Jenkins and Kasra Darvish and Rishabh Sachdeva and Ryan Barron and John Winder and Donald Engel and Edward Raff and Francis Ferraro and Cynthia Matuszek},
booktitle={Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1)},
year={2021},
url={https://openreview.net/forum?id=Yx9jT3fkBaD}
}