Stanford Internet Research Data Repository

Despite many efforts to automatically identify toxic comments online (including sexual harassment, threats, and identity attacks), modern systems fail to generalize to the diverse concerns of Internet users. This dataset consists of 107,620 social media comments annotated by 17,280 unique participants, and was collected to understand how user expectations for what constitutes toxic content differ across demographics, beliefs, and personal experiences. The dataset is encrypted – please contact Deepak Kumar for the password.

Study Details

Study: Designing Toxic Content Classification for a Diversity of Perspectives
USENIX Symposium on Usable Privacy and Security (SOUPS) 2021
Authors: Deepak Kumar, Patrick Gage Kelley, Sunny Consolvo, Joshua Mason, Elie Bursztein, Zakir Durumeric, Kurt Thomas, Michael Bailey
Contact: Deepak Kumar

Dataset Details

107,620 social media comments labeled by five annotators each.

File Download

File Name	MetaData	SHA-1 Fingerprint	Size	Updated At
toxicity_ratings.zip	unavailable	unavailable	14.98 MB	2021-06-09

Designing Toxic Content Classification for a Diversity of Perspectives

Study Details

Dataset Details

File Download