Relevant Datasets - weiyinc11/HateSpeechModerationTwitch GitHub Wiki
Malicious URLs
- https://www.kaggle.com/datasets/sid321axn/malicious-urls-dataset/data
- https://github.com/mitchellkrogza/The-Big-List-of-Hacked-Malware-Web-Sites/blob/master/.dev-tools/_strip_domains/domains.txt
- https://research.aalto.fi/en/datasets/phishstorm-phishing-legitimate-url-dataset
- https://www.virustotal.com/gui/home/url
Swear Words
Hatespeech
-
https://github.com/aymeam/Datasets-for-Hate-Speech-Detection?tab=readme-ov-file
-
- used this dataset (SBIC) from 2020.
- 150k structured annotations of social media posts, covering over 34k implications about a thousand demographic groups. Focuses on implications and bias. Hand labeled with free text justifications. Limited in terms of in-group messages.
- They also used this dataset (hatexplain)
- 20k posts from Twitter and Gab, and ask Amazon Mechanical Turk (MTurk) workers to annotate these posts. Includes groups referenced in annotation.
- used this dataset (SBIC) from 2020.
-
This paper combines 3 datasets:
-
This paper expands hatexplain