CHIMPS Lab - Detecting Offensive Tweets via Topical Feature Discovery over a Large Scale Twitter Corpus

Authors

Guang Xiang, Bin Fan, Ling Wang, Jason Hong, and Carolyn P. Rose

Venue

Conference on Information and Knowledge Management (CIKM)

Published

Work in Progress

Abstract

In this paper, we propose a novel semi-supervised approach for detecting profanity-related offensive content in Twitter. Our ap-proach exploits linguistic regularities in profane language via statistical topic modeling on a huge Twitter corpus, and detects offensive tweets using these automatically generated features. Our approach performs competitively with a variety of machine learning (ML) algorithms. For instance, our approach achieves a true positive rate (TP) of 75.1% over 4029 testing tweets using Logistic Regression, significantly outperforming the popular keyword matching base-line, which has a TP of 69.7%, while keeping the false positive rate (FP) at the same level as the baseline at about 3.77%. Our ap-proach provides an alternative to large scale hand annotation efforts required by fully supervised learning approaches.

Tags

machine learning, topic modeling, twitter

Files

Paper