Readme

Hate Speech on Twitter: A Natural Language Processing Challenge


Want to look at the “official” presentation that goes with this project? Click here!

Overall Summary

About the Dataset:

The aim of this dataset is to determine if a set of 30,000 tweets contains hate speech relating to sexism and racism in order to create predictive models to identify such language in the future. Tweets are pre-labeled as either 0(not containing hate speech) or 1(containing hate speech).

This dataset is available online as a part of the Analytics Vidhya challenge series.

Skills Used in This Project:
Throughout the course of this project I had the chance to strengthen the following skills:
  • Data Wrangling: The dataset came with only three features: id, label, and tweet. It was up to me to break down the info and get it to make sense. This involved cleaning the tweets for extraneous characters and symbols, creating my own “stop word” list of common filler words that didn’t have much impact on the sentiment of the tweet, and analyzing the accuracy of the labels.
  • Natural Language Processing: Since this was my first time to experiment with Natural Language Processing I really dove into how keywords can (or can’t) signify semantics (meaning) in text. Using SciKit-Learn I processed the tweets into individual word-features and looked at the relationship between groups of word and meaning.
  • Machine Learning/Predictive Modeling: While working through this project I quickly learned that using base keywords to create labels for machine learning can be completely inaccurate! Fun! This led me to create my own labels for the tweets using a list of my own keywords(which is ever growing).

Prerequisites/Project Process

Exploratory Data Analysis (EDA)
Please see the provided presentation for a full break-down of my EDA process and (very pretty) visualizations!


Testing/Main Purpose of Project
Can we use our prelabeled data set to predict if a tweet is hate speech or not?

Lecture

Below is the slide deck for this project.

Code

Click here to see the original github for this code!

Elements

Text

This is bold and this is strong. This is italic and this is emphasized. This is superscript text and this is subscript text. This is underlined and this is code: for (;;) { ... }. Finally, this is a link.


Heading Level 2

Heading Level 3

Heading Level 4

Heading Level 5
Heading Level 6

Blockquote

Fringilla nisl. Donec accumsan interdum nisi, quis tincidunt felis sagittis eget tempus euismod. Vestibulum ante ipsum primis in faucibus vestibulum. Blandit adipiscing eu felis iaculis volutpat ac adipiscing accumsan faucibus. Vestibulum ante ipsum primis in faucibus lorem ipsum dolor sit amet nullam adipiscing eu felis.

Preformatted

i = 0;

while (!deck.isInOrder()) {
    print 'Iteration ' + i;
    deck.shuffle();
    i++;
}

print 'It took ' + i + ' iterations to sort the deck.';

Lists

Unordered

  • Dolor pulvinar etiam.
  • Sagittis adipiscing.
  • Felis enim feugiat.

Alternate

  • Dolor pulvinar etiam.
  • Sagittis adipiscing.
  • Felis enim feugiat.

Ordered

  1. Dolor pulvinar etiam.
  2. Etiam vel felis viverra.
  3. Felis enim feugiat.
  4. Dolor pulvinar etiam.
  5. Etiam vel felis lorem.
  6. Felis enim et feugiat.

Icons

Actions

Table

Default

Name Description Price
Item One Ante turpis integer aliquet porttitor. 29.99
Item Two Vis ac commodo adipiscing arcu aliquet. 19.99
Item Three Morbi faucibus arcu accumsan lorem. 29.99
Item Four Vitae integer tempus condimentum. 19.99
Item Five Ante turpis integer aliquet porttitor. 29.99
100.00

Alternate

Name Description Price
Item One Ante turpis integer aliquet porttitor. 29.99
Item Two Vis ac commodo adipiscing arcu aliquet. 19.99
Item Three Morbi faucibus arcu accumsan lorem. 29.99
Item Four Vitae integer tempus condimentum. 19.99
Item Five Ante turpis integer aliquet porttitor. 29.99
100.00

Buttons

  • Disabled
  • Disabled

Form