Sarcasm Detection using Machine Learning.

RMAG news

I’ll walk you through the task of detecting sarcasm with machine learning using the Python programming language.

It reads a dataset of headlines labeled as sarcastic or non-sarcastic, processes the data to map the labels into human-readable form, and converts the text data into a matrix of token counts using the CountVectorizer.

The data is then split into training and testing sets, and a Bernoulli Naive Bayes classifier is trained on the training set. The model’s accuracy is evaluated on the test set, and it can also predict whether new user-inputted text is sarcastic or not.

import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import BernoulliNB
from sklearn.model_selection import train_test_split

These lines import the necessary libraries:

pandas (pd) for data manipulation.

numpy (np) for numerical operations.

CountVectorizer from sklearn for converting text data into a matrix of token counts.

BernoulliNB from sklearn for implementing the Bernoulli Naive Bayes classifier.

train_test_split from sklearn for splitting data into training and testing sets.

data = pd.read_json(, lines=True)

This line reads JSON data from the given URL into a pandas DataFrame. The lines=True argument specifies that each line in the file is a separate JSON object.


Displays the first few rows of the DataFrame to give an overview of the data.


Displays the last few rows of the DataFrame to give another overview of the data.


Shows the column names of the DataFrame.


Displays the dimensions (number of rows and columns) of the DataFrame.

data[is_sarcastic] = data[is_sarcastic].map({0:No Sarcasm, 1: Sarcasm})

Maps the values in the is_sarcastic column from 0 and 1 to ‘No Sarcasm’ and ‘Sarcasm’ respectively.


Displays the first few rows of the DataFrame again to show the updated is_sarcastic column.

data = data[[headline, is_sarcastic]]

Selects only the headline and is_sarcastic columns from the DataFrame for further analysis.

x = np.array(data[headline])
y = np.array(data[is_sarcastic])

Converts the headline and is_sarcastic columns to numpy arrays, assigning them to x and y respectively.

cv = CountVectorizer()

Creates an instance of CountVectorizer to transform the text data into a matrix of token counts.

X = cv.fit_transform(x)

Fits the CountVectorizer to the headlines and transforms them into a sparse matrix of token counts, assigned to X.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Splits the data into training and testing sets. 80% of the data is used for training and 20% for testing. The random_state=42 ensures reproducibility.

model = BernoulliNB()

Creates an instance of the Bernoulli Naive Bayes classifier., y_train)

Trains the model using the training data (X_train and y_train).

print(model.score(X_test, y_test))

Prints the accuracy of the model on the test data.

user = input(Enter the text here)

Prompts the user to enter a piece of text for sarcasm detection.

data = cv.transform([user]).toarray()

Transforms the user input text into the same format as the training data (a sparse matrix of token counts).

output = model.predict(data)

Uses the trained model to predict whether the user input text is sarcastic or not.


Prints the prediction result.

You can find the dataset here, and colab notebook here also you can follow me on Github.

Happy Coding!

Leave a Reply

Your email address will not be published. Required fields are marked *