Sarcasm Detection AI Model (97% Accuracy) Trained With Reddit Comments – Part 1

RMAG news

I have trained a Sarcasm Detection AI model using Reddit comments. This is how you can do it too.

Requirements:
Google Colab
Reddit API Credentials
Lots of time
Coffee

First we will import the necessary libraries.

import asyncio # For asynchronous programming in Python.
import asyncpraw # Python Reddit API Wrapper for asynchronous Reddit API interactions.
import pandas as pd # Data manipulation and analysis tool.
import nest_asyncio # Necessary for allowing nested asyncio run loops.
import re # Regular expressions for pattern matching and text manipulation.
from sklearn.model_selection import train_test_split # Splits data into training and testing sets.
from sklearn.feature_extraction.text import TfidfVectorizer # Converts text data into TF-IDF feature vectors.
from sklearn.ensemble import RandomForestClassifier # Random Forest classifier for machine learning.
from sklearn.metrics import accuracy_score, classification_report # Metrics for evaluating model performance.
from imblearn.over_sampling import SMOTE # Oversampling technique for handling class imbalance.
from sklearn.pipeline import Pipeline # Constructs a pipeline of transformations and estimators.
from sklearn.model_selection import GridSearchCV # Performs grid search over specified parameter values.

Connecting to Reddit API
Get your API credentials from https://www.reddit.com/prefs/apps

`client_id = ‘your_client_id’
client_secret = ‘your_client_secret’
user_agent = ‘MyRedditApp/0.1 by your_username’

reddit = praw.Reddit(client_id=client_id,
client_secret=client_secret,
user_agent=user_agent)`

This code sets up authentication credentials (client_id, client_secret, user_agent) to create a Reddit API connection using praw. The Reddit object initializes a connection to Reddit’s API, allowing the Python script to interact with Reddit, retrieve data, and perform various actions programmatically on the platform.

Initialization and Setup

`nest_asyncio.apply()`

This line ensures that asyncio can be used in a nested manner, which is necessary when using asynchronous operations in environments that already have an event loop running.

Asynchronous Function Definition

`async def collect_reddit_comments(subreddit_name, keyword, limit=1000):
reddit = asyncpraw.Reddit(
client_id=client_id,
client_secret=client_secret,
user_agent=user_agent
)`

Defines an asynchronous function collect_reddit_comments to retrieve comments from Reddit. It initializes a Reddit instance using asyncpraw, passing in credentials (client_id, client_secret, user_agent) for API authentication.

Fetching Subreddit and Comments

`subreddit = await reddit.subreddit(subreddit_name)
comments = []
count = 0
after = None`

Asynchronously fetches the subreddit object based on subreddit_name. Initializes an empty list comments to store comment data, and sets counters (count) and pagination marker (after) for comment retrieval.

Looping Through Submissions and Comments

`while len(comments) < limit:
try:
async for submission in subreddit.search(keyword, limit=None, params={‘after’: after}):
await submission.load()
submission.comment_limit = 0
submission.comments.replace_more(limit=0)`

Explanation: Enters a loop to fetch submissions matching keyword within the specified subreddit. Asynchronously loads submission details and retrieves all comments for each submission, handling cases where more comments are nested (replace_more).

Collecting and Storing Comments

` for comment in submission.comments.list():
if isinstance(comment, asyncpraw.models.Comment):
author_name = comment.author.name if comment.author else ‘[deleted]’
comments.append([comment.body, author_name, comment.created_utc])
count += 1

if count >= limit:
break

after = submission.id # Sets the ‘after’ parameter for pagination

if count >= limit:
break`

Iterates through each comment in the submission, checking if it’s a valid comment. Collects comment details such as body, author name, and creation time (created_utc). Controls the loop with count and limit to ensure the specified number of comments (limit) is collected.

Handling API Exceptions

`except asyncpraw.exceptions.APIException as e:
print(f”API exception occurred: {e}”)
wait_time = 60 # Wait for 1 minute before retrying
print(f”Waiting for {wait_time} seconds before retrying…”)
await asyncio.sleep(wait_time)`

Catches and handles API exceptions that may occur during Reddit API interactions. Prints the exception message, waits for a minute (wait_time) before retrying, and then resumes fetching comments.

Returning Results

`return comments[:limit]` # Returns up to ‘limit’ number of comments

Returns a list of collected comments, limited by the specified limit, ensuring only the required number of comments are returned.

Main Function to Execute Collection

async def main():
comments = await collect_reddit_comments(‘sarcasm’, ‘sarcastic’, limit=5000) # Adjust limit as needed
df = pd.DataFrame(comments, columns=[‘comment’, ‘author’, ‘created_utc’])
df.to_csv(‘reddit_comments.csv’, index=False)
print(f”Total comments collected: {len(df)}”)
print(df.head())

Defines an asynchronous main function to orchestrate the comment collection process. Calls collect_reddit_comments with parameters subreddit_name=’sarcasm’, keyword=’sarcastic’, and limit=5000 (can be adjusted). Converts collected comments into a Pandas DataFrame (df), stores it as a CSV file (reddit_comments.csv), and prints summary information about the collected data.

Running the Main Function

`await main()`

Executes the main function asynchronously, initiating the process of collecting Reddit comments, processing them into a DataFrame, saving them to a CSV file, and providing feedback on the number of comments collected and a preview of the data.

Read the Part 2 – Sarcasm Detection From Reddit Comments : Cleaning & Saving The Data

Please follow and like us:
Pin Share