Sarcasm Detection AI Model (97% Accuracy) Trained With Reddit Comments – Part 1

Claudio Ctin3 months ago8 mins

I have trained a Sarcasm Detection AI model using Reddit comments. This is how you can do it too.

Requirements:
Google Colab
Reddit API Credentials
Lots of time
Coffee

First we will import the necessary libraries.

 import asyncio  # For asynchronous programming in Python.
 import asyncpraw  # Python Reddit API Wrapper for asynchronous Reddit API interactions.
 import pandas as pd  # Data manipulation and analysis tool.
 import nest_asyncio  # Necessary for allowing nested asyncio run loops.
 import re  # Regular expressions for pattern matching and text manipulation.
 from sklearn.model_selection import train_test_split  # Splits data into training and testing sets.
 from sklearn.feature_extraction.text import TfidfVectorizer  # Converts text data into TF-IDF feature vectors.
 from sklearn.ensemble import RandomForestClassifier  # Random Forest classifier for machine learning.
 from sklearn.metrics import accuracy_score, classification_report  # Metrics for evaluating model performance.
 from imblearn.over_sampling import SMOTE  # Oversampling technique for handling class imbalance.
 from sklearn.pipeline import Pipeline  # Constructs a pipeline of transformations and estimators.
 from sklearn.model_selection import GridSearchCV  # Performs grid search over specified parameter values.

Connecting to Reddit API
Get your API credentials from https://www.reddit.com/prefs/apps

`client_id = ‘your_client_id’
client_secret = ‘your_client_secret’
user_agent = ‘MyRedditApp/0.1 by your_username’

reddit = praw.Reddit(client_id=client_id,
client_secret=client_secret,
user_agent=user_agent)`

This code sets up authentication credentials (client_id, client_secret, user_agent) to create a Reddit API connection using praw. The Reddit object initializes a connection to Reddit’s API, allowing the Python script to interact with Reddit, retrieve data, and perform various actions programmatically on the platform.

Initialization and Setup

`nest_asyncio.apply()`

This line ensures that asyncio can be used in a nested manner, which is necessary when using asynchronous operations in environments that already have an event loop running.

Asynchronous Function Definition

 `async def collect_reddit_comments(subreddit_name, keyword, limit=1000):
 reddit = asyncpraw.Reddit(
 client_id=client_id,
 client_secret=client_secret,
 user_agent=user_agent
 )`

Defines an asynchronous function collect_reddit_comments to retrieve comments from Reddit. It initializes a Reddit instance using asyncpraw, passing in credentials (client_id, client_secret, user_agent) for API authentication.

Fetching Subreddit and Comments

 `subreddit = await reddit.subreddit(subreddit_name)
 comments = []
 count = 0
 after = None`

Asynchronously fetches the subreddit object based on subreddit_name. Initializes an empty list comments to store comment data, and sets counters (count) and pagination marker (after) for comment retrieval.

Looping Through Submissions and Comments

 `while len(comments) < limit:
 try:
 async for submission in subreddit.search(keyword, limit=None, params={‘after’: after}):
 await submission.load()
 submission.comment_limit = 0
 submission.comments.replace_more(limit=0)`

Explanation: Enters a loop to fetch submissions matching keyword within the specified subreddit. Asynchronously loads submission details and retrieves all comments for each submission, handling cases where more comments are nested (replace_more).

Collecting and Storing Comments

` for comment in submission.comments.list():
if isinstance(comment, asyncpraw.models.Comment):
author_name = comment.author.name if comment.author else ‘[deleted]’
comments.append([comment.body, author_name, comment.created_utc])
count += 1

if count >= limit:
break

after = submission.id # Sets the ‘after’ parameter for pagination

if count >= limit:
break`

Iterates through each comment in the submission, checking if it’s a valid comment. Collects comment details such as body, author name, and creation time (created_utc). Controls the loop with count and limit to ensure the specified number of comments (limit) is collected.

Handling API Exceptions

 `except asyncpraw.exceptions.APIException as e:
 print(f”API exception occurred: {e}”)
 wait_time = 60  # Wait for 1 minute before retrying
 print(f”Waiting for {wait_time} seconds before retrying…”)
 await asyncio.sleep(wait_time)`

Catches and handles API exceptions that may occur during Reddit API interactions. Prints the exception message, waits for a minute (wait_time) before retrying, and then resumes fetching comments.

Returning Results

`return comments[:limit]` # Returns up to ‘limit’ number of comments

Returns a list of collected comments, limited by the specified limit, ensuring only the required number of comments are returned.

Main Function to Execute Collection

 async def main():
 comments = await collect_reddit_comments(‘sarcasm’, ‘sarcastic’, limit=5000)  # Adjust limit as needed
 df = pd.DataFrame(comments, columns=[‘comment’, ‘author’, ‘created_utc’])
 df.to_csv(‘reddit_comments.csv’, index=False)
 print(f”Total comments collected: {len(df)}”)
 print(df.head())

Defines an asynchronous main function to orchestrate the comment collection process. Calls collect_reddit_comments with parameters subreddit_name=’sarcasm’, keyword=’sarcastic’, and limit=5000 (can be adjusted). Converts collected comments into a Pandas DataFrame (df), stores it as a CSV file (reddit_comments.csv), and prints summary information about the collected data.

Running the Main Function

`await main()`

Executes the main function asynchronously, initiating the process of collecting Reddit comments, processing them into a DataFrame, saving them to a CSV file, and providing feedback on the number of comments collected and a preview of the data.

Read the Part 2 – Sarcasm Detection From Reddit Comments : Cleaning & Saving The Data

Please follow and like us: