I have trained a Sarcasm Detection AI model using Reddit comments. This is how you can do it too.
Requirements:
Google Colab
Reddit API Credentials
Lots of time
Coffee
First we will import the necessary libraries.
import asyncpraw # Python Reddit API Wrapper for asynchronous Reddit API interactions.
import pandas as pd # Data manipulation and analysis tool.
import nest_asyncio # Necessary for allowing nested asyncio run loops.
import re # Regular expressions for pattern matching and text manipulation.
from sklearn.model_selection import train_test_split # Splits data into training and testing sets.
from sklearn.feature_extraction.text import TfidfVectorizer # Converts text data into TF-IDF feature vectors.
from sklearn.ensemble import RandomForestClassifier # Random Forest classifier for machine learning.
from sklearn.metrics import accuracy_score, classification_report # Metrics for evaluating model performance.
from imblearn.over_sampling import SMOTE # Oversampling technique for handling class imbalance.
from sklearn.pipeline import Pipeline # Constructs a pipeline of transformations and estimators.
from sklearn.model_selection import GridSearchCV # Performs grid search over specified parameter values.
Connecting to Reddit API
Get your API credentials from https://www.reddit.com/prefs/apps
client_secret = ‘your_client_secret’
user_agent = ‘MyRedditApp/0.1 by your_username’
reddit = praw.Reddit(client_id=client_id,
client_secret=client_secret,
user_agent=user_agent)`
This code sets up authentication credentials (client_id, client_secret, user_agent) to create a Reddit API connection using praw. The Reddit object initializes a connection to Reddit’s API, allowing the Python script to interact with Reddit, retrieve data, and perform various actions programmatically on the platform.
Initialization and Setup
This line ensures that asyncio can be used in a nested manner, which is necessary when using asynchronous operations in environments that already have an event loop running.
Asynchronous Function Definition
reddit = asyncpraw.Reddit(
client_id=client_id,
client_secret=client_secret,
user_agent=user_agent
)`
Defines an asynchronous function collect_reddit_comments to retrieve comments from Reddit. It initializes a Reddit instance using asyncpraw, passing in credentials (client_id, client_secret, user_agent) for API authentication.
Fetching Subreddit and Comments
comments = []
count = 0
after = None`
Asynchronously fetches the subreddit object based on subreddit_name. Initializes an empty list comments to store comment data, and sets counters (count) and pagination marker (after) for comment retrieval.
Looping Through Submissions and Comments
try:
async for submission in subreddit.search(keyword, limit=None, params={‘after’: after}):
await submission.load()
submission.comment_limit = 0
submission.comments.replace_more(limit=0)`
Explanation: Enters a loop to fetch submissions matching keyword within the specified subreddit. Asynchronously loads submission details and retrieves all comments for each submission, handling cases where more comments are nested (replace_more).
Collecting and Storing Comments
if isinstance(comment, asyncpraw.models.Comment):
author_name = comment.author.name if comment.author else ‘[deleted]’
comments.append([comment.body, author_name, comment.created_utc])
count += 1
if count >= limit:
break
after = submission.id # Sets the ‘after’ parameter for pagination
if count >= limit:
break`
Iterates through each comment in the submission, checking if it’s a valid comment. Collects comment details such as body, author name, and creation time (created_utc). Controls the loop with count and limit to ensure the specified number of comments (limit) is collected.
Handling API Exceptions
print(f”API exception occurred: {e}”)
wait_time = 60 # Wait for 1 minute before retrying
print(f”Waiting for {wait_time} seconds before retrying…”)
await asyncio.sleep(wait_time)`
Catches and handles API exceptions that may occur during Reddit API interactions. Prints the exception message, waits for a minute (wait_time) before retrying, and then resumes fetching comments.
Returning Results
Returns a list of collected comments, limited by the specified limit, ensuring only the required number of comments are returned.
Main Function to Execute Collection
comments = await collect_reddit_comments(‘sarcasm’, ‘sarcastic’, limit=5000) # Adjust limit as needed
df = pd.DataFrame(comments, columns=[‘comment’, ‘author’, ‘created_utc’])
df.to_csv(‘reddit_comments.csv’, index=False)
print(f”Total comments collected: {len(df)}”)
print(df.head())
Defines an asynchronous main function to orchestrate the comment collection process. Calls collect_reddit_comments with parameters subreddit_name=’sarcasm’, keyword=’sarcastic’, and limit=5000 (can be adjusted). Converts collected comments into a Pandas DataFrame (df), stores it as a CSV file (reddit_comments.csv), and prints summary information about the collected data.
Running the Main Function
Executes the main function asynchronously, initiating the process of collecting Reddit comments, processing them into a DataFrame, saving them to a CSV file, and providing feedback on the number of comments collected and a preview of the data.
Read the Part 2 – Sarcasm Detection From Reddit Comments : Cleaning & Saving The Data