Using Natural Language Processing (NLP) to Classify Reddit Posts
Can we differentiate between Bodyweight Fitness and Weightlifting subreddit posts?
I’m a data nerd. Yes, I admit, I love Data Science and all things data. But I also have a passion for exercise and fitness and, as a certified personal trainer, what better way to combine my passions than a data project that attempts to understand a few of the multitude of fitness interests that exist. It’s a perfect marriage of interests that would allow me to grow in my understanding of both.
First, though, what exactly is Natural Language Processing or NLP and how can it be used to study exercise? Well, distilled to the absolute basics, NLP is the ability of computers to understand human language whether written or spoken. Apple’s Siri and Amazon’s Alexa? Yeah, those involve NLP. Specific to this project, however, NLP will be applied to the text of several thousand Reddit posts to extract insights about those posts and the type of exercise that the writers of those posts engage in.
So on to the issue at hand. Can we predict what subreddit a post belongs to? Specifically, can we predict the categorization of r/bodyweightfitness and r/weightlifting postings? Bodyweight fitness is just what you’d expect — people using their bodies and gravity for resistance to achieve fitness goals — think pushups and pullups. Weightlifting involves individuals who use bars and weight plates to achieve resistance. You’ve probably seen examples of this if you’ve ever watched a CrossFit workout or Olympic weightlifting.
This project used data scraped from Reddit using the Pushshift API. Specifically, this project drew from two subreddits: r/bodyweightfitness and r/weightlifting. The ultimate goal was to build a statistical model to correctly predict the class, or to which subreddit, each post belonged using user submitted text. In total, 40,000 posts were scraped with 20,000 from each subreddit. The collection excluded all video posts and posts generated by a non-human automoderator.
During the initial examination of the data, cleaning and restructuring was necessary for analyses. The first step taken was to remove any rows that had been flagged as deleted by a moderator, the original poster, or by Reddit. Further cleaning steps were taken which:
- Removed duplicates. No duplicates were found, however, the step was still programmed in case it is necessary in the future should more data be collected.
- Keep only relevant columns. Columns kept for further processing included: ‘selftext’ or the text of the post and ‘subreddit’ or which subreddit the post was scraped from.
- Removed any additional unnecessary posts. Upon further examination, posts that had been deleted by an ‘AutoModerator’ were removed as were those where ‘is_self’ was set to False. The ‘is_self’=False cases did not have an author and the content was flagged as deleted.
After the additional cleaning noted above, 24,549 posts remained in the dataframe. R/weightlifting accounted for 10,606 and r/bodyweightfitness accounted for 13,943 posts. Before finalizing the dataset, r/weightlifting and r/bodyweightfitness were categorized with values of 0 and 1, respectively, in a dummy variable column called subreddit_bodyweightfitness. This newly cleaned and pared down dataframe was then exported as a csv for the next, analytic steps.
Below is a graph outlining how much of the sample was drawn from r/bodyweightfitness (57 percent) and r/weightlifting (43 percent).
Exploratory Data Analysis
Prior to model building, high-level snapshots of the data were taken. This was done to inform understandings of the text and classes in the dataset. Also, and perhaps most importantly, it was conducted to observe any further cleaning needed.
Below is an image of the most frequent words before a stop words list was implemented. Because of this, it was deemed necessary to remove these stop words as they did not appear to enrich the analyses. After removal of custom stopwords — completed via a modified list curated from that built into sklearn, most frequent words in the data frame included things like workout, body, training, day, and weight.
Several models were examined to see which was best suited to predict the classification of r/bodyweightfitness and r/weightlifting posts. Multinomial Naïve Bayes, Logistic Regression, and Random Forest were all considered and tested. Each model’s parameters were adjusted over several iterations and each was modeled with TF-IDF or or Term Frequency Inverse Document Frequency — a word vectorization that finds how original a word is by comparing how often the word appears in a document or row with how many documents it appears in at all. Each classification method was also modeled with the text vectorizer CVEC or Count Vectorizer, however, TF-IDF produced superior results and so was used in the final modeling.
Two models exhibiting the best accuracy with the least amount of overfitting were selected. These models were Logistic Regression with default settings and Naïve Bayes both using TF-IDF word vectorization.
Finally, some of the top words for each subreddit can be found below. While there is some overlap, the most common words for each subreddit are quite unique.
A table of results is below.
The Logistic Regression model has a higher cross validation score and a higher training accuracy. However, the difference between the train and test accuracy is larger than with the Naïve Bayes model. The Naïve Bayes model has a cross validation score of 0.88, a train accuracy of 0.89 and a test accuracy of 0.88.
The accuracy is — overall — what percentage of observations were correctly predicted. Specific to this study, this would be how well or accurately did the model predict posts as belonging to r/bodyweightfitness or r/weightlifting.
For the other metrics, the Logistic Regression also seemed to have slightly superior specificity, sensitivity, and precision.
Both seemed to perform well with the sensitivity, though. Both models had a sensitivity of around 0.94 in terms of being able to correctly predict posts belonging to body weight fitness.
Specificity in this case is how well the model correctly predicted those in the 0 or r/weightlifting class, the models were correct between 0.85 and 0.82 percent of the time.
A condensed data dictionary is below. It was not possible to provide a complete list of all variables due to the way the submission text was vectorized.
Both models predicted to which subreddit the post belonged to better than the baseline or null model which had a 57% accuracy. These models performed similarly although most of the metrics for the Logistic Regression appeared to edge out the Naïve Bayes. The exception here would be the train and test accuracy. The Logistic Regression seemed to be slightly more overfit than the Naïve Bayes.
Lastly, there are interesting differences in the top words found in each subreddit. Perhaps not so surprising to fitness enthusiasts but the lists shown earlier have some obvious differences.