Published on: July 13, 2023 · Python · NLTK · VADER · K-Means
Analysed comments from multiple social media platforms to classify sentiment and identify dominant topic clusters, combining unsupervised clustering with a lexicon-based sentiment model.
We started by pulling comments from different social media platforms using platform APIs and web scraping tools. Comments were collected in multiple languages and required translation before analysis.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.cluster import KMeans
from sklearn.feature_extraction.text import TfidfVectorizer
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
sns.set()
After cleaning and translating the data, we vectorised comments using TF-IDF and applied K-Means clustering to group comments by topic similarity. The Elbow Method was used to find the optimal number of clusters.
text = final_df['translated_text'].astype(str)
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(text)
# Find optimal k using Elbow Method
inertias = []
K = range(2, 10)
for k in K:
km = KMeans(n_clusters=k, max_iter=200, n_init=10, random_state=42)
km.fit(X)
inertias.append(km.inertia_)
plt.plot(K, inertias, 'bx-')
plt.xlabel('Number of clusters (k)')
plt.ylabel('Sum of squared distances')
plt.title('Elbow Method — Optimal k')
plt.show()
# Optimal k = 6 based on elbow curve
model = KMeans(n_clusters=6, init='k-means++', max_iter=200, n_init=10, random_state=42)
model.fit(X)
clusters = pd.DataFrame({'text': text, 'cluster': model.labels_})
print(clusters.sort_values('cluster'))
The elbow curve plateaued at k=6, confirming six dominant topic clusters across the comment dataset.
VADER (Valence Aware Dictionary and sEntiment Reasoner) was applied to each comment. VADER is well-suited for social media text — it handles slang, emojis, and capitalisation without requiring labelled training data.
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
analyzer = SentimentIntensityAnalyzer()
def classify_sentiment(text):
score = analyzer.polarity_scores(str(text))['compound']
if score >= 0.05:
return 'positive'
elif score <= -0.05:
return 'negative'
else:
return 'neutral'
final_df['sentiment'] = final_df['translated_text'].apply(classify_sentiment)
print(final_df['sentiment'].value_counts(normalize=True))
Running VADER on the full dataset produced a strong positive skew — 92% of comments were classified as positive. The negative cluster was concentrated around delivery and pricing complaints.
The majority of social media comments (92%) were positive. The K-Means clustering revealed six distinct conversation topics, allowing the business to prioritise which sentiment clusters required brand response. The pipeline was subsequently adapted for real-time monitoring of campaign mentions.