Overview
MSc NLP final assessment. Entity-level sentiment classification on the Twitter Entity Sentiment dataset (positive / negative / neutral, with irrelevant rolled into neutral). I ran six architectures back to back — MLP, BiLSTM, 1-D CNN, DistilBERT, RoBERTa, ALBERT — with two configurations each, so twelve experiments in total measuring accuracy, macro/weighted F1, precision, recall, and wall-clock training time.
The Problem
The interesting question with sentiment classification isn't 'can a transformer do this', it's 'when do you actually need one?'. Transformers are expensive to train and serve, and a lot of NLP tasks don't earn that cost back. I wanted concrete numbers on where the trade-off sits for entity-level sentiment.
The Approach
Same preprocessing and train/val split across every model. Classical and recurrent baselines (MLP, BiLSTM, 1-D CNN) used trained embeddings. The three transformer baselines (DistilBERT, RoBERTa, ALBERT) were fine-tuned from Hugging Face checkpoints. I ran two configs per model — different widths, layers, learning rates — and logged accuracy, macro F1, weighted F1, precision, recall, and training time for each run.
Outcome
MLP Config A topped the leaderboard at 98.6% accuracy and 0.986 macro F1, trained in 54 seconds. RoBERTa Config A came in at 97.5% but took over an hour to train. DistilBERT Config B managed 97.1% in 20 minutes. The headline finding for this dataset: a well-tuned MLP matched or beat every transformer at roughly 1/70th the training cost, which is a useful concrete data point about when transformer overhead is worth paying for.
