Dissemination of hateful content on nearly all social media is increasingly becoming an alarming concern. In the research community as well, this is a heavily studied domain. The HASOC shared task is one such track that intends to provide a platform to develop and optimize Hate Speech detection algorithms. It provides a forum and a data challenge for multilingual research on the identification of problematic content. It is one of the workshops at the proceedings of the Forum for Information Retrieval Evaluation (FIRE) 2021.
A group of 8 students from PreCog IIIT Hyderabad participated in all the 6 sub tasks of this challenge.
The Task Description
This year, the tasks were primarily centered around three languages — English, Hindi and Marathi. The focus on Indian languages was another motivating factor for us to participate since this closely relates to several exciting ongoing projects at the lab.
The descriptions for all the six subtasks can be found on the official website here.
The dataset for the challenge consisted of content from Twitter with each subtask having about 2,000 to 4,500 tweets in the training corpus.
Given that the challenge intended to encourage participants to build robust mechanisms to detect hate with even relatively small datasets, we did face a few difficulties in the process of working with the given data. The size of the dataset was definitely something that hindered us from getting perfect performance scores by simply using existing off-the-shelf pre-trained models. This in fact pushed us to work towards trying to think of smarter methodologies that we could adopt to improve our submissions.
Another point of concern was the class imbalance present in the dataset. The dominance of a few classes skewed our model predictions initially which we later successfully managed to overcome and bump up our position on the leaderboard by a few points.
To solve these tasks we first identified a few potential problem areas:
- We had a very small dataset for training
- The labels were imbalanced
- The data was code-mixed
Having identified these artifacts about the dataset, we started with fine tuning transformer based models ( BERT, XLMR, TwitterBERT, MuRIL, mBERT) on the given corpus. We knew these would lead to some degree of overfitting and decided on steps to mitigate this issue. For the English task we had more datasets which we decided to incorporate during training. On the contrary, for the Hindi subtask, extrapolating the dataset with past challenge data proved to reduce the accuracy of the model considering the out of distribution data present.
The imbalance in the label distribution was a difficult problem to overcome. Counterintuitively, using a weighted loss function led to a degradation in the model performance.
To tackle code-mixing in the given tweets we experimented with two different multilingual models namely MuRIL, XLM-R. For all the subtasks the XLM-R model captured the multilingual sentences the best, and helped improve our performance by a decent margin.
Additionally, we also explored existing literature in the field of hateful content and came across a BERT based transfer learning approach using CNNs for social media. This methodology fit well for our use case and we fused a part of this architecture into our pipeline which further boosted our performance for the Hindi and Marathi Subtasks drastically. Furthermore, leveraging the textual features in the different datasets was another aspect that we tried to include in our end-to-end model architecture.
For Subtask 2, in order to capture the context of the tweet and its ancestors in the hierarchy, we combined the two into a single input using the following mechanism
[CLS] <tweet text to be classified> [SEP] <context of ancestor tweets> [SEP]
where CLS , SEP are part of the vocabulary of these models used to classify an input, and take multiple sentences as input, respectively.
What Worked and What Didn’t
We trained different models for each of the subtasks and surveyed the results obtained on the train and validation split. Since we could only make 5 submissions for each task we could not test the efficacy for all our models and had to rely on a sampled validation set from the training data.
For the Hindi subtask which was a multi-class classification setting, we observed that the label distribution was imbalanced and skewed. To combat this, instead of using the more popular Cross Entropy Loss, we used Focal Loss which compensates for class imbalance by modifying the Cross Entropy function with a factor that increases the network’s sensitivity towards misclassified samples and down-weights the easy samples. Additionally, we experimented with XLM-R and MiniLM while fine tuning and found that MiniLM performed better even with fewer parameters.
As for the Marathi subtask, including textual features such as – fraction of profane words, sentiment of the tweet – profoundly helped in increasing our F1 scores and helped us move a few positions up the leaderboard.
For any NLP related problem statement, text preprocessing is a critical step. We noticed how for some of the tasks (hate detection for instance) the presence of emojis converted to text in the tweets didn’t improve the performance. However, including emojis while classifying hate (for subtask B) did have a positive impact. On the other hand for Subtask 2, with the understanding that the model would be able to perform well if the Hindi tokens were in Devnagri script (the representations would be better for classification) we converted the Romanised Hindi tokens to their Devnagri counterparts.
Key Learnings & Takeaways
Throughout the challenge, there were several points that we came across which aided us in the process of building solutions for each of the tasks.
For shared tasks / challenges such as this, we learnt that the ensembling of transformer model based methods could help improve performance. Primarily because, there is no single “right” or “wrong” method, and utilizing the advantages of different model architectures may turn out to perform better. Additionally, we observed that while weighted Cross entropy loss doesn’t help much with class imbalance, utilizing slight modifications of the Cross Entropy loss like the Focal Loss to account for harder examples can help the model gain extra performance.
In transformer based language models, hyperparameter tuning is another aspect that is worth exploring. We did play around with this for SubTask 2 (detection of hateful speech in conversation threads) however, owing to time constraints, we weren’t able to exploit this entirely.
Analysing and using the sentiment of each individual tweet in the hierarchy as a feature in Subtask-2 is something we initially explored, but later were not able to incorporate due to the time constraints. It would be interesting to study how these features can aid the classifier.
The entire leaderboard can be viewed here. Our final standings were as follows
|Task Type||Leaderboard Position||Difference to Leader|
|English A||23 of 56||~0.02 in F1|
|English B||21 of 37||~0.07 in F1|
|Hindi A||8 of 34||~0.01 in F1|
|Hindi B||8 of 24||~0.05 in F1|
|Marathi||4 of 25||~0.05 in F1|
|Conversation Thread||3 of 16||~0.04 in F1|
Overall, it was truly an enriching experience to participate as a team on this challenge. Of the 67 teams that participated in the challenge, only 6 managed to make submissions for all the 6 tasks. Our team stood 3rd in this list. Our submitted version of the report / paper.
Through this task, we explored and learnt a bunch of new things which not only helped us in the challenge, but we strongly believe that they have broadened our thinking skills and will help us apply these in our ongoing research projects at the lab as well!
Team: Aditya Kadam, Anmol Goel, Jivitesh Jain, Jushaan Singh Kalra, Mallika Subramanian, Manvith Reddy, Prashant Kodalli, T.H. Arjun