The detection of controversial content in political discussions on the Internet is a critical challenge in maintaining healthy digital discourse. Unlike much of the existing literature that relies on synthetically balanced data, our work preserves the natural distribution of controversial and non-controversial posts. This real-world imbalance highlights a core challenge that needs to be addressed for practical deployment. Our study re-evaluates well-established methods for detecting controversial content. We curate our own dataset focusing on the Indian political context that preserves the natural distribution of controversial content, with only 12.9% of the posts in our dataset being controversial. This disparity reflects the true imbalance in real-world political discussions and highlights a critical limitation in the existing evaluation methods. Benchmarking on datasets that model data imbalance is vital for ensuring real-world applicability. Thus, in this work, (i) we release our dataset, with an emphasis on class imbalance, that focuses on the Indian political context, (ii) we evaluate existing methods from this domain on this dataset and demonstrate their limitations in the imbalanced setting, (iii) we introduce an intuitive metric to measure a model’s robustness to class imbalance, (iv) we also incorporate ideas from the domain of Topological Data Analysis, specifically Persistent Homology, to curate features that provide richer representations of the data. Furthermore, we benchmark models trained with topological features against established baselines.
@article{arun2025topo,title={Topo Goes Political: TDA-Based Controversy Detection in Imbalanced Reddit Political Data},author={Arun, Arvindh and Chandra, Karuna K and Sinha, Akshit and Velayutham, Balakumar and Arora, Jashn and Jain, Manish and Kumaraguru, Ponnurangam},year={2025},journal={5th International Workshop on Computational Methods for Online Discourse Analysis (BeyondFacts’25) Collocated with The Web Conference 2025},}
SSI-FM @ ICLR
Can Language Models Falsify? Evaluating Algorithmic Reasoning with Counterexample Creation
Shiven
Sinha, Shashwat
Goel, Ponnurangam
Kumaraguru, Jonas
Geiping, Matthias
Bethge, and Ameya
Prabhu
Scaling Self-Improving Foundation Models Workshop at ICLR ’25, 2025
There is growing excitement about the potential of Language Models (LMs) to accelerate scientific discovery. Falsifying hypotheses is key to scientific progress, as it allows claims to be iteratively refined over time. This process requires significant researcher effort, reasoning, and ingenuity. Yet current benchmarks for LMs predominantly assess their ability to generate solutions rather than challenge them. We advocate for developing benchmarks that evaluate this inverse capability - creating counterexamples for subtly incorrect solutions. To demonstrate this approach, we start with the domain of algorithmic problem solving, where counterexamples can be evaluated automatically using code execution. Specifically, we introduce REFUTE, a dynamically updating benchmark that includes recent problems and incorrect submissions from programming competitions, where human experts successfully identified counterexamples. Our analysis finds that the best reasoning agents, even OpenAI o3-mini (high) with code execution feedback, can create counterexamples for only <9% of incorrect solutions in REFUTE, even though ratings indicate its ability to solve up to 48% of these problems from scratch. We hope our work spurs progress in evaluating and enhancing LMs’ ability to falsify incorrect solutions - a capability that is crucial for both accelerating research and making models self-improve through reliable reflective reasoning.
@article{sinha2025falsify,title={Can Language Models Falsify? Evaluating Algorithmic Reasoning with Counterexample Creation},author={Sinha, Shiven and Goel, Shashwat and Kumaraguru, Ponnurangam and Geiping, Jonas and Bethge, Matthias and Prabhu, Ameya},year={2025},journal={Scaling Self-Improving Foundation Models Workshop at ICLR '25},}
IJDSA
Deep learning and transfer learning to understand emotions: a PoliEMO dataset and multi-label classification in Indian elections
Anuradha
Surolia, Shikha
Mehta, and Ponnurangam
Kumaraguru
International Journal of Data Science and Analytics, 2025
Understanding user emotions to identify user opinion, sentiment, stance, and preferences has become a hot topic of research in the last few years. Many studies and datasets are designed for user emotion analysis including news websites, blogs, and user tweets. However, there is little exploration of political emotions in the Indian context for multi-label emotion detection. This paper presents a PoliEMO dataset—a novel benchmark corpus of political tweets in a multi-label setup for Indian elections, consisting of over 3512 tweets manually annotated. In this work, 6792 labels were generated for six emotion categories: anger, insult, joy, neutral, sadness, and shameful. Next, PoliEMO dataset is used to understand emotions in a multi-label context using state-of-the-art machine learning algorithms with multi-label classifier (binary relevance (BR), label powerset (LP), classifier chain (CC), and multi-label k-nearest neighbors (MkNN)) and deep learning models like convolutional neural network (CNN), long short-term memory (LSTM), bidirectional long short-term memory (Bi-LSTM), and transfer learning model, i.e., bidirectional encoder representations from transformers (BERT). Experiments and results show Bi-LSTM performs better with micro-averaged F1 score of 0.81, macro-averaged F1 score of 0.78, and accuracy 0.68 as compared to state-of-the-art approaches.
@article{surolia2025deeplearning,title={Deep learning and transfer learning to understand emotions: a PoliEMO dataset and multi-label classification in Indian elections},author={Surolia, Anuradha and Mehta, Shikha and Kumaraguru, Ponnurangam},year={2025},journal={International Journal of Data Science and Analytics},pages={1--15},}
WebSci
COBIAS: Assessing the Contextual Reliability of Bias Benchmarks for Language Models
Priyanshul
Govil, Hemang
Jain, Vamshi
Bonagiri, Aman
Chadha, Ponnurangam
Kumaraguru, Manas
Gaur, and Sanorita
Dey
In Proceedings of the 17th ACM Web Science Conference 2025, 2025
Large Language Models (LLMs) often inherit biases from the web data they are trained on, which contains stereotypes and prejudices. Current methods for evaluating and mitigating these biases rely on bias-benchmark datasets. These benchmarks measure bias by observing an LLM’s behavior on biased statements. However, these statements lack contextual considerations of the situations they try to present. To address this, we introduce a contextual reliability framework, which evaluates model robustness to biased statements by considering the various contexts in which they may appear. We develop the Context-Oriented Bias Indicator and Assessment Score (COBIAS) to measure a biased statement’s reliability in detecting bias based on the variance in model behavior across different contexts. To evaluate the metric, we augment 2,291 stereotyped statements from two existing benchmark datasets by adding contextual information. We show that COBIAS aligns with human judgment on the contextual reliability of biased statements and can be used to create reliable datasets, which would assist bias mitigation works.
@inproceedings{govil2025cobias,title={COBIAS: Assessing the Contextual Reliability of Bias Benchmarks for Language Models},author={Govil, Priyanshul and Jain, Hemang and Bonagiri, Vamshi and Chadha, Aman and Kumaraguru, Ponnurangam and Gaur, Manas and Dey, Sanorita},year={2025},booktitle={Proceedings of the 17th ACM Web Science Conference 2025},}
WebSci
Framing the Fray: Conflict Framing in Indian Election News Coverage
In covering elections, journalists often use conflict frames which depict events and issues as adversarial, often highlighting confrontations between opposing parties. Although conflict frames result in more citizen engagement, they may distract from substantive policy discussion. In this work, we analyze the use of conflict frames in online English-language news articles by seven major news outlets in the 2014 and 2019 Indian general elections. We find that the use of conflict frames is not linked to the news outlets’ ideological biases but is associated with TV-based (rather than print-based) media. Further, the majority of news outlets do not exhibit ideological biases in portraying parties as aggressors or targets in articles with conflict frames. Finally, comparing news articles reporting on political speeches to their original speech transcripts, we find that, on average, news outlets tend to consistently report on attacks on the opposition party in the speeches but under-report on more substantive electoral issues covered in the speeches such as farmers’ issues and infrastructure.
@inproceedings{chebroluframing,title={Framing the Fray: Conflict Framing in Indian Election News Coverage},author={Chebrolu, Tejasvi and Modepalle, Rohan and Vardhan, Harsha and Rajadesingan, Ashwin and Kumaraguru, Ponnurangam},year={2025},booktitle={Proceedings of the 17th ACM Conference on Web Science},}
WebSci
Personal Narratives Empower Politically Disinclined Individuals to Engage in Political Discussions
Tejasvi
Chebrolu, Ashwin
Rajadesingan, and Ponnurangam
Kumaraguru
In Proceedings of the 17th ACM Conference on Web Science, 2025
Engaging in political discussions is crucial in democratic societies, yet many individuals remain politically disinclined due to various factors such as perceived knowledge gaps, conflict avoidance, or a sense of disconnection from the political system. In this paper, we explore the potential of personal narratives—short, first-person accounts emphasizing personal experiences—as a means to empower these individuals to participate in online political discussions. Using a text classifier that identifies personal narratives, we conducted a large-scale computational analysis to evaluate the relationship between the use of personal narratives and participation in political discussions on Reddit. We find that politically disinclined individuals (PDIs) are more likely to use personal narratives than more politically active users. Personal narratives are more likely to attract and retain politically disinclined individuals in political discussions than other comments. Importantly, personal narratives posted by politically disinclined individuals are received more positively than their other comments in political communities. These results emphasize the value of personal narratives in promoting inclusive political discourse.
@inproceedings{chebrolunarrative,title={Personal Narratives Empower Politically Disinclined Individuals to Engage in Political Discussions},author={Chebrolu, Tejasvi and Rajadesingan, Ashwin and Kumaraguru, Ponnurangam},year={2025},booktitle={Proceedings of the 17th ACM Conference on Web Science},}
SIFM@ICLR
Great Models Think Alike and this Undermines AI Oversight
Shashwat
Goel, Joschka
Struber, Ilze Amanda
Auzina, Karuna K
Chandra, Ponnurangam
Kumaraguru, Douwe
Kiela, Ameya
Prabhu, Matthias
Bethge, and Jonas
Geiping
In ICLR Workshop on Self-Improving Foundation Models, 2025
As Language Model (LM) capabilities advance, evaluating and supervising them at scale is getting harder for humans. There is hope that other language models can automate both these tasks, which we refer to as "AI Oversight". We study how model similarity affects both aspects of AI oversight by proposing a probabilistic metric for LM similarity based on overlap in model mistakes. Using this metric, we first show that LLM-as-a-judge scores favor models similar to the judge, generalizing recent self-preference results. Then, we study training on LM annotations, and find complementary knowledge between the weak supervisor and strong student model plays a crucial role in gains from "weak-to-strong generalization". As model capabilities increase, it becomes harder to find their mistakes, and we might defer more to AI oversight. However, we observe a concerning trend – model mistakes are becoming more similar with increasing capabilities, pointing to risks from correlated failures. Our work underscores the importance of reporting and correcting for model similarity, especially in the emerging paradigm of AI oversight.
@inproceedings{goel2025greatmodelsthinkalike,title={Great Models Think Alike and this Undermines AI Oversight},author={Goel, Shashwat and Struber, Joschka and Auzina, Ilze Amanda and Chandra, Karuna K and Kumaraguru, Ponnurangam and Kiela, Douwe and Prabhu, Ameya and Bethge, Matthias and Geiping, Jonas},year={2025},booktitle={ICLR Workshop on Self-Improving Foundation Models},}
COLING
KnowledgePrompts: Exploring the Abilities of Large Language Models to Solve Proportional Analogies via Knowledge-Enhanced Prompting
Thilini
Wijesiriwardene, Ruwan
Wickramarachchi, Sreeram
Vennam, Vinija
Jain, Aman
Chadha, Amitava
Das, Ponnurangam
Kumaraguru, and Amit
Sheth
In The 31st International Conference on Computational Linguistics (COLING 2025), 2025
Making analogies is fundamental to cognition. Proportional analogies, which consist of four terms, are often used to assess linguistic and cognitive abilities. For instance, completing analogies like "Oxygen is to Gas as blank is to blank" requires identifying the semantic relationship (e.g., "type of") between the first pair of terms ("Oxygen" and "Gas") and finding a second pair that shares the same relationship (e.g., "Aluminum" and "Metal"). In this work, we introduce a 15K Multiple-Choice Question Answering (MCQA) dataset for proportional analogy completion and evaluate the performance of contemporary Large Language Models (LLMs) in various knowledge-enhanced prompt settings. Specifically, we augment prompts with three types of knowledge: exemplar, structured, and targeted. Our results show that despite extensive training data, solving proportional analogies remains challenging for current LLMs, with the best model achieving an accuracy of 55%. Notably, we find that providing targeted knowledge can better assist models in completing proportional analogies compared to providing exemplars or collections of structured knowledge.
@inproceedings{wijesiriwardene2024exploring,title={KnowledgePrompts: Exploring the Abilities of Large Language Models to Solve Proportional Analogies via Knowledge-Enhanced Prompting},author={Wijesiriwardene, Thilini and Wickramarachchi, Ruwan and Vennam, Sreeram and Jain, Vinija and Chadha, Aman and Das, Amitava and Kumaraguru, Ponnurangam and Sheth, Amit},year={2025},booktitle={The 31st International Conference on Computational Linguistics (COLING 2025)},}
AAAI
Higher Order Structures For Graph Explanations
Akshit
Sinha*, Sreeram
Vennam*, Charu
Sharma, and Ponnurangam
Kumaraguru
In The 39th Annual AAAI Conference on Artificial Intelligence, 2025
Graph Neural Networks (GNNs) have emerged as powerful tools for learning representations of graph-structured data, demonstrating remarkable performance across various tasks. Recognising their importance, there has been extensive research focused on explaining GNN predictions, aiming to enhance their interpretability and trustworthiness. However, GNNs and their explainers face a notable challenge: graphs are primarily designed to model pair-wise relationships between nodes, which can make it tough to capture higher-order, multi-node interactions. This characteristic can pose difficulties for existing explainers in fully representing multi-node relationships. To address this gap, we present Framework For Higher-Order Representations In Graph Explanations (FORGE), a framework that enables graph explainers to capture such interactions by incorporating higher-order structures, resulting in more accurate and faithful explanations. Extensive evaluation shows that on average real-world datasets from the GraphXAI benchmark and synthetic datasets across various graph explainers, FORGE improves average explanation accuracy by 1.9x and 2.25x, respectively. We perform ablation studies to confirm the importance of higher-order relations in improving explanations, while our scalability analysis demonstrates FORGE’s efficacy on large graphs.
@inproceedings{sinha2024higherorderstructuresgraph,title={Higher Order Structures For Graph Explanations},author={Sinha, Akshit and Vennam, Sreeram and Sharma, Charu and Kumaraguru, Ponnurangam},year={2025},booktitle={The 39th Annual AAAI Conference on Artificial Intelligence},}
2024
arXiv
A Cognac shot to forget bad memories: Corrective Unlearning in GNNs
Graph Neural Networks (GNNs) are increasingly being used for a variety of ML applications on graph data. Because graph data does not follow the independently and identically distributed (i.i.d.) assumption, adversarial manipulations or incorrect data can propagate to other data points through message passing, which deteriorates the model’s performance. To allow model developers to remove the adverse effects of manipulated entities from a trained GNN, we study the recently formulated problem of Corrective Unlearning. We find that current graph unlearning methods fail to unlearn the effect of manipulations even when the whole manipulated set is known. We introduce a new graph unlearning method, Cognac, which can unlearn the effect of the manipulation set even when only 5% of it is identified.
@misc{kolipaka2024cognacshotforgetbad,title={A Cognac shot to forget bad memories: Corrective Unlearning in GNNs},author={Kolipaka, Varshita and Sinha, Akshit and Mishra, Debangan and Kumar, Sumit and Arun, Arvindh and Goel, Shashwat and Kumaraguru, Ponnurangam},year={2024},eprint={2412.00789},archiveprefix={arXiv},primaryclass={cs.LG},}
ACM
From Human Judgements to Predictive Models: Unravelling Acceptability in Code-Mixed Sentences
Current computational approaches for analysing or generating code-mixed sentences do not explicitly model "naturalness" or "acceptability" of code-mixed sentences, but rely on training corpora to reflect distribution of acceptable code-mixed sentences. Modelling human judgement for the acceptability of code-mixed text can help in distinguishing natural code-mixed text and enable quality-controlled generation of code-mixed text. To this end, we construct Cline - a dataset containing human acceptability judgements for English-Hindi (en-hi) code-mixed text. Cline is the largest of its kind with 16,642 sentences, consisting of samples sourced from two sources: synthetically generated code-mixed text and samples collected from online social media. Our analysis establishes that popular code-mixing metrics such as CMI, Number of Switch Points, Burstines, which are used to filter/curate/compare code-mixed corpora have low correlation with human acceptability judgements, underlining the necessity of our dataset. Experiments using Cline demonstrate that simple Multilayer Perceptron (MLP) models trained solely on code-mixing metrics are outperformed by fine-tuned pre-trained Multilingual Large Language Models (MLLMs). Specifically, XLM-Roberta and Bernice outperform IndicBERT across different configurations in challenging data settings. Comparison with ChatGPT’s zero and fewshot capabilities shows that MLLMs fine-tuned on larger data outperform ChatGPT, providing scope for improvement in code-mixed tasks. Zero-shot transfer from English-Hindi to English-Telugu acceptability judgments using our model checkpoints proves superior to random baselines, enabling application to other code-mixed language pairs and providing further avenues of research. We publicly release our human-annotated dataset, trained checkpoints, code-mix corpus, and code for data generation and model training.
@inproceedings{kodali2024humanjudgementspredictivemodels,title={From Human Judgements to Predictive Models: Unravelling Acceptability in Code-Mixed Sentences},author={Kodali, Prashant and Goel, Anmol and Asapu, Likhith and Bonagiri, Vamshi Krishna and Govil, Anirudh and Choudhury, Monojit and Shrivastava, Manish and Kumaraguru, Ponnurangam},year={2024},booktitle={},}
UniReps @ NeurIPS
Emergence of Text Semantics in CLIP Image Encoders
Sreeram
Vennam, Shashwat
Singh, Anirudh
Govil, and Ponnurangam
Kumaraguru
In UniReps: 2nd Edition of the Workshop on Unifying Representations in Neural Models, 2024
Certain self-supervised approaches to train image encoders, like CLIP, align images with their text captions. However, these approaches do not have an a priori incentive to learn to associate text inside the image with the semantics of the text. Our work studies the semantics of text rendered in images. We show evidence suggesting that the image representations of CLIP have a subspace for textual semantics that abstracts away fonts. Furthermore, we show that the rendered text representations from the image encoder only slightly lag behind the text representations with respect to preserving semantic relationships.
@inproceedings{vennam2024emergence,title={Emergence of Text Semantics in {CLIP} Image Encoders},author={Vennam, Sreeram and Singh, Shashwat and Govil, Anirudh and Kumaraguru, Ponnurangam},year={2024},booktitle={UniReps: 2nd Edition of the Workshop on Unifying Representations in Neural Models},}
JURIX
InSaAF: Incorporating Safety Through Accuracy and Fairness - Are LLMs Ready for the Indian Legal Domain?
Recent advancements in language technology and Artificial Intelligence have resulted in numerous Language Models being proposed to perform various tasks in the legal domain ranging from predicting judgments to generating summaries. Despite their immense potential, these models have been proven to learn and exhibit societal biases and make unfair predictions. In this study, we explore the ability of Large Language Models (LLMs) to perform legal tasks in the Indian landscape when social factors are involved. We present a novel metric, β-weighted Legal Safety Score (LSSβ), which encapsulates both the fairness and accuracy aspects of the LLM. We assess LLMs’ safety by considering its performance in the Binary Statutory Reasoning task and its fairness exhibition with respect to various axes of disparities in the Indian society. Task performance and fairness scores of LLaMA and LLaMA–2 models indicate that the proposed LSSβ metric can effectively determine the readiness of a model for safe usage in the legal sector. We also propose finetuning pipelines, utilising specialised legal datasets, as a potential method to mitigate bias and improve model safety. The finetuning procedures on LLaMA and LLaMA–2 models increase the LSSβ, improving their usability in the Indian legal domain. Our code is publicly released.
@inproceedings{Tripathi2024,title={InSaAF: Incorporating Safety Through Accuracy and Fairness - Are LLMs Ready for the Indian Legal Domain?},author={Tripathi, Yogesh and Donakanti, Raghav and Girhepuje, Sahil and Kavathekar, Ishan and Vedula, Bhaskara Hanuma and Krishnan, Gokul S. and Goel, Anmol and Goyal, Shreya and Ravindran, Balaraman and Kumaraguru, Ponnurangam},year={2024},booktitle={Legal Knowledge and Information Systems - JURIX 2024: The Thirty-seventh Annual Conference, Brno, Czech Republic, 11-13 December 2024},}
TMLR
Corrective Machine Unlearning
Shashwat
Goel, Ameya
Prabhu, Philip
Torr, Ponnurangam
Kumaraguru, and Amartya
Sanyal
In Transactions of Machine Learning Research (TMLR), 2024
Machine Learning models increasingly face data integrity challenges due to the use of large-scale training datasets drawn from the Internet. We study what model developers can do if they detect that some data was manipulated or incorrect. Such manipulated data can cause adverse effects including vulnerability to backdoored samples, systemic biases, and reduced accuracy on certain input domains. Realistically, all manipulated training samples cannot be identified, and only a small, representative subset of the affected data can be flagged. We formalize Corrective Machine Unlearning as the problem of mitigating the impact of data affected by unknown manipulations on a trained model, only having identified a subset of the corrupted data. We demonstrate that the problem of corrective unlearning has significantly different requirements from traditional privacy-oriented unlearning. We find most existing unlearning methods, including retraining-from-scratch without the deletion set, require most of the manipulated data to be identified for effective corrective unlearning. However, one approach, Selective Synaptic Dampening, achieves limited success, unlearning adverse effects with just a small portion of the manipulated samples in our setting, which shows encouraging signs for future progress. We hope our work spurs research towards developing better methods for corrective unlearning and offers practitioners a new strategy to handle data integrity challenges arising from web-scale training.
@inproceedings{goel2024corrective,title={Corrective Machine Unlearning},author={Goel, Shashwat and Prabhu, Ameya and Torr, Philip and Kumaraguru, Ponnurangam and Sanyal, Amartya},year={2024},booktitle={Transactions of Machine Learning Research (TMLR)},}
MLC @ NeurIPS
LLM Vocabulary Compression for Low-Compute Environments
Sreeram
Vennam, Anish R
Joishy, and Ponnurangam
Kumaraguru
In Workshop on Machine Learning and Compression, NeurIPS 2024, 2024
We present a method to compress the final linear layer of language models, reducing memory usage by up to 3.4x without significant performance loss. By grouping tokens based on Byte Pair Encoding (BPE) merges, we prevent materialization of the memory-intensive logits tensor. Evaluations on the TinyStories dataset show that our method performs on par with GPT-Neo and GPT2 while significantly improving throughput by up to 3x, making it suitable for low-compute environments.
@inproceedings{vennam2024llm,title={{LLM} Vocabulary Compression for Low-Compute Environments},author={Vennam, Sreeram and Joishy, Anish R and Kumaraguru, Ponnurangam},year={2024},booktitle={Workshop on Machine Learning and Compression, NeurIPS 2024},}
NeurIPS
Random Representations Outperform Online Continually Learned Representations
Ameya
Prabhu, Shiven
Sinha, Ponnurangam
Kumaraguru, Philip
Torr, Ozan
Sener, and Puneet K.
Dokania
In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024
Continual learning has primarily focused on the issue of catastrophic forgetting and the associated stability-plasticity tradeoffs. However, little attention has been paid to the efficacy of continually learned representations, as representations are learned alongside classifiers throughout the learning process. Our primary contribution is empirically demonstrating that existing online continually trained deep networks produce inferior representations compared to a simple pre-defined random transforms. Our approach projects raw pixels using a fixed random transform, approximating an RBF-Kernel initialized before any data is seen. We then train a simple linear classifier on top without storing any exemplars, processing one sample at a time in an online continual learning setting. This method, called RanDumb, significantly outperforms state-of-the-art continually learned representations across all standard online continual learning benchmarks. Our study reveals the significant limitations of representation learning, particularly in low-exemplar and online continual learning scenarios. Extending our investigation to popular exemplar-free scenarios with pretrained models, we find that training only a linear classifier on top of pretrained representations surpasses most continual fine-tuning and prompt-tuning strategies. Overall, our investigation challenges the prevailing assumptions about effective representation learning in online continual learning. Our code is available at://github.com/drimpossible/RanDumb.
@inproceedings{prabhu2024random,title={Random Representations Outperform Online Continually Learned Representations},author={Prabhu, Ameya and Sinha, Shiven and Kumaraguru, Ponnurangam and Torr, Philip and Sener, Ozan and Dokania, Puneet K.},year={2024},booktitle={The Thirty-eighth Annual Conference on Neural Information Processing Systems},}
MathAI @ NeurIPS
Wu’s Method Boosts Symbolic AI to Rival Silver Medalists and AlphaGeometry to Outperform Gold Medalists at IMO Geometry
Proving geometric theorems constitutes a hallmark of visual reasoning combining both intuitive and logical skills. Therefore, automated theorem proving of Olympiad-level geometry problems is considered a notable milestone in human-level automated reasoning. The introduction of AlphaGeometry, a neuro-symbolic model trained with 100 million synthetic samples, marked a major breakthrough. It solved 25 of 30 International Mathematical Olympiad (IMO) problems whereas the reported baseline based on Wu’s method solved only ten. In this note, we revisit the IMO-AG-30 Challenge introduced with AlphaGeometry, and find that Wu’s method is surprisingly strong. Wu’s method alone can solve 15 problems, and some of them are not solved by any of the other methods. This leads to two key findings: (i) Combining Wu’s method with the classic synthetic methods of deductive databases and angle, ratio, and distance chasing solves 21 out of 30 methods by just using a CPU-only laptop with a time limit of 5 minutes per problem. Essentially, this classic method solves just 4 problems less than AlphaGeometry and establishes the first fully symbolic baseline strong enough to rival the performance of an IMO silver medalist. (ii) Wu’s method even solves 2 of the 5 problems that AlphaGeometry failed to solve. Thus, by combining AlphaGeometry with Wu’s method we set a new state-of-the-art for automated theorem proving on IMO-AG-30, solving 27 out of 30 problems, the first AI method which outperforms an IMO gold medalist.
@inproceedings{sinha2024wus,title={Wu{\textquoteright}s Method Boosts Symbolic {AI} to Rival Silver Medalists and AlphaGeometry to Outperform Gold Medalists at {IMO} Geometry},author={Sinha, Shiven and Prabhu, Ameya and Kumaraguru, Ponnurangam and Bhat, Siddharth and Bethge, Matthias},year={2024},booktitle={The 4th Workshop on Mathematical Reasoning and AI at NeurIPS'24},}
ICML
Representation Surgery: Theory and Practice of Affine Steering
Shashwat
Singh, Shauli
Ravfogel, Jonathan
Herzig, Roee
Aharoni, Ryan
Cotterell, and Ponnurangam
Kumaraguru
In Forty-first International Conference on Machine Learning, 2024
Language models often exhibit undesirable behavior, e.g., generating toxic or gender-biased text. In the case of neural language models, an encoding of the undesirable behavior is often present in the model’s representations. Thus, one natural (and common) approach to prevent the model from exhibiting undesirable behavior is to steer the model’s representations in a manner that reduces the probability of it generating undesirable text. This paper investigates the formal and empirical properties of steering functions, i.e., transformation of the neural language model’s representations that alter its behavior. First, we derive two optimal, in the least-squares sense, affine steering functions under different constraints. Our theory provides justification for existing approaches and offers a novel, improved steering approach. Second, we offer a series of experiments that demonstrate the empirical effectiveness of the methods in mitigating bias and reducing toxic generation.
@inproceedings{singhrepresentation,title={Representation Surgery: Theory and Practice of Affine Steering},author={Singh, Shashwat and Ravfogel, Shauli and Herzig, Jonathan and Aharoni, Roee and Cotterell, Ryan and Kumaraguru, Ponnurangam},year={2024},booktitle={Forty-first International Conference on Machine Learning},}
ICML
The WMDP Benchmark: Measuring and Reducing Malicious Use with Unlearning
Nathaniel
Li, Alexander
Pan, Anjali
Gopal, Summer
Yue, Daniel
Berrios, Alice
Gatti, Justin D.
Li, Ann-Kathrin
Dombrowski, Shashwat
Goel, Gabriel
Mukobi, Nathan
Helm-Burger, Rassin
Lababidi, Lennart
Justen, Andrew Bo
Liu, Michael
Chen, Isabelle
Barrass, Oliver
Zhang, Xiaoyuan
Zhu, Rishub
Tamirisa, Bhrugu
Bharathi, Ariel
Herbert-Voss, Cort B
Breuer, Andy
Zou, Mantas
Mazeika, Zifan
Wang, Palash
Oswal, Weiran
Lin, Adam Alfred
Hunt, Justin
Tienken-Harder, Kevin Y.
Shih, Kemper
Talley, John
Guan, Ian
Steneker, David
Campbell, Brad
Jokubaitis, Steven
Basart, Stephen
Fitz, Ponnurangam
Kumaraguru, Kallol Krishna
Karmakar, Uday
Tupakula, Vijay
Varadharajan, Yan
Shoshitaishvili, Jimmy
Ba, Kevin M.
Esvelt, Alexandr
Wang, and Dan
Hendrycks
In Forty-first International Conference on Machine Learning, 2024
The White House Executive Order on Artificial Intelligence highlights the risks of large language models (LLMs) empowering malicious actors in developing biological, cyber, and chemical weapons. To measure these risks of malicious use, government institutions and major AI labs are developing evaluations for hazardous capabilities in LLMs. However, current evaluations are private, preventing further research into mitigating risk. Furthermore, they focus on only a few, highly specific pathways for malicious use. To fill these gaps, we publicly release the Weapons of Mass Destruction Proxy (WMDP) benchmark, a dataset of 3,668 multiple-choice questions that serve as a proxy measurement of hazardous knowledge in biosecurity, cybersecurity, and chemical security. WMDP was developed by a consortium of academics and technical consultants, and was stringently filtered to eliminate sensitive information prior to public release. WMDP serves two roles: first, as an evaluation for hazardous knowledge in LLMs, and second, as a benchmark for unlearning methods to remove such hazardous knowledge. To guide progress on unlearning, we develop RMU, a state-of-the-art unlearning method based on controlling model representations. RMU reduces model performance on WMDP while maintaining general capabilities in areas such as biology and computer science, suggesting that unlearning may be a concrete path towards reducing malicious use from LLMs.
@inproceedings{li2024the,title={The {WMDP} Benchmark: Measuring and Reducing Malicious Use with Unlearning},author={Li, Nathaniel and Pan, Alexander and Gopal, Anjali and Yue, Summer and Berrios, Daniel and Gatti, Alice and Li, Justin D. and Dombrowski, Ann-Kathrin and Goel, Shashwat and Mukobi, Gabriel and Helm-Burger, Nathan and Lababidi, Rassin and Justen, Lennart and Liu, Andrew Bo and Chen, Michael and Barrass, Isabelle and Zhang, Oliver and Zhu, Xiaoyuan and Tamirisa, Rishub and Bharathi, Bhrugu and Herbert-Voss, Ariel and Breuer, Cort B and Zou, Andy and Mazeika, Mantas and Wang, Zifan and Oswal, Palash and Lin, Weiran and Hunt, Adam Alfred and Tienken-Harder, Justin and Shih, Kevin Y. and Talley, Kemper and Guan, John and Steneker, Ian and Campbell, David and Jokubaitis, Brad and Basart, Steven and Fitz, Stephen and Kumaraguru, Ponnurangam and Karmakar, Kallol Krishna and Tupakula, Uday and Varadharajan, Vijay and Shoshitaishvili, Yan and Ba, Jimmy and Esvelt, Kevin M. and Wang, Alexandr and Hendrycks, Dan},year={2024},booktitle={Forty-first International Conference on Machine Learning},}
Graph neural networks (GNNs) are increasingly being used on sensitive graph-structured data, necessitating techniques for handling unlearning requests on the trained models, particularly node unlearning. However, unlearning nodes on GNNs is challenging due to the interdependence between the nodes in a graph. We compare MEGU, a state-of-the-art graph unlearning method, and SCRUB, a general unlearning method for classification, to investigate the efficacy of graph unlearning methods over traditional unlearning methods. Surprisingly, we find that SCRUB performs comparably or better than MEGU on random node removal and on removing an adversarial node injection attack. Our results suggest that 1) graph unlearning studies should incorporate general unlearning methods like SCRUB as baselines, and 2) there is a need for more rigorous behavioral evaluations that reveal the differential advantages of proposed graph unlearning methods. Our work, therefore, motivates future research into more comprehensive evaluations for assessing the true utility of graph unlearning algorithms.
@inproceedings{anonymous2024sanity,title={Sanity Checks for Evaluating Graph Unlearning},author={Kolipaka, Varshita and Sinha, Akshit and Mishra, Debangan and Kumar, Sumit and Arun, Arvindh and Goel, Shashwat and Kumaraguru, Ponnurangam},year={2024},booktitle={Third Conference on Lifelong Learning Agents - Workshop Track},}
KIL @ KDD
Towards Infusing Auxiliary Knowledge for Distracted Driver Detection
Distracted driving is a leading cause of road accidents globally. Identification of distracted driving involves reliably detecting and classifying various forms of driver distraction (e.g., texting, eating, or using in-car devices) from in-vehicle camera feeds to enhance road safety. This task is challenging due to the need for robust models that can generalize to a diverse set of driver behaviors without requiring extensive annotated datasets. In this paper, we propose KiD3, a novel method for distracted driver detection (DDD) by infusing auxiliary knowledge about semantic relations between entities in a scene and the structural configuration of the driver’s pose. Specifically, we construct a unified framework that integrates the scene graphs, and driver pose information with the visual cues in video frames to create a holistic representation of the driver’s this http URL results indicate that KiD3 achieves a 13.64% accuracy improvement over the vision-only baseline by incorporating such auxiliary knowledge with visual information.
@inproceedings{balappanawar2024towards,title={Towards Infusing Auxiliary Knowledge for Distracted Driver Detection},author={Balappanawar, Ishwar and Chamoli, Ashmit and Wickramarachchi, Ruwan and Mishra, Aditya and Kumaraguru, Ponnurangam},year={2024},booktitle={Fourth Workshop on Knowledge-infused Learning co-located with 30th ACM KDD Conference, Barcelona, Spain},}
LREC-COLING
SaGE: Evaluating Moral Consistency in Large Language Models
Vamshi Krishna
Bonagiri, Sreeram
Vennam, Priyanshul
Govil, Ponnurangam
Kumaraguru, and Manas
Gaur
In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), 2024
Despite recent advancements showcasing the impressive capabilities of Large Language Models (LLMs) in conversational systems, we show that even state-of-the-art LLMs are morally inconsistent in their generations, questioning their reliability (and trustworthiness in general). Prior works in LLM evaluation focus on developing ground-truth data to measure accuracy on specific tasks. However, for moral scenarios that often lack universally agreed-upon answers, consistency in model responses becomes crucial for their reliability. To address this issue, we propose an information-theoretic measure called Semantic Graph Entropy (SaGE), grounded in the concept of “Rules of Thumb” (RoTs) to measure a model‘s moral consistency. RoTs are abstract principles learned by a model and can help explain their decision-making strategies effectively. To this extent, we construct the Moral Consistency Corpus (MCC), containing 50K moral questions, responses to them by LLMs, and the RoTs that these models followed. Furthermore, to illustrate the generalizability of SaGE, we use it to investigate LLM consistency on two popular datasets – TruthfulQA and HellaSwag. Our results reveal that task accuracy and consistency are independent problems, and there is a dire need to investigate these issues further.
@inproceedings{bonagiri-etal-2024-sage,title={{S}a{GE}: Evaluating Moral Consistency in Large Language Models},author={Bonagiri, Vamshi Krishna and Vennam, Sreeram and Govil, Priyanshul and Kumaraguru, Ponnurangam and Gaur, Manas},year={2024},booktitle={Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)},}
EMNLP Findings
Counter Turing Test (CT^2): Investigating AI-Generated Text Detection for Hindi - Ranking LLMs based on Hindi AI Detectability Index (ADI_hi)
Ishan
Kavathekar, Anku
Rani, Ashmit
Chamoli, Ponnurangam
Kumaraguru, Amit P.
Sheth, and Amitava
Das
In Findings of the Association for Computational Linguistics: EMNLP 2024, 2024
The widespread adoption of Large Language Models (LLMs) and awareness around multilingual LLMs have raised concerns regarding the potential risks and repercussions linked to the misapplication of AI-generated text, necessitating increased vigilance. While these models are primarily trained for English, their extensive training on vast datasets covering almost the entire web, equips them with capabilities to perform well in numerous other languages. AI-Generated Text Detection (AGTD) has emerged as a topic that has already received immediate attention in research, with some initial methods having been proposed, soon followed by the emergence of techniques to bypass detection. In this paper, we report our investigation on AGTD for an indic language Hindi. Our major contributions are in four folds: i) examined 26 LLMs to evaluate their proficiency in generating Hindi text, ii) introducing the AI-generated news article in Hindi (AGhi) dataset, iii) evaluated the effectiveness of five recently proposed AGTD techniques: ConDA, J-Guard, RADAR, RAIDAR and Intrinsic Dimension Estimation for detecting AI-generated Hindi text, iv) proposed Hindi AI Detectability Index (ADIhi) which shows a spectrum to understand the evolving landscape of eloquence of AI-generated text in Hindi.
@inproceedings{kavathekar-etal-2024-counter,title={Counter {T}uring Test ($CT^2$): Investigating {AI}-Generated Text Detection for {H}indi - Ranking {LLM}s based on {H}indi {AI} Detectability Index ($ADI\_{hi}$)},author={Kavathekar, Ishan and Rani, Anku and Chamoli, Ashmit and Kumaraguru, Ponnurangam and Sheth, Amit P. and Das, Amitava},year={2024},booktitle={Findings of the Association for Computational Linguistics: EMNLP 2024},}
ICWSM
Put Your Money Where Your Mouth Is: Dataset and Analysis of Real World Habit Building Attempts
Hitkul
Jangra, Rajiv
Shah, and Ponnurangam
Kumaraguru
In Proceedings of the International AAAI Conference on Web and Social Media, 2024
The pursuit of habit building is challenging, and most people struggle with it. Research on successful habit formation is mainly based on small human trials focusing on the same habit for all the participants as conducting long-term heterogonous habit studies can be logistically expensive. With the advent of self-help, there has been an increase in online communities and applications that are centered around habit building and logging. Habit building applications can provide large-scale data on real-world habit building attempts and unveil the commonalities among successful ones. We collect public data on stickk.com, which allows users to track progress on habit building attempts called commitments. A commitment can have an external referee, regular check-ins about the progress, and a monetary stake in case of failure. Our data consists of 742,923 users and 397,456 commitments. In addition to the dataset, rooted in theories like Fresh Start Effect, Accountablity, and Loss Aversion, we ask questions about how commitment properties like start date, external accountability, monitory stake, and pursuing multiple habits together affects the odds of success. We found that people tend to start habits on temporal landmarks, but that does not affect the probability of their success. Practices like accountability and stakes are not often used but are strong determents of success. Commitments of 6 to 8 weeks in length, weekly reporting with an external referee, and a monetary amount at stake tend to be most successful. Finally, around 40% of all commitments are attempted simultaneously with other goals. Simultaneous attempts of pursuing commitments may fail early, but if pursued through the initial phase, they are statistically more successful than building one habit at a time.
@inproceedings{article,title={Put Your Money Where Your Mouth Is: Dataset and Analysis of Real World Habit Building Attempts},author={Jangra, Hitkul and Shah, Rajiv and Kumaraguru, Ponnurangam},year={2024},booktitle={Proceedings of the International AAAI Conference on Web and Social Media},}
LREC-COLING
Multilingual Coreference Resolution in Low-resource South Asian Languages
Ritwik
Mishra, Pooja
Desur, Rajiv Ratn
Shah, and Ponnurangam
Kumaraguru
In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), 2024
Coreference resolution involves the task of identifying text spans within a discourse that pertain to the same real-world entity. While this task has been extensively explored in the English language, there has been a notable scarcity of publicly accessible resources and models for coreference resolution in South Asian languages. We introduce a Translated dataset for Multilingual Coreference Resolution (TransMuCoRes) in 31 South Asian languages using off-the-shelf tools for translation and word-alignment. Nearly all of the predicted translations successfully pass a sanity check, and 75% of English references align with their predicted translations. Using multilingual encoders, two off-the-shelf coreference resolution models were trained on a concatenation of TransMuCoRes and a Hindi coreference resolution dataset with manual annotations. The best performing model achieved a score of 64 and 68 for LEA F1 and CoNLL F1, respectively, on our test-split of Hindi golden set. This study is the first to evaluate an end-to-end coreference resolution model on a Hindi golden set. Furthermore, this work underscores the limitations of current coreference evaluation metrics when applied to datasets with split antecedents, advocating for the development of more suitable evaluation metrics.
@inproceedings{mishra-etal-2024-multilingual,title={Multilingual Coreference Resolution in Low-resource {S}outh {A}sian Languages},author={Mishra, Ritwik and Desur, Pooja and Shah, Rajiv Ratn and Kumaraguru, Ponnurangam},year={2024},booktitle={Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)},}
The default approach to deal with the enormous size and limited accessibility of many Web and social media networks is to sample one or more subnetworks from a conceptually unbounded unknown network. Clearly, the extracted subnetworks will crucially depend on the sampling scheme. Motivated by studies of homophily and opinion formation, we propose a variant of snowball sampling designed to prioritize inclusion of entire cohesive communities rather than any kind of representativeness, breadth, or depth of coverage. The method is illustrated on a concrete example, and experiments on synthetic networks suggest that it behaves as desired.
@inproceedings{articlf,title={Tight Sampling in Unbounded Networks},author={Jaglan, Kshitijaa and Pindiprolu, Meher and Sharma, Triansh and Singam, Abhijeeth and Goyal, Nidhi and Kumaraguru, Ponnurangam and Brandes, Ulrik},year={2024},booktitle={Proceedings of the International AAAI Conference on Web and Social Media},}
AFME@Neurips
Improving Bias Metrics in Vision-Language Models by Addressing Inherent Model Disabilities
L.
Darur, S.K.
Gouravarapu, S.
Goel, and P.
Kumaraguru
In Workshop on Algorithmic Fairness through the Lens of Metrics and Evaluation, NeurIPS 2024, 2024
The integration of Vision-Language Models (VLMs) into various applications has highlighted the importance of evaluating these models for inherent biases, especially along gender and racial lines. Traditional bias assessment methods in VLMs typically rely on accuracy metrics, assessing disparities in performance across different demographic groups. These methods, however, often overlook the impact of the model’s disabilities, like lack spatial reasoning, which may skew the bias assessment. In this work, we propose an approach that systematically examines how current bias evaluation metrics account for the model’s limitations. We introduce two methods that circumvent these disabilities by integrating spatial guidance from textual and visual modalities. Our experiments aim to refine bias quantification by effectively mitigating the impact of spatial reasoning limitations, offering a more accurate assessment of biases in VLMs.
@inproceedings{darur2024improvingbiasmetrics,title={Improving Bias Metrics in Vision-Language Models by Addressing Inherent Model Disabilities},author={Darur, L. and Gouravarapu, S.K. and Goel, S. and Kumaraguru, P.},year={2024},booktitle={Workshop on Algorithmic Fairness through the Lens of Metrics and Evaluation, NeurIPS 2024},}
Jurix
Incorporating Safety through Accuracy and Fairness - Are LLMs ready for the Indian Legal Domain?
R.
Donakanti, I.
Kavathekar, A.
Hanuma, P.
Kumaraguru, Y.
Tripathi, S.
Girhepuje, G.
Krishnan, S.
Goyal, and R.
& Balaraman
Recent advancements in language technology and Artificial Intelligence have resulted in numerous Language Models being proposed to perform various tasks in the legal domain ranging from predicting judgments to generating summaries. Despite their immense potential, these models have been proven to learn and exhibit societal biases and make unfair predictions. In this study, we explore the ability of Large Language Models (LLMs) to perform legal tasks in the Indian landscape when social factors are involved. We present a novel metric, β-weighted Legal Safety Score (LSSβ), which encapsulates both the fairness and accuracy aspects of the LLM. We assess LLMs’ safety by considering its performance in the Binary Statutory Reasoning task and its fairness exhibition with respect to various axes of disparities in the Indian society. Task performance and fairness scores of LLaMA and LLaMA–2 models indicate that the proposed LSSβ metric can effectively determine the readiness of a model for safe usage in the legal sector. We also propose finetuning pipelines, utilising specialised legal datasets, as a potential method to mitigate bias and improve model safety. The finetuning procedures on LLaMA and LLaMA–2 models increase the LSSβ, improving their usability in the Indian legal domain. Our code is publicly released.
@inproceedings{donakanti2024incorporatingsafetythrough,title={Incorporating Safety through Accuracy and Fairness - Are LLMs ready for the Indian Legal Domain?},author={Donakanti, R. and Kavathekar, I. and Hanuma, Goel, A. and Kumaraguru, P. and Tripathi, Y. and Girhepuje, S. and Krishnan, G. and Goyal, S. and Balaraman, R.},year={2024},booktitle={JURIX 2024},}
ICWSM
Understanding Coordinated Communities through the Lens of Protest-Centric Narratives: A Case Study on #CAA Protest
N.
Kumari, V.
Agrawal, S.
Chhatani, R.
Sharma, A. B.
Buduru, and P.
& Kumaraguru
In Proceedings of the International AAAI Conference on Web and Social Media (ICWSM), 2024
Social media platforms, particularly Twitter, have emerged asvital media for organizing online protests worldwide. Dur-ing protests, users on social media share different narratives,often coordinated to share collective opinions and obtainwidespread reach. In this paper, we focus on the communi-ties formed during a protest and the collective narratives theyshare, using the protest on the enactment of the CitizenshipAmendment Act (#CAA) by the Indian Government as a casestudy. Since #CAA protest led to divergent discourse in thecountry, we first classify the users into opposing stances, i.e. andprotesters (who opposed the Act) and counter-protesters (whosupported it) in an unsupervised manner. Next, we identifythe coordinated communities in the opposing stances and ex-amine the collective narratives they shared. We use content-based metrics to identify user coordination, including hash-tags, mentions, and retweets. Our results suggest mentionas the strongest metric for coordination across the oppos-ing stances. Next, we decipher the collective narratives inthe opposing stances using an unsupervised narrative detec-tion framework and found call-to-action, on-ground activity,grievances sharing, questioning, and skepticism narratives inthe protest tweets. We analyze the strength of the differentcoordinated communities using network measures, and per-form inauthentic activity analysis on the most coordinatedcommunities on both sides. Our findings suggest that coor-dinated communities, which were highly inauthentic, showedthe highest clustering coefficient towards a greater extent ofcoordination.
@inproceedings{kumari2024understandingcoordinatedcommunities,title={Understanding Coordinated Communities through the Lens of Protest-Centric Narratives: A Case Study on #CAA Protest},author={Kumari, N. and Agrawal, V. and Chhatani, S. and Sharma, R. and Buduru, A. B. and Kumaraguru, P.},year={2024},booktitle={Proceedings of the International AAAI Conference on Web and Social Media (ICWSM)},}
SIGKDD
Television Discourse Decoded: Comprehensive Multimodal Analytics at Scale
A.
Agarwal, P.
Priyadarshi, S.
Sinha, S.
Gupta, K.
Hitkul, and P.
& Kumaraguru
In 2024 ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2024
In this paper, we tackle the complex task of analyzing televised debates, with a focus on a prime time news debate show from India. Previous methods, which often relied solely on text, fall short in capturing the multimodal essence of these debates. To address this gap, we introduce a comprehensive automated toolkit that employs advanced computer vision and speech-to-text techniques for large-scale multimedia analysis. Utilizing state-of-the-art computer vision algorithms and speech-to-text methods, we transcribe, diarize, and analyze thousands of YouTube videos of a prime-time television debate show in India. These debates are a central part of Indian media but have been criticized for compromised journalistic integrity and excessive dramatization. Our toolkit provides concrete metrics to assess bias and incivility, capturing a comprehensive multimedia perspective that includes text, audio utterances, and video frames. Our findings reveal significant biases in topic selection and panelist representation, along with alarming levels of incivility. This work offers a scalable, automated approach for future research in multimedia analysis, with profound implications for the quality of public discourse and democratic debate. To catalyze further research in this area, we also release the code, dataset collected and supplemental pdf.
@inproceedings{agarwal2024televisiondiscourse,title={Television Discourse Decoded: Comprehensive Multimodal Analytics at Scale},author={Agarwal, A. and Priyadarshi, P. and Sinha, S. and Gupta, S. and Hitkul, Garimella, K. and Kumaraguru, P.},year={2024},booktitle={2024 ACM SIGKDD International Conference on Knowledge Discovery and Data Mining},}
ICML
MiMiC: Minimally Modified Counterfactuals in the Representation Space
Singh
S., Ravfogel
S., Jonathan
H., Aharoni
R., Cotterell
R., and
Kumaraguru P.
In The Forty-first International Conference on Machine Learning, 2024, 2024
Language models often exhibit undesirable behaviors, such as gender bias or toxic language. Interventions in the representation space were shown effective in mitigating such issues by altering the LM behavior. We first show that two prominent intervention techniques, Linear Erasure and Steering Vectors, do not enable a high degree of control and are limited in expressivity. We then propose a novel intervention methodology for generating expressive counterfactuals in the representation space, aiming to make representations of a source class (e.g. and “toxic”) resemble those of a target class (e.g. and “non-toxic”). This approach, generalizing previous linear intervention techniques, utilizes a closed-form solution for the Earth Mover’s problem under Gaussian assumptions and provides theoretical guarantees on the representation space’s geometric organization. We further build on this technique and derive a nonlinear intervention that enables controlled generation. We demonstrate the effectiveness of the proposed approaches in mitigating bias in multiclass classification and in reducing the generation of toxic language, outperforming strong baselines.
@inproceedings{singh2024minimallymodified,title={MiMiC: Minimally Modified Counterfactuals in the Representation Space},author={S., Singh and S., Ravfogel and H., Jonathan and R., Aharoni and R., Cotterell and and Kumaraguru P.},year={2024},booktitle={The Forty-first International Conference on Machine Learning, 2024},}
SNAM
GAME-ON: Graph Attention Network Based Multimodal Fusion For Fake News Detection
Dhawan
M.*, Sharma
S.*, Kadam
A., Sharma
R., and P.
Kumaraguru
In Journal of Social Network Analysis and Mining, 2024
Social media in present times has a significant and growing influence. Fake news being spread on these platforms have a disruptive and damaging impact on our lives. Furthermore, as multimedia content improves the visibility of posts more than text data, it has been observed that often multimedia is being used for creating fake content. A plethora of previous multimodal-based work has tried to address the problem of modeling heterogeneous modalities in identifying fake content. However, these works have the following limitations: (1) inefficient encoding of inter-modal relations by utilizing a simple concatenation operator on the modalities at a later stage in a model, which might result in information loss; (2) training very deep neural networks with a disproportionate number of parameters on small but complex real-life multimodal datasets result in higher chances of overfitting. To address these limitations, we propose GAME-ON, a Graph Neural Network based end-to-end trainable framework that allows granular interactions within and across different modalities to learn more robust data representations for multimodal fake news detection. We use two publicly available fake news datasets, Twitter and Weibo, for evaluations. Our model outperforms on Twitter by an average of 11% and keeps competitive performance on Weibo, within a 2.6% margin, while using 65% fewer parameters than the best comparable state-of-the-art baseline.
@inproceedings{dhawan2024graphattention,title={GAME-ON: Graph Attention Network Based Multimodal Fusion For Fake News Detection},author={M., Dhawan and S., Sharma and A., Kadam and R., Sharma and Kumaraguru, P.},year={2024},booktitle={Journal of Social Network Analysis and Mining},}
WOAH @ NAACL
X-posing Free Speech: Examining the Impact of Moderation Relaxation on Online Social Networks
Arvindh
Arun, Saurav
Chhatani, Jisun
An, and Ponnurangam
Kumaraguru
In Proceedings of the 8th Workshop on Online Abuse and Harms (WOAH 2024), 2024
We investigate the impact of free speech and the relaxation of moderation on online social media platforms using Elon Musk’s takeover of Twitter as a case study. By curating a dataset of over 10 million tweets, our study employs a novel framework combining content and network analysis. Our findings reveal a significant increase in the distribution of certain forms of hate content, particularly targeting the LGBTQ+ community and liberals. Network analysis reveals the formation of cohesive hate communities facilitated by influential bridge users, with substantial growth in interactions hinting at increased hate production and diffusion. By tracking the temporal evolution of PageRank, we identify key influencers, primarily self-identified far-right supporters disseminating hate against liberals and woke culture. Ironically, embracing free speech principles appears to have enabled hate speech against the very concept of freedom of expression and free speech itself. Our findings underscore the delicate balance platforms must strike between open expression and robust moderation to curb the proliferation of hate online.
@inproceedings{arun-etal-2024-x,title={{X}-posing Free Speech: Examining the Impact of Moderation Relaxation on Online Social Networks},author={Arun, Arvindh and Chhatani, Saurav and An, Jisun and Kumaraguru, Ponnurangam},year={2024},booktitle={Proceedings of the 8th Workshop on Online Abuse and Harms (WOAH 2024)},}
Thesis
Improving Content Quality for Online Professional Activities using Domain Specific Learning and Knowledge
@inproceedings{improvingcontentqualityforonlineprofessionalactivitiesusingdomainspecificlearningandknowledge,title={Improving Content Quality for Online Professional Activities using Domain Specific Learning and Knowledge},author={Goyal, N.},year={2024},booktitle={Ph.D. Thesis, IIIT-Delhi},}
Thesis
Sampling cohesive communities in unbounded networks
@inproceedings{modelingonlineuserinteractionsandtheirofflineeffectsonsociotechnicalplatforms,title={Modeling Online User Interactions and their Offline Effects on Socio-Technical Platforms},author={Hitkul},year={2024},booktitle={Ph.D. Thesis, IIIT-Delhi},}
Thesis
New Frontiers in Machine Unlearning
S.
Goel
In MS in Computer Science by Research, IIIT Hyderabad, 2024
@inproceedings{newfrontiersinmachineunlearning,title={New Frontiers in Machine Unlearning},author={Goel, S.},year={2024},booktitle={MS in Computer Science by Research, IIIT Hyderabad},}
Thesis
Towards Trustworthy Digital Ecosystem: From Fair Representation Learning to Fraud Detection
A.
Arun
In MS in Computer Science by Research at IIIT Hyderabad, 2024
@inproceedings{towardstrustworthydigitalecosystemfromfairrepresentationlearningtofrauddetection,title={Towards Trustworthy Digital Ecosystem: From Fair Representation Learning to Fraud Detection},author={Arun, A.},year={2024},booktitle={MS in Computer Science by Research at IIIT Hyderabad},}
Thesis
Understanding Online Protests: Unveiling Strategies, Collective Narratives, and Harmful Behaviors
Prior work has shown that pretrained language models often make incorrect predictions for negated inputs. The reason for this behaviour has remained unclear. It has been argued that since language models (LMs) don’t change their predictions about factual propositions under negation, they might not detect negation. We show encoder LMs do detect negation as their representations across layers reliably distinguish negated inputs from non-negated inputs, and when negation leads to contradictions. However, probing experiments show that these LMs indeed don’t use negation when evaluating whether a factual statement is true, even when fine-tuned with the objective of changing outputs on negated sentences (Hosseini et al., 2021). We hypothesize about why pretrained LMs are inconsistent under negation: when the statement could refer to multiple ground entities with conflicting properties, negation may not entail a change in output. This means negation minimal pairs in different training samples can have the same completion in pretraining corpora. We argue pretraining may not provide enough signal to learn the distribution of ground referents a token could have, confusing the LM on how to handle negation.
@inproceedings{singhprobing,title={Probing Negation in Language Models},author={Singh, Shashwat and Goel, Shashwat and Vaduguru, Saujas and Kumaraguru, Ponnurangam},year={2023},booktitle={Workshop on Representation Learning for NLP},}
ASONAM
Together Apart: Decoding Support Dynamics in Online COVID-19 Communities
The COVID-19 pandemic that broke out globally in December 2019 put us all in an unprecedented situation. Social media became a vital source of support and information during the pandemic, as physical interactions were limited by people staying at home. This paper investigates support dynamics and user commitment in an online COVID-19 community of Reddit. We define various support classes and observe them along with user behavior and temporal phases for a coherent in the community. We perform survival analysis using Cox Regression to identify factors influencing a user’s commitment to the community. People seeking more emotional and informational support while they are COVID-positive stay longer in the community. Surprisingly, people who give more support in their early phases are less likely to stay. Additionally, contrary to common belief, our findings show that receiving emotional and informational support has little effect on users’ longevity in the community. Our results lead to a better understanding of user dynamics related to community support and can directly impact moderators and platform owners in designing community guidelines and incentive structures.
@inproceedings{10.1145/3625007.3627297,title={Together Apart: Decoding Support Dynamics in Online COVID-19 Communities},author={Jangid, Hitkul and Pandey, Tanisha and Singhal, Sonali and Kandhari, Pranjal and Tomar, Aryamann and Kumaraguru, Ponnurangam},year={2023},booktitle={Proceedings of the 2023 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining},}
JURIX
CiteCaseLAW: Citation Worthiness Detection in Caselaw for Legal Assistive Writing
M.
Khatri, R.
Sheik, P.
Wadhwa, G.
Satija, Y.
Kumar, R.
Shah, and P.
Kumaraguru
In legal document writing, one of the key elements is properly citing the case laws and other sources to substantiate claims and arguments. Understanding the legal domain and identifying appropriate citation context or cite-worthy sentences are challenging tasks that demand expensive manual annotation. The presence of jargon, language semantics, and high domain specificity makes legal language complex, making any associated legal task hard for automation. The current work focuses on the problem of citation-worthiness identification. It is designed as the initial step in today’s citation recommendation systems to lighten the burden of extracting an adequate set of citation contexts. To accomplish this, we introduce a labeled dataset of 178M sentences for citation-worthiness detection in the legal domain from the Caselaw Access Project (CAP). The performance of various deep learning models was examined on this novel dataset. The domain-specific pre-trained model tends to outperform other models, with an 88% F1-score for the citation-worthiness detection task.
@inproceedings{khatri2023citationworthiness,title={CiteCaseLAW: Citation Worthiness Detection in Caselaw for Legal Assistive Writing},author={Khatri, M. and Sheik, R. and Wadhwa, P. and Satija, G. and Kumar, Y. and Shah, R. and Kumaraguru, P.},year={2023},booktitle={JURIX 2023},}
KDD @ KIL
Representation Learning for Identifying Depression Causes in Social Media
Govil
P., Krishna Bonagiri
V., Garg
M., and
Kumaraguru P.
In Proceedings of the Third ACM SIGKDD Workshop on Knowledge-infused Learning (KDD KiL 2023), 2023
Social media provides a supportive and anonymous environment for discussing mental health issues, including depression. Existing research on identifying the cause of depression focuses primarily on improving classifier models, while neglecting the importance of learning better data representations. To address this gap, we introduce an architecture that enhances the identification of the cause of depression by learning improved data representations. Our work enables a deeper interpretation of the cause of depression in social media contexts, emphasizing the significance of effective representation learning for this task. Our work can act as a foundation for self-help applications in the field of mental health.
@inproceedings{govil2023representationlearningfor,title={Representation Learning for Identifying Depression Causes in Social Media},author={P., Govil and V., Krishna Bonagiri and M., Garg and and Kumaraguru P.},year={2023},booktitle={Proceedings of the Third ACM SIGKDD Workshop on Knowledge-infused Learning (KDD KiL 2023)},}
WWW
Social Re-Identification Assisted RTO Detection for E-Commerce
A.
Hatkul, S.
Saha, S.
Banerjee, M.
Chelliah, and P.
Kumaraguru
In Companion Proceedings of the ACM Web Conference 2023 (WWW ’23 Companion), 2023
E-commerce features like easy cancellations, returns, and refunds can be exploited by bad actors or uninformed customers, leading to revenue loss for organization. One such problem faced by e-commerce platforms is Return To Origin (RTO), where the user cancels an order while it is in transit for delivery. In such a scenario platform faces logistics and opportunity costs. Traditionally, models trained on historical trends are used to predict the propensity of an order becoming RTO. Sociology literature has highlighted clear correlations between socio-economic indicators and users’ tendency to exploit systems to gain financial advantage. Social media profiles have information about location, education, and profession which have been shown to be an estimator of socio-economic condition. We believe combining social media data with e-commerce information can lead to improvements in a variety of tasks like RTO, recommendation, fraud detection, and credit modeling. In our proposed system, we find the public social profile of an e-commerce user and extract socio-economic features. Internal data fused with extracted social features are used to train a RTO order detection model. Our system demonstrates a performance improvement in RTO detection of 3.1% and 19.9% on precision and recall, respectively. Our system directly impacts the bottom line revenue and shows the applicability of social re-identification in e-commerce.
@inproceedings{hatkul2023socialassisted,title={Social Re-Identification Assisted RTO Detection for E-Commerce},author={Hatkul, K, A. and Saha, S. and Banerjee, S. and Chelliah, M. and Kumaraguru, P.},year={2023},booktitle={Companion Proceedings of the ACM Web Conference 2023 (WWW ’23 Companion)},}
ICWSM
Effect of Feedback on Drug Consumption Disclosures on Social Media
RR.
Hitkul, and P.
Kumaraguru
In Proceedings of the 17th International AAAI Conference on Web and Social Media (ICWSM ’23), 2023
Deaths due to drug overdose in the US have doubled in the last decade. Drug-related content on social media has also exploded in the same time frame. The pseudo-anonymous nature of social media platforms enables users to discourse about taboo and sometimes illegal topics like drug consumption. User-generated content (UGC) about drugs on social media can be used as an online proxy to detect offline drug consumption. UGC also gets exposed to the praise and criticism of the community. Law of effect proposes that positive reinforcement on an experience can incentivize the users to engage in the experience repeatedly. Therefore, we hypothesize that positive community feedback on a user’s online drug consumption disclosure will increase the probability of the user doing an online drug consumption disclosure post again. To this end, we collect data from 10 drug-related subreddits. First, we build a deep learning model to classify UGC as indicative of drug consumption offline or not, and analyze the extent of such activities. Further, we use matching-based causal inference techniques to unravel community feedback’s effect on users’ future drug consumption behavior. We discover that 84% of posts and 55% comments on drug-related subreddits indicate real-life drug consumption. Users who get positive feedback generate up to two times more drugs consumption content in the future. Finally, we conducted an anonymous user study on drug-related subreddits to compare members’ opinions with our experimental findings and show that user tends to underestimate the effect community peers can have on their decision to interact with drugs.
@inproceedings{hitkul2023effectoffeedback,title={Effect of Feedback on Drug Consumption Disclosures on Social Media},author={Hitkul, Shah, RR. and and Kumaraguru, P.},year={2023},booktitle={Proceedings of the 17th International AAAI Conference on Web and Social Media (ICWSM '23)},}
SAIL
Are Models Trained on Indian Legal Data Fair?
S.
Girhepuje, A.
Goel, G.
Krishnan, S.
Goyal, S.
Pandey, P.
Kumaraguru, and B.
& Ravindran
In 3rd Symposium on Artificial Intelligence and Law (SAIL), 2023
Recent advances and applications of language technology and artificial intelligence have enabled much success across multiple domains like law, medical and mental health. AI-based Language Models, like Judgement Prediction, have recently been proposed for the legal sector. However, these models are strife with encoded social biases picked up from the training data. While bias and fairness have been studied across NLP, most studies primarily locate themselves within a Western context. In this work, we present an initial investigation of fairness from the Indian perspective in the legal domain. We highlight the propagation of learnt algorithmic biases in the bail prediction task for models trained on Hindi legal documents. We evaluate the fairness gap using demographic parity and show that a decision tree model trained for the bail prediction task has an overall fairness disparity of 0.237 between input features associated with Hindus and Muslims. Additionally, we highlight the need for further research and studies in the avenues of fairness/bias in applying AI in the legal sector with a specific focus on the Indian context.
@inproceedings{girhepuje2023aremodelstrained,title={Are Models Trained on Indian Legal Data Fair?},author={Girhepuje, S. and Goel, A. and Krishnan, G. and Goyal, S. and Pandey, S. and Kumaraguru, P. and Ravindran, B.},year={2023},booktitle={3rd Symposium on Artificial Intelligence and Law (SAIL)},}
EACL
JobXMLC: EXtreme Multi-Label Classification of Job Skills with Graph Neural Networks
N.
Goyal, J.
Kalra, C.
Sharma, R.
Mutharaju, N.
Sachdeva, and P.
Kumaraguru
In Findings of the Association for Computational Linguistics: EACL 2023, 2023
Writing a good job description is an important step in the online recruitment process to hire the best candidates. Most recruiters forget to include some relevant skills in the job description. These missing skills affect the performance of recruitment tasks such as job suggestions, job search, candidate recommendations, etc. Existing approaches are limited to contextual modelling, do not exploit inter-relational structures like job-job and job-skill relationships, and are not scalable. In this paper, we exploit these structural relationships using a graph-based approach. We propose a novel skill prediction framework called JobXMLC, which uses graph neural networks with skill attention to predict missing skills using job descriptions. JobXMLC enables joint learning over a job-skill graph consisting of 22.8K entities (jobs and skills) and 650K relationships. We experiment with real-world recruitment datasets to evaluate our proposed approach. We train JobXMLC on 20,298 job descriptions and 2,548 skills within 30 minutes on a single GPU machine. JobXMLC outperforms the state-of-the-art approaches by 6% in precision and 3% in recall. JobXMLC is 18X faster for training task and up to 634X faster in skill prediction on benchmark datasets enabling JobXMLC to scale up on larger datasets.
@inproceedings{goyal2023extreme,title={JobXMLC: EXtreme Multi-Label Classification of Job Skills with Graph Neural Networks},author={Goyal, N. and Kalra, J. and Sharma, C. and Mutharaju, R. and Sachdeva, N. and and Kumaraguru, P.},year={2023},booktitle={Findings of the Association for Computational Linguistics: EACL 2023},}
ECIR
Towards Effective Paraphrasing for Information Disguise
A.
Agarwal, S.
Gupta, V.
Bonagiri, M.
Gaur, J.
Reagle, and P.
Kumaraguru
In The 45th European Conference in Information Retrieval (ECIR 2023), 2023
Information Disguise (ID), a part of computational ethics in Natural Language Processing (NLP), is concerned with best practices of textual paraphrasing to prevent the non-consensual use of authors’ posts on the Internet. Research on ID becomes important when authors’ written online communication pertains to sensitive domains, e.g. and mental health. Over time, researchers have utilized AI-based automated word spinners (e.g. and SpinRewriter, WordAI) for paraphrasing content. However, these tools fail to satisfy the purpose of ID as their paraphrased content still leads to the source when queried on search engines. There is limited prior work on judging the effectiveness of paraphrasing methods for ID on search engines or their proxies, neural retriever (NeurIR) models. We propose a framework where, for a given sentence from an author’s post, we perform iterative perturbation on the sentence in the direction of paraphrasing with an attempt to confuse the search mechanism of a NeurIR system when the sentence is queried on it. Our experiments involve the subreddit ’r/AmItheAsshole’ as the source of public content and Dense Passage Retriever as a NeurIR system-based proxy for search engines. Our work introduces a novel method of phrase-importance rankings using perplexity scores and involves multi-level phrase substitutions via beam search. Our multi-phrase substitution scheme succeeds in disguising sentences 82% of the time and hence takes an essential step towards enabling researchers to disguise sensitive content effectively before making it public. We also release the code of our approach.
@inproceedings{agarwal2023towardseffectiveparaphrasing,title={Towards Effective Paraphrasing for Information Disguise},author={Agarwal, A. and Gupta, S. and Bonagiri, V. and Gaur, M. and Reagle, J. and and Kumaraguru, P.},year={2023},booktitle={The 45th European Conference in Information Retrieval (ECIR 2023)},}
COMAD
Warning: It’s a scam!! Towards understanding the Employment Scams using Knowledge Graphs
N.
Goyal, R.
Mamidi, N.
Sachdeva, and P.
Kumaraguru
In ACM India Joint International Conference on Data Science and Management of Data (CoDS-COMAD 2023) YRS track, 2023
Employment scams, such as scapegoat positions, clickbait and non-existing jobs, etc. and are among the top five scams registered over online platforms.1 Generally, scam complaints contain heterogeneous information (money, location, employment type, organization, email, and phone number), which can provide critical insights for appropriate interventions to avoid scams. Despite substantial efforts to analyze employment scams, integrating relevant scam-related information in structured form remains unexplored. In this work, we extract this information and construct a large-scale Employment Scam Knowledge Graph consisting of 0.1M entities and 0.2M relationships. Our findings include discovering different modes of employment scams, entities, and relationships among entities to alert job seekers. We plan to extend this work by utilizing a knowledge graph to identify and avoid potential scams in the future.
@inproceedings{goyal2023a,title={Warning: It’s a scam!! Towards understanding the Employment Scams using Knowledge Graphs},author={Goyal, N. and Mamidi, R. and Sachdeva, N. and and Kumaraguru, P.},year={2023},booktitle={ACM India Joint International Conference on Data Science and Management of Data (CoDS-COMAD 2023) YRS track},}
WACV
A Suspect Identification Framework using Contrastive Relevance Feedback
D.
Gupta, A.
Saini, S.
Bhagat, S.
Uppal, R.
Jain, D.
Bhasin, P.
Kumaraguru, and R.
Shah
In Winter Conference on Applications of Computer Vision (WACV) 2023, 2023
Suspect Identification is one of the most pivotal aspects of a forensic and criminal investigation. A significant amount of time and skill is devoted to creating sketches for it and requires a fair amount of recollections from the witness to provide a useful sketch. We devise a method that aims to automate the process of suspect identification and model this problem by iteratively retrieving images from feedback provided by the user. Compared to standard image retrieval tasks, interactive facial image retrieval is specifically more challenging due to the high subjectivity involved in describing a person’s facial attributes and appropriately evolving with the preferences put forward by the user. Our method uses a relatively simpler form of supervision by utilizing the user’s feedback to label images as either similar or dissimilar to their mental image of the suspect based on which we propose a loss function using the contrastive learning paradigm that is optimized in an online fashion. We validate the efficacy of our proposed approach using a carefully designed testbed to simulate user feedback and a large-scale user study. We empirically show that our method iteratively improves personalization, leading to faster convergence and enhanced recommendation relevance, thereby improving user satisfaction. Our proposed framework is being designed for real-time use in the metropolitan crime investigation department, and thus is also equipped with a user-friendly web interface with a real-time experience for suspect retrieval.
@inproceedings{gupta2023asuspectidentification,title={A Suspect Identification Framework using Contrastive Relevance Feedback},author={Gupta, D. and Saini, A. and Bhagat, S. and Uppal, S. and Jain, R. and Bhasin, D. and Kumaraguru, P. and and Shah, R.},year={2023},booktitle={Winter Conference on Applications of Computer Vision (WACV) 2023},}
BDA
Explaining Finetuned Transformers on Hate Speech Predictions Using Layerwise Relevance Propagation
Ritwik
Mishra, Ajeet
Yadav, Rajiv
Shah, and Ponnurangam
Kumaraguru
In Proceedings of the 11th International Conference on Big Data and Artificial Intelligence, 2023
Explainability of model predictions has become imperative for architectures that involve fine-tuning of a pretrained transformer encoder for a downstream task such as hate speech detection. In this work, we compare the explainability capabilities of three post-hoc methods on the HateXplain benchmark with different encoders. Our research is the first work to evaluate the effectiveness of Layerwise Relevance Propagation (LRP) as a post-hoc method for fine-tuned transformer architectures used in hate speech detection. The analysis revealed that LRP tends to perform less effectively than the other two methods across various explainability metrics. A random rationale generator was found to be providing a better interpretation than the LRP method. Upon further investigation, it was discovered that the LRP method assigns higher relevance scores to the initial tokens of the input text because fine-tuned encoders tend to concentrate the text information in the embeddings corresponding to early tokens of the text. Therefore, our findings demonstrate that LRP relevance values at the input of fine-tuning layers are not a good representative of the rationales behind the predicted score.
@inproceedings{inbook,title={Explaining Finetuned Transformers on Hate Speech Predictions Using Layerwise Relevance Propagation},author={Mishra, Ritwik and Yadav, Ajeet and Shah, Rajiv and Kumaraguru, Ponnurangam},year={2023},booktitle={Proceedings of the 11th International Conference on Big Data and Artificial Intelligence},}
ECAI
CAFIN: Centrality Aware Fairness Inducing IN-Processing for Unsupervised Representation Learning on Graphs
Unsupervised Representation Learning on graphs is gaining traction due to the increasing abundance of unlabelled network data and the compactness, richness, and usefulness of the representations generated. In this context, the need to consider fairness and bias constraints while generating the representations has been well-motivated and studied to some extent in prior works. One major limitation of most of the prior works in this setting is that they do not aim to address the bias generated due to connectivity patterns in the graphs, such as varied node centrality, which leads to a disproportionate performance across nodes. In our work, we aim to address this issue of mitigating bias due to inherent graph structure in an unsupervised setting. To this end, we propose CAFIN, a centrality-aware fairness-inducing framework that leverages the structural information of graphs to tune the representations generated by existing frameworks. We deploy it on GraphSAGE (a popular framework in this domain) and showcase its efficacy on two downstream tasks - Node Classification and Link Prediction. Empirically, CAFIN consistently reduces the performance disparity across popular datasets (varying from 18 to 80% reduction in performance disparity) from various domains while incurring only a minimal cost of fairness.
@inproceedings{cafin,title={{CAFIN}: {C}entrality {A}ware {F}airness Inducing {IN}-Processing for Unsupervised Representation Learning on Graphs},author={Arun, Arvindh and Aanegola, Aakash and Agrawal, Amul and Narayanam, Ramasuri and Kumaraguru, Ponnurangam},year={2023},booktitle={Proceedings of the 26th European Conference on Artificial Intelligence},}
ACL
X-RiSAWOZ: High-Quality End-to-End Multilingual Dialogue Datasets and Few-shot Agents
Mehrad
Moradshahi, Tianhao
Shen, Kalika
Bali, Monojit
Choudhury, Gael
Chalendar, Anmol
Goel, Sungkyun
Kim, Prashant
Kodali, Ponnurangam
Kumaraguru, Nasredine
Semmar, Sina
Semnani, Jiwon
Seo, Vivek
Seshadri, Manish
Shrivastava, Michael
Sun, Aditya
Yadavalli, Chaobin
You, Deyi
Xiong, and Monica
Lam
In Findings of the Association for Computational Linguistics: ACL 2023, 2023
Task-oriented dialogue research has mainly focused on a few popular languages like English and Chinese, due to the high dataset creation cost for a new language. To reduce the cost, we apply manual editing to automatically translated data. We create a new multilingual benchmark, X-RiSAWOZ, by translating the Chinese RiSAWOZ to 4 languages: English, French, Hindi, Korean; and a code-mixed English-Hindi language.X-RiSAWOZ has more than 18,000 human-verified dialogue utterances for each language, and unlike most multilingual prior work, is an end-to-end dataset for building fully-functioning agents. The many difficulties we encountered in creating X-RiSAWOZ led us to develop a toolset to accelerate the post-editing of a new language dataset after translation. This toolset improves machine translation with a hybrid entity alignment technique that combines neural with dictionary-based methods, along with many automated and semi-automated validation checks. We establish strong baselines for X-RiSAWOZ by training dialogue agents in the zero- and few-shot settings where limited gold data is available in the target language. Our results suggest that our translation and post-editing methodology and toolset can be used to create new high-quality multilingual dialogue agents cost-effectively. Our dataset, code, and toolkit are released open-source.
@inproceedings{moradshahi-etal-2023-x,title={{X}-{R}i{SAWOZ}: High-Quality End-to-End Multilingual Dialogue Datasets and Few-shot Agents},author={Moradshahi, Mehrad and Shen, Tianhao and Bali, Kalika and Choudhury, Monojit and de Chalendar, Gael and Goel, Anmol and Kim, Sungkyun and Kodali, Prashant and Kumaraguru, Ponnurangam and Semmar, Nasredine and Semnani, Sina and Seo, Jiwon and Seshadri, Vivek and Shrivastava, Manish and Sun, Michael and Yadavalli, Aditya and You, Chaobin and Xiong, Deyi and Lam, Monica},year={2023},booktitle={Findings of the Association for Computational Linguistics: ACL 2023},}
WASSA @ ACL
PrecogIIITH@WASSA2023: Emotion Detection for Urdu-English Code-mixed Text
Bhaskara Hanuma
Vedula, Prashant
Kodali, Manish
Shrivastava, and Ponnurangam
Kumaraguru
In Proceedings of the 13th Workshop on Computational Approaches to Subjectivity, Sentiment, & Social Media Analysis, 2023
Code-mixing refers to the phenomenon of using two or more languages interchangeably within a speech or discourse context. This practice is particularly prevalent on social media platforms, and determining the embedded affects in a code-mixed sentence remains as a challenging problem. In this submission we describe our system for WASSA 2023 Shared Task on Emotion Detection in English-Urdu code-mixed text. In our system we implement a multiclass emotion detection model with label space of 11 emotions. Samples are code-mixed English-Urdu text, where Urdu is written in romanised form. Our submission is limited to one of the subtasks - Multi Class classification and we leverage transformer-based Multilingual Large Language Models (MLLMs), XLM-RoBERTa and Indic-BERT. We fine-tune MLLMs on the released data splits, with and without pre-processing steps (translation to english), for classifying texts into the appropriate emotion category. Our methods did not surpass the baseline, and our submission is ranked sixth overall.
@inproceedings{vedula-etal-2023-precogiiith,title={{P}recog{IIITH}@{WASSA}2023: Emotion Detection for {U}rdu-{E}nglish Code-mixed Text},author={Vedula, Bhaskara Hanuma and Kodali, Prashant and Shrivastava, Manish and Kumaraguru, Ponnurangam},year={2023},booktitle={Proceedings of the 13th Workshop on Computational Approaches to Subjectivity, Sentiment, {\&} Social Media Analysis},}
Thesis
Identify, Inspect and Intervene Multimodal Fake News
@inproceedings{beyondthesurfaceacomputationalexplorationoflinguisticambiguity,title={Beyond the Surface: A Computational Exploration of Linguistic Ambiguity},author={Goel, A.},year={2023},booktitle={MS in Computer Science by Research at IIIT Hyderabad},}
Thesis
Modeling Online User Interactions and their Offline effects on Socio-Technical Platforms
@inproceedings{modelingonlineuserinteractionsandtheirofflineeffectsonsociotechnicalplatformt,title={Modeling Online User Interactions and their Offline effects on Socio-Technical Platforms},author={Hitkul},year={2023},booktitle={Ph.D. Comprehensive Report},}
2022
ASONAM
The Pursuit of Being Heard: An Unsupervised Approach to Narrative Detection in Online Protest
Protests and mass mobilization are scarce; however, they may lead to dramatic outcomes when they occur. Social media such as Twitter has become a center point for the organization and development of online protests worldwide. It becomes crucial to decipher various narratives shared during an online protest to understand people’s perceptions. In this work, we propose an unsupervised clustering-based framework to understand the narratives present in a given online protest. Through a comparative analysis of tweet clusters in 3 protests around government policy bills, we contribute novel insights about narratives shared during an online protest. Across case studies of government policy-induced online protests in India and the United Kingdom, we found familiar mass mo-bilization narratives across protests. We found reports of on-ground activities and call-to-action for people’s participation narrative clusters in all three protests under study. We also found protest-centric narratives in different protests, such as skepticism around the topic. The results from our analysis can be used to understand and compare people’s perceptions of future mass mobilizations.
@inproceedings{neha2022thepursuitof,title={The Pursuit of Being Heard: An Unsupervised Approach to Narrative Detection in Online Protest},author={Neha, K. and Agrawal, V. and Buduru, A. and and Kumaraguru, P.},year={2022},booktitle={ASONAM 2022},}
ASONAM
Understanding the Impact of Awards on Award Winners and the Community on Reddit
A.
Tulasi, M.
Mondal, A.
Buduru, and P.
Kumaraguru
Non-financial incentives in the form of awards often act as a driver of positive reinforcement and elevation of social status in the offline world. The elevated social status results in people becoming more active, aligning to a change in the communities’ expectations. However, the impact in terms of longevity of social influence and community acceptance of leaders of these incentives in the form of awards are not well-understood in the online world. Our work aims to shed light on the impact of these awards on the awardee and the community. We focus on three large subreddits with a snapshot of 219K posts and 5.8 million comments contributed by 88K Reddit users who received 14,146 awards. Our work establishes that the behaviour of awardees change statistically significantly for a short time after getting an award; however, the change is ephemeral since the awardees return to their pre-award behaviour within days. Additionally, via a user survey, we identified a long-lasting impact of awards-we found that the community’s stance softened towards awardees.
@inproceedings{tulasi2022understandingtheimpact,title={Understanding the Impact of Awards on Award Winners and the Community on Reddit},author={Tulasi, A. and Mondal, M. and Buduru, A. and and Kumaraguru, P.},year={2022},booktitle={ASONAM 2022},}
SocInfo
“The Times They Are-a-Changin”: The Effect of the Covid-19 Pandemic on Online Music Sharing in India
Kamble
T., Desur
P., Krause
A., Kumaraguru
P., and Alluri
V.
In Proceedings of the 13th International Conference on Social Informatics (SocInfo) 2022, 2022
Music sharing trends have been shown to change during times of socio-economic crises. Studies have also shown that music can act as a social surrogate, helping to significantly reduce loneliness by acting as an empathetic friend. We explored these phenomena through a novel study of online music sharing during the Covid-19 pandemic in India. We collected tweets from the popular social media platform Twitter during India’s first and second wave of the pandemic (n = 1,364). We examined the different ways in which music was able to accomplish the role of a social surrogate via analyzing tweet text using Natural Language Processing techniques. Additionally, we analyzed the emotional connotations of the music shared through the acoustic features and lyrical content and compared the results between pandemic and pre-pandemic times. It was observed that the role of music shifted to a more community focused function rather than tending to a more self-serving utility. Results demonstrated that people shared music during the Covid-19 pandemic which had lower valence and shared songs with topics that reflected turbulent times such as Hardship and Exclusion when compared to songs shared during pre-Covid times. The results are further discussed in the context of individualistic versus collectivistic cultures.
@inproceedings{kamble2022timesthey,title={“The Times They Are-a-Changin”: The Effect of the Covid-19 Pandemic on Online Music Sharing in India},author={T., Kamble and P., Desur and A., Krause and P., Kumaraguru and V., Alluri},year={2022},booktitle={Proceedings of the 13th International Conference on Social Informatics (SocInfo) 2022},}
NAACL
Learning to Automate Follow-up Question Generation using Process Knowledge for Depression Triage on Reddit Posts
Gupta
S.*, Agarwal
A.*, Gaur
M., Roy
K., Narayanam
V., Kumaraguru
P., and Sheth
A.
In Eight Workshop on Computational Linguistics and Clinical Psychology: Mental Health in the Face of Change, NAACL’22, 2022
Conversational Agents (CAs) powered with deep language models (DLMs) have shown tremendous promise in the domain of mental health. Prominently, the CAs have been used to provide informational or therapeutic services (e.g. and cognitive behavioral therapy) to patients. However, the utility of CAs to assist in mental health triaging has not been explored in the existing work as it requires a controlled generation of follow-up questions (FQs), which are often initiated and guided by the mental health professionals (MHPs) in clinical settings. In the context of ‘depression’, our experiments show that DLMs coupled with process knowledge in a mental health questionnaire generate 12.54% and 9.37% better FQs based on similarity and longest common subsequence matches to questions in the PHQ-9 dataset respectively, when compared with DLMs without process knowledge support. Despite coupling with process knowledge, we find that DLMs are still prone to hallucination, i.e. and generating redundant, irrelevant, and unsafe FQs. We demonstrate the challenge of using existing datasets to train a DLM for generating FQs that adhere to clinical process knowledge. To address this limitation, we prepared an extended PHQ-9 based dataset, PRIMATE, in collaboration with MHPs. PRIMATE contains annotations regarding whether a particular question in the PHQ-9 dataset has already been answered in the user’s initial description of the mental health condition. We used PRIMATE to train a DLM in a supervised setting to identify which of the PHQ-9 questions can be answered directly from the user’s post and which ones would require more information from the user. Using performance analysis based on MCC scores, we show that PRIMATE is appropriate for identifying questions in PHQ-9 that could guide generative DLMs towards controlled FQ generation (with minimal hallucination) suitable for aiding triaging. The dataset created as a part of this research can be obtained from https://github.com/primate-mh/Primate2022
@inproceedings{gupta2022learningtoautomate,title={Learning to Automate Follow-up Question Generation using Process Knowledge for Depression Triage on Reddit Posts},author={S., Gupta and A., Agarwal and M., Gaur and K., Roy and V., Narayanam and P., Kumaraguru and A., Sheth},year={2022},booktitle={Eight Workshop on Computational Linguistics and Clinical Psychology: Mental Health in the Face of Change, NAACL'22},}
IJCNN
Ask It Right! Identifying Low-Quality questions on Community Question Answering Services
U.*
Arora, N.*
Goyal, A.
Goel, N.
Sachdeva, and P.
Kumaraguru
In Proceedings of International Joint Conference on Neural Networks (IJCNN-2022), 2022
Stack Overflow is a Community Question Answering service that attracts millions of users to seek answers to their questions. Maintaining high-quality content is necessary for relevant question retrieval, question recommendation, and enhancing the user experience. Manually removing low-quality content from the platform is time-consuming and challenging for site moderators. Thus, it is imperative to assess the content quality by automatically detecting and ‘closing’ the low-quality questions. Previous works have explored lexical, community-based, vote-based, and style-based features to detect low-quality questions. These approaches are limited to writing styles, textual, and handcrafted features. However, these features fall short in understanding semantic features and capturing the implicit relationships between tags and questions. In contrast, we propose LQuaD (Low-Quality Question Detection), a multi-tier hybrid framework that, a) incorporates semantic information of questions associated with each post using transformers, b) includes the question and tag information that enables learning via a graph convolutional network. LQuaD outperforms the state-of-the-art methods by a 21% higher F1-score on the dataset of 2.8 million questions. Furthermore, we apply survival analysis which acts as a proactive intervention to reduce the number of questions closed by informing users to take appropriate action. We find that the timeframe between the stages from the question’s creation till it gets ‘closed’ varies significantly for tags and different ‘closing’ reasons for these questions.
@inproceedings{arora2022askit,title={Ask It Right! Identifying Low-Quality questions on Community Question Answering Services},author={Arora, U. and Goyal, N. and Goel, A. and Sachdeva, N. and Kumaraguru, P.},year={2022},booktitle={Proceedings of International Joint Conference on Neural Networks (IJCNN-2022)},}
COMPASS
Urbanization and Literacy as factors in Politicians’ Social Media Use in a largely Rural State: Evidence from Uttar Pradesh, India
Singh
A., Jain
J., Kameswari
L., Kumaraguru
P., and
Pal J.
In ACM SIGCAS Conference on Computing and Sustainable Societies (COMPASS ’22), 2022
With Twitter growing as a preferred channel for outreach among major politicians, there have been focused efforts on online communication, even in election campaigns in primarily rural regions. In this paper, we examine the relationship between politicians’ use of social media and the level of urbanization and literacy by compiling a comprehensive list of Twitter handles of political party functionaries and election candidates in the run-up to the 2022 State Assembly elections in Uttar Pradesh, India. We find statistically significant relationships between political Twitter presence and levels of urbanization and with levels of literacy. We also find a strong correlation between vote share and Twitter presence in the winning party, a relationship that is even stronger in urban districts. This provides empirical evidence that social media is already a central part of electoral outreach processes in the Global South, but that this is still selectively more relevant to voters in, and politicians standing for elections from urban and higher-educated regions.
@inproceedings{singh2022urbanizationandliteracy,title={Urbanization and Literacy as factors in Politicians' Social Media Use in a largely Rural State: Evidence from Uttar Pradesh, India},author={A., Singh and J., Jain and L., Kameswari and P., Kumaraguru and and Pal J.},year={2022},booktitle={ACM SIGCAS Conference on Computing and Sustainable Societies (COMPASS '22)},}
WebSci
A Tale of Two Sides: Study of Protesters and Counter-protesters on #CitizenshipAmendmentAct Campaign on Twitter
K.
Neha, V.
Agrawal, V.
Kumar, T.
Mohar, A.
Chopra, A.
Buduru, R.
Sharma, and P.
Kumaraguru
Online social media platforms have evolved into a significant place for debate around socio-political phenomena such as government policies and bills. Studying online debates on such topics can help infer people’s perception and acceptance of the happenings. At the same time, various inauthentic users that often pollute the democratic discussion of the subject need to be weeded out from the debate. The characterization of a campaign keeping in mind various forms of involved actors thus becomes very important. On December 12, 2019, Citizenship Amendment Act (CAA) was enacted by the Indian Government, triggering a debate on whether the act was unfair. In this work, we investigate the user’s perception of the #CitizenshipAmendmentAct on Twitter, as the campaign unrolled with divergent discourse in the country. Keeping the campaign participants as the prime focus, we study 9,947,814 tweets produced by 275,111 users during the starting 3 months of protest. Our study includes the analysis of user engagement, content, and network properties with online accounts divided into authentic (genuine users) and inauthentic (bots, suspended, and deleted) users. Our findings show different themes in shared tweets among protesters and counter-protesters. We find presence of inauthentic users on both side of discourse, with counter-protesters having more inauthentic users than protesters. The follow network of the users suggests homophily among users on the same side of discourse and connection between various inauthentic and authentic users. This work contributes to filling the gap of understanding the role of users (from both sides) in a less studied geo-location, India.
@inproceedings{neha2022ataleof,title={A Tale of Two Sides: Study of Protesters and Counter-protesters on #CitizenshipAmendmentAct Campaign on Twitter},author={Neha, K. and Agrawal, V. and Kumar, V. and Mohar, T. and Chopra, A. and Buduru, A. and Sharma, R. and Kumaraguru, P.},year={2022},booktitle={WebSci-2022},}
JIBM
VacSIM: Learning Effective Strategies for COVID-19 Vaccine Distribution using Reinforcement Learning
R.
Awasthi, K.
Guliani, S.
Khan, A.
Vashishtha, M.
Gill, A.
Bhatt, A.
Nagori, A.
Gupta, P.
Kumaraguru, and P.
Sethi
A COVID-19 vaccine is our best bet for mitigating the ongoing onslaught of the pandemic. However, vaccine is also expected to be a limited resource. An optimal allocation strategy, especially in countries with access inequities and temporal separation of hot-spots, might be an effective way of halting the disease spread. We approach this problem by proposing a novel pipeline VacSIM that dovetails Deep Reinforcement Learning models into a Contextual Bandits approach for optimizing the distribution of COVID-19 vaccine. Whereas the Reinforcement Learning models suggest better actions and rewards, Contextual Bandits allow online modifications that may need to be implemented on a day-to-day basis in the real world scenario. We evaluate this framework against a naive allocation approach of distributing vaccine proportional to the incidence of COVID-19 cases in five different States across India (Assam, Delhi, Jharkhand, Maharashtra and Nagaland) and demonstrate up to 9039 potential infections prevented and a significant increase in the efficacy of limiting the spread over a period of 45 days through the VacSIM approach. Our models and the platform are extensible to all states of India and potentially across the globe. We also propose novel evaluation strategies including standard compartmental model-based projections and a causality-preserving evaluation of our model. Since all models carry assumptions that may need to be tested in various contexts, we open source our model VacSIM and contribute a new reinforcement learning environment compatible with OpenAI gym to make it extensible for real-world applications across the globe.
@inproceedings{awasthi2022learningeffective,title={VacSIM: Learning Effective Strategies for COVID-19 Vaccine Distribution using Reinforcement Learning},author={Awasthi, R. and Guliani, K. and Khan, S. and Vashishtha, A. and Gill, M. and Bhatt, A. and Nagori, A. and Gupta, A. and Kumaraguru, P. and Sethi, P.},year={2022},booktitle={Journal of Intelligence-Based Medicine},}
HTSM
Erasing Labor with Labor: Dark Patterns and Lockstep Behaviors on Google Play
A.
Singh, Arun
A., Malhotra
P., P.
Desur, A.
Jain, DH.
Chau, and P.
Kumaraguru
In Proceedings of the 33rd ACM Conference on Hypertext and Social Media, 2022
Google Play’s policy forbids the use of incentivized installs, ratings, and reviews to manipulate the placement of apps. However, there still exist apps that incentivize installs for other apps on the platform. To understand how install-incentivizing apps affect users, we examine their ecosystem through a socio-technical lens and perform a mixed-methods analysis of their reviews and permissions. Our dataset contains 319K reviews collected daily over five months from 60 such apps that cumulatively account for over 160.5M installs. We perform qualitative analysis of reviews to reveal various types of dark patterns that developers incorporate in install-incentivizing apps, highlighting their normative concerns at both user and platform levels. Permissions requested by these apps validate our discovery of dark patterns, with over 92% apps accessing sensitive user information. We find evidence of fraudulent reviews on install-incentivizing apps, following which we model them as an edge stream in a dynamic bipartite graph of apps and reviewers. Our proposed reconfiguration of a state-of-the-art microcluster anomaly detection algorithm yields promising preliminary results in detecting this fraud. We discover highly significant lockstep behaviors exhibited by reviews that aim to boost the overall rating of an install-incentivizing app. Upon evaluating the 50 most suspicious clusters of boosting reviews detected by the algorithm, we find (i) near-identical pairs of reviews across 94% (47 clusters), and (ii) over 35% (1,687 of 4,717 reviews) present in the same form near-identical pairs within their cluster. Finally, we conclude with a discussion on how fraud is intertwined with labor and poses a threat to the trust and transparency of Google Play.
@inproceedings{singh2022erasinglaborwith,title={Erasing Labor with Labor: Dark Patterns and Lockstep Behaviors on Google Play},author={Singh, A. and A., Arun and P., Malhotra and Desur, P. and Jain, A. and Chau, DH. and and Kumaraguru, P.},year={2022},booktitle={Proceedings of the 33rd ACM Conference on Hypertext and Social Media},}
ICWSM
Twitter-STMHD: An Extensive User-Level Database of Multiple Mental Health Disorders
A.
Suahvi, A.
Singh, S.
Shrivastava, U.
Arora, P.
Kumaraguru, and RR
Shah
In Proceedings of the 16th International AAAI Conference on Web and Social Media (ICWSM ’22), 2022
Social Media is equipped with the ability to track and quantify user behavior, establishing it as an appropriate resource for mental health studies. However, previous efforts in the area have been limited by the lack of data and contextually relevant information. There is a need for large-scale, well-labeled mental health datasets with fast reproducible methods to facilitate their heuristic growth. In this paper, we cater to this need by building the Twitter - Self-Reported Temporally-Contextual Mental Health Diagnosis Dataset (Twitter-STMHD), a large scale, user-level dataset grouped into 8 disorder categories and a companion class of control users. The dataset is 60% hand-annotated, which lead to the creation of high-precision self-reported diagnosis report patterns, used for the construction of the rest of the dataset. The dataset, instead of being a corpus of tweets, is a collection of user-profiles of those suffering from mental health disorders to provide a holistic view of the problem statement. By leveraging temporal information, the data for a given profile in the dataset has been collected for disease prevalence periods: onset of disorder, diagnosis and progression, along with a fourth period: COVID-19. This is the only and the largest dataset that captures the tweeting activity of users suffering from mental health disorders during the COVID-19 period.
@inproceedings{suahvi2022anextensive,title={Twitter-STMHD: An Extensive User-Level Database of Multiple Mental Health Disorders},author={Suahvi, Singh, A. and Singh, A. and Shrivastava, S. and Arora, U. and Kumaraguru, P. and and Shah, RR},year={2022},booktitle={Proceedings of the 16th International AAAI Conference on Web and Social Media (ICWSM '22)},}
ICWSM
Effect of Popularity Shocks on User Behavior
O.
Gurjar, T.
Bansal, H.
Hitkul, and P.
Kumaraguru
In Proceedings of the 16th AAAI International Conference on Web and Social Media (ICWSM ’22), 2022
Users often post on content-sharing platforms in the hope of attracting high engagement from viewers. Some posts receive unusual attention and go "viral", eliciting a significant response (likes, views, shares) to the creator in the form of popularity shocks. Past theories have suggested a sense of reputation as one of the key drivers of online activity and the tendency of users to repeat fruitful behaviors. Based on these, we theorize popularity shocks to be linked with changes in the behavior of users. In this paper, we propose a framework to study the changes in user activity in terms of frequency of posting and content posted around popularity shocks. Further, given the sudden nature of their occurrence, we look into the survival durations of effects associated with these shocks. We observe that popularity shocks lead to an increase in the posting frequency of users, and users alter their content to match with the one which resulted in the shock. Also, it is found that shocks are tough to maintain, with effects fading within a few days for most users. High response from viewers and diversification of content posted is found to be linked with longer survival durations of the shock effects. We believe our work fills the gap related to observing users’ online behavior exposed to sudden popularity and has widespread implications for platforms, users, and brands involved in marketing on such platforms.
@inproceedings{gurjar2022effectofpopularity,title={Effect of Popularity Shocks on User Behavior},author={Gurjar, O. and Bansal, T. and Hitkul, Lamba, H. and and Kumaraguru, P.},year={2022},booktitle={Proceedings of the 16th AAAI International Conference on Web and Social Media (ICWSM '22)},}
ICWSM
FactDrill: A Data Repository of Fact-checked Social Media Content to Study Fake News Incidents in India
S.
Singhal, RR.
Shah, and P.
Kumaraguru
In Proceedings of the 16th International AAAI Conference on Web and Social Media (ICWSM), 2022
The production and circulation of fake content in India is a rising problem. There is a dire need to investigate the false claims made in public. This paper presents a dataset containing 22,435 fact-checked social media content to study fake news incidents in India. The dataset comprises news stories from 2013 to 2020, covering 13 different languages spoken in the country. We present a detailed description of the 14 different attributes present in the dataset. We also present the detailed characterisation of three M’s (multi-lingual, multi-media, multi-domain) in the FactDrill dataset. Lastly, we present some potential use cases of the dataset. We expect that the dataset will be a valuable resource to understand the dynamics of fake content in a multi-lingual setting in India.
@inproceedings{singhal2022adata,title={FactDrill: A Data Repository of Fact-checked Social Media Content to Study Fake News Incidents in India},author={Singhal, S. and Shah, RR. and and Kumaraguru, P.},year={2022},booktitle={Proceedings of the 16th International AAAI Conference on Web and Social Media (ICWSM)},}
MUWS
Leveraging Intra and Inter Modality Relationship for Multimodal Fake News Detection
S.
Singhal, T.
Pandey, S.
Mrig, RR.
Shah, and P.
Kumaraguru
In Proceedings of the 1st International Workshop on Multimodal Understanding for the Web and Social Media (MUWS), co-located with The WebConf (WWW) 2022, 2022
Recent years have witnessed a massive growth in the proliferation of fake news online. User-generated content is a blend of text and visual information leading to producing different variants of fake news. As a result, researchers started targeting multimodal methods for fake news detection. Existing methods capture high-level information from different modalities and jointly model them to decide. Given multiple input modalities, we hypothesize that not all modalities may be equally responsible for decision-making. Hence, this paper presents a novel architecture that effectively identifies and suppresses information from weaker modalities and extracts relevant information from the strong modality on a per-sample basis. We also establish intra-modality relationship by extracting fine-grained image and text features. We conduct extensive experiments on real-world datasets to show that our approach outperforms the state-of-the-art by an average of 3.05% and 4.525% on accuracy and F1-score, respectively. We also release the code, implementation details, and model checkpoints for the community’s interest.1
@inproceedings{singhal2022leveragingintraand,title={Leveraging Intra and Inter Modality Relationship for Multimodal Fake News Detection},author={Singhal, S. and Pandey, T. and Mrig, S. and Shah, RR. and and Kumaraguru, P.},year={2022},booktitle={Proceedings of the 1st International Workshop on Multimodal Understanding for the Web and Social Media (MUWS), co-located with The WebConf (WWW) 2022},}
FinWeb @ WWW
TweetBoost: Influence of Social Media on NFT Valuation
A.
Kapoor, D.
Guhathakurta, M.
Mathur, Y.
Yadav, M.
Gupta, and P.
Kumaraguru
NFT or Non-Fungible Token is a token that certifies a digital asset to be unique. A wide range of assets including, digital art, music, tweets, memes, are being sold as NFTs. NFT-related content has been widely shared on social media sites such as Twitter. We aim to understand the dominant factors that influence NFT asset valuation. Towards this objective, we create a first-of-its-kind dataset linking Twitter and OpenSea (the largest NFT marketplace) to capture social media profiles and linked NFT assets. Our dataset contains 245,159 tweets posted by 17,155 unique users, directly linking 62,997 NFT assets on OpenSea worth 19 Million USD. We have made the dataset public. We analyze the growth of NFTs, characterize the Twitter users promoting NFT assets, and gauge the impact of Twitter features on the virality of an NFT. Further, we investigate the effectiveness of different social media and NFT platform features by experimenting with multiple machine learning and deep learning models to predict an asset’s value. Our results show that social media features improve the accuracy by 6% over baseline models that use only NFT platform features. Among social media features, count of user membership lists, number of likes and retweets are important features.
@inproceedings{kapoor2022influenceof,title={TweetBoost: Influence of Social Media on NFT Valuation},author={Kapoor, A. and Guhathakurta, D. and Mathur, M. and Yadav, Y. and Gupta, M. and Kumaraguru, P.},year={2022},booktitle={FinWeb-2022 (Workshop at WWW '2022)},}
CIN
Stress Classification Using Brain Signals Based on LSTM Network
N.
Phutela, D.
Relan, G.
Gabrani, P.
Kumaraguru, and M.
Samuel
In Journal of Computational Intelligence and Neuroscience, 2022
The early diagnosis of stress symptoms is essential for preventing various mental disorder such as depression. Electroencephalography (EEG) signals are frequently employed in stress detection research and are both inexpensive and noninvasive modality. This paper proposes a stress classification system by utilizing an EEG signal. EEG signals from thirty-five volunteers were analysed which were acquired using four EEG sensors using a commercially available 4-electrode Muse EEG headband. Four movie clips were chosen as stress elicitation material. Two clips were selected to induce stress as it contains emotionally inductive scenes. The other two clips were chosen that do not induce stress as it has many comedy scenes. The recorded signals were then used to build the stress classification model. We compared the Multilayer Perceptron (MLP) and Long Short-Term Memory (LSTM) for classifying stress and nonstress group. The maximum classification accuracy of 93.17% was achieved using two-layer LSTM architecture.
@inproceedings{phutela2022stressclassificationusing,title={Stress Classification Using Brain Signals Based on LSTM Network},author={Phutela, N. and Relan, D. and Gabrani, G. and Kumaraguru, P. and Samuel, M.},year={2022},booktitle={Journal of Computational Intelligence and Neuroscience},}
ICTD
Diagnosing Data from ICTs to Provide Focused Assistance in Agricultural Adoptions
In International Conference on Information & Communication Technologies and Development (ICTD) 2022, 2022
In the last two decades, ICTs have played a pivotal role in empowering rural populations in India by making knowledge more accessible. Digital Green (DG) is one such ICT that employs a participatory approach with smallholder farmers to produce instructional videos that encompass content specific to them. With help of human mediators, they disseminate these videos using projectors to improve the adoption of agricultural practices. DG’s web-based data tracker stores attendance and adoption logs of millions of farmers, videos screened and their demographic information. We leverage this data for a period of ten years between 2010-2020 across five states in India and use it to conduct a holistic evaluation of the ICT. First, we find disparities in adoption rates of farmers, following which we use statistical tests to identify different factors that lead to these disparities and gender-based inequalities. Second, to provide assistance to farmers facing challenges, we model the adoption of practices from a video as a prediction problem and experiment with different model architectures. Our classifier achieves accuracies ranging from 79% to 90% across the five states, demonstrating its potential for assisting future ethnographic investigations. Third, we use SHAP values in conjunction with our model for explaining the impact of various network, content and demographic features on adoption. Our research finds that farmers greatly benefit from past adopters of a video from their group and village. We also discover that videos with a low content-specificity benefit some farmers more than others. Next, we highlight the implications of our findings by translating them into recommendations for community building, revisiting participatory approach and mitigating inequalities. We conclude with a discussion on how our work can assist future investigations into the lived experiences of farmers.
@inproceedings{ashwin2022diagnosingdatafrom,title={Diagnosing Data from ICTs to Provide Focused Assistance in Agricultural Adoptions},author={},year={2022},booktitle={International Conference on Information & Communication Technologies and Development (ICTD) 2022},}
Many populous countries including India are burdened with a considerable backlog of legal cases. Development of automated systems that could process legal documents and augment legal practitioners can mitigate this. However, there is a dearth of high-quality corpora that is needed to develop such data-driven systems. The problem gets even more pronounced in the case of low resource languages such as Hindi. In this resource paper, we introduce the Hindi Legal Documents Corpus (HLDC), a corpus of more than 900K legal documents in Hindi. Documents are cleaned and structured to enable the development of downstream applications. Further, as a use-case for the corpus, we introduce the task of bail prediction. We experiment with a battery of models and propose a Multi-Task Learning (MTL) based model for the same. MTL models use summarization as an auxiliary task along with bail prediction as the main task. Experiments with different models are indicative of the need for further research in this area. We release the corpus and model implementation code with this paper.
@inproceedings{kapoor2022hindilegal,title={HLDC: Hindi Legal Documents Corpus},author={},year={2022},booktitle={Findings of ACL 2022},}
AIST
EEG Based Stress Classification in Response to Stress Stimulus
N.
Phutela, D.
Relan, G.
Gabrani, and P.
Kumaraguru
In International Conference on Artificial Intelligence and Speech Technology, 2022
Stress, either physical or mental, is experienced by almost every person at some point in his lifetime. Stress is one of the leading causes of various diseases and burdens society globally. Stress badly affects an individual’s well-being. Thus, stress-related study is an emerging field, and in the past decade, a lot of attention has been given to the detection and classification of stress. The estimation of stress in the individual helps in stress management before it invades the human mind and body. In this paper, we proposed a system for the detection and classification of stress. We compared the various machine learning algorithms for stress classification using EEG signal recordings. Interaxon Muse device having four dry electrodes has been used for data collection. We have collected the EEG data from 20 subjects. The stress was induced in these volunteers by showing stressful videos to them, and the EEG signal was then acquired. The frequency-domain features such as absolute band powers were extracted from EEG signals. The data were then classified into stress and non-stressed using different machine learning methods - Random Forest, Support Vector Machine, Logistic Regression, Naive Bayes, K-Nearest Neighbors, and Gradient Boosting. We performed 10-fold cross-validation, and the average classification accuracy of 95.65% was obtained using the gradient boosting method.
@inproceedings{phutela2022eegbasedstress,title={EEG Based Stress Classification in Response to Stress Stimulus},author={Phutela, N. and Relan, D. and Gabrani, G. and Kumaraguru, P.},year={2022},booktitle={International Conference on Artificial Intelligence and Speech Technology},}
JCC
FakeNewsIndia: A Benchmark Dataset of Fake News Incidents in India, Collection Methodology and Impact Assesment in Social Media
A.
Dhawan, M.
Bhalla, D.
Arora, R.
Kaushal, and P.
Kumaraguru
Online Social Media platforms (OSMs) have become an essential source of information. The high speed at which OSM users submit data makes moderation extremely hard. Consequently, besides offering online networking to users, the OSMs have also become carriers for spreading fake news. Knowingly or unknowingly, users circulate fake news on OSMs, adversely affecting an individual’s offline activity. To counter fake news, several dedicated websites (referred to as fact-checkers) have sprung up whose sole purpose is to identify and report fake news incidents. There are well-known datasets of fake news; however, not much work has been done regarding credible datasets of fake news in India. Therefore, we design an automated data collection pipeline to collect fake incidents reported by fact-checkers in this work. We gather 4,803 fake news incidents from June 2016 to December 2019 reported by six popular fact-checking websites in India and make this dataset (FakeNewsIndia) available to the research community. We find 5,031 tweets on Twitter and 866 videos on YouTube mentioned in these 4,803 fake news incidents. Further, we evaluate the impact of fake new incidents on the two prominent OSM platforms, namely, Twitter and YouTube. We use popularity metrics based on engagement rate and likes ratio to measure impact and categorize impact into three levels — low, medium, and high. Our learning models use features extracted from text, images, and videos present in the fake news incident articles written by fact-checking websites. Experiments show that we can predict the impact (popularity) of videos (appearing on fake news incident articles) on YouTube more accurately (with baseline accuracy ranging from 86% to 92%) as compared to the impact (popularity) of tweets on Twitter (with baseline accuracy of 37% to 41%). We need to build more intelligent models that predict tweets’ impact, appearing in fact-checking incident articles on Twitter as future work.
@inproceedings{dhawan2022abenchmark,title={FakeNewsIndia: A Benchmark Dataset of Fake News Incidents in India, Collection Methodology and Impact Assesment in Social Media},author={Dhawan, A. and Bhalla, M. and Arora, D. and Kaushal, R. and Kumaraguru, P.},year={2022},booktitle={Journal of Computer Communications},}
ACL
SyMCoM - Syntactic Measure of Code Mixing A Study Of English-Hindi Code-Mixing
Code mixing is the linguistic phenomenon where bilingual speakers tend to switch between two or more languages in conversations. Recent work on code-mixing in computational settings has leveraged social media code mixed texts to train NLP models. For capturing the variety of code mixing in, and across corpus, Language ID (LID) tags based measures (CMI) have been proposed. Syntactical variety/patterns of code-mixing and their relationship vis-a-vis computational model‘s performance is under explored. In this work, we investigate a collection of English(en)-Hindi(hi) code-mixed datasets from a syntactic lens to propose, SyMCoM, an indicator of syntactic variety in code-mixed text, with intuitive theoretical bounds. We train SoTA en-hi PoS tagger, accuracy of 93.4%, to reliably compute PoS tags on a corpus, and demonstrate the utility of SyMCoM by applying it on various syntactical categories on a collection of datasets, and compare datasets using the measure.
@inproceedings{kodali-etal-2022-symcom,title={{S}y{MC}o{M} - Syntactic Measure of Code Mixing A Study Of {E}nglish-{H}indi Code-Mixing},author={Kodali, Prashant and Goel, Anmol and Choudhury, Monojit and Shrivastava, Manish and Kumaraguru, Ponnurangam},year={2022},booktitle={Findings of the Association for Computational Linguistics: ACL 2022},}
INLG
PreCogIIITH at HinglishEval : Leveraging Code-Mixing Metrics & Language Model Embeddings To Estimate Code-Mix Quality
Prashant
Kodali, Tanmay
Sachan, Akshay
Goindani, Anmol
Goel, Naman
Ahuja, Manish
Shrivastava, and Ponnurangam
Kumaraguru
In Proceedings of the 15th International Conference on Natural Language Generation: Generation Challenges, 2022
Code-Mixing is a phenomenon of mixing two or more languages in a speech event and is prevalent in multilingual societies. Given the low-resource nature of Code-Mixing, machine generation of code-mixed text is a prevalent approach for data augmentation. However, evaluating the quality of such machine gen- erated code-mixed text is an open problem. In our submission to HinglishEval, a shared- task collocated with INLG2022, we attempt to build models factors that impact the quality of synthetically generated code-mix text by pre- dicting ratings for code-mix quality. Hingli- shEval Shared Task consists of two sub-tasks - a) Quality rating prediction); b) Disagree- ment prediction. We leverage popular code- mixed metrics and embeddings of multilin- gual large language models (MLLMs) as fea- tures, and train task specific MLP regression models. Our approach could not beat the baseline results. However, for Subtask-A our team ranked a close second on F-1 and Co- hen‘s Kappa Score measures and first for Mean Squared Error measure. For Subtask-B our ap- proach ranked third for F1 score, and first for Mean Squared Error measure. Code of our submission can be accessed here.
@inproceedings{kodali-etal-2022-precogiiith,title={{P}re{C}og{IIITH} at {H}inglish{E}val : Leveraging Code-Mixing Metrics {\&} Language Model Embeddings To Estimate Code-Mix Quality},author={Kodali, Prashant and Sachan, Tanmay and Goindani, Akshay and Goel, Anmol and Ahuja, Naman and Shrivastava, Manish and Kumaraguru, Ponnurangam},year={2022},booktitle={Proceedings of the 15th International Conference on Natural Language Generation: Generation Challenges},}
LREC
HashSet - A Dataset For Hashtag Segmentation
Prashant
Kodali, Akshala
Bhatnagar, Naman
Ahuja, Manish
Shrivastava, and Ponnurangam
Kumaraguru
In Proceedings of the Thirteenth Language Resources and Evaluation Conference, 2022
Hashtag segmentation is the task of breaking a hashtag into its constituent tokens. Hashtags often encode the essence of user-generated posts, along with information like topic and sentiment, which are useful in downstream tasks. Hashtags prioritize brevity and are written in unique ways - transliterating and mixing languages, spelling variations, creative named entities. Benchmark datasets used for the hashtag segmentation task - STAN, BOUN - are small and extracted from a single set of tweets. However, datasets should reflect the variations in writing styles of hashtags and account for domain and language specificity, failing which the results will misrepresent model performance. We argue that model performance should be assessed on a wider variety of hashtags, and datasets should be carefully curated. To this end, we propose HashSet, a dataset comprising of: a) 1.9k manually annotated dataset; b) 3.3M loosely supervised dataset. HashSet dataset is sampled from a different set of tweets when compared to existing datasets and provides an alternate distribution of hashtags to build and validate hashtag segmentation models. We analyze the performance of SOTA models for Hashtag Segmentation, and show that the proposed dataset provides an alternate set of hashtags to train and assess models.
@inproceedings{kodali-etal-2022-hashset,title={{H}ash{S}et - A Dataset For Hashtag Segmentation},author={Kodali, Prashant and Bhatnagar, Akshala and Ahuja, Naman and Shrivastava, Manish and Kumaraguru, Ponnurangam},year={2022},booktitle={Proceedings of the Thirteenth Language Resources and Evaluation Conference},}
ACM HT
Erasing Labor with Labor: Dark Patterns and Lockstep Behaviors on Google Play
Ashwin
Singh, Arvindh
Arun, Pulak
Malhotra, Pooja
Desur, Ayushi
Jain, Duen Horng
Chau, and Ponnurangam
Kumaraguru
In Proceedings of the 33rd ACM Conference on Hypertext and Social Media, 2022
Google Play’s policy forbids the use of incentivized installs, ratings, and reviews to manipulate the placement of apps. However, there still exist apps that incentivize installs for other apps on the platform. To understand how install-incentivizing apps affect users, we examine their ecosystem through a socio-technical lens and perform a mixed-methods analysis of their reviews and permissions. Our dataset contains 319K reviews collected daily over five months from 60 such apps that cumulatively account for over 160.5M installs. We perform qualitative analysis of reviews to reveal various types of dark patterns that developers incorporate in install-incentivizing apps, highlighting their normative concerns at both user and platform levels. Permissions requested by these apps validate our discovery of dark patterns, with over 92% apps accessing sensitive user information. We find evidence of fraudulent reviews on install-incentivizing apps, following which we model them as an edge stream in a dynamic bipartite graph of apps and reviewers. Our proposed reconfiguration of a state-of-the-art microcluster anomaly detection algorithm yields promising preliminary results in detecting this fraud. We discover highly significant lockstep behaviors exhibited by reviews that aim to boost the overall rating of an install-incentivizing app. Upon evaluating the 50 most suspicious clusters of boosting reviews detected by the algorithm, we find (i) near-identical pairs of reviews across 94% (47 clusters), and (ii) over 35% (1,687 of 4,717 reviews) present in the same form near-identical pairs within their cluster. Finally, we conclude with a discussion on how fraud is intertwined with labor and poses a threat to the trust and transparency of Google Play.
@inproceedings{acmht-22,title={{E}rasing {L}abor with {L}abor: {D}ark {P}atterns and {L}ockstep {B}ehaviors on {G}oogle {P}lay},author={Singh, Ashwin and Arun, Arvindh and Malhotra, Pulak and Desur, Pooja and Jain, Ayushi and Chau, Duen Horng and Kumaraguru, Ponnurangam},year={2022},booktitle={Proceedings of the 33rd ACM Conference on Hypertext and Social Media},}
Towards adversarial evaluations for inexact machine unlearning
Shashwat
Goel, Ameya
Prabhu, Amartya
Sanyal, Ser-Nam
Lim, Philip
Torr, and Ponnurangam
Kumaraguru
Machine Learning models face increased concerns regarding the storage of personal user data and adverse impacts of corrupted data like backdoors or systematic bias. Machine Unlearning can address these by allowing post-hoc deletion of affected training data from a learned model. Achieving this task exactly is computationally expensive; consequently, recent works have proposed inexact unlearning algorithms to solve this approximately as well as evaluation methods to test the effectiveness of these algorithms. In this work, we first outline some necessary criteria for evaluation methods and show no existing evaluation satisfies them all. Then, we design a stronger black-box evaluation method called the Interclass Confusion (IC) test which adversarially manipulates data during training to detect the insufficiency of unlearning procedures. We also propose two analytically motivated baseline methods (EU-k and CF-k) which outperform several popular inexact unlearning methods. Overall, we demonstrate how adversarial evaluation strategies can help in analyzing various unlearning phenomena which can guide the development of stronger unlearning algorithms.
@inproceedings{goel2022towards,title={Towards adversarial evaluations for inexact machine unlearning},author={Goel, Shashwat and Prabhu, Ameya and Sanyal, Amartya and Lim, Ser-Nam and Torr, Philip and Kumaraguru, Ponnurangam},year={2022},booktitle={},}
EMNLP
An Unsupervised, Geometric and Syntax-aware Quantification of Polysemy
Anmol
Goel, Charu
Sharma, and Ponnurangam
Kumaraguru
In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 2022
Polysemy is the phenomenon where a single word form possesses two or more related senses. It is an extremely ubiquitous part of natural language and analyzing it has sparked rich discussions in the linguistics, psychology and philosophy communities alike. With scarce attention paid to polysemy in computational linguistics, and even scarcer attention toward quantifying polysemy, in this paper, we propose a novel, unsupervised framework to compute and estimate polysemy scores for words in multiple languages. We infuse our proposed quantification with syntactic knowledge in the form of dependency structures. This informs the final polysemy scores of the lexicon motivated by recent linguistic findings that suggest there is an implicit relation between syntax and ambiguity/polysemy. We adopt a graph based approach by computing the discrete Ollivier Ricci curvature on a graph of the contextual nearest neighbors. We test our framework on curated datasets controlling for different sense distributions of words in 3 typologically diverse languages - English, French and Spanish. The effectiveness of our framework is demonstrated by significant correlations of our quantification with expert human annotated language resources like WordNet. We observe a 0.3 point increase in the correlation coefficient as compared to previous quantification studies in English. Our research leverages contextual language models and syntactic structures to empirically support the widely held theoretical linguistic notion that syntax is intricately linked to ambiguity/polysemy.
@inproceedings{goel-etal-2022-unsupervised,title={An Unsupervised, Geometric and Syntax-aware Quantification of Polysemy},author={Goel, Anmol and Sharma, Charu and Kumaraguru, Ponnurangam},year={2022},booktitle={Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing},}
Thesis
Development of Stress Induction and Detection System to Study its Effect on Brain
@inproceedings{developmentofstressinductionanddetectionsystemtostudyitseffectonbrain,title={Development of Stress Induction and Detection System to Study its Effect on Brain},author={Phutela, N.},year={2022},booktitle={Ph.D. Thesis, BML Munjal University},}
Thesis
Leveraging AI to Understand Protests & Foster Secure Societies During Protest
@inproceedings{leveragingaitounderstandprotestsfostersecuresocietiesduringprotest,title={Leveraging AI to Understand Protests & Foster Secure Societies During Protest},author={Neha, K.},year={2022},booktitle={Ph.D. Comprehensive Report},}
Thesis
A Framework For Automatic Question Answering in Indian Languages
@inproceedings{aframeworkforautomaticquestionansweringinindianlanguages,title={A Framework For Automatic Question Answering in Indian Languages},author={Mishra, R.},year={2022},booktitle={Ph.D. Comprehensive Report},}
@inproceedings{deanonymizingpreservinganddemocratizingdataprivacyandownership,title={De-anonymizing, Preserving and Democratizing Data Privacy and Ownership},author={Gupta, S.},year={2022},booktitle={Ph.D. Comprehensive Report},}
2021
NLP4DH
An Exploratory Study on Temporally Evolving Discussions around COVID-19 using Diachronic Word Embeddings
A.
Tulasi, A.
Kitamoto, A.
Buduru, and P.
Kumaraguru
In Workshop on Natural Language Processing for Digital Humanities (NLP4DH), ICON 2021, 2021
Covid 19 has seen the world go into a lock down and unconventional social situations throughout. During this time, the world saw a surge in information sharing around the pandemic and the topics shared in the time were diverse. People’s sentiments have changed during this period. Given the wide spread usage of Online Social Networks (OSN) and support groups, the user sentiment is well reflected in online discussions. In this work, we aim to show the topics under discussion, evolution of discussions, change in user sentiment during the pandemic. Alongside which, we also demonstrate the possibility of exploratory analysis to find pressing topics, change in perception towards the topics and ways to use the knowledge extracted from online discussions. For our work we employ Diachronic Word embeddings which capture the change in word usage over time. With the help of analysis from temporal word usages, we show the change in people’s option on covid-19 from being a conspiracy, to the post-covid topics that surround vaccination.
@inproceedings{tulasi2021anexploratorystudy,title={An Exploratory Study on Temporally Evolving Discussions around COVID-19 using Diachronic Word Embeddings},author={Tulasi, A. and Kitamoto, A. and Buduru, A. and Kumaraguru, P.},year={2021},booktitle={Workshop on Natural Language Processing for Digital Humanities (NLP4DH), ICON 2021},}
WI-IAT
“I’ll be back”: Examining Restored Accounts On Twitter
A.
Kapoor, R.R.
Jain, A.
Prabhu, T.
Karandikar, and P.
Kumaraguru
In The 20th IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology, 2021
Online social networks like Twitter actively monitor their platform to identify accounts that go against their rules. Twitter enforces account level moderation, i.e. suspension of a Twitter account in severe cases of platform abuse. A point of note is that these suspensions are sometimes temporary and even incorrect. Twitter provides a redressal mechanism to ’restore’ suspended accounts. We refer to all suspended accounts who later have their suspension reversed as ’restored accounts’. In this paper, we release the firstever dataset and methodology 1 to identify restored accounts. We inspect account properties and tweets of these restored accounts to get key insights into the effects of this http URL build a prediction model to classify an account into normal, suspended or restored. We use SHAP values to interpret this model and identify important features. SHAP (SHapley Additive exPlanations) is a method to explain individual predictions. We show that profile features like date of account creation and the ratio of retweets to total tweets are more important than content-based features like sentiment scores and Ekman emotion scores when it comes to classification of an account as normal, suspended or restored. We investigate restored accounts further in the pre-suspension and post-restoration phases. We see that the number of tweets per account drop by 53.95% in the post-restoration phase, signifying less ’spammy’ behaviour after reversal of suspension. However, there was no substantial difference in the content of the tweets posted in the pre-suspension and post-restoration phases.
@inproceedings{kapoor2021be,title={“I'll be back”: Examining Restored Accounts On Twitter},author={Kapoor, A. and Jain, R.R. and Prabhu, A. and Karandikar, T. and Kumaraguru, P.},year={2021},booktitle={The 20th IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology},}
WI-IAT
Efficient Representation of Interaction Patterns with Hyperbolic Hierarchical Clustering for Classification of Users on Twitter
T.
Karandikar, A.
Prabhu, A.
Tulasi, A.B.
Buduru, and P.
Kumaraguru
In The 20th IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology, 2021
Social media platforms play an important role in democratic processes. During the 2019 General Elections of India, political parties and politicians widely used Twitter to share their ideals, advocate their agenda and gain popularity. Twitter served as a ground for journalists, politicians and voters to interact. The organic nature of these interactions can be upended by malicious accounts on Twitter, which end up being suspended or deleted from the platform. Such accounts aim to modify the reach of content by inorganically interacting with particular handles. These interactions are a threat to the integrity of the platform, as such activity has the potential to affect entire results of democratic processes. In this work, we design a feature extraction framework which compactly captures potentially insidious interaction patterns. Our proposed features are designed to bring out communities amongst the users that work to boost the content of particular accounts. We use Hyperbolic Hierarchical Clustering (HypHC) which represents the features in the hyperbolic manifold to further separate such communities. HypHC gives the added benefit of representing these features in a lower dimensional space – thus serving as a dimensionality reduction technique. We use these features to distinguish between different classes of users that emerged in the aftermath of the 2019 General Elections of India. Amongst the users active on Twitter during the elections, 2.8% of the users participating were suspended and 1% of the users were deleted from the platform. We demonstrate the effectiveness of our proposed features in differentiating between regular users (users who were neither suspended nor deleted), suspended users and deleted users. By leveraging HypHC in our pipeline, we obtain F1 scores of upto 93%.
@inproceedings{karandikar2021efficientrepresentationof,title={Efficient Representation of Interaction Patterns with Hyperbolic Hierarchical Clustering for Classification of Users on Twitter},author={Karandikar, T. and Prabhu, A. and Tulasi, A. and Buduru, A.B. and Kumaraguru, P.},year={2021},booktitle={The 20th IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology},}
FOSINT-SI
Truth and Travesty Intertwined: A Case Study of #SSR Counter public Campaign
K.
Neha, T.
Mohan, A.
Buduru, and P.
Kumaraguru
In International Symposium on Foundations of Open Source Intelligence and Security Informatics (FOSINT-SI), ASONAM’2021, 2021
Twitter has emerged as a prominent social media platform for activism and counterpublic narratives. The counterpublics leverage hashtags to build a diverse support network and share content on a global platform that counters the dominant narrative. This paper applies the framework of connective action on the counter-narrative campaign over the cause of death of #SushantSinghRajput. We combine descriptive network, modularity, and hashtag based topical analysis to identify three major mechanisms underlying the campaign: generative role taking, hashtag-based narratives and formation of alignment network towards a common cause. Using the case study of #SushantSinghRajput, we highlight how connective action framework can be used to identify different strategies adopted by counterpublics for the emergence of connective action.
@inproceedings{neha2021truthandtravesty,title={Truth and Travesty Intertwined: A Case Study of #SSR Counter public Campaign},author={Neha, K. and Mohan, T. and Buduru, A. and and Kumaraguru, P.},year={2021},booktitle={International Symposium on Foundations of Open Source Intelligence and Security Informatics (FOSINT-SI), ASONAM'2021},}
MMAsia
Inter-modality Discordance for Multimodal Fake News Detection
S.
Singhal, M.
Dhawan, RR.
Shah, and P.
Kumaraguru
The paradigm shift in the consumption of news via online platforms has cultivated the growth of digital journalism. Contrary to traditional media, lowering entry barriers and enabling everyone to be part of content creation have disabled the concept of centralized gatekeeping in digital journalism. This in turn has triggered the production of fake news. Current studies have made a significant effort towards multimodal fake news detection with less emphasis on exploring the discordance between the different multimedia present in a news article. We hypothesize that fabrication of either modality will lead to dissonance between the modalities, and resulting in misrepresented, misinterpreted and misleading news. In this paper, we inspect the authenticity of news coming from online media outlets by exploiting relationship (discordance) between the textual and multiple visual cues. We develop an inter-modality discordance based fake news detection framework to achieve the goal. The modal-specific discriminative features are learned, employing the cross-entropy loss and a modified version of contrastive loss that explores the inter-modality discordance. To the best of our knowledge, this is the first work that leverages information from different components of the news article (i.e. and headline, body, and multiple images) for multimodal fake news detection. We conduct extensive experiments on the real-world datasets to show that our approach outperforms the state-of-the-art by an average F1-score of 6.3%.
@inproceedings{singhal2021discordancefor,title={Inter-modality Discordance for Multimodal Fake News Detection},author={Singhal, S. and Dhawan, M. and Shah, RR. and and Kumaraguru, P.},year={2021},booktitle={ACM Multimedia Asia (MMAsia ’21)},}
ASONAM
What’s Kooking? Characterizing India’s Emerging Social Network, Koo
A.
Singh, C.
Jain, J.
Jain, R.
Jain, S.
Sehgal, T.
Pandey, and P.
Kumaraguru
In The 2021 IEEE/ACM International Conference on Advances in Social Network Analysis and Mining (ASONAM’21), 2021
Social media has grown exponentially in a short period, coming to the forefront of communications and online interactions. Despite their rapid growth, social media platforms have been unable to scale to different languages globally and remain inaccessible to many. In this paper, we characterize Koo, a multilingual micro-blogging site that rose in popularity in 2021, as an Indian alternative to Twitter. We collected a dataset of 4.07 million users, 163.12 million follower-following relationships, and their content and activity across 12 languages. We study the user demographic along the lines of language, location, gender, and profession. The prominent presence of Indian languages in the discourse on Koo indicates the platform’s success in promoting regional languages. We observe Koo’s follower-following network to be much denser than Twitter’s, comprising of closely-knit linguistic communities. An N-gram analysis of posts on Koo shows a #KooVsTwitter rhetoric, revealing the debate comparing the two platforms. Our characterization highlights the dynamics of the multilingual social network and its diverse Indian user base.
@inproceedings{singh2021characterizing,title={What's Kooking? Characterizing India's Emerging Social Network, Koo},author={Singh, A. and Jain, C. and Jain, J. and Jain, R. and Sehgal, S. and Pandey, T. and Kumaraguru, P.},year={2021},booktitle={The 2021 IEEE/ACM International Conference on Advances in Social Network Analysis and Mining (ASONAM'21)},}
SHC
Psychometric Analysis and Coupling of Emotions Between State Bulletins and Twitter in India during COVID-19 Infodemic
B.
Jolly, P.
Aggrawal, A.
Gulati, A.
Sethi, P.
Kumaraguru, and T.
Sethi
In Frontiers in Communication, Section Health Communication, 2021
COVID-19 infodemic has been spreading faster than the pandemic itself. The misinformation riding upon the infodemic wave poses a major threat to people’s health and governance systems. Since social media is the largest source of information, managing the infodemic not only requires mitigating of misinformation but also an early understanding of psychological patterns resulting from it. During the COVID-19 crisis, Twitter alone has seen a sharp 45% increase in the usage of its curated events page, and a 30% increase in its direct messaging usage, since March 6th 2020. In this study, we analyze the psychometric impact and coupling of the COVID-19 infodemic with the official bulletins related to COVID-19 at the national and state level in India. We look at these two sources with a psycho-linguistic lens of emotions and quantified the extent and coupling between the two. We modified path, a deep skip-gram based open-sourced lexicon builder for effective capture of health-related emotions. We were then able to capture the time-evolution of health-related emotions in social media and official bulletins. An analysis of lead-lag relationships between the time series of extracted emotions from official bulletins and social media using Granger’s causality showed that state bulletins were leading the social media for some emotions such as Medical Emergency. Further insights that are potentially relevant for the policymaker and the communicators actively engaged in mitigating misinformation are also discussed. Our paper also introduces CoronaIndiaDataset2, the first social media based COVID-19 dataset at national and state levels from India with over 5.6 million national and 2.6 million state-level tweets. Finally, we present our findings as COVibes, an interactive web application capturing psychometric insights captured upon the CoronaIndiaDataset, both at a national and state level.
@inproceedings{jolly2021psychometricanalysisand,title={Psychometric Analysis and Coupling of Emotions Between State Bulletins and Twitter in India during COVID-19 Infodemic},author={Jolly, B. and Aggrawal, P. and Gulati, A. and Sethi, A. and Kumaraguru, P. and Sethi, T.},year={2021},booktitle={Frontiers in Communication, Section Health Communication},}
JMIR
(Un) Masked COVID-19 Trends from Social Media
A.K.
Singh, P.
Mehan, D.
Sharma, R.
Pandey, T.
Sethi, and P.
Kumaraguru
Wearing masks is a useful protection method against COVID-19, which has caused widespread economic and social impact worldwide. Across the globe, governments have put mandates for the use of face masks, which have received both positive and negative reaction. Online social media provides an exciting platform to study the use of masks and analyze underlying mask-wearing patterns. In this article, we analyze 2.04 million social media images for six US cities. An increase in masks worn in images is seen as the COVID-19 cases rose, particularly when their respective states imposed strict regulations. We also found a decrease in the posting of group pictures as stay-at-home laws were put into place. Furthermore, mask compliance in the Black Lives Matter protest was analyzed, eliciting that 40% of the people in group photos wore masks, and 45% of them wore the masks with a fit score of greater than 80%. We introduce two new datasets, VAriety MAsks - Classification (VAMA-C) and VAriety MAsks - Segmentation (VAMA-S), for mask detection and mask fit analysis tasks, respectively. For the analysis, we create two frameworks, face mask detector (for classifying masked and unmasked faces) and mask fit analyzer (a semantic segmentation based model to calculate a mask-fit score). The face mask detector achieved a classification accuracy of 98%, and the semantic segmentation model for the mask fit analyzer achieved an Intersection Over Union (IOU) score of 98%. We conclude that such a framework can be used to evaluate the effectiveness of such public health strategies using social media platforms in times of pandemic.
@inproceedings{singh2021masked,title={(Un) Masked COVID-19 Trends from Social Media},author={Singh, A.K. and Mehan, P. and Sharma, D. and Pandey, R. and Sethi, T. and Kumaraguru, P.},year={2021},booktitle={Journal for Public Health & Surveillance},}
HTSM
“A Virus Has No Religion”: Analyzing Islamophobia on Twitter During the COVID-19 Outbreak
M.
Chandra, M.
Reddy, S.
Sehgal, S.
Gupta, A.
Buduru, and P.
Kumaraguru
In Proceedings of the 32nd ACM Conference on Hypertext and Social Media (HT ’21), 2021
The COVID-19 pandemic has disrupted people’s lives driving them to act in fear, anxiety, and anger, leading to worldwide racist events in the physical world and online social networks. Though there are works focusing on Sinophobia during the COVID-19 pandemic, less attention has been given to the recent surge in Islamophobia. A large number of positive cases arising out of the religious Tablighi Jamaat gathering has driven people towards forming anti-Muslim communities around hashtags like #coronajihad, #tablighijamaatvirus on Twitter. In addition to the online spaces, the rise in Islamophobia has also resulted in increased hate crimes in the real world. Hence, an investigation is required to create interventions. To the best of our knowledge, we present the first large-scale quantitative study linking Islamophobia with COVID-19. In this paper, we present CoronaBias dataset which focuses on anti-Muslim hate spanning four months, with over 410,990 tweets from 244,229 unique users. We use this dataset to perform longitudinal analysis. We find the relation between the trend on Twitter with the offline events that happened over time, measure the qualitative changes in the context associated with the Muslim community, and perform macro and micro topic analysis to find prevalent topics. We also explore the nature of the content, focusing on the toxicity of the URLs shared within the tweets present in the CoronaBias dataset. Apart from the content-based analysis, we focus on user analysis, revealing that the portrayal of religion as a symbol of patriotism played a crucial role in deciding how the Muslim community was perceived during the pandemic. Through these experiments, we reveal the existence of anti-Muslim rhetoric around COVID-19 in the Indian sub-continent.
@inproceedings{chandra2021virushas,title={“A Virus Has No Religion”: Analyzing Islamophobia on Twitter During the COVID-19 Outbreak},author={Chandra, M. and Reddy, M. and Sehgal, S. and Gupta, S. and Buduru, A. and and Kumaraguru, P.},year={2021},booktitle={Proceedings of the 32nd ACM Conference on Hypertext and Social Media (HT ’21)},}
KAIS
Tweet-Scan-Post - A system for Analysis of Sensitive Private Data Disclosure in Online Social Media
R.
Geetha, S.
Karthika, and P.
Kumaraguru
In Journal of Knowledge and Information Systems (KAIS), 2021
The social media technologies are open to users who are intended in creating a community and publishing their opinions of recent incidents. The participants of the online social networking sites remain ignorant of the criticality of disclosing personal data to the public audience. The private data of users are at high risk leading to many adverse effects like cyberbullying, identity theft, and job loss. This research work aims to define the user entities or data like phone number, email address, family details, health-related information as user’s sensitive private data (SPD) in a social media platform. The proposed system, Tweet-Scan-Post (TSP), is mainly focused on identifying the presence of SPD in user’s posts under personal, professional, and health domains. The TSP framework is built based on the standards and privacy regulations established by social networking sites and organizations like NIST, DHS, GDPR. The proposed approach of TSP addresses the prevailing challenges in determining the presence of sensitive PII, user privacy within the bounds of confidentiality and trustworthiness. A novel layered classification approach with various state-of-art machine learning models is used by the TSP framework to classify tweets as sensitive and insensitive. The findings of TSP systems include 201 Sensitive Privacy Keywords using a boosting strategy, sensitivity scaling that measures the degree of sensitivity allied with a tweet. The experimental results revealed that personal tweets were highly related to mother and children, professional tweets with apology, and health tweets with concern over the father’s health condition.
@inproceedings{geetha2021a,title={Tweet-Scan-Post - A system for Analysis of Sensitive Private Data Disclosure in Online Social Media},author={Geetha, R. and Karthika, S. and and Kumaraguru, P.},year={2021},booktitle={Journal of Knowledge and Information Systems (KAIS)},}
ICANN
KCNet: Kernel-based Canonicalization Network for entities in Recruitment Domain
N.
Goyal, N.
Sachdeva, A.
Goel, J.
Kalra, and P.
Kumaraguru
In 30th International Conference on Artificial Neural Networks (ICANN), 2021
Online recruitment platforms have abundant user-generated content in the form of job postings, candidate, and company profiles. This content when ingested into Knowledge bases causes redundant, ambiguous, and noisy entities. These multiple (non-standardized) representation of the entities deteriorates the performance of downstream tasks such as job recommender systems, search systems, and question answering. Therefore, making it imperative to canonicalize the entities to improve the performance of such tasks. Recent research discusses either statistical similarity measures or deep learning methods like word-embedding or siamese network-based representations for canonicalization. In this paper, we propose a Kernel-based Canonicalization Network (KCNet) that outperforms all the known statistical and deep learning methods. We also show that the use of side information such as industry type, url of websites, etc. further enhances the performance of the proposed method. Our experiments on 351,600 entities (companies, institutes, skills, and designations) from a popular online recruitment platform demonstrate that the proposed method improves the overall F1-score by 23% compared to the previous baselines, which results in coherent clusters of unique entities.
@inproceedings{goyal2021canonicalization,title={KCNet: Kernel-based Canonicalization Network for entities in Recruitment Domain},author={Goyal, N. and Sachdeva, N. and Goel, A. and Kalra, J. and and Kumaraguru, P.},year={2021},booktitle={30th International Conference on Artificial Neural Networks (ICANN)},}
KSEM
Spy The Lie: Fraudulent Jobs Detection in Recruitment Domain using Knowledge Graphs
N.
Goyal, N.
Sachdeva, and P.
Kumaraguru
In 14th International Conference on Knowledge Science, Engineering and Management (KSEM 2021), 2021
Fraudulent jobs are an emerging threat over online recruitment platforms such as LinkedIn, Glassdoor. Fraudulent job postings affect the platform’s trustworthiness and have a negative impact on user experience. Therefore, these platforms need to detect and remove these fraudulent jobs. Generally, fraudulent job postings contain untenable facts about domain-specific entities such as mismatch in skills, industries , offered compensation, etc. However, existing approaches focus on studying writing styles, linguistics, and context-based features, and ignore the relationships among domain-specific entities. To bridge this gap, we propose an approach based on the Knowledge Graph (KG) of domain-specific entities to detect fraudulent jobs. In this paper, we present a multi-tier novel end-to-end framework called FRaudulent Jobs Detection (FRJD) Engine, which considers a) fact validation module using KGs, b) contextual module using deep neural networks c) meta-data module to capture the semantics of job postings. We conduct our experiments using a fact validation dataset containing 4 million facts extracted from job postings. Extensive evaluation shows that FRJD yields a 0.96 F1-score on the curated dataset of 157,880 job postings. Finally, we provide insights on the performance of different fact-checking algorithms on recruitment domain datasets.
@inproceedings{goyal2021spythe,title={Spy The Lie: Fraudulent Jobs Detection in Recruitment Domain using Knowledge Graphs},author={Goyal, N. and Sachdeva, N. and and Kumaraguru, P.},year={2021},booktitle={14th International Conference on Knowledge Science, Engineering and Management (KSEM 2021)},}
WebSci
“Subverting the Jewtocracy”: Online Antisemitism Detection Using Multimodal Deep Learning
M.
Chandra, D.
Pailla, H.
Bhatia, A.
Sanchawala, M.
Gupta, M.
Shrivastava, and P.
Kumaraguru
In 13th ACM Web Science Conference (WebSci) 2021, 2021
The exponential rise of online social media has enabled the creation, distribution, and consumption of information at an unprecedented rate. However, it has also led to the burgeoning of various forms of online abuse. Increasing cases of online antisemitism have become one of the major concerns because of its socio-political consequences. Unlike other major forms of online abuse like racism, sexism, etc. and online antisemitism has not been studied much from a machine learning perspective. To the best of our knowledge, we present the first work in the direction of automated multimodal detection of online antisemitism. The task poses multiple challenges that include extracting signals across multiple modalities, contextual references, and handling multiple aspects of antisemitism. Unfortunately, there does not exist any publicly available benchmark corpus for this critical task. Hence, we collect and label two datasets with 3,102 and 3,509 social media posts from Twitter and Gab respectively. Further, we present a multimodal deep learning system that detects the presence of antisemitic content and its specific antisemitism category using text and images from posts. We perform an extensive set of experiments on the two datasets to evaluate the efficacy of the proposed system. Finally, we also present a qualitative analysis of our study.
@inproceedings{chandra2021the,title={“Subverting the Jewtocracy”: Online Antisemitism Detection Using Multimodal Deep Learning},author={Chandra, M. and Pailla, D. and Bhatia, H. and Sanchawala, A. and Gupta, M. and Shrivastava, M. and and Kumaraguru, P.},year={2021},booktitle={13th ACM Web Science Conference (WebSci) 2021},}
POSN
On the Vulnerability of Community Structure in Complex Networks
V.
Parimi, A.
Pal, S.
Ruj, P.
Kumaraguru, and T.
Chakraborty
In Principles of Social Networking: The New Horizon and Emerging Challenge, Springer, 2021
In this paper, we study the role of nodes and edges in a complex network in dictating the robustness of a community structure toward structural perturbations. Specifically, we attempt to identify all vital nodes, which, when removed, would lead to a large change in the underlying community structure of the network. This problem is critical because the community structure of a network allows us to explore deep underlying insights into how the function and topology of the network affect each other. Moreover, it even provides a way to condense large networks into smaller modules where each community acts as a meta node and aids in more straightforward network analysis. If the community structure were to be compromised by either accidental or intentional perturbations to the network that would make such analysis difficult. Since identifying such vital nodes is computationally intractable, we propose a suite of heuristics that allow to find solutions close to the optimality. To show the effectiveness of our approach, we first test these heuristics on small networks and then move to more extensive networks to show that we achieve similar results. Further analysis reveals that the proposed approaches are useful to analyze the vulnerability of communities in networks irrespective of their size and scale. Additionally, we show the performance through an extrinsic evaluation framework—we employ two tasks, i.e. and link prediction and information diffusion, and show that the effect of our algorithms on these tasks is higher than the other baselines.
@inproceedings{parimi2021onthevulnerability,title={On the Vulnerability of Community Structure in Complex Networks},author={Parimi, V. and Pal, A. and Ruj, S. and Kumaraguru, P. and Chakraborty, T.},year={2021},booktitle={Principles of Social Networking: The New Horizon and Emerging Challenge, Springer},}
Student Abstract @ AAAI
Detecting Lexical Semantic Change across Corpora with Smooth Manifolds
A.
Goel, and P.
Kumaraguru
In Student Abstract, 35th AAAI Conference on Artificial Intelligence 2021, 2021
Comparing two bodies of text and detecting words with significant lexical semantic shift between them is an important part of digital humanities. Traditional approaches have relied on aligning the different embeddings using the Orthogonal Procrustes problem in the Euclidean space. This study presents a geometric framework that leverages smooth Riemannian manifolds for corpus-specific orthogonal rotations and a corpus-independent scaling metric to project the different vector spaces into a shared latent space. This enables us to capture any affine relationship between the embedding spaces while utilising the rich geometry of smooth manifolds.
@inproceedings{goel2021detectinglexicalsemantic,title={Detecting Lexical Semantic Change across Corpora with Smooth Manifolds},author={Goel, A. and Kumaraguru, P.},year={2021},booktitle={Student Abstract, 35th AAAI Conference on Artificial Intelligence 2021},}
CODS-COMAD
A Geometric Measure of Polysemy in Hindi Language
A.
Goel, and P.
Kumaraguru
In ACM India Joint International Conference on Data Science and Management of Data 2021 (Young Researchers’ Symposium), 2021
A word referring to two or more different meanings is called polysemous. In this study, we introduce a geometric method to estimate the polysemy of words using the discrete Ollivier-Ricci curvature of a graph of synonyms in Hindi. We show that this approach can effectively measure the polysemy of words and is strongly correlated with theoretical interpretations of polysemy.
@inproceedings{goel2021ageometricmeasure,title={A Geometric Measure of Polysemy in Hindi Language},author={Goel, A. and and Kumaraguru, P.},year={2021},booktitle={ACM India Joint International Conference on Data Science and Management of Data 2021 (Young Researchers’ Symposium)},}
CALCS @ ACL
CoMeT: Towards Code-Mixed Translation Using Parallel Monolingual Sentences
Code-mixed languages are very popular in multilingual societies around the world, yet the resources lag behind to enable robust systems on such languages. A major contributing factor is the informal nature of these languages which makes it difficult to collect code-mixed data. In this paper, we propose our system for Task 1 of CACLS 2021 to generate a machine translation system for English to Hinglish in a supervised setting. Translating in the given direction can help expand the set of resources for several tasks by translating valuable datasets from high resource languages. We propose to use mBART, a pre-trained multilingual sequence-to-sequence model, and fully utilize the pre-training of the model by transliterating the roman Hindi words in the code-mixed sentences to Devanagri script. We evaluate how expanding the input by concatenating Hindi translations of the English sentences improves mBART‘s performance. Our system gives a BLEU score of 12.22 on test set. Further, we perform a detailed error analysis of our proposed systems and explore the limitations of the provided dataset and metrics.
@inproceedings{gautam-etal-2021-comet,title={{C}o{M}e{T}: Towards Code-Mixed Translation Using Parallel Monolingual Sentences},author={Gautam, Devansh and Kodali, Prashant and Gupta, Kshitij and Goel, Anmol and Shrivastava, Manish and Kumaraguru, Ponnurangam},year={2021},booktitle={Proceedings of the Fifth Workshop on Computational Approaches to Linguistic Code-Switching},}
GermEval
Precog-LTRC-IIITH at GermEval 2021: Ensembling Pre-Trained Language Models with Feature Engineering
T. H.
Arjun, Arvindh
A., and Kumaraguru
Ponnurangam
In Proceedings of the GermEval 2021 Shared Task on the Identification of Toxic, Engaging, and Fact-Claiming Comments, 2021
We describe our participation in all the subtasks of the Germeval 2021 shared task on the identification of Toxic, Engaging, and Fact-Claiming Comments. Our system is an ensemble of state-of-the-art pre-trained models finetuned with carefully engineered features. We show that feature engineering and data augmentation can be helpful when the training data is sparse. We achieve an F1 score of 66.87, 68.93, and 73.91 in Toxic, Engaging, and Fact-Claiming comment identification subtasks.
@inproceedings{germeval-21,title={Precog-{LTRC}-{IIITH} at {G}erm{E}val 2021: Ensembling Pre-Trained Language Models with Feature Engineering},author={Arjun, T. H. and A., Arvindh and Ponnurangam, Kumaraguru},year={2021},booktitle={Proceedings of the GermEval 2021 Shared Task on the Identification of Toxic, Engaging, and Fact-Claiming Comments},}
2020
AAAI
SpotFake+: A Multimodal Framework for Fake News Detection via Transfer Learning (Student Abstract)
S.
Singhal, A.
Kabra, M.
Sharma, RR.
Shah, T.
Chakraborty, and P.
Kumaraguru
In Proceedings of Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI-20), 2020
In recent years, there has been a substantial rise in the consumption of news via online platforms. The ease of publication and lack of editorial rigour in some of these platforms have further led to the proliferation of fake news. In this paper, we study the problem of detecting fake news on the FakeNewsNet repository, a collection of full length articles along with associated images. We present SpotFake+, a multimodal approach that leverages transfer learning to capture semantic and contextual information from the news articles and its associated images and achieves the better accuracy for fake news detection. To the best of our knowledge, this is the first work that performs a multimodal approach for fake news detection on a dataset that consists of full length articles. It outperforms the performance shown by both single modality and multiple-modality models. We also release the pretrained model for the benefit of the community.
@inproceedings{singhal2020amultimodal,title={SpotFake+: A Multimodal Framework for Fake News Detection via Transfer Learning (Student Abstract)},author={Singhal, S. and Kabra, A. and Sharma, M. and Shah, RR. and Chakraborty, T. and and Kumaraguru, P.},year={2020},booktitle={Proceedings of Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI-20)},}
DiffGeo4DL @ Neurips
Leveraging Smooth Manifolds for Lexical Semantic Change Detection across Corpora
Comparing two bodies of text and detecting words with significant lexical semantic shift between them is an important part of digital humanities. Traditional approaches have relied on aligning the different embeddings in the Euclidean space using the Orthogonal Procrustes problem. This study presents a geometric framework that leverages optimization on smooth Riemannian manifolds for obtaining corpus-specific orthogonal rotations and a corpus-independent scaling to project the different vector spaces into a shared latent space. This enables us to capture any affine relationship between the embedding spaces while utilising the rich geometry of smooth manifolds.
@inproceedings{goel2020leveragingsmoothmanifolds,title={Leveraging Smooth Manifolds for Lexical Semantic Change Detection across Corpora},author={Goel, A. and Kumaraguru, P.},year={2020},booktitle={DiffGeo4DL Workshop, NeurIPS 2020},}
COLING
AbuseAnalyzer: Abuse Detection, Severity and Target Prediction for Gab Posts
M.
Chandra, A.
Pathak, E.
Dutta, P.
Jain, M.
Gupta, M.
Shrivastava, and P.
Kumaraguru
In 28th International Conference on Computational Linguistics (COLING) 2020, 2020
While extensive popularity of online social media platforms has made information dissemination faster, it has also resulted in widespread online abuse of different types like hate speech, offensive language, sexist and racist opinions, etc. Detection and curtailment of such abusive content is critical for avoiding its psychological impact on victim communities, and thereby preventing hate crimes. Previous works have focused on classifying user posts into various forms of abusive behavior. But there has hardly been any focus on estimating the severity of abuse and the target. In this paper, we present a first of the kind dataset with 7,601 posts from Gab which looks at online abuse from the perspective of presence of abuse, severity and target of abusive behavior. We also propose a system to address these tasks, obtaining an accuracy of ∼80% for abuse presence, ∼82% for abuse target prediction, and ∼65% for abuse severity prediction.
@inproceedings{chandra2020abuse,title={AbuseAnalyzer: Abuse Detection, Severity and Target Prediction for Gab Posts},author={Chandra, M. and Pathak, A. and Dutta, E. and Jain, P. and Gupta, M. and Shrivastava, M. and Kumaraguru, P.},year={2020},booktitle={28th International Conference on Computational Linguistics (COLING) 2020},}
SocInfo
#IVoted to #IGotPwned: Studying Voter Privacy Leaks in Indian Lok Sabha Elections on Twitter
Gupta
S., Agarwal
A., S.
Vyalla, A.
Buduru, and P.
Kumaraguru
In 12th International Conference on Social Informatics (SocInfo) 2020, 2020
Online Social Networks (OSNs) play a crucial role in elections worldwide. Users post their opinions and sentiments on events, candidates, and parties. One of the cardinal principles in elections is to ensure that the party (or candidate) to which a citizen vote remains secret. However, given that citizens are free to express their opinions and views on OSN platforms like Twitter, some of them, in direct and indirect ways, reveal their political inclinations, which we refer to as Voter Privacy Leaks (VPL). In this paper, we cross-link VPL user’s online details with other publicly available information (like electoral rolls, which is a list of all eligible voters), to get access to their personally identifiable information (PII) including their voter ID, age, gender, address, and family details. Finally, to safeguard such users, we develop browser plugin based nudge that leverages machine learning-based classifiers to flag a given post on Twitter as a VPL or a Non-VPL, thereby helping users protect their voter privacy. To validate our approach, we focus on the period of Lok Sabha elections held in 2019 in India, the largest democracy in the world. We collect tweets starting from April 11, 2019, to May 22, 2019. We detect 91,253 instances of VPL and using a subset of electoral rolls, successfully cross-link 44 Twitter users to their exact PII. Our proposed nudge detects 93% of VPL incidents.
@inproceedings{gupta2020to,title={#IVoted to #IGotPwned: Studying Voter Privacy Leaks in Indian Lok Sabha Elections on Twitter},author={S., Gupta and A., Agarwal and Vyalla, S. and Buduru, A. and Kumaraguru, P.},year={2020},booktitle={12th International Conference on Social Informatics (SocInfo) 2020},}
TPS-ISA
imdpGAN: Generating Private and Specific Data with Generative Adversarial Networks
S.
Gupta, A.
Buduru, and P.
Kumaraguru
In IEEE International Conference on Trust, Privacy, and Security in Intelligent Systems, and Applications 2020, 2020
Generative Adversarial Network (GAN) and its variants have shown promising results in generating synthetic data. However, the issues with GANs are: (i) the learning happens around the training samples and the model often ends up remembering them, consequently, compromising the privacy of individual samples - this becomes a major concern when GANs are applied to training data including personally identifiable information, (ii) the randomness in generated data - there is no control over the specificity of generated samples. To address these issues, we propose imdpGAN-an information maximizing differentially private Generative Adversarial Network. It is an end-to-end framework that simultaneously achieves privacy protection and learns latent representations. With experiments on MNIST dataset, we show that imdpGAN preserves the privacy of the individual data point, and learns latent codes to control the specificity of the generated samples. We perform binary classification on digit pairs to show the utility versus privacy trade-off. The classification accuracy decreases as we increase privacy levels in the framework. We also experimentally show that the training process of imdpGAN is stable but experience a 10-fold time increase as compared with other GAN frameworks. Finally, we extend imdpGAN framework to CelebA dataset to show how the privacy and learned representations can be used to control the specificity of the output.
@inproceedings{gupta2020generatingprivate,title={imdpGAN: Generating Private and Specific Data with Generative Adversarial Networks},author={Gupta, S. and Buduru, A. and Kumaraguru, P.},year={2020},booktitle={IEEE International Conference on Trust, Privacy, and Security in Intelligent Systems, and Applications 2020},}
BigMM
Analyzing Traffic Violations through e-challan System in Metropolitan Cities
R.
Mishra, A.
Sadaria, S.
Srikanth, K.
Gupta, H.
Bhatia, P.
Jain, RR.
Shah, and P.
Kumaraguru
In 6th IEEE International Conference on Multimedia Big Data (BigMM) 2020, 2020
Given that India is now moving towards automated solutions to curb traffic violations and road accidents, we focus our efforts on characterizing these violations in Indian cities. In this work, we present our characterization of the traffic violations via an Automated e-challan (electronic traffic-violation receipt) issuance system of Ahmedabad and New Delhi. To explore this, we collected an exhaustive dataset of over 6 million e-challans. Characterizing the fine payment behavior, we find that 57% of unique vehicles in Ahmedabad are involved in repeat offenses. The temporal analysis shows a significant difference in e-challans issued during the festivals. Spatially, different violation types are distributed differently with the existence of certain unique hotspots. Finally, we also demonstrate how e-challans can act as a proxy measure to analyze the efficacy of the Motor Vehicles (Amendment) Act 2019. Our work suggests that high penalties may not have a long term impact on decreasing traffic violations.
@inproceedings{mishra2020analyzingtrafficviolations,title={Analyzing Traffic Violations through e-challan System in Metropolitan Cities},author={Mishra, R. and Sadaria, A. and Srikanth, S. and Gupta, K. and Bhatia, H. and Jain, P. and Shah, RR. and Kumaraguru, P.},year={2020},booktitle={6th IEEE International Conference on Multimedia Big Data (BigMM) 2020},}
BigMM
Are Bot Humans? Analysis of bot accounts in 2019 Indian Lok Sabha election
O.
Hitkul, A.
Sadaria, K.
Gupta, S.
Srikanth, RR.
Shah, and P.
Kumaraguru
In 6th IEEE International Conference on Multimedia Big Data (BigMM) 2020, 2020
Social media platforms have taken political and cultural conversations to an online platform making them more accessible. Ability to anonymously post has allowed more people to participate fearlessly. However, this has also led to an opportunity to spread miss information and manipulative content. Political groups around the globe have used Bot accounts to help spread their preferred narrative online during elections. In the midst of 2019 Indian Lok Sabha Elections speculations were made about the presence of cyber-troops/IT Cells which operate fake accounts and push propaganda. Our finding suggests that a portion of Bot accounts seems to be operated by humans in the background. These accounts have a very distinct usage pattern on Twitter compared to legitimate human users. Our experiments also point out that only 1.3% of total interactions are directed from Humans to Bots, showing Bot accounts inability to gel well in the online social network.
@inproceedings{hitkul2020arebot,title={Are Bot Humans? Analysis of bot accounts in 2019 Indian Lok Sabha election},author={Hitkul, Gurjar, O. and Sadaria, A. and Gupta, K. and Srikanth, S. and Shah, RR. and Kumaraguru, P.},year={2020},booktitle={6th IEEE International Conference on Multimedia Big Data (BigMM) 2020},}
JEI
DeepFakes: temporal sequential analysis to detect face-swapped video clips using convolutional long short-term memory
Deepfake (a bag of “deep learning” and “fake”) is a technique for human image synthesis based on artificial intelligence, i.e. and to superimpose the existing (source) images or videos onto destination images or videos using neural networks (NNs). Deepfake enthusiasts have been using NNs to produce convincing face swaps. Deepfakes are a type of video or image forgery developed to spread misinformation, invade privacy, and mask the truth using advanced technologies such as trained algorithms, deep learning applications, and artificial intelligence. They have become a nuisance to social media users by publishing fake videos created by fusing a celebrity’s face over an explicit video. The impact of deepfakes is alarming, with politicians, senior corporate officers, and world leaders being targeted by nefarious actors. An approach to detect deepfake videos of politicians using temporal sequential frames is proposed. The proposed approach uses the forged video to extract the frames at the first level followed by a deep depth-based convolutional long short-term memory model to identify the fake frames at the second level. Also the proposed model is evaluated on our newly collected ground truth dataset of forged videos using source and destination video frames of famous politicians. Experimental results demonstrate the effectiveness of our method.
@inproceedings{kaur2020temporalsequential,title={DeepFakes: temporal sequential analysis to detect face-swapped video clips using convolutional long short-term memory},author={Kaur, S. and Kumar, P. and Kumaraguru, P.},year={2020},booktitle={Journal of Electronic Imaging,},}
BigMM
Hashtags are (not) judgmental: The untold story of Lok Sabha elections 2019
S.
Gupta, A.
Singh, A.
Buduru, and P.
Kumaraguru
In 6th IEEE International Conference on Multimedia Big Data (BigMM) 2020, 2020
Hashtags in online social media have become a way for users to build communities around topics, promote opinions, and categorize messages. In the political context, hashtags on Twitter are used by users to campaign for their parties, spread news, or to get followers and get a general idea by following a discussion built around a hashtag. In the past, researchers have studied certain types and specific properties of hashtags by utilizing a lot of data collected around hashtags. In this paper, we perform a large-scale empirical analysis of elections using only the hashtags shared on Twitter during the 2019 Lok Sabha elections in India. We study the trends and events unfolded on the ground, the latent topics to uncover representative hashtags and semantic similarity to discover sentiments during elections. We collect over 24 million hashtags to perform extensive experiments to find the trending hashtags, and cross-reference them with the tweets in our data set to list down notable events. We also use semantic similarity based techniques to find related hashtags and latent topics among the hashtags.
@inproceedings{gupta2020hashtagsare,title={Hashtags are (not) judgmental: The untold story of Lok Sabha elections 2019},author={Gupta, S. and Singh, A. and Buduru, A. and Kumaraguru, P.},year={2020},booktitle={6th IEEE International Conference on Multimedia Big Data (BigMM) 2020},}
SMDS
Multi-objective Reinforcement Learning based approach for User-Centric Power Optimization in Smart Home Environments
S.
Gupta, S.
Bhambri, K.
Dhingra, A.
Buduru, and P.
Kumaraguru
Smart homes require every device inside them to be connected with each other at all times, which leads to a lot of power wastage on a daily basis. As the devices inside a smart home increase, it becomes difficult for the user to control or operate every individual device optimally. Therefore, users generally rely on power management systems for such optimization but often are not satisfied with the results. In this paper, we present a novel multi-objective reinforcement learning framework with two-fold objectives of minimizing power consumption and maximizing user satisfaction. The framework explores the trade-off between the two objectives and converges to a better power management policy when both objectives are considered while finding an optimal policy. We experiment on real-world smart home data, and show that the multi-objective approaches: i) establish trade-off between the two objectives, ii) achieve better combined user satisfaction and power consumption than single-objective approaches. We also show that the devices that are used regularly and have several fluctuations in device modes at regular intervals should be targeted for optimization, and the experiments on data from other smart homes fetch similar results, hence ensuring transfer-ability of the proposed framework.
@inproceedings{gupta2020reinforcementlearning,title={Multi-objective Reinforcement Learning based approach for User-Centric Power Optimization in Smart Home Environments},author={Gupta, S. and Bhambri, S. and Dhingra, K. and Buduru, A. and Kumaraguru, P.},year={2020},booktitle={IEEE SmartDataServices 2020},}
ICWSM
Driving the Last Mile: Characterizing and Understanding Distracted Driving Posts on Social Networks
H.
Lamba, S.
Srikanth, D.
Reddy, S.
Singh, K.
Juneja, and P.
Kumaraguru
In International AAAI Conference on Web and Social Media, 2020
In 2015, 391,000 people were injured due to distracted driving in the US. One of the major reasons behind distracted driving is the use of cell-phones, accounting for 14% of fatal crashes. Social media applications have enabled users to stay connected, however, the use of such applications while driving could have serious repercussions - often leading the user to be distracted from the road and ending up in an accident. In the context of impression management, it has been discovered that individuals often take a risk (such as teens smoking cigarettes, indulging in narcotics, and participating in unsafe sex) to improve their social standing. Therefore, viewing the phenomena of posting distracted driving posts under the lens of self-presentation, it can be hypothesized that users often indulge in risk-taking behavior on social media to improve their impression among their peers. In this paper, we first try to understand the severity of such social-media-based distractions by analyzing the content posted on a popular social media site where the user is driving and is also simultaneously creating content. To this end, we build a deep learning classifier to identify publicly posted content on social media that involves the user driving. Furthermore, a framework proposed to understand factors behind voluntary risk-taking activity observes that younger individuals are more willing to perform such activities, and men (as opposed to women) are more inclined to take risks. Grounding our observations in this framework, we test these hypotheses on 173 cities across the world. We conduct spatial and temporal analysis on a city-level and understand how distracted driving content posting behavior changes due to varied demographics. We discover that the factors put forth by the framework are significant in estimating the extent of such behavior.
@inproceedings{lamba2020drivingthelast,title={Driving the Last Mile: Characterizing and Understanding Distracted Driving Posts on Social Networks},author={Lamba, H. and Srikanth, S. and Reddy, D. and Singh, S. and Juneja, K. and Kumaraguru, P.},year={2020},booktitle={International AAAI Conference on Web and Social Media},}
SAC
Investigation of Biases in Identity Linkage DataSets
R.
Kaushal, S.
Gupta, and P.
Kumaraguru
In 35th ACM/SIGAPP Symposium on Applied Computing (SAC 2020), 2020
In social networks, the problem of identity linkage is to find whether a pair of user identities on two social networks belong to the same individual or not. Prior works typically first collect ground truth datasets of user identities across social networks belonging to the same individuals and then build a machine learning model driven by features from user identities. User behaviors in different social networks drive the construction of these datasets, and as a consequence, behavioral biases get manifested in them. Our work performs a detailed investigation into these dataset biases, a work which has mostly remained under-explored in the identity linkage research. More specifically, we characterize, detect, and quantify behavioral biases in the dataset that manifest in the form of lexical differences in user-generated content, particularly in usernames and display names configured by users. We study these biases on more than 1 million user identity pairs obtained by leveraging two user behaviors, namely cross-posting and self-disclosure. We find that users who self-disclose their usernames and display names on different social networks show higher lexical similarity than users who cross-post. These behavioral biases lower down the performance (precision and recall) of learning models by 5-20%. Inspired by discrimination measurement metrics, we propose and implement a framework to quantify the extent of these biases and find that 15–20% of test data get affected.
@inproceedings{kaushal2020investigationofbiases,title={Investigation of Biases in Identity Linkage DataSets},author={Kaushal, R. and Gupta, S. and and Kumaraguru, P.},year={2020},booktitle={35th ACM/SIGAPP Symposium on Applied Computing (SAC 2020)},}
NetSci-X
NeXLink: Node Embedding Framework for Cross-Network Linkages Across Social Networks
R.
Kaushal, S.
Singh, and P.
Kumaraguru
In International School and Conference on Network Science (NetSci-X 2020), 2020
Users create accounts on multiple social networks to get connected to their friends across these networks. We refer to these user accounts as user identities. Since users join multiple social networks, therefore, there will be cases where a pair of user identities across two different social networks belong to the same individual. We refer to such pairs as Cross-Network Linkages (CNLs). In this work, we model the social network as a graph to explore the question, whether we can obtain effective social network graph representation such that node embeddings of users belonging to CNLs are closer in embedding space than other nodes, using only the network information. To this end, we propose a modular and flexible node embedding framework, referred to as NeXLink, which comprises of three steps. First, we obtain local node embeddings by preserving the local structure of nodes within the same social network. Second, we learn the global node embeddings by preserving the global structure, which is present in the form of common friendship exhibited by nodes involved in CNLs across social networks. Third, we combine the local and global node embeddings, which preserve local and global structures to facilitate the detection of CNLs across social networks. We evaluate our proposed framework on an augmented (synthetically generated) dataset of 63,713 nodes & 817,090 edges and real-world dataset of 3338 Twitter-Foursquare node pairs. Our approach achieves an average Hit@1 rate of 98% for detecting CNLs across social networks and significantly outperforms previous state-of-the-art methods.
@inproceedings{kaushal2020nodeembedding,title={NeXLink: Node Embedding Framework for Cross-Network Linkages Across Social Networks},author={Kaushal, R. and Singh, S. and and Kumaraguru, P.},year={2020},booktitle={International School and Conference on Network Science (NetSci-X 2020)},}
JSC
Automating Fake News Detection System using Multi-level Voting Model
The issues of online fake news have attained an increasing eminence in the diffusion of shaping news stories online. Misleading or unreliable information in the form of videos, posts, articles, URLs is extensively disseminated through popular social media platforms such as Facebook and Twitter. As a result, editors and journalists are in need of new tools that can help them to pace up the verification process for the content that has been originated from social media. Motivated by the need for automated detection of fake news, the goal is to find out which classification model identifies phony features accurately using three feature extraction techniques, Term Frequency–Inverse Document Frequency (TF–IDF), Count-Vectorizer (CV) and Hashing-Vectorizer (HV). Also, in this paper, a novel multi-level voting ensemble model is proposed. The proposed system has been tested on three datasets using twelve classifiers. These ML classifiers are combined based on their false prediction ratio. It has been observed that the Passive Aggressive, Logistic Regression and Linear Support Vector Classifier (LinearSVC) individually perform best using TF-IDF, CV and HV feature extraction approaches, respectively, based on their performance metrics, whereas the proposed model outperforms the Passive Aggressive model by 0.8%, Logistic Regression model by 1.3%, LinearSVC model by 0.4% using TF-IDF, CV and HV, respectively. The proposed system can also be used to predict the fake content (textual form) from online social media websites.
@inproceedings{kaur2020automatingfakenews,title={Automating Fake News Detection System using Multi-level Voting Model},author={Kaur, S. and Kumar, P. and and Kumaraguru, P.},year={2020},booktitle={Journal of Soft Computing},}
Thesis
User Identity Linkage: Data Collection, DataSet Biases, Method, Control and Application
@inproceedings{useridentitylinkagedatacollectiondatasetbiasesmethodcontrolandapplication,title={User Identity Linkage: Data Collection, DataSet Biases, Method, Control and Application},author={Kaushal, R.},year={2020},booktitle={Ph.D. Thesis, IIIT-Delhi},}
Thesis
Characterizing and Detecting livestreaming Chatbots
S.
Jain
In MS by Research in Computer Science and Engineering IIIT-Hyderabad, 2020
@inproceedings{characterizinganddetectinglivestreamingchatbots,title={Characterizing and Detecting livestreaming Chatbots},author={Jain, S.},year={2020},booktitle={MS by Research in Computer Science and Engineering IIIT-Hyderabad},}
This paper presents Con2KG, a large-scale recruitment domain Knowledge Graph that describes 4 million triples as facts from 250 thousands of unstructured data of job postings. We propose a novel framework for Knowledge Graph construction from unstructured text and an unsupervised, dynamically evolving ontology that helps Con2KG to capture hierarchical links between the entities missed by explicit relational facts in the triples. To enrich our graph, we include entity context and its polarity. Towards this end, we discuss Con2KG applications that may benefit the recruitment domain.
@inproceedings{goyal2019,title={Con2KG-A Large-scale Domain-Specific Knowledge Graph},author={Goyal, N. and N., Sachdeva and V., Choudhary and R., Kar and P., Kumaraguru and and Rajput N.},year={2019},booktitle={Proceedings of the 30th ACM Conference on Hypertext and Social Media},}
RSSCONF
Analysing How the Shift in Discourses on Social Media Affected the Narrative Around the Indian General Election 2019
D.
Manu, R.
Krishnan, and P.
Kumaraguru
In 2nd International Conference on Research in Social Sciences (RSSCONF), 2019
The Lok Sabha Elections 2019 in the world’s largest democracy, India, was the biggest electoral event on the planet. These elections are key in the selection of the Prime Minister, the highest authority in the cabinet. Keeping in pace with the global trend, the Indian elections saw a very prominent use of Online Social Media by political parties to create a major discourse around the event. We focus our study on Twitter, collecting over 45 Million tweets, tracking more than 3500 hashtags and over 2500 political handles while monitoring their network interactions. In this work, we have analysed tweets from all these political handles to see how narratives were shaped and altered over time. We study these narratives formed by the party already in power and how they were supported or challenged by other parties. Spanning over 5 months, January to May 2019, we analysed the monthly changes in the rhetoric created by the leading political parties and leaders. We then discern the impact of these changes on existing narratives during the campaigning and the elections.
@inproceedings{manu2019analysinghowthe,title={Analysing How the Shift in Discourses on Social Media Affected the Narrative Around the Indian General Election 2019},author={Manu, D. and Krishnan, R. and and Kumaraguru, P.},year={2019},booktitle={2nd International Conference on Research in Social Sciences (RSSCONF)},}
CACM
The Positive and Negative Effects of Social Media in India
N.
Ganguly, and P.
Kumaraguru
In Communications of Association of Computing Machinery, 2019
There has been a phenomenal increase in the use of online social media (OSM) services in India, including Facebook, Twitter, Instagram, LinkedIn, and YouTube. In addition to these services, one-to-one messaging services like WhatsApp have 200 million users, the highest in the world. India has 462 million users accessing the Internet, among these: Facebook has 250+ million users, LinkedIn 42+ million, and Twitter 23+ million users, and the majority of users access these services through their mobile phones.
@inproceedings{ganguly2019thepositiveand,title={The Positive and Negative Effects of Social Media in India},author={Ganguly, N. and Kumaraguru, P.},year={2019},booktitle={Communications of Association of Computing Machinery},}
MIKE
Detection of Misbehaviors in Clone Identities on Online Social Networks
R.
Kaushal, C.
Sharma, and P.
Kumaraguru
In 7th International Conference on Mining Intelligence and Knowledge Exploration (MIKE 2019), 2019
The account registration steps in Online Social Networks (OSNs) are simple to facilitate users to join the OSN sites. Alongside, Personally Identifiable Information (PII) of users is readily available online. Therefore, it becomes trivial for a malicious user (attacker) to create a spoofed identity of a real user (victim), which we refer to as clone identity. While a victim can be an ordinary or a famous person, we focus our attention on clone identities of famous persons (celebrity clones). These clone identities ride on the credibility and popularity of celebrities to gain engagement and impact. In this work, we analyze celebrity clone identities and extract an exhaustive set of 40 features based on posting behavior, friend network and profile attributes. Accordingly, we characterize their behavior as benign and malicious. On detailed inspection, we find benign behaviors are either to promote the celebrity which they have cloned or seek attention, thereby helping in the popularity of celebrity. However, on the contrary, we also find malicious behaviors (misbehaviors) wherein clone celebrities indulge in spreading indecent content, issuing advisories and opinions on contentious topics. We evaluate our approach on a real social network (Twitter) by constructing a machine learning based model to automatically classify behaviors of clone identities, and achieve accuracies of 86%, 95%, 74%, 92% & 63% for five clone behaviors corresponding to promotion, indecency, attention-seeking, advisory and opinionated.
@inproceedings{kaushal2019detectionofmisbehaviors,title={Detection of Misbehaviors in Clone Identities on Online Social Networks},author={Kaushal, R. and Sharma, C. and and Kumaraguru, P.},year={2019},booktitle={7th International Conference on Mining Intelligence and Knowledge Exploration (MIKE 2019)},}
SocialCom
Methods for User Profiling Across Social Networks
R.
Kaushal, V.
Ghosh, and P.
Kumaraguru
In 12th IEEE International Conference on Social Computing (SocialCom 2019), 2019
Users have their accounts on multiple Online Social Networks (OSNs) to access a variety of content and connect to their friends. Consequently, user behaviors get distributed across many OSNs. Collection of comprehensive user information referred to as user profiling; an essential first step is to link user accounts (identities) belonging to the same individual across OSNs. To this end, we provide a detailed methodology of five methods useful for user profiling, which we refer to as Advanced Search Operator (ASO), Social Aggregator (SA), Cross-Platform Sharing (CPS), Self-Disclosure (SD) and Friend Finding Feature (FFF). Taken together, we collect linked identities of 208,120 individuals distributed across 43 different OSNs. We compare these methods quantitatively based on social network coverage and the number of linked identities obtained per-individual. And also perform a qualitative assessment of linked user data, thus obtained by these methods, on the criteria of completeness, validity, consistency, accuracy, and timeliness.
@inproceedings{kaushal2019methodsforuser,title={Methods for User Profiling Across Social Networks},author={Kaushal, R. and Ghosh, V. and and Kumaraguru, P.},year={2019},booktitle={12th IEEE International Conference on Social Computing (SocialCom 2019)},}
WebSci
Building Sociality through Sharing: Seniors’ Perspectives on Misinformation
In 11th ACM Conference on Web Science (WebSci ’19), 2019
This paper attempts to understand the perspectives of the seniors (aged 65 years and above) on misinformation in the Indian context. Interviews with 33 seniors who use social media regularly revealed three themes. The seniors viewed and rationalized sharing news irrespective of its veracity as a process of building sociality. Sharing information was also based on the logic of superimposing information with an epistemic ascription to the networks from where they received it. Finally, a kind of normative dualism becomes apparent from an acknowledgment of the role they may play in the spread of misinformation as agents on the one hand and a resounding need to stop it on the other due to its potential social ramifications.
@inproceedings{wason2019buildingsocialitythrough,title={Building Sociality through Sharing: Seniors' Perspectives on Misinformation},author={},year={2019},booktitle={11th ACM Conference on Web Science (WebSci ’19)},}
ASONAM
Finding Your Social Space: Empirical Study of Social Exploration in Multiplayer Online Games
A.
Chandra, Z.
Borbora, P.
Kumaraguru, and J.
Srivastava
In 9th Workshop on Social Network Analysis in Applications, International Conference on Advances in Social Networks Analysis and Mining (ASONAM ’19), 2019
Social dynamics are based on human needs for trust, support, resource sharing, irrespective of whether they operate in real life or in a virtual setting. Massively multiplayer online role-playing games (MMORPGS) serve as enablers of leisurely social activity and are important tools for social interactions. Past research has shown that socially dense gaming environments like MMORPGs can be used to study important social phenomena, which may operate in real life, too. We describe the process of social exploration to entail the following components 1) finding the balance between personal and social time 2) making choice between a large number of weak ties or few strong social ties. 3) finding a social group. In general, these are the major determinants of an individual’s social life. This paper looks into the phenomenon of social exploration in an activity based online social environment. We study this process through the lens of the following research questions, 1) What are the different social behavior types? 2) Is there a change in a player’s social behavior over time? 3) Are certain social behaviors more stable than the others? 4) Can longitudinal research of player behavior help shed light on the social dynamics and processes in the network? We use an unsupervised machine learning approach to come up with 4 different social behavior types - Lone Wolf, Pack Wolf of Small Pack, Pack Wolf of a Large Pack and Social Butterfly. The types represent the degree of socialization of players in the game. Our research reveals that social behaviors change with time. While lone wolf and pack wolf of small pack are more stable social behaviors, pack wolf of large pack and social butterflies are more transient. We also observe that players progressively move from large groups with weak social ties to settle in small groups with stronger ties.
@inproceedings{chandra2019findingyoursocial,title={Finding Your Social Space: Empirical Study of Social Exploration in Multiplayer Online Games},author={Chandra, A. and Borbora, Z. and Kumaraguru, P. and and Srivastava, J.},year={2019},booktitle={9th Workshop on Social Network Analysis in Applications, International Conference on Advances in Social Networks Analysis and Mining (ASONAM '19)},}
FOSINT-SI
On Churn and Social Contagion
Z.
Borbora, A.
Chandra, P.
Kumaraguru, and J.
Srivastava
In International Conference on Advances in Social Networks Analysis and Mining (ASONAM ’19), Foundations of Open Source Intelligence and Security Informatics (FOSINT-SI 2019), 2019
Massively Multiplayer Online Role-Playing Games (MMORPGs) are persistent virtual environments where millions of players interact in an online manner. We study the problem of player churn and social contagion using MMORPG game logs by analyzing the impact of a node’s churn behavior on its immediate neighborhood or group. The two key research questions in this paper are - When an active node, ego, becomes dormant, what is the impact on the activity behavior of ego’s immediate neighbor, alter, 1) based on ego’s characteristics and ego’s relationship with alter and 2) based on the activity behavior of alter’s remaining neighbors. We use a supervised learning framework to study the impact of player churn and social contagion. Experimental results show that the classification models perform substantially better than random for both the research problems. Finally, we use a data-driven approach to propose a player typology based on degree of socialization and analyze churn behavior among these player types. Experimental results show that the loner player type is much more likely to churn than the socializer player types and as the degree of socialization decreases among socializers, the propensity to churn increases.
@inproceedings{borbora2019onchurnand,title={On Churn and Social Contagion},author={Borbora, Z. and Chandra, A. and Kumaraguru, P. and and Srivastava, J.},year={2019},booktitle={International Conference on Advances in Social Networks Analysis and Mining (ASONAM '19), Foundations of Open Source Intelligence and Security Informatics (FOSINT-SI 2019)},}
BigMM
SpotFake: A Multi-Modal Framework for Fake News Detection
S.
Singhal, R.
Shah, T.
Chakraborty, P.
Kumaraguru, and S.
Satoh
In The Fifth IEEE International Conference on Multimedia Big Data, 2019
A rapid growth in the amount of fake news on social media is a very serious concern in our society. It is usually created by manipulating images, text, audio, and videos. This indicates that there is a need of multimodal system for fake news detection. Though, there are multimodal fake news detection systems but they tend to solve the problem of fake news by considering an additional sub-task like event discriminator and finding correlations across the modalities. The results of fake news detection are heavily dependent on the subtask and in absence of subtask training, the performance of fake news detection degrade by 10% on an average. To solve this issue, we introduce SpotFake-a multi-modal framework for fake news detection. Our proposed solution detects fake news without taking into account any other subtasks. It exploits both the textual and visual features of an article. Specifically, we made use of language models (like BERT) to learn text features, and image features are learned from VGG-19 pre-trained on ImageNet dataset. All the experiments are performed on two publicly available datasets, i.e. and Twitter and Weibo. The proposed model performs better than the current state-of-the-art on Twitter and Weibo datasets by 3.27% and 6.83%, respectively.
@inproceedings{singhal2019a,title={SpotFake: A Multi-Modal Framework for Fake News Detection},author={Singhal, S. and Shah, R. and Chakraborty, T. and Kumaraguru, P. and and Satoh, S.},year={2019},booktitle={The Fifth IEEE International Conference on Multimedia Big Data},}
BigMM
Detecting Trolling Prone Images on Instagram
R.
Hitant Kul, P.
Kumaraguru, and S.
Satoh
In The Fifth IEEE International Conference on Multimedia Big Data, 2019
Improvement in network infrastructure and smartphones have made images based social media platforms like Instagram and Flickr popular. The visual medium of communication has also led to an alarming increase in trolling incidents on social media. Though it is crucial to automatically detect trolling incidents on social media, in this paper, we look at the problem from the eye of prevention rather than detection. A system that can recognize trolling prone images can issue a warning to users before the content is posted online and prevent potential trolling incidents. We attempt to make a supervised classifier to detect trolling prone images and discuss why the conventional state-of-the-art image classification method does not work well for this task. We also provide an extensive analysis of trolling patterns in images from Instagram, discuss challenges and possible future paths in detail.
@inproceedings{hitant2019detectingtrollingprone,title={Detecting Trolling Prone Images on Instagram},author={Hitant Kul, Shah, R. and Kumaraguru, P. and and Satoh, S.},year={2019},booktitle={The Fifth IEEE International Conference on Multimedia Big Data},}
ICDEW
Characterizing the Twitter Verified User Network. Elites Tweet?
I.
Paul, A.
Khattar, P.
Kumaraguru, M
Gupta, and S.
Chopra
In ICDE Workshop on Large Scale Graph Data Analytics, 2019
Social network and publishing platforms, such as Twitter, support the concept of verification. Verified accounts are deemed worthy of platform-wide public interest and are separately authenticated by the platform itself. There have been repeated assertions by these platforms about verification not being tantamount to endorsement. However, a significant body of prior work suggests that possessing a verified status symbolizes enhanced credibility in the eyes of the platform audience. As a result, such a status is highly coveted among public figures and influencers. Hence, we attempt to characterize the network of verified users on Twitter and compare the results to similar analysis performed for the entire Twitter network. We extracted the entire network of verified users on Twitter (as of July 2018) and obtained 231,246 English user profiles and 79,213,811 connections. Subsequently, in the network analysis, we found that the sub-graph of verified users mirrors the full Twitter users graph in some aspects such as possessing a short diameter. However, our findings contrast with earlier findings on multiple aspects, such as the possession of a power law out-degree distribution, slight dissortativity, and a significantly higher reciprocity rate, as elucidated in the paper. Moreover, we attempt to gauge the presence of salient components within this sub-graph and detect the absence of homophily with respect to popularity, which again is in stark contrast to the full Twitter graph. Finally, we demonstrate stationarity in the time series of verified user activity levels. To the best of our knowledge, this work represents the first quantitative attempt at characterizing verified users on Twitter.
@inproceedings{paul2019characterizingthetwitter,title={Characterizing the Twitter Verified User Network. Elites Tweet?},author={Paul, I. and Khattar, A. and Kumaraguru, P. and Gupta, M and Chopra, S.},year={2019},booktitle={ICDE Workshop on Large Scale Graph Data Analytics},}
ASONAM
Characterizing and Detecting Livestreaming Chatbots
S.
Jain, D.
Niranjan, H.
Lamba, and P.
Kumaraguru
In Proceedings of the 2019 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, 2019
Livestreaming platforms enable content producers, or streamers, to broadcast creative content to a potentially large viewer base. Chatrooms form an integral part of such platforms, enabling viewers to interact both with the streamer, and amongst themselves. Streams with high engagement (many viewers and active chatters) are typically considered engaging, and often promoted to end users by means of recommendation algorithms, and exposed to better monetization opportunities via revenue share from platform advertising, viewer donations, and third-party sponsorships. Given such incentives, some streamers make use of fraudulent means to increase perceived engagement by simulating chatter via fake "chatbots" which can be purchased from shady online marketplaces. This inauthentic engagement can negatively influence recommendation, hurt streamer and viewer trust in the platform, and harm monetization for honest streamers. In this paper, we tackle the novel problem of automating detection of chatbots on livestreaming platforms. To this end, we first formalize the livestreaming chatbot detection problem and characterize differences between botted and genuine chatter behavior observed from a real-world livestreaming chatter dataset collected from Twitch.tv. We then propose SHERLOCK, which posits a two-stage approach of detecting chatbotted streams, and subsequently detecting the constituent chatbots. Finally, we demonstrate effectiveness on both real and synthetic data: to this end, we propose a novel strategy for collecting labeled, synthetic chatter dataset (typically unavailable) from such platforms, enabling evaluation of proposed detection approaches against chatbot behaviors with varying signatures. Our approach achieves .97 precision/recall on the real-world dataset, and .80+ F1 scores across most simulated attack settings.
@inproceedings{jain2019characterizinganddetecting,title={Characterizing and Detecting Livestreaming Chatbots},author={Jain, S. and Niranjan, D. and Lamba, H. and and Kumaraguru, P.},year={2019},booktitle={Proceedings of the 2019 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining},}
WebSci
Angel or Demon? Characterizing Variations Across Twitter Timeline of Technical Support Campaigners
S.
Gupta, G.
Bhatia, S.
Suri, D.
Kuchhal, P.
Gupta, M.
Ahamad, M.
Gupta, and P.
Kumaraguru
Technical Support spam, which abuse Web 2.0 and carry out social engineering attacks have been in existence for a very long time, despite several measures taken to thwart such attacks. Although recent research has looked into unveiling tactics employed by spammers to lure victims, damage done on Online Social Networks is largely unexplored. In this paper, we perform the first large-scale study to understand the behavior of technical support spammers, and compare them with the legitimate technical support offered to OSN users by several brands such as Microsoft, Facebook, Amazon. We analyze the spam and legitimate accounts over a period of 20 months, and provide a taxonomy of the different types of spammers that are active in Tech Support spam landscape. We develop an automated mechanism to classify spammers from legitimate accounts, achieving a precision, recall of 99.8%. Our results shed light on the threats associated with billions of users using OSNs from Tech Support spam, and can help researchers and OSN service providers in developing effective countermeasures to fight them.
@inproceedings{gupta2019angelor,title={Angel or Demon? Characterizing Variations Across Twitter Timeline of Technical Support Campaigners},author={Gupta, S. and Bhatia, G. and Suri, S. and Kuchhal, D. and Gupta, P. and Ahamad, M. and Gupta, M. and and Kumaraguru, P.},year={2019},booktitle={Journal of Web Science},}
ACM MM
Towards Increased Accessibility of Meme Images with the help of Rich Face Emotion Captions
In recent years, there has been an explosion in the number of memes being created and circulated in online social networks. Despite their rapidly increasing impact on how we communicate online, meme images are virtually inaccessible to the visually impaired users. Existing automated assistive systems that were primarily devised for natural photos in social media, overlook the specific fine-grained visual details in meme images. In this paper, we concentrate on describing one such prominent visual detail: the meme face emotion. We propose a novel automated method that enables visually impaired social media users to understand and appreciate meme face emotions with the help of rich textual captions. We first collect a challenging dataset of meme face emotion captions to support future research in face emotion understanding. We design a two-stage approach that significantly outperforms baseline approaches across all the standard captioning metrics and also generates richer discriminative captions. By validating our solution with the help of visually impaired social media users, we show that our emotion captions enable them to understand and appreciate one of the most popular classes of meme images encountered on the Internet for the first time. Code, data, and models are publicly available.
@inproceedings{prajwal2019towardsincreasedaccessibility,title={Towards Increased Accessibility of Meme Images with the help of Rich Face Emotion Captions},author={Prajwal K R, Jawahar C V and Kumaraguru, P.},year={2019},booktitle={ACM Multimedia 2019},}
DLSA
Transfer learning for detecting hateful sentiments in code-switched language
K.
Rajput, R.
Kapoor, P.
Mathur, P.
Hitkul, and R.
Shah
In Deep learning based approaches for sentiment analysis, Springer, 2019
With the phenomenal increase in the penetration of social media in linguistically diverse demographic regions, conversations have become more casual and multilingual. The rise of informal code-switched multilingual languages makes it tough for automated systems to monitor instances of hate speech, which are further intelligently disguised through the use of spelling variations, code-mixing, homophones, homonyms, and the absence of sophisticated grammar rules. Machine transliteration can be employed for converting the code-switched text into a singular script but poses the challenge of the semantical breakdown of the text. To overcome this drawback, this chapter investigates the application of transfer learning. The CNN-based neural models are trained on a large dataset of hateful tweets in a chosen primary language, followed by retraining on the small transliterated dataset in the same language. Since transfer learning can act as an effective strategy to reuse already learned features in learning a specialized task through cross-domain knowledge transfer, hate speech classification on a large English corpus can act as source tasks to help in obtaining pre-trained deep learning classifiers for the target task of classifying tweets translated in English from other code-switched languages. Effects of the different types of popular word embeddings and multiple supervised inputs such as the LIWC, the presence of profanities, and sentiment are carefully studied to derive the most representative combination of input settings that can help achieve state-of-the-art hate speech detection from code-switched multilingual short texts on Twitter.
@inproceedings{rajput2019transferlearningfor,title={Transfer learning for detecting hateful sentiments in code-switched language},author={Rajput, K. and Kapoor, R. and Mathur, P. and Hitkul, Kumaraguru, P. and and Shah, R.},year={2019},booktitle={Deep learning based approaches for sentiment analysis, Springer},}
ICIP
Attentional Road Safety Networks
S.
Gupta, D.
Srivatsav, A V.
Subramanyam, and P.
Kumaraguru
In Accepted at 26th IEEE International Conference on Image Processing (ICIP), 2019
Road safety mapping using satellite images is a cost-effective but a challenging problem for smart city planning. The scarcity of labeled data, misalignment and ambiguity makes it hard to learn efficient embeddings in order to classify between safe and dangerous road segments. In this paper, we address the challenges using a region guided attention network. In our model, we extract global features from a base network and augment it with local features obtained using the region guided attention network. In addition, we perform domain adaptation for unlabeled target data. In order to bridge the gap between safe samples and dangerous samples from source and target respectively, we propose a loss function based on within and between class covariance matrices. We conduct experiments on a public dataset of London to show that the algorithm achieves significant results with the classification accuracy of 86.21%. We obtain an increase of 4% accuracy for NYC using domain adaptation network.
@inproceedings{gupta2019attentionalroadsafety,title={Attentional Road Safety Networks},author={Gupta, S. and Srivatsav, D. and Subramanyam, A V. and Kumaraguru, P.},year={2019},booktitle={Accepted at 26th IEEE International Conference on Image Processing (ICIP)},}
WWW
Signals Matter: Understanding Popularity and Impact of Users on Stack Overflow
A.
Merchant, D.
Shah, G.
Bhatia, A.
Ghosh, and P.
Kumaraguru
Stack Overflow, a Q&A site on programming, awards reputation points and badges (game elements) to users on performing various actions. Situating our work in Digital Signaling Theory, we investigate the role of these game elements in characterizing social qualities (specifically, popularity and impact) of its users. We operationalize these attributes using common metrics and apply statistical modeling to empirically quantify and validate the strength of these signals. Our results are based on a rich dataset of 3,831,147 users and their activities spanning nearly a decade since the site’s inception in 2008. We present evidence that certain non-trivial badges, reputation scores and age of the user on the site positively correlate with popularity and impact. Further, we find that the presence of costly to earn and hard to observe signals qualitatively differentiates highly impactful users from highly popular users.
@inproceedings{merchant2019signalsunderstanding,title={Signals Matter: Understanding Popularity and Impact of Users on Stack Overflow},author={Merchant, A. and Shah, D. and Bhatia, G. and Ghosh, A. and and Kumaraguru, P.},year={2019},booktitle={WWW '19: The World Wide Web Conference},}
AW4CITY
Travel time estimation accuracy in developing regions: An empirical case study with Uber data in Delhi-NCR
D.
Shah, A.
Kumaran, R.
Sen, and P.
Kumaraguru
In AW4CITY 2019: 5th International Smart City Workshop, 2019
Travel time estimates are highly useful in planning urban mobility events. This paper investigates the quality of travel time estimates in the Indian capital city of Delhi and the National Capital Region (NCR). Using Uber mobile and web applications, we collect data about 610 trips from 34 Uber users. We empirically show the unpredictability of travel time estimates for Uber cabs. We also discuss the adverse effects of such unpredictability on passengers waiting for the cabs, leading to a whopping 28.4% of the requested trips being cancelled. Our empirical observations differ significantly from the high accuracies reported in travel time estimation literature. These pessimistic results will hopefully trigger useful investigations in future on why the travel time estimates are mismatching the high accuracy levels reported in literature - (a) is it a lack of training data issue for developing countries or (b) an algorithmic shortcoming that cannot capture the (lack of) historical patterns in developing region travel times or (c) a conscious policy decision by Uber platform or Uber drivers, to mismatch the correctly predicted travel time estimates and increase cab cancellation fees? In the context of smartphone apps extensively generating and utilizing travel time information for urban commute, this paper identifies and discusses the important problem of travel time estimation inaccuracies in developing countries.
@inproceedings{shah2019traveltimeestimation,title={Travel time estimation accuracy in developing regions: An empirical case study with Uber data in Delhi-NCR},author={Shah, D. and Kumaran, A. and Sen, R. and and Kumaraguru, P.},year={2019},booktitle={AW4CITY 2019: 5th International Smart City Workshop},}
WebSci
What Sets Verified Users Apart? Insights, Analysis and Prediction of Verified Users on Twitter
I.
Paul, A.
Khattar, S.
Chopra, P.
Kumaraguru, and M.
Gupta
Social network and publishing platforms, such as Twitter, support the concept of a secret proprietary verification process, for handles they deem worthy of platform-wide public interest. In line with significant prior work which suggests that possessing such a status symbolizes enhanced credibility in the eyes of the platform audience, a verified badge is clearly coveted among public figures and brands. What are less obvious are the inner workings of the verification process and what being verified represents. This lack of clarity, coupled with the flak that Twitter received by extending aforementioned status to political extremists in 2017, backed Twitter into publicly admitting that the process and what the status represented needed to be rethought. With this in mind, we seek to unravel the aspects of a user’s profile which likely engender or preclude verification. The aim of the paper is two-fold: First, we test if discerning the verification status of a handle from profile metadata and content features is feasible. Second, we unravel the features which have the greatest bearing on a handle’s verification status. We collected a dataset consisting of profile metadata of all 231,235 verified English-speaking users (as of July 2018), a control sample of 175,930 non-verified English-speaking users and all their 494 million tweets over a one year collection period. Our proposed models are able to reliably identify verification status (Area under curve AUC > 99%). We show that number of public list memberships, presence of neutral sentiment in tweets and an authoritative language style are the most pertinent predictors of verification status. To the best of our knowledge, this work represents the first attempt at discerning and classifying verification worthy users on Twitter.
@inproceedings{paul2019whatsetsverified,title={What Sets Verified Users Apart? Insights, Analysis and Prediction of Verified Users on Twitter},author={Paul, I. and Khattar, A. and Chopra, S. and Kumaraguru, P. and and Gupta, M.},year={2019},booktitle={11th ACM Conference on Web Science},}
CHI4Evil
Evils of Social Media: Case Study of the Blue Whale Challenge
S.
Chopra, A.
Khattar, K.
Dabas, K.
Gupta, and P.
Kumaraguru
In Accepted at CHI4EVIL workshop at CHI 2019, 2019
The Blue Whale Challenge is a deadly challenge propagating on online social media and has claimed multiple lives across the globe [5]. This challenge requires the person to indulge in a series of self-mutilating tasks for a duration of 50 days and ultimately commit suicide. The so-called “administrators” or “curators” of the challenge contact users - who express their willingness to take part on social networking websites - via direct messages. We conducted a study to understand the spread of the challenge on social media websites such as VKontakte, Twitter, and Instagram, identify different types of users involved in the challenge, study their demographics, and identify distinguishing features between the users involved in the challenge and those who are not. Through this position paper, we throw some light upon dangerous social media challenges such as the Blue Whale Challenge which lure, engage, and victimize a spectrum of people. We express our interest in studying the harmful effects of technology and social media and elucidate our positionality with respect to the same.
@inproceedings{chopra2019evilsofsocial,title={Evils of Social Media: Case Study of the Blue Whale Challenge},author={Chopra, S. and Khattar, A. and Dabas, K. and Gupta, K. and and Kumaraguru, P.},year={2019},booktitle={Accepted at CHI4EVIL workshop at CHI 2019},}
IJCNN
Hardening Deep Neural Networks via Adversarial Model Cascades
A.
Suri, D.
Vijaykeerthy, S.
Mehta, and P.
Kumaraguru
In International Joint Conference on Neural Networks (IJCNN) 2019, 2019
Deep neural networks (DNNs) are vulnerable to malicious inputs crafted by an adversary to produce erroneous outputs. Works on securing neural networks against adversarial examples achieve high empirical robustness on simple datasets such as MNIST. However, these techniques are inadequate when empirically tested on complex data sets such as CIFAR-10 and SVHN. Further, existing techniques are designed to target specific attacks and fail to generalize across attacks. We propose Adversarial Model Cascades (AMC) as a way to tackle the above inadequacies. Our approach trains a cascade of models sequentially where each model is optimized to be robust towards a mixture of multiple attacks. Ultimately, it yields a single model which is secure against a wide range of attacks; namely FGSM, Elastic, Virtual Adversarial Perturbations and Madry. On an average, AMC increases the model’s empirical robustness against various attacks simultaneously, by a significant margin (of 6.225% for MNIST, 5.075% for SVHN and 2.65% for CIFAR-10 ). At the same time, the model’s performance on non-adversarial inputs is comparable to the state-of-the-art models.
@inproceedings{suri2019hardeningdeepneural,title={Hardening Deep Neural Networks via Adversarial Model Cascades},author={Suri, A. and Vijaykeerthy, D. and Mehta, S. and and Kumaraguru, P.},year={2019},booktitle={International Joint Conference on Neural Networks (IJCNN) 2019},}
SIGAPP
KidsGuard: A fine-Grained approach for child UnsAfe video Representation and Detection
S.
Singh, R.
Kaushal, A.
Buduru, and P.
Kumaraguru
In 34th ACM/SIGAPP Symposium On Applied Computing 2019, 2019
Increasingly more and more videos are being uploaded on video sharing platforms, and a significant number of viewers on these platforms are children. At times, these videos have violent or sexually explicit scenes (referred as child unsafe) to catch children’s attention. To evade moderation, malicious video uploaders typically limit the child unsafe content to only a few frames in the video. Hence, a fine-grained approach, referred as KidsGUARD1, to detect sparsely present child unsafe content is required. Prior approaches to content moderation either flag the entire video as inappropriate or use hand-crafted features derived from video frames. In this work, we leverage Long Short Term Memory (LSTM) based autoencoder to learn effective video representations of video descriptors obtained from using VGG16 Convolutional Neural Network (CNN). Encoded video representations are fed into LSTM classifier for detection of sparse child unsafe video content. To evaluate this approach, we create a dataset of 109,835 video clips curated specifically for child unsafe content. We find that deep learning approach (1) detects fine-grained child unsafe video content with the granularity of 1 second, (2) identifies even sparsely location child unsafe video content by achieving a high recall of 81% at high precision of 80%, and (3) outperforms baseline video encoding approaches based on like Fisher Vector (FV) and Vector of Locally Aggregated Descriptors (VLAD).
@inproceedings{singh2019a,title={KidsGuard: A fine-Grained approach for child UnsAfe video Representation and Detection},author={Singh, S. and Kaushal, R. and Buduru, A. and and Kumaraguru, P.},year={2019},booktitle={34th ACM/SIGAPP Symposium On Applied Computing 2019},}
CoDS-COMAD
MalReG: Detecting and Analyzing Malicious Retweeter Groups
S.
Gupta, P.
Kumaraguru, and T.
Chakraborty
In Proceedings of the ACM India Joint International Conference on Data Science and Management of Data, ACM India Joint (CoDS-COMAD 2019), 2019
Given a retweeter network in Twitter for any event, how can we detect the group of users that collude to retweet together maliciously? A large number of retweets of a post often indicates the virality of the post. It also helps increase the visibility and volume of hashtags, topics or URLs, to promote the event associated with it. Our primary hunch is that there is synchronization or indicative pattern in the behavior of such users. In this paper, we propose (i) MalReG, a novel algorithm to detect retweeter groups, and (ii) a set of 23 group-based features (entropy-based and temporal-based) to train a supervised model to identify malicious retweeter groups (MRG). We present experiments on three real-world datasets with more than 10 million retweets crawled from Twitter. MalReG identifies 1, 017 retweeter groups present in our dataset. We train a supervised learning model to detect MRG which achieves 0.921 ROC AUC using Random Forest, outperforming the baseline by 7.97% higher AUC. Additionally, we perform geographical location-based and temporal analysis of these groups. Interestingly, we find the presence of the same group, retweeting different political events that took place in different continents at different times. We also discover masquerading techniques used by MRG to evade detection.
@inproceedings{gupta2019detectingand,title={MalReG: Detecting and Analyzing Malicious Retweeter Groups},author={Gupta, S. and Kumaraguru, P. and and Chakraborty, T.},year={2019},booktitle={Proceedings of the ACM India Joint International Conference on Data Science and Management of Data, ACM India Joint (CoDS-COMAD 2019)},}
JBI
A Distant Supervision Based Approach to Medical Persona Classification
N.
Pattisapu, M.
Gupta, P.
Kumaraguru, and V.
Varma
Identifying medical persona from a social media post is critical for drug marketing, pharmacovigilance and patient recruitment. Medical persona classification aims to computationally model the medical persona associated with a social media post. We present a novel deep learning model for this task which consists of two parts: Convolutional Neural Networks (CNNs), which extract highly relevant features from the sentences of a social media post and average pooling, which aggregates the sentence embeddings to obtain task-specific document embedding. We compare our approach against standard baselines, such as Term Frequency - Inverse Document Frequency (TF-IDF), averaged word embedding based methods and popular neural architectures, such as CNN-Long Short Term Memory (CNN-LSTM) and Hierarchical Attention Networks (HANs). Our model achieves an improvement of 19.7% for classification accuracy and 20.1% for micro F1 measure over the current state-of-the-art. We eliminate the need for manual labeling by employing a distant supervision based method to obtain labeled examples for training the models. We thoroughly analyze our model to discover cues that are indicative of a particular persona. Particularly, we use first derivative saliency to identify the salient words in a particular social media post.
@inproceedings{pattisapu2019adistantsupervision,title={A Distant Supervision Based Approach to Medical Persona Classification},author={Pattisapu, N. and Gupta, M. and Kumaraguru, P. and Varma, V.},year={2019},booktitle={Journal of Biomedical Informatics},}
AAAI
Get IT Scored using AutoSAS - An Automated System for Scoring Short Answers
Y.
Kumar, S.
Aggarwal, D.
Mahata, R.
Shah, P.
Kumaraguru, and R.
Zimmermann
In Proceedings of the AAAI Conference on Artificial Intelligence, 2019
In the era of MOOCs, online exams are taken by millions of candidates, where scoring short answers is an integral part. It becomes intractable to evaluate them by human graders. Thus, a generic automated system capable of grading these responses should be designed and deployed. In this paper, we present a fast, scalable, and accurate approach towards automated Short Answer Scoring (SAS). We propose and explain the design and development of a system for SAS, namely AutoSAS. Given a question along with its graded samples, AutoSAS can learn to grade that prompt successfully. This paper further lays down the features such as lexical diversity, Word2Vec, prompt, and content overlap that plays a pivotal role in building our proposed model. We also present a methodology for indicating the factors responsible for scoring an answer. The trained model is evaluated on an extensively used public dataset, namely Automated Student Assessment Prize Short Answer Scoring (ASAP-SAS). AutoSAS shows state-of-the-art performance and achieves better results by over 8% in some of the question prompts as measured by Quadratic Weighted Kappa (QWK), showing performance comparable to humans.
@inproceedings{kumar2019getitscored,title={Get IT Scored using AutoSAS - An Automated System for Scoring Short Answers},author={Kumar, Y. and Aggarwal, S. and Mahata, D. and Shah, R. and Kumaraguru, P. and and Zimmermann, R.},year={2019},booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},}
AAAI
Mind Your Language: Abuse and Offense Detection for Code-Switched Languages
R.
Kapoor, Y.
Kumar, K.
Rajput, R.
Shah, P.
Kumaraguru, and R.
Zimmermann
In 32nd AAAI Conference on Artificial Intelligence, 2019
In multilingual societies like the Indian subcontinent, use of code-switched languages is much popular and convenient for the users. In this paper, we study offense and abuse detection in the code-switched pair of Hindi and English (i.e. Hinglish), the pair that is the most spoken. The task is made difficult due to non-fixed grammar, vocabulary, semantics and spellings of Hinglish language. We apply transfer learning and make a LSTM based model for hate speech classification. This model surpasses the performance shown by the current best models to establish itself as the state-of-the-art in the unexplored domain of Hinglish offensive text this http URL also release our model and the embeddings trained for research purposes
@inproceedings{kapoor2019mindyour,title={Mind Your Language: Abuse and Offense Detection for Code-Switched Languages},author={Kapoor, R. and Kumar, Y. and Rajput, K. and Shah, R. and Kumaraguru, P. and and Zimmermann, R.},year={2019},booktitle={32nd AAAI Conference on Artificial Intelligence},}
2018
BDA
CbI: Improving Credibility of User-Generated Content on Facebook
S.
Gupta, S.
Sachdeva, P.
Dewan, and P.
Kumaraguru
In Sixth International Conference on Big Data Analytics, 2018
Online Social Networks (OSNs) have become a popular platform to share information with each other. Fake news often spread rapidly in OSNs especially during news-making events, e.g. and Earthquake in Chile (2010) and Hurricane Sandy in the USA (2012). A potential solution is to use machine learning techniques to assess the credibility of a post automatically, i.e. and whether a person would consider the post believable or trustworthy. In this paper, we provide a fine-grained definition of credibility. We call a post to be credible if it is accurate, clear, and timely. Hence, we propose a system which calculates the Accuracy, Clarity, and Timeliness (A-C-T) of a Facebook post which in turn are used to rank the post for its credibility. We experiment with 1,056 posts created by 107 pages that claim to belong to the news category. We use a set of 152 features to train classification models each for A-C-T using supervised algorithms. We use the best-performing features and models to develop a RESTful API and a Chrome browser extension to rank posts for their credibility in real-time. The random forest algorithm performed the best and achieved ROC AUC of 0.916, 0.875, and 0.851 for A-C-T respectively.
@inproceedings{gupta2018improvingcredibility,title={CbI: Improving Credibility of User-Generated Content on Facebook},author={Gupta, S. and Sachdeva, S. and Dewan, P. and and Kumaraguru, P.},year={2018},booktitle={Sixth International Conference on Big Data Analytics},}
ASONAM
Followee Management: Helping users follow the right users on Online Social Media
A.
Verma, A.
Wadhwa, N.
Singh, S.
Beniwal, R.
Kaushal, and P.
Kumaraguru
In ASONAM ’18: Proceedings of the 2018 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, 2018
User timelines in Online Social Media (OSM) remains filled with a significant amount of information received from followees. Given that content posted by followee is not under user’s control, this information may not always be relevant. If there is large presence of not so relevant content, then a user may end up overlooking relevant content, which is undesirable. To address this issue, in the first part of our work, we propose suitable metrics to characterize the user-followee relationship. We find that most of the users choose their followees primarily due to the content that they post (content-conscious behavior, measured by content similarity scores). For a small number of followees, a high degree of social engagement (likes and shares) irrespective of the content posted by them is observed (user-conscious behavior, measured by user affinity scores). We evaluate our proposed approach on 26,516 followees across 100 random users on Twitter who have cumulatively posted 234,403 tweets. We find that on average for 60% of their followees, users exhibit very low degree of content similarity and social engagement. These findings motivate the second part of our work, where we develop a Followee Management Nudge (FMN) through a browser extension (plugin) that helps users remain more informed about their relationship with each of their followees. In particular, the FMN nudges a user with a list of followees with whom they have least (or never) engaged in the past and also exhibit very low similarity in terms of content, thereby helping a user to make an informed decision (say by unfollowing some of these followees). Results from a preliminary controlled lab study show that 62.5% of participants find the nudge to be quite useful.
@inproceedings{verma2018followeehelping,title={Followee Management: Helping users follow the right users on Online Social Media},author={Verma, A. and Wadhwa, A. and Singh, N. and Beniwal, S. and Kaushal, R. and Kumaraguru, P.},year={2018},booktitle={ASONAM '18: Proceedings of the 2018 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining},}
CMT
Neural Machine Translation for English-Tamil
H.
Choudhary, A.
Pathak, R.
Shah, and P.
Kumaraguru
In Proceedings of the Third Conference on Machine Translation: Shared Task Papers, 2018
A huge amount of valuable resources is available on the web in English, which are often translated into local languages to facilitate knowledge sharing among local people who are not much familiar with English. However, translating such content manually is very tedious, costly, and time-consuming process. To this end, machine translation is an efficient approach to translate text without any human involvement. Neural machine translation (NMT) is one of the most recent and effective translation technique amongst all existing machine translation systems. In this paper, we apply NMT for English-Tamil language pair. We propose a novel neural machine translation technique using word-embedding along with Byte-Pair-Encoding (BPE) to develop an efficient translation system that overcomes the OOV (Out Of Vocabulary) problem for languages which do not have much translations available online. We use the BLEU score for evaluating the system performance. Experimental results confirm that our proposed MIDAS translator (8.33 BLEU score) outperforms Google translator (3.75 BLEU score).
@inproceedings{choudhary2018neuralmachinetranslation,title={Neural Machine Translation for English-Tamil},author={Choudhary, H. and Pathak, A. and Shah, R. and Kumaraguru, P.},year={2018},booktitle={Proceedings of the Third Conference on Machine Translation: Shared Task Papers},}
ICCC
Empowering First Responders through Automated Multimodal Content Moderation
D.
Gupta, I.
Sen, N.
Sachdeva, P.
Kumaraguru, and A.
Buduru
In The Second IEEE International Congress on Cognitive Computing, 2018
Social media enables users to spread information and opinions, including in times of crisis events such as riots, protests or uprisings. Sensitive event-related content can lead to repercussions in the real world. Therefore it is crucial for first responders, such as law enforcement agencies, to have ready access, and the ability to monitor the propagation of such content. Obstacles to easy access include a lack of automatic moderation tools targeted for first responders. Efforts are further complicated by the multimodal nature of content which may have either textual and pictorial aspects. In this work, as a means of providing intelligence to first responders, we investigate automatic moderation of sensitive event-related content across the two modalities by exploiting recent advances in Deep Neural Networks (DNN). We use a combination of image classification with Convolutional Neural Networks (CNN) and text classification with Recurrent Neural Networks (RNN). Our multilevel content classifier is obtained by fusing the image classifier and the text classifier. We utilize feature engineering for preprocessing but bypass it during classification due to our use of DNNs while achieving coverage by leveraging community guidelines. Our approach maintains a low false positive rate and high precision by learning from a weakly labeled dataset and then, by learning from an expert annotated dataset. We evaluate our system both quantitatively and qualitatively to gain a deeper understanding of its functioning. Finally, we benchmark our technique with current approaches to combating sensitive content and find that our system outperforms by 16% in accuracy.
@inproceedings{gupta2018empoweringfirstresponders,title={Empowering First Responders through Automated Multimodal Content Moderation},author={Gupta, D. and Sen, I. and Sachdeva, N. and Kumaraguru, P. and Buduru, A.},year={2018},booktitle={The Second IEEE International Congress on Cognitive Computing},}
ACL
Language Identification and Named Entity Recognition in Hinglish Code Mixed Tweets
While growing code-mixed content on Online Social Networks(OSN) provides a fertile ground for studying various aspects of code-mixing, the lack of automated text analysis tools render such studies challenging. To meet this challenge, a family of tools for analyzing code-mixed data such as language identifiers, parts-of-speech (POS) taggers, chunkers have been developed. Named Entity Recognition (NER) is an important text analysis task which is not only informative by itself, but is also needed for downstream NLP tasks such as semantic role labeling. In this work, we present an exploration of automatic NER of code-mixed data. We compare our method with existing off-the-shelf NER tools for social media content,and find that our systems outperforms the best baseline by 33.18 % (F1 score).
@inproceedings{singh2018languageidentificationand,title={Language Identification and Named Entity Recognition in Hinglish Code Mixed Tweets},author={Singh, K. and Sen, I. and Kumaraguru, P.},year={2018},booktitle={ the ACL Student Research Workshop 2018},}
SocialNLP
A Twitter Corpus for Hindi English Code Mixed Dataset for POS Tagging
K.
Singh, I.
Sen, and P.
Kumaraguru
In Sixth International Workshop on Natural Language Processing for Social Media (SocialNLP 2018), 2018
Code-mixing is a linguistic phenomenon where multiple languages are used in the same occurrence that is increasingly common in multilingual societies. Code-mixed content on social media is also on the rise, prompting the need for tools to automatically understand such content. Automatic Parts-of-Speech (POS) tagging is an essential step in any Natural Language Processing (NLP) pipeline, but there is a lack of annotated data to train such models. In this work, we present a unique language tagged and POS-tagged dataset of code-mixed English-Hindi tweets related to five incidents in India that led to a lot of Twitter activity. Our dataset is unique in two dimensions: (i) it is larger than previous annotated datasets and (ii) it closely resembles typical real-world tweets. Additionally, we present a POS tagging model that is trained on this dataset to provide an example of how this dataset can be used. The model also shows the efficacy of our dataset in enabling the creation of code-mixed social media POS taggers.
@inproceedings{singh2018atwittercorpus,title={A Twitter Corpus for Hindi English Code Mixed Dataset for POS Tagging},author={Singh, K. and Sen, I. and Kumaraguru, P.},year={2018},booktitle={Sixth International Workshop on Natural Language Processing for Social Media (SocialNLP 2018)},}
WebSci
Worth its Weight in Likes: Towards Detecting Fake Likes on Instagram
I.
Sen, A.
Aggarwal, S.
Mian, S.
Singh, P.
Kumaraguru, and A.
Datta
Instagram is a significant platform for users to share media; reflecting their interests. It is used by marketers and brands to reach their potential audience for advertisement. The number of likes on posts serves as a proxy for social reputation of the users, and in some cases, social media influencers with an extensive reach are compensated by marketers to promote products. This emerging market has led to users artificially bolstering the likes they get to project an inflated social worth. In this study, we enumerate the potential factors which contribute towards a genuine like on Instagram. Based on our analysis of liking behaviour, we build an automated mechanism to detect fake likes on Instagram which achieves a high precision of 83.5%. Our work serves an important first step in reducing the effect of fake likes on Instagram influencer market.
@inproceedings{sen2018worthitsweight,title={Worth its Weight in Likes: Towards Detecting Fake Likes on Instagram},author={Sen, I. and Aggarwal, A. and Mian, S. and Singh, S. and Kumaraguru, P. and Datta, A.},year={2018},booktitle={the 10th ACM Conference on Web Science},}
WebSci
Under the Shadow of Sunshine: Characterizing Spam Campaigns Abusing Phone Numbers Across Online Social Networks
S.
Gupta, D.
Kuchhal, P.
Gupta, M.
Ahamad, M.
Gupta, and P.
Kumaraguru
Cybercriminals abuse Online Social Networks (OSNs) to lure victims into a variety of spam. Among different spam types, a less explored area is OSN abuse that leverages the telephony channel to defraud users. Phone numbers are advertized via OSNs, and users are tricked into calling these numbers. To expand the reach of such scam / spam campaigns, phone numbers are advertised across multiple platforms like Facebook, Twitter, GooglePlus, Flickr, and YouTube. In this paper, we present the first data-driven characterization of cross-platform campaigns that use multiple OSN platforms to reach their victims and use phone numbers for monetization. We collect -23M posts containing -1.8M unique phone numbers from Twitter, Facebook, GooglePlus, Youtube, and Flickr over a period of six months. Clustering these posts helps us identify 202 campaigns operating across the globe with Indonesia, United States, India, and United Arab Emirates being the most prominent originators. We find that even though Indonesian campaigns generate highest volume (-3.2M posts), only 1.6% of the accounts propagating Indonesian campaigns have been suspended so far. By examining campaigns running across multiple OSNs, we discover that Twitter detects and suspends -93% more accounts than Facebook. Therefore, sharing intelligence about abuse-related user accounts across OSNs can aid in spam detection. According to our dataset, around -35K victims and -$8.8M could have been saved if intelligence was shared across the OSNs. By analyzing phone number based spam campaigns running on OSNs, we highlight the unexplored variety of phone-based attacks surfacing on OSNs.
@inproceedings{gupta2018undertheshadow,title={Under the Shadow of Sunshine: Characterizing Spam Campaigns Abusing Phone Numbers Across Online Social Networks},author={Gupta, S. and Kuchhal, D. and Gupta, P. and Ahamad, M. and Gupta, M. and Kumaraguru, P.},year={2018},booktitle={10th ACM Conference on Web Science},}
MSM
Stop the Killfies! Using Deep Learning to Identify Dangerous Selfies
V.
Nanda, H.
Lamba, D.
Agarwal, M.
Arora, N.
Sachdeva, and P.
Kumaraguru
In 9th International Workshop on Modeling Social Media (MSM’2018), 2018
Selfies have become a prominent medium for self-portrayal on social media. Unfortunately, certain social media users go to extreme lengths to click selfies, which puts their lives at risk. Two hundred and sixteen individuals have died since March 2014 until January 2018 while trying to click selfies. It is imperative to be able to identify dangerous selfies posted on social media platforms to be able to build an intervention for users going to extreme lengths for clicking such selfies. In this work, we propose a convolutional neural network based classifier to identify dangerous selfies posted on social media using only the image (no metadata). We show that our proposed approach gives an accuracy of 98% and performs better than previous methods.
@inproceedings{nanda2018stopthe,title={Stop the Killfies! Using Deep Learning to Identify Dangerous Selfies},author={Nanda, V. and Lamba, H. and Agarwal, D. and Arora, M. and Sachdeva, N. and and Kumaraguru, P.},year={2018},booktitle={9th International Workshop on Modeling Social Media (MSM'2018)},}
EPJ
Collective Aspects of Privacy in the Twitter Social Network
M.
Goel, A.
Agrawal, D.
Garcia, and P.
Kumaraguru
In The European Physical Journal Data Science 2018, 2018
Preserving individual control over private information is one of the rising concerns in our digital society. Online social networks exist in application ecosystems that allow them to access data from other services, for example gathering contact lists through mobile phone applications. Such data access might allow social networking sites to create shadow profiles with information about non-users that has been inferred from information shared by the users of the social network. This possibility motivates the shadow profile hypothesis: the data shared by the users of an online service predicts personal information of non-users of the service. We test this hypothesis for the first time on Twitter, constructing a dataset of users that includes profile biographical text, location information, and bidirectional friendship links. We evaluate the predictability of the location of a user by using only information given by friends of the user that joined Twitter before the user did. This way, we audit the historical prediction power of Twitter data for users that had not joined Twitter yet. Our results indicate that information shared by users in Twitter can be predictive of the location of individuals outside Twitter. Furthermore, we observe that the quality of this prediction increases with the tendency of Twitter users to share their mobile phone contacts and is more accurate for individuals with more contacts inside Twitter. We further explore the predictability of biographical information of non-users, finding evidence in line with our results for locations. These findings illustrate that individuals are not in full control of their online privacy and that sharing personal data with a social networking site is a decision that is collectively mediated by the decisions of others.
@inproceedings{goel2018collectiveaspectsof,title={Collective Aspects of Privacy in the Twitter Social Network},author={Goel, M. and Agrawal, A. and Garcia, D. and Kumaraguru, P.},year={2018},booktitle={The European Physical Journal Data Science 2018},}
WWW
Collective Classification of Spam Campaigners on Twitter: A Hierarchical Meta-Path Based Approach
S.
Gupta, A.
Khattar, A.
Gogia, P.
Kumaraguru, and T.
Chakraborty
In The Web Conf 2018 (Formerly WWW Conference), 2018
Cybercriminals have leveraged the popularity of a large user base available on Online Social Networks to spread spam campaigns by propagating phishing URLs, attaching malicious contents, etc. However, another kind of spam attacks using phone numbers has recently become prevalent on OSNs, where spammers advertise phone numbers to attract users’ attention and convince them to make a call to these phone numbers. The dynamics of phone number based spam is different from URL-based spam due to an inherent trust associated with a phone number. While previous work has proposed strategies to mitigate URL-based spam attacks, phone number based spam attacks have received less attention. In this paper, we aim to detect spammers that use phone numbers to promote campaigns on Twitter. To this end, we collected information about 3,370 campaigns spread by 670,251 users. We model the Twitter dataset as a heterogeneous network by leveraging various interconnections between different types of nodes present in the dataset. In particular, we make the following contributions: (i) We propose a simple yet effective metric, called Hierarchical Meta-Path Score (HMPS) to measure the proximity of an unknown user to the other known pool of spammers. (ii) We design a feedback-based active learning strategy and show that it significantly outperforms three state-of-the-art baselines for the task of spam detection. Our method achieves 6.9% and 67.3% higher F1-score and AUC, respectively compared to the best baseline method. (iii) To overcome the problem of less training instances for supervised learning, we show that our proposed feedback strategy achieves 25.6% and 46% higher F1-score and AUC respectively than other oversampling strategies. Finally, we perform a case study to show how our method is capable of detecting those users as spammers who have not been suspended by Twitter (and other baselines) yet.
@inproceedings{gupta2018collectiveclassificationof,title={Collective Classification of Spam Campaigners on Twitter: A Hierarchical Meta-Path Based Approach},author={Gupta, S. and Khattar, A. and Gogia, A. and Kumaraguru, P. and Chakraborty, T.},year={2018},booktitle={The Web Conf 2018 (Formerly WWW Conference)},}
EuroS&PW
I Spy with My Little Eye: Analysis and Detection of Spying Browser Extensions
A.
Aggarwal, B.
Viswanath, L.
Zhang, S.
Kumar, A.
Shah, and P.
Kumaraguru
In 3rd IEEE European Symposium on Security and Privacy, 2018
Several studies have been conducted on understanding third-party user tracking on the web. However, web trackers can only track users on sites where they are embedded by the publisher, thus obtaining a fragmented view of a user’s online footprint. In this work, we investigate a different form of user tracking, where browser extensions are repurposed to capture the complete online activities of a user and communicate the collected sensitive information to a third-party domain. We conduct an empirical study of spying browser extensions on the Chrome Web Store. First, we present an in-depth analysis of the spying behavior of these extensions. We observe that these extensions steal a variety of sensitive user information, such as the complete browsing history (e.g. and the sequence of web traversals), online social network (OSN) access tokens, IP address, and user geolocation. Second, we investigate the potential for automatically detecting spying extensions by applying machine learning schemes. We show that using a Recurrent Neural Network (RNN), the sequences of browser API calls can be a robust feature, outperforming hand-crafted features (used in prior work on malicious extensions) to detect spying extensions. Our RNN based detection scheme achieves a high precision (90.02%) and recall (93.31%) in detecting spying extensions.
@inproceedings{aggarwal2018ispywith,title={I Spy with My Little Eye: Analysis and Detection of Spying Browser Extensions},author={Aggarwal, A. and Viswanath, B. and Zhang, L. and Kumar, S. and Shah, A. and Kumaraguru, P.},year={2018},booktitle={3rd IEEE European Symposium on Security and Privacy},}
SIGAPP
The Follower Count Fallacy: Detecting Twitter Users with Manipulated Follower Count
A.
Aggarwal, S.
Kumar, K.
Bhargava, and P.
Kumaraguru
In 33rd ACM / SIGAPP Symposium on Applied Computing, 2018
Online Social Networks (OSN) are increasingly being used as platform for an effective communication, to engage with other users, and to create a social worth via number of likes, followers and shares. Such metrics and crowd-sourced ratings give the OSN user a sense of social reputation which she tries to maintain and boost to be more influential. Users artificially bolster their social reputation via black-market web services. In this work, we identify users which manipulate their projected follower count using an unsupervised local neighborhood detection method. We identify a neighborhood of the user based on a robust set of features which reflect user similarity in terms of the expected follower count. We show that follower count estimation using our method has 84.2% accuracy with a low error rate. In addition, we estimate the follower count of the user under suspicion by finding its neighborhood drawn from a large random sample of Twitter. We show that our method is highly tolerant to synthetic manipulation of followers. Using the deviation of predicted follower count from the displayed count, we are also able to detect customers with a high precision of 98.62%
@inproceedings{aggarwal2018thefollowercount,title={The Follower Count Fallacy: Detecting Twitter Users with Manipulated Follower Count},author={Aggarwal, A. and Kumar, S. and Bhargava, K. and and Kumaraguru, P.},year={2018},booktitle={33rd ACM / SIGAPP Symposium on Applied Computing},}
2017
IJMCMC
Cultural and Psychological Factors in Cyber-Security
T.
Halevi, N.
Memon, J.
Lewis, P.
Kumaraguru, S.
Arora, N.
Dagar, F.
Aloul, and J.
Chen
In Journal of Mobile Multimedia, Vol. 13, Nov. 1 & 2, 2017
Increasing cyber-security presents an ongoing challenge to security professionals. Research continuously suggests that online users are a weak link in information security. This research explores the relationship between cyber-security and cultural, personality and demographic variables. This study was conducted in four different countries and presents a multi-cultural view of cyber-security. In particular, it looks at how behavior, self-efficacy and privacy attitude are affected by culture compared to other psychological and demographics variables (such as gender and computer expertise). It also examines what kind of data people tend to share online and how culture affects these choices. This work supports the idea of developing personality based UI design to increase users’ cyber-security. Its results show that certain personality traits affect the user cyber-security related behavior across different cultures, which further reinforces their contribution compared to cultural effects.
@inproceedings{halevi2017culturalandpsychological,title={Cultural and Psychological Factors in Cyber-Security},author={Halevi, T. and Memon, N. and Lewis, J. and Kumaraguru, P. and Arora, S. and Dagar, N. and Aloul, F. and and Chen, J.},year={2017},booktitle={Journal of Mobile Multimedia, Vol. 13, Nov. 1 & 2},}
ICTD
Leveraging Facebook’s Free Basics Engine for Web Service Deployment in Developing Regions
S.
Singh, V.
Nanda, R.
Sen, S.
Sengupta, P.
Kumaraguru, and K.
Gummadi
In ICTD ’17: Proceedings of the Ninth International Conference on Information and Communication Technologies and Development, 2017
In this paper we analyze Facebook’s Free Basics program, which provides free Internet access to a restricted set of web services. As the program grows to 60+ developing countries, an independent and data-driven audit of its scope and outreach is highly relevant to the ICTD community. We provide the first large scale empirical observations on how content providers are using the Free Basics platform and what kind of user traffic is expected once a Free Basics service goes live. Implementing an Android app for data collection and recruiting participants from 15 countries, we analyze the current set of Free Basics services and their growth over time. We also deploy our own Free Basics services to gather first hand experience about Facebook’s gate-keeping procedure in the program. One of our services Bugle News, an RSS news feed aggregator offered in English, Spanish and French, attracted 95.6K unique visitors from 55+ countries since Sep 2016. This enables us to characterize the nationality, demographics and interests of this Free Basics user population. We specifically deploy an ICTD related Free Basics service called Awaaz: My Voice. Awaaz is a web-service, where citizens can report local issues with location and images. This citizen journalism portal has attracted several hundred users during its short two months deployment in ten cities across South Africa. Visitors have reported concrete issues in categories of road, electricity, water, health and sanitation, school and education, crime and others. Overall our experimental observations allow the ICTD community to understand how Free Basics works and our deployment experiences pave the way for other applications to be launched in future, geared towards important use cases the ICTD community cares about.
@inproceedings{singh2017leveragingfree,title={Leveraging Facebook's Free Basics Engine for Web Service Deployment in Developing Regions},author={Singh, S. and Nanda, V. and Sen, R. and Sengupta, S. and Kumaraguru, P. and and Gummadi, K.},year={2017},booktitle={ICTD '17: Proceedings of the Ninth International Conference on Information and Communication Technologies and Development},}
MM ’17
Visual Summarization of Social Media Events Using Mid-Level Visual Elements
S.
Goel, S.
Ahuja, A.
Subramanyam, and P.
Kumaraguru
The data generated on social media sites continues to grow at an increasing rate with more than 36% of tweets containing images making the dominance of multimedia content evidently visible. is massive user-generated content has become a reection of world events. In order to enhance the ability and eectiveness to consume this plethora of data, summarization of these events is needed. However, very few studies have exploited the images aached with social media events to summarize them using “mid-level visual elements”. ese are the entities which are both representative and discriminative to the target dataset besides being human-readable and hence more informative. In this paper, we propose a methodology for visual event summarization by extracting mid-level visual elements from images associated with social media events on Twier (#VisualHashtags). e key research question is Which elements can visually capture the essence of a viral event?, hence explain its virality, and summarize it. Compared to the existing approaches of visual event summarization on social media data, we aim to discover #VisualHashtags, i.e. and meaningful patches that can become the visual analog of a regular text hashtag that Twier generates. Our algorithm incorporates a multi-stage ltering process and social popularity-based ranking to discover mid-level visual elements, which overcomes the challenges faced by the direct application of the existing methods. We evaluate our approach on a recently collected social media event dataset, comprising of 20,084 images. We evaluate the quality of #VisualHashtags extracted by conducting a user-centered evaluation where users are asked to rate the relevance of the resultant patches w.r.t. the event and the quality of the patch in terms of how meaningful it is. We also do a quantitative evaluation on the results. We show a high search space reduction of 93% in images and 99% in patches aer summarization. Further, we get 83% of purity in the resultant patches with a data coverage of 18%.
@inproceedings{goel2017visualsummarizationof,title={Visual Summarization of Social Media Events Using Mid-Level Visual Elements},author={Goel, S. and Ahuja, S. and Subramanyam, A. and Kumaraguru, P.},year={2017},booktitle={25th ACM Conference on MultiMedia 2017},}
SocInfo
Nudging Nemo: Helping Users Control Linkability across Social Networks
R.
Kaushal, S.
Chandok, P.
Jain, P.
Dewan, N.
Gupta, and P.
Kumaraguru
In 9th International Conference on Social Informatics, 2017
The last decade has witnessed a boom in social networking platforms; each new platform is unique in its own ways, and offers a different set of features and services. In order to avail these services, users end up creating multiple virtual identities across these platforms. Researchers have proposed numerous techniques to resolve multiple such identities of a user across different platforms. However, the ability to link different identities poses a threat to the users’ privacy; users may or may not want their identities to be linkable across networks. In this paper, we propose Nudging Nemo, a framework which assists users to control the linkability of their identities across multiple platforms. We model the notion of linkability as the probability of an adversary (who is part of the user’s network) being able to link two profiles across different platforms, to the same real user. Nudging Nemo has two components; a linkability calculator which uses state-of-the-art identity resolution techniques to compute a normalized linkability measure for each pair of social network platforms used by a user, and a soft paternalistic nudge, which alerts the user if any of their activity violates their preferred linkability. We evaluate the effectiveness of the nudge by conducting a controlled user study on privacy conscious users who maintain their accounts on Facebook, Twitter, and Instagram. Outcomes of user study confirmed that the proposed framework helped most of the participants to take informed decisions, thereby preventing inadvertent exposure of their personal information across social network services.
@inproceedings{kaushal2017nudginghelping,title={Nudging Nemo: Helping Users Control Linkability across Social Networks},author={Kaushal, R. and Chandok, S. and Jain, P. and Dewan, P. and Gupta, N. and Kumaraguru, P.},year={2017},booktitle={9th International Conference on Social Informatics},}
LNSN
Hiding in Plain Sight: The Anatomy of Malicious Pages on Facebook
P.
Dewan, S.
Bagroy, and P.
Kumaraguru
In Lecture Notes in Social Networks, Springer, 2017
Facebook is the world’s largest Online Social Network, having more than 1 billion users. Like most other social networks, Facebook is home to various categories of hostile entities who abuse the platform by posting malicious content. In this paper, we identify and characterize Facebook pages that engage in spreading URLs pointing to malicious domains. We used the Web of Trust API to determine domain reputations of URLs published by pages, and identified 627 pages publishing untrustworthy information, misleading content, adult and child unsafe content, scams, etc. which are deemed as "Page Spam" by Facebook, and do not comply with Facebook’s community standards. Our findings revealed dominant presence of politically polarized entities engaging in spreading content from untrustworthy web domains. Anger and religion were the most prominent topics in the textual content published by these pages. We found that at least 8% of all malicious pages were dedicated to promote a single malicious domain. Studying the temporal posting activity of pages revealed that malicious pages were more active than benign pages. We further identified collusive behavior within a set of malicious pages spreading adult and pornographic content. We believe our findings will enable technologists to devise efficient automated solutions to identify and curb the spread of malicious content through such pages. To the best of our knowledge, this is the first attempt in literature, focused exclusively on characterizing malicious Facebook pages.
@inproceedings{dewan2017hidinginplain,title={Hiding in Plain Sight: The Anatomy of Malicious Pages on Facebook},author={Dewan, P. and Bagroy, S. and and Kumaraguru, P.},year={2017},booktitle={Lecture Notes in Social Networks, Springer},}
ASONAM
Understanding Psycho-Sociological Vulnerability of ISIS Patronizers in Twitter
A.N.
Reganti, T.
Maheshwari, Das
A., T.
Chakraborthy, and P.
Kumaraguru
In IEEE/ACM International Conference on Social Networks Analysis and Mining (ASONAM 2017), 2017
The Islamic State of Iraq and Syria (ISIS) is a Salafi jihadist militant group that has made extensive use of online social media platforms to promulgate its ideologies and evoke many individuals to support the organization. The psycho-sociological background of an individual plays a crucial role in determining his/her vulnerability of being lured into joining the organisation and indulge in terrorist activities, since his/her behavior largely depends on the society s/he was brought up in. Here, we analyse five sociological aspects - personality, values & ethics, optimism/pessimism, age and gender to understand the psycho-sociological vulnerability of individuals over Twitter. Experimental results suggest that psycho-sociological aspects indeed act as foundation to discover and differentiate between prominent and unobtrusive users in Twitter.
@inproceedings{reganti2017understandingvulnerability,title={Understanding Psycho-Sociological Vulnerability of ISIS Patronizers in Twitter},author={Reganti, A.N. and Maheshwari, T. and A., Das and Chakraborthy, T. and Kumaraguru, P.},year={2017},booktitle={IEEE/ACM International Conference on Social Networks Analysis and Mining (ASONAM 2017)},}
ASONAM
Towards Understanding Crisis Events On Online Social Networks Through Pictures
P.
Dewan, A.
Suri, V.
Bharadhwaj, A.
Mithal, and P.
Kumaraguru
In IEEE/ACM International Conference on Social Networks Analysis and Mining (ASONAM 2017), 2017
Extensive research has been conducted to identify, analyze and measure popular topics and public sentiment on Online Social Networks (OSNs) through text, especially during crisis events. However, little work has been done to understand such events through pictures posted on these networks. Given the potential of visual content for influencing users’ thoughts and emotions, we perform a large-scale analysis to study and compare popular themes and sentiment across images and textual content posted on Facebook during the terror attacks that took place in Paris in 2015. We propose a generalizable and highly automated 3-tier pipeline which utilizes state-of-the-art computer vision techniques to extract high-level human understandable image descriptors. We used these descriptors to associate themes and sentiment with images, and analyzed over 57,000 images related to the Paris Attacks. We discovered multiple visual themes which were popular in images, but were not identifiable through text. We also uncovered instances of misinformation and false flag (conspiracy) theories among popular image themes, which were not prominent in user-generated textual content. Further, our analysis revealed that while textual content posted after the attacks reflected negative sentiment, images inspired positive sentiment. These findings suggest that large-scale mining of images posted on OSNs during crisis, and other news-making events can significantly augment textual content to understand such events.
@inproceedings{dewan2017towardsunderstandingcrisis,title={Towards Understanding Crisis Events On Online Social Networks Through Pictures},author={Dewan, P. and Suri, A. and Bharadhwaj, V. and Mithal, A. and Kumaraguru, P.},year={2017},booktitle={IEEE/ACM International Conference on Social Networks Analysis and Mining (ASONAM 2017)},}
ASONAM
Medical Persona Classification in Social Media
N.
Pattisapu, M.
Gupta, P.
Kumaraguru, and V.
Varma
In IEEE/ACM International Conference on Social Networks Analysis and Mining (ASONAM 2017), 2017
Identifying medical persona from a social media post is of paramount importance for drug marketing and pharmacovigilance. In this work, we propose multiple approaches to infer the medical persona associated with a social media post. We pose this as a supervised multi-label text classification problem. The main challenge is to identify the hidden cues in a post that are indicative of a particular persona. We first propose a large set of manually engineered features for this task. Further, we propose multiple neural network based architectures to extract useful features from these posts using pre-trained word embeddings. Our experiments on thousands of blogs and tweets show that the proposed approach results in 7% and 5% gain in F-measure over manual feature engineering based approach for blogs and tweets respectively.
@inproceedings{pattisapu2017medicalpersonaclassification,title={Medical Persona Classification in Social Media},author={Pattisapu, N. and Gupta, M. and Kumaraguru, P. and Varma, V.},year={2017},booktitle={IEEE/ACM International Conference on Social Networks Analysis and Mining (ASONAM 2017)},}
SIGMETRICS
An Empirical Analysis of Facebook’s Free Basics
S.
Singh, V.
Nanda, R.
Sen, S.
Ahmada, S.
Sengupta, A.
Phokeer, Z.A.
Farooq, T.A.
Khan, P.
Kumaraguru, I.A.
Qazi, D.
Choffnes, and K.
Gummadi
In this paper, we develop a suite of measurement techniques to improve the transparency of Free Basics and inform policy debates with empirical evidence. While our study necessarily focuses on Free Basics, our approach can be applied to any similar zero-rated and proxied services that arise. Our analysis answers the following key questions covering different aspects of the program: ∙ Free Basics services: What services constitute the current walled garden of Free Basics? Are these services same across countries? Are these services growing over time? ∙ Free Basics users: How many visitors does a typical service get, and from which countries, demographic and economic backgrounds? ∙ Free Basics architecture and Internet providers: What network quality are the services given, as a tradeoff for free access? Which party is primarily responsible for the quality: Facebook or the participating cellular providers?
@inproceedings{singh2017anempiricalanalysis,title={An Empirical Analysis of Facebook's Free Basics},author={Singh, S. and Nanda, V. and Sen, R. and Ahmada, S. and Sengupta, S. and Phokeer, A. and Farooq, Z.A. and Khan, T.A. and Kumaraguru, P. and Qazi, I.A. and Choffnes, D. and Gummadi, K.},year={2017},booktitle={ACM SIGMETRICS 2017},}
JCS
On the Security and Usability of Dynamic Cognitive Game CAPTCHAs
M.
Mohamed, S.
Gao, N.
Sachdeva, S.
Saxena, C.
Zhang, P.
Kumaraguru, and J.
Oorschot
Existing CAPTCHA solutions are a major source of user frustration on the Internet today, frequently forcing companies to lose customers and business. Game CAPTCHAs are a promising approach which may make CAPTCHA solving a fun activity for the user. One category of such CAPTCHAs – called Dynamic Cognitive Game (DCG) CAPTCHA – challenges the user to perform a game-like cognitive (or recognition) task interacting with a series of dynamic images. Specifically, it takes the form of many objects floating around within the images, and the user’s task is to match the objects corresponding to specific target(s), and drag/drop them to the target region(s). In this paper, we pursue a comprehensive analysis of DCG CAPTCHAs. We design and implement such CAPTCHAs, and dissect them across four broad but overlapping dimensions: (1) usability, (2) fully automated attacks, (3) human-solving relay attacks, and (4) hybrid attacks that combine the strengths of automated and relay attacks. Our study shows that DCG CAPTCHAs are highly usable, even on mobile devices and offer some resilience to relay attacks, but they are vulnerable to our proposed automated and hybrid attacks.
@inproceedings{mohamed2017onthesecurity,title={On the Security and Usability of Dynamic Cognitive Game CAPTCHAs},author={Mohamed, M. and Gao, S. and Sachdeva, N. and Saxena, S. and Zhang, C. and Kumaraguru, P. and and van Oorschot, J.},year={2017},booktitle={Journal of Computer Security (JCS)},}
SNAM
Facebook Inspector (FbI): Towards Automatic Real Time Detection of Malicious Content on Facebook
P.
Dewan, and P.
Kumaraguru
In Journal of Social Network Analysis and Mining (SNAM), Volume 7, Issue 1, 2017
Online Social Networks witness a rise in user activity whenever a major event makes news. Cyber criminals exploit this spur in user engagement levels to spread malicious content that compromises system reputation, causes financial losses and degrades user experience. In this paper, we collect and characterize a dataset of 4.4 million public posts generated on Facebook during 17 news-making events (natural calamities, sports, terror attacks, etc.) over a 16-month time period. From this dataset, we filter out two sets of malicious posts, one using URL blacklists and another using human annotations. Our observations reveal some characteristic differences between malicious posts obtained from the two methodologies, thus demanding a twofold filtering process for a more complete and robust filtering system. We empirically confirm the need for this twofold filtering approach by cross-validating supervised learning models obtained from the two sets of malicious posts. These supervised learning models include Naive Bayesian, Decision Trees, Random Forest, and Support Vector Machine-based models. Based on this learning, we implement Facebook Inspector, a REST API-based browser plug-in for identifying malicious Facebook posts in real time. Facebook Inspector uses class probabilities obtained from two independent supervised learning models based on a Random Forest classifier to identify malicious posts in real time. These supervised learning models are based on a feature set comprising of 44 features and achieve an accuracy of over 80% each, using only publicly available features. During the first 9 months of its public deployment (August 2015–May 2016), Facebook Inspector processed 0.97 million posts at an average response time of 2.6 s per post and was downloaded over 2500 times. We also evaluate Facebook Inspector in terms of performance and usability to identify further scope for improvement.
@inproceedings{dewan2017facebookinspector,title={Facebook Inspector (FbI): Towards Automatic Real Time Detection of Malicious Content on Facebook},author={Dewan, P. and Kumaraguru, P.},year={2017},booktitle={Journal of Social Network Analysis and Mining (SNAM), Volume 7, Issue 1},}
ICWSM
From Camera to Deathbed: Understanding Dangerous Selfies on Social Media
H.
Lamba, V.
Bharadhwaj, M.
Vachher, D.
Agarwal, M.
Arora, N.
Sachdeva, and P.
Kumaraguru
In 11th International Conference on Web and Social Media (ICWSM), 2017
Selfie culture has emerged as a ubiquitous instrument for self portrayal in recent years. To portray themselves differently and attractive to others, individuals may risk their life by clicking selfies in dangerous situations. Consequently, selfies have claimed 137 lives around the world since March 2014 until December 2016. In this work, we perform a comprehensiv analysis of the reported selfie-casualties and note various reasons behind these deaths. We perform an in-depth analysis of such selfies posted on social media to identify dangerous selfies and explore a series of statistical models to predict dangerous posts. We find that our multimodal classifier using combination of text-based, image-based and location-based features performs the best in spotting dangerous selfies. Our classifier is trained on 6K annotated selfies collected on Twitter and gives 82% accuracy for identifying whether a selfie posted on Twitter is dangerous or not.
@inproceedings{lamba2017fromcamerato,title={From Camera to Deathbed: Understanding Dangerous Selfies on Social Media},author={Lamba, H. and Bharadhwaj, V. and Vachher, M. and Agarwal, D. and Arora, M. and Sachdeva, N. and Kumaraguru, P.},year={2017},booktitle={11th International Conference on Web and Social Media (ICWSM)},}
CHI ’17
A Social Media Based Index of Mental Well-Being in College Campuses
S.
Bagroy, P.
Kumaraguru, and M.
De Choudhury
In Proceedings of the 34th Annual ACM Conference on Human Factors in Computing Systems (CHI), 2017
Psychological distress in the form of depression, anxiety and other mental health challenges among college students is a growing health concern. Dearth of accurate, continuous, and multi-campus data on mental well-being presents significant challenges to intervention and mitigation efforts in college campuses. We examine the potential of social media as a new “barometer” for quantifying the mental well-being of college populations. Utilizing student-contributed data in Reddit communities of over 100 universities, we first build and evaluate a transfer learning based classification approach that can detect mental health expressions with 97% accuracy. Thereafter, we propose a robust campus-specific Mental Well-being Index: MWI. We find that MWI is able to reveal meaningful temporal patterns of mental well-being in campuses, and to assess how their expressions relate to university attributes like size, academic prestige, and student demographics. We discuss the implications of our work for improving counselor efforts, and in the design of tools that can enable better assessment of the mental health climate of college campuses.
@inproceedings{bagroy2017asocialmedia,title={A Social Media Based Index of Mental Well-Being in College Campuses},author={Bagroy, S. and Kumaraguru, P. and and De Choudhury, M.},year={2017},booktitle={Proceedings of the 34th Annual ACM Conference on Human Factors in Computing Systems (CHI)},}
CSCW
Call for Service: Characterizing and Modeling Police Response to Serviceable Requests on Facebook
N.
Sachdeva, and P.
Kumaraguru
In ACM Conference on Computer-Supported Cooperative Work and Social Computing (CSCW), 2017
Social media platforms have obtained substantial interest of police to connect with residents. This has encouraged residents to report day-to-day law and order concerns such as traffic congestion, missing people, and harassment by cops on these platforms. In this paper, we study day-to-day concerns shared by residents on social media and police response to such concerns. Based on the input of police experts, we define concerns that require police response and attention, as a serviceable request. We provide insights on six textual attributes that can identify serviceable posts. We find such posts are marked by high negative emotions, more factual, and objective content such as location and time of incidences. We show that police response time varies depending upon the kind of serviceable requests. Our work explores a series of statistical models to predict serviceable posts and its different types. We conclude the paper, discussing the implication of our findings on police practices and design needs for possible technological interventions. These technological interventions will help increase the interactions between police and residents and thereby increasing the well-being and safety of society.
@inproceedings{sachdeva2017callfor,title={Call for Service: Characterizing and Modeling Police Response to Serviceable Requests on Facebook},author={Sachdeva, N. and and Kumaraguru, P.},year={2017},booktitle={ACM Conference on Computer-Supported Cooperative Work and Social Computing (CSCW)},}
2016
iiWAS2016
Cultural and Psychological Factors in Cyber-Security
T.
Halevi, N. D.
Memon, J.
Lewis, P.
Kumaraguru, S.
Arora, N.
Dagar, F. A.
Aloul, and J.
Chen
In 18th International Conference on Information Integration and Web-based Applications & Services (iiWAS2016), 2016
Increasing cyber-security presents an ongoing challenge to security professionals. Research continuously suggests that online users are a weak link in information security. This research explores the relationship between cyber-security and cultural, personality and demographic variables. This study was conducted in four different countries and presents a multi-cultural view of cyber-security. In particular, it looks at how behavior, self-efficacy and privacy attitude are affected by culture compared to other psychological and demographics variables (such as gender and computer expertise). It also examines what kind of data people tend to share online and how culture affects these choices. This work supports the idea of developing personality based UI design to increase users’ cyber-security. Its results show that certain personality traits affect the user cyber-security related behavior across different cultures, which further reinforces their contribution compared to cultural effects.
@inproceedings{halevi2016culturalandpsychological,title={Cultural and Psychological Factors in Cyber-Security},author={Halevi, T. and Memon, N. D. and Lewis, J. and Kumaraguru, P. and Arora, S. and Dagar, N. and Aloul, F. A. and and Chen, J.},year={2016},booktitle={18th International Conference on Information Integration and Web-based Applications & Services (iiWAS2016)},}
SPSM
Exploiting Phone Numbers and Cross-Application Features in Targeted Mobile Attacks
S.
Gupta, P.
Gupta, M.
Ahamad, and P.
Kumaraguru
In 6th Workshop on Security and Privacy in Smartphones and Mobile Devices (SPSM), 2016
Smartphones have fueled a shift in the way we communicate with each other via Instant Messaging. With the convergence of Internet and telephony, new Over-The-Top (OTT) messaging applications (e.g. and WhatsApp, Viber, WeChat etc.) have emerged as an important means of communication for millions of users. These applications use phone numbers as the only means of authentication and are becoming an attractive medium for attackers to deliver spam and carry out more targeted attacks. The universal reach of telephony along with its past trusted nature makes phone numbers attractive identifiers for reaching potential attack targets. In this paper, we explore the feasibility, automation, and scalability of a variety of targeted attacks that can be carried out by abusing phone numbers. These attacks can be carried out on different channels viz. OTT messaging applications, voice, e-mail, or SMS. We demonstrate a novel system that takes a phone number as an input, leverages information from applications like Truecaller and Facebook about the victim and his / her social network, checks the presence of phone number’s owner (victim) on the attack channel (OTT messaging applications, voice, e-mail, or SMS), and finally targets the victim on the chosen attack channel. As a proof of concept, we enumerated through a random pool of 1.16 million phone numbers and demonstrated that targeted attacks could be crafted against the owners of 255,873 phone numbers by exploiting cross-application features. Due to the significantly increased user engagement via new mediums of communication like OTT messaging applications and ease with which phone numbers allow collection of pertinent information, there is a clear need for better protection of applications that rely on phone numbers.
@inproceedings{gupta2016exploitingphonenumbers,title={Exploiting Phone Numbers and Cross-Application Features in Targeted Mobile Attacks},author={Gupta, S. and Gupta, P. and Ahamad, M. and and Kumaraguru, P.},year={2016},booktitle={6th Workshop on Security and Privacy in Smartphones and Mobile Devices (SPSM)},}
PST
Detection, Characterization and Analysis of Child Unsafe Content & Promoters on YouTube
YouTube draws large number of users who contribute actively by uploading videos or commenting on existing videos. However, being a crowd sourced and large content pushed onto it, there is limited control over the content. This makes malicious users push content (videos and comments) which is inappropriate (unsafe), particularly when such content is placed around cartoon videos which are typically watched by kids. In this paper, we focus on presence of unsafe content for children and users who promote it. For detection of child unsafe content and its promoters, we perform two approaches, one based on supervised classification which uses an extensive set of video-level, user-level and comment-level features and another based Convolutional Neural Network using video frames. Detection accuracy of 85.7% is achieved which can be leveraged to build a system to provide a safe YouTube experience for kids. Through detailed characterization studies, we are able to successfully conclude that unsafe content promoters are less popular and engage less as compared with other users. Finally, using a network of unsafe content promoters and other users based on their engagements (likes, subscription and playlist addition) and other factors, we find that unsafe content is present very close to safe content and unsafe content promoters form very close knit communities with other users, thereby further increasing the likelihood of a child getting getting exposed to unsafe content.
@inproceedings{kaushal2016characterizationand,title={Detection, Characterization and Analysis of Child Unsafe Content & Promoters on YouTube},author={Kaushal, R. and Saha, S. and Bajaj, P. and and Kumaraguru, P.},year={2016},booktitle={Privacy Security and Trust (PST), 2016},}
SNAM
Other Times, Other Values: Leveraging Attribute History to Link User Profiles across Online Social Networks
P.
Jain, P.
Kumaraguru, and A.
Joshi
In Journal of Social Network Analysis and Mining (SNAM), 2016
Profile linking is the ability to connect profiles of a user on different social networks. Linked profiles can help companies like Disney to build psychographics of potential customers and segment them for targeted marketing in a cost-effective way. Existing methods link profiles by observing high similarity between most recent (current) values of the attributes like name and username. However, for a section of users observed to evolve their attributes over time and choose dissimilar values across their profiles, these current values have low similarity. Existing methods then falsely conclude that profiles refer to different users. To reduce such false conclusions, we suggest to gather rich history of values assigned to an attribute over time and compare attribute histories to link user profiles across networks. We believe that attribute history highlights user preferences for creating attribute values on a social network. Co-existence of these preferences across profiles on different social networks result in alike attribute histories that suggests profiles potentially refer to a single user. Through a focused study on \emphusername, we quantify the importance of username history for profile linking on a dataset of real-world users with profiles on Twitter, Facebook, Instagram and Tumblr. We show that username history correctly links 44% more profile pairs with non-matching current values that are incorrectly unlinked by existing methods. We further explore if factors such as longevity and availability of username history on either profiles affect linking performance. To the best of our knowledge, this is the first study that explores viability of using an attribute history to link profiles on social networks.
@inproceedings{jain2016otherother,title={Other Times, Other Values: Leveraging Attribute History to Link User Profiles across Online Social Networks},author={Jain, P. and Kumaraguru, P. and and Joshi, A.},year={2016},booktitle={Journal of Social Network Analysis and Mining (SNAM)},}
SocInfo
PicHunt: Social Media Image Retrieval for Improved Law Enforcement
S.
Goel, N.
Sachdeva, P.
Kumaraguru, A.
Subramanyam, and D.
Gupta
In 8th International Conference on Social Informatics, 2016
First responders are increasingly using social media to identify and reduce crime for well-being and safety of the society. Images shared on social media hurting religious, political, communal and other sentiments of people, often instigate violence and create law & order situations in society. This results in the need for first responders to inspect the spread of such images and users propagating them on social media. In this paper, we present a comparison between different hand-crafted features and a Convolutional Neural Network (CNN) model to retrieve similar images, which outperforms state-of-art hand-crafted features. We propose an Open-Source-Intelligent (OSINT) real-time image search system, robust to retrieve modified images that allows first responders to analyze the current spread of images, sentiments floating and details of users propagating such content. The system also aids officials to save time of manually analyzing the content by reducing the search space on an average by 67%.
@inproceedings{goel2016socialmedia,title={PicHunt: Social Media Image Retrieval for Improved Law Enforcement},author={Goel, S. and Sachdeva, N. and Kumaraguru, P. and Subramanyam, A. and and Gupta, D.},year={2016},booktitle={8th International Conference on Social Informatics},}
MM ’16
Disinformation in Multimedia Annotation: Misleading Metadata Detection on YouTube
P.
Bajaj, M.
Kavidayal, P.
Srivastava, Akthar
M., and P.
Kumaraguru
In ACM Multimedia 2016 Workshop: Vision and Language Integration Meets Multimedia Fusion, 2016
Popularity of online videos is increasing at a rapid rate. Not only the users can access these videos online, but they can also upload video content on platforms like YouTube and Myspace. These videos are indexed by user generated multimedia annotation, also known as metadata, which is usually rich contextual information added by users about the content of the videos to facilitate access to their videos. Metadata plays a crucial role in techniques for video search and retrieval. However, this freedom of choosing annotation causes some uploaders to provide additional tags which are not even related to the content of the videos. Therefore, it is essential to verify the relevance of user-generated tags with the content of the video. Given the sheer volume of video content uploaded everyday, manual tag validation can be a highly labor intensive task. In this paper, we propose a method to automatically analyze user generated tags against video content to identify relevance of these tags and to detect irrelevant and misleading metadata for online videos. Our contributions are three-fold: First, we study nature of user-assigned tags and characterize them in two categories-generic and specific tags. Second, we propose a novel hierarchical graph based approach to identify tags which are relevant to content of the video. Third, we present a way to use user-generated comments for multimedia annotation verification. We demonstrate results of our method and evaluation on 300 YouTube videos for three different categories. The results show that we are able to identify relevant tags with average recall of 0.813 and average precision of 0.97.
@inproceedings{bajaj2016disinformationinmultimedia,title={Disinformation in Multimedia Annotation: Misleading Metadata Detection on YouTube},author={Bajaj, P. and Kavidayal, M. and Srivastava, P. and M., Akthar and and Kumaraguru, P.},year={2016},booktitle={ACM Multimedia 2016 Workshop: Vision and Language Integration Meets Multimedia Fusion},}
ASONAM
Emerging Threats Abusing Phone Numbers Exploiting Cross-Platform Features
S.
Gupta
In 2016 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM), Ph.D. Forum, 2016
Phone number, a unique identifier has emerged as an important Personally Identifiable Information (PII) in the last few years. Other PII like e-mail and online identity have been exploited in the past to launch phishing and spam attacks against them. The reach and security of a phone number provide a genuine advantage over e-mail or online identity, making it the most vulnerable attack vector. In this work, we explore the emerging threats that abuse phone numbers by exploiting cross-platform features. Given that phone number space hasn’t been extensively studied in the past, there is a dire need to understand the threat landscape and develop solutions to prevent its abuse.
@inproceedings{gupta2016emergingthreatsabusing,title={Emerging Threats Abusing Phone Numbers Exploiting Cross-Platform Features},author={Gupta, S.},year={2016},booktitle={2016 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM), Ph.D. Forum},}
ASONAM
Hiding in Plain Sight: Characterizing and Detecting Malicious Facebook Pages
P.
Dewan, S.
Bagroy, and P.
Kumaraguru
In IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM), 2016
Facebook is the world’s largest Online Social Network, having more than 1 billion users. Like most other social networks, Facebook is home to various categories of hostile entities who abuse the platform by posting malicious content. In this paper, we identify and characterize Facebook pages that engage in spreading URLs pointing to malicious domains. We used the Web of Trust API to determine domain reputations of URLs published by pages, and identified 627 pages publishing untrustworthy information, misleading content, adult and child unsafe content, scams, etc. which are deemed as "Page Spam" by Facebook, and do not comply with Facebook’s community standards. Our findings revealed dominant presence of politically polarized entities engaging in spreading content from untrustworthy web domains. Anger and religion were the most prominent topics in the textual content published by these pages. We found that at least 8% of all malicious pages were dedicated to promote a single malicious domain. Studying the temporal posting activity of pages revealed that malicious pages were more active than benign pages. We further identified collusive behavior within a set of malicious pages spreading adult and pornographic content. We believe our findings will enable technologists to devise efficient automated solutions to identify and curb the spread of malicious content through such pages. To the best of our knowledge, this is the first attempt in literature, focused exclusively on characterizing malicious Facebook pages.
@inproceedings{dewan2016hidinginplain,title={Hiding in Plain Sight: Characterizing and Detecting Malicious Facebook Pages},author={Dewan, P. and Bagroy, S. and and Kumaraguru, P.},year={2016},booktitle={IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM)},}
BHCI
Social Media for Safety: Characterizing Online Interactions between Citizens and Police
N.
Sachdeva, P.
Kumaraguru, and M.
Choudhury
In 30th British Human Computer Interaction Conference (BHCI) 2016, 2016
Social media has emerged as a promising resource for police to connect with citizens for collective action. However the attributes of police citizen interactions on social media remain under-explored. In this paper, we utilise official and public Facebook pages of several police departments in India to study the patterns of engagement, emotions, and social processes between citizens and police in the context of day-to-day policing. We examine two prominent discussion threads: police initiated and citizen initiated. We find that topics exchanged in police initiated discussions are more focussed than citizen initiated threads; police focused on topics concerning safety awareness programs, action reports, and information regarding policing activities. Compared to police initiated discussions, citizen initiated discussions show lower engagement. Further, discussions involving both police and citizens show higher negative emotions, anger and arousal than citizen only discussions; these interventions involving both reveal a stronger notion of a collective identity. We discuss the implications of our work in designing technological support for improved policing and to help understand citizen opinions, safety concerns and well-being via social media.
@inproceedings{sachdeva2016socialmediafor,title={Social Media for Safety: Characterizing Online Interactions between Citizens and Police},author={Sachdeva, N. and Kumaraguru, P. and Choudhury, M.},year={2016},booktitle={30th British Human Computer Interaction Conference (BHCI) 2016},}
HCI
Online Social Media - New face of policing? A Survey Exploring Perceptions, Behavior, Challenges for Police Field Officers and Residents
N.
Sachdeva, and P.
Kumaraguru
In 18th International Conference on Human-Computer Interaction, 2016
Online social media (OSM) has become a preferred choice of police to communicate and collaborate with citizens for improved safety. Various studies investigate perceptions and opinion of high ranked police officers on use of OSM in policing, however, understanding and perceptions of field level police personnel is largely unexplored. We collected survey responses of 445 police personnel and 204 citizens’ survey in India to understand perceptions on OSN use for policing. Further, we analyzed posts from Facebook pages of Indian police organizations to study the behavior of police and citizens as they pursue social and safety goals on OSN. We find that success of OSN for policing demands effective communication between the stakeholders (citizens and police). Our results show preliminary evidences that OSN use for policing can help (1) increase participation in problem solving process, (2) increase community engagement by providing unique channel for both Feedback and Anonymity. However, such a system will need appropriate acknowledgment and trustworthiness channels to be successful. We also identify challenges in adopting OSN and outline design opportunities for HCI researchers and practitioners to design tools supporting social interactions for policing.
@inproceedings{sachdeva2016onlinesocialmedia,title={Online Social Media - New face of policing? A Survey Exploring Perceptions, Behavior, Challenges for Police Field Officers and Residents},author={Sachdeva, N. and and Kumaraguru, P.},year={2016},booktitle={18th International Conference on Human-Computer Interaction},}
CoDS
On the Dynamics of Username Changing Behavior on Twitter
P.
Jain, and P.
Kumaraguru
In 3rd IKDD Conference on Data Science, 2016, 2016
People extensively use username to lookup users, their profiles and tweets that mention them via Twitter search engine. Often, the searched username is outdated due to a recent username change and no longer refers to the user of interest. Search by the user’s old username results in a failed attempt to reach the user’s profile, thereby making others falsely believe that the user account has been deactivated. Such search can also redirect to a different user who later picks the old username, thereby reaching to a different person altogether. Past studies show that a substantial section of Twitter users change their username over time. We also observe similar trends when tracked 8.7 million users on Twitter for a duration of two months. To this point, little is known about how and why do these users undergo changes to their username, given the consequences of unreachability. To answer this, we analyze username changing behavior of carefully selected users on Twitter and find that users change username frequently within short time intervals (a day) and choose new username un-related to the old one. Few favor a username by repeatedly choosing it multiple times. We explore few of the many reasons that may have caused username changes. We believe that studying username changing behavior can help correctly find the user of interest in addition to learning username creation strategies and uncovering plausible malicious intentions for the username change.
@inproceedings{jain2016onthedynamics,title={On the Dynamics of Username Changing Behavior on Twitter},author={Jain, P. and and Kumaraguru, P.},year={2016},booktitle={3rd IKDD Conference on Data Science, 2016},}
ICWSM
Emotions, Demographics and Sociability in Online Interactions
K.
Lerman, M.
Arora, L.
Marin, P.
Kumaraguru, and D.
Garcia
The social connections people form online affect the quality of information they receive and their online experience. Although a host of socioeconomic and cognitive factors were implicated in the formation of offline social ties, few of them have been empirically validated, particularly in an on-line setting. In this study, we analyze a large corpus of geo-referenced messages, or tweets, posted by social media users from a major US metropolitan area. We linked these tweets to US Census data through their locations. This allowed us to measure emotions expressed in the tweets posted from an area, the structure of social connections, and also use that area’s socioeconomic characteristics in analysis. We find that at an aggregate level, places where social media users engage more deeply with less diverse social contacts are those where they express more negative emotions, like sadness and anger. Demographics also has an impact: these places have residents with lower household income and education levels. Conversely, places where people engage less frequently but with diverse contacts have happier, more positive messages posted from them and also have better educated, younger, more affluent residents. Results suggest that cognitive factors and offline characteristics affect the quality of online interactions. Our work highlights the value of linking social media data to traditional data sources, such as US Census, to drive novel analysis of online behavior.
@inproceedings{lerman2016demographicsand,title={Emotions, Demographics and Sociability in Online Interactions},author={Lerman, K. and Arora, M. and Marin, L. and Kumaraguru, P. and and Garcia, D.},year={2016},booktitle={ICWSM 2016},}
2015
HTSM
Other Times, Other Values: Leveraging Attribute History to Link User Profiles across Online Social Networks
P.
Jain, P.
Kumaraguru, and A.
Joshi
In HT ’15: Proceedings of the 26th ACM Conference on Hypertext & Social Media, 2015
Profile linking is the ability to connect profiles of a user on different social networks. Linked profiles can help companies like Disney to build psychographics of potential customers and segment them for targeted marketing in a cost-effective way. Existing methods link profiles by observing high similarity between most recent (current) values of the attributes like name and username. However, for a section of users observed to evolve their attributes over time and choose dissimilar values across their profiles, these current values have low similarity. Existing methods then falsely conclude that profiles refer to different users. To reduce such false conclusions, we suggest to gather rich history of values assigned to an attribute over time and compare attribute histories to link user profiles across networks. We believe that attribute history highlights user preferences for creating attribute values on a social network. Co-existence of these preferences across profiles on different social networks result in alike attribute histories that suggests profiles potentially refer to a single user. Through a focused study on \emphusername, we quantify the importance of username history for profile linking on a dataset of real-world users with profiles on Twitter, Facebook, Instagram and Tumblr. We show that username history correctly links 44% more profile pairs with non-matching current values that are incorrectly unlinked by existing methods. We further explore if factors such as longevity and availability of username history on either profiles affect linking performance. To the best of our knowledge, this is the first study that explores viability of using an attribute history to link profiles on social networks.
@inproceedings{jain2015otherother,title={Other Times, Other Values: Leveraging Attribute History to Link User Profiles across Online Social Networks},author={Jain, P. and Kumaraguru, P. and and Joshi, A.},year={2015},booktitle={HT '15: Proceedings of the 26th ACM Conference on Hypertext & Social Media},}
PST
What They Do in Shadows: Twitter Underground Follower Market
A.
Aggarwal, and P.
Kumaraguru
In 13th Annual Conference on Privacy, Security and Trust (PST) 2015, 2015
Internet users and businesses are increasingly using online social networks (OSN) to drive audience traffic and increase their popularity. In order to boost social presence, OSN users need to increase the visibility and reach of their online profile, like - Facebook likes, Twitter followers, Instagram comments and Yelp reviews. For example, an increase in Twitter followers not only improves the audience reach of the user but also boosts the perceived social reputation and popularity. This has led to a scope for an underground market that provides followers, likes, comments, etc. via a network of fraudulent and compromised accounts and various collusion techniques. In this paper, we landscape the underground markets that provide Twitter followers by studying their basic building blocks - merchants, customers and phony followers. We charecterize the services provided by merchants to understand their operational structure and market hierarchy. Twitter underground markets can operationalize using a premium monetary scheme or other incentivized freemium schemes. We find out that freemium market has an oligopoly structure with few merchants being the market leaders. We also show that merchant popularity does not have any correlation with the quality of service provided by the merchant to its customers. Our findings also shed light on the characteristics and quality of market customers and the phony followers provided. We draw comparison between legitimate users and phony followers, and find out key identifiers to separate such users. With the help of these differentiating features, we build a supervised learning model to predict suspicious following behaviour with an accuracy of 89.2%.
@inproceedings{aggarwal2015whattheydo,title={What They Do in Shadows: Twitter Underground Follower Market},author={Aggarwal, A. and and Kumaraguru, P.},year={2015},booktitle={13th Annual Conference on Privacy, Security and Trust (PST) 2015},}
PST
Towards Automatic Real Time Identification of Malicious Posts on Facebook
P.
Dewan, and P.
Kumaraguru
In 13th Annual Conference on Privacy, Security and Trust (PST) 2015, 2015
Online Social Networks (OSNs) witness a rise in user activity whenever a news-making event takes place. Cyber criminals exploit this spur in user-engagement levels to spread malicious content that compromises system reputation, causes financial losses and degrades user experience. In this paper, we characterized a dataset of 4.4 million public posts generated on Facebook during 17 news-making events (natural calamities, terror attacks, etc.) and identified 11,217 malicious posts containing URLs. We found that most of the malicious content which is currently evading Facebook’s detection techniques originated from third party and web applications, while more than half of all legitimate content originated from mobile applications. We also observed greater participation of Facebook pages in generating malicious content as compared to legitimate content. We proposed an extensive feature set based on entity profile, textual content, metadata, and URL features to automatically identify malicious content on Facebook in real time. This feature set was used to train multiple machine learning models and achieved an accuracy of 86.9%. We performed experiments to show that past techniques for spam campaign detection identified less than half the number of malicious posts as compared to our model. This model was used to create a REST API and a browser plug-in to identify malicious Facebook posts in real time.
@inproceedings{dewan2015towardsautomaticreal,title={Towards Automatic Real Time Identification of Malicious Posts on Facebook},author={Dewan, P. and and Kumaraguru, P.},year={2015},booktitle={13th Annual Conference on Privacy, Security and Trust (PST) 2015},}
ECSCW
Online Social Networks and Police in India - Understanding the Perceptions, Behavior, Challenges
N.
Sachdeva, and P.
Kumaraguru
In European conference on Computer-Supported Cooperative Work (ECSCW) 2015, 2015
Safety is a concern for most urban communities; police departments bear the majority of responsibility to maintain law and order and prevent crime. Police agencies across the globe are increasingly using Online Social Network (OSN) (such as Facebook and Twitter) to acquire intelligence and connect with citizens. Developing nations like India are however, still exploring OSN for policing. We interviewed 20 IPS officers and 21 citizens to understand perceptions, and explored challenges experienced while using OSN for policing. Interview analysis, highlights how citizens and police think about information shared on OSN, handling offensive comments, and acknowledgment overload, as they pursue social and safety goals. We found that success of OSN for policing demands effective communication between the stakeholders (citizens and police). Our study shows that OSN offers community-policing opportunities, enabling police to identify crime with the help of citizens. It can reduce the communication gap and improve coordination between police and citizens. We also discuss design opportunities for tools to support social interactions between stakeholders.
@inproceedings{sachdeva2015onlinesocialnetworks,title={Online Social Networks and Police in India - Understanding the Perceptions, Behavior, Challenges},author={Sachdeva, N. and and Kumaraguru, P.},year={2015},booktitle={European conference on Computer-Supported Cooperative Work (ECSCW) 2015},}
DGO
Social Networks for Police and Residents in India: Exploring Online Communication for Crime Prevention
N.
Sachdeva, and P.
Kumaraguru
In 16th Annual International Conference on Digital Government Research (dg.o) 2015, 2015
Safety is a concern for most urban communities; residents interact in multiple ways with the police to address their safety concerns. Positive interactions with police help residents to feel safe. In developing countries, residents have started to use Online Social Networks (OSN) such as Facebook to share concerns and seek solutions. In this study, we investigate whether residents’ post on OSN contain actionable information that police can use to address safety concerns and how residents use OSN to communicate with police. For this, we analyze residents’ posts and comments on the Facebook page of Bangalore City Police, India, over a period of one month. Our results show that residents post information (including location) about various crimes such as neighborhood issues (drunkards, illegal construction), financial frauds, property crime, and thefts. In addition to crime, Facebook page gives information on residents’ satisfaction and police performance. Majority of residents use police Facebook page to appreciate the good work of police. Police response to residents’ post vary from ignore, acknowledge, reply, and follow-up. We find that police respond to most residents’ post and help residents to reach the authority who can help solve the issue. Police adopt a formal communication style to interact with residents. We find that in addition to actionable information, OSN can help understand fear of crime among residents and develop mutual accountability between police and residents.
@inproceedings{sachdeva2015socialnetworksfor,title={Social Networks for Police and Residents in India: Exploring Online Communication for Crime Prevention},author={Sachdeva, N. and and Kumaraguru, P.},year={2015},booktitle={16th Annual International Conference on Digital Government Research (dg.o) 2015},}
2014
SocInfo
TweetCred: A Real-time Web-based System for Assessing Credibility of Content on Twitter
A.
Gupta, P.
Kumaraguru, C.
Castillo, and P.
Meier
In 6th International Conference on Social Informatics (SocInfo) 2014, 2014
During large scale events, a large volume of content is posted on Twitter, but not all of this content is trustworthy. The presence of spam, advertisements, rumors and fake images reduces the value of information collected from Twitter, especially during sudden-onset crisis events where information from other sources is scarce. In this research work, we describe various facets of assessing the credibility of user-generated content on Twitter during large scale events, and develop a novel real-time system to assess the credibility of tweets. Firstly, we develop a semi-supervised ranking model using SVM-rank for assessing credibility, based on training data obtained from six high-impact crisis events of 2013. An extensive set of forty-five features is used to determine the credibility score for each of the tweets. Secondly, we develop and deploy a system–TweetCred–in the form of a browser extension, a web application and an API at the link: http://twitdigest.iiitd.edu.in/TweetCred/. To the best of our knowledge, this is the first research work to develop a practical system for credibility on Twitter and evaluate it with real users. TweetCred was installed and used by 717 Twitter users within a span of three weeks. During this period, a credibility score was computed for more than 1.1 million unique tweets. Thirdly, we evaluated the real-time performance of TweetCred, observing that 84% of the credibility scores were displayed within 6 seconds. We report on the positive feedback that we received from the system’s users and the insights we gained into improving the system for future iterations.
@inproceedings{gupta2014a,title={TweetCred: A Real-time Web-based System for Assessing Credibility of Content on Twitter},author={Gupta, A. and Kumaraguru, P. and Castillo, C. and and Meier, P.},year={2014},booktitle={6th International Conference on Social Informatics (SocInfo) 2014},}
eCRS
bit.ly/malicious: Deep Dive into Short URL based e-Crime Detection
N.
Gupta, Aggarwal
A., and P.
Kumaraguru
In 9th APWG eCrime Research Symposium (eCRS) 2014, 2014
Existence of spam URLs over emails and Online Social Media (OSM) has become a massive e-crime. To counter the dissemination of long complex URLs in emails and character limit imposed on various OSM (like Twitter), the concept of URL shortening has gained a lot of traction. URL shorteners take as input a long URL and output a short URL with the same landing page (as in the long URL) in return. With their immense popularity over time, URL shorteners have become a prime target for the attackers giving them an advantage to conceal malicious content. Bitly, a leading service among all shortening services is being exploited heavily to carry out phishing attacks, work-from-home scams, pornographic content propagation, etc. This imposes additional performance pressure on Bitly and other URL shorteners to be able to detect and take a timely action against the illegitimate content. In this study, we analyzed a dataset of 763,160 short URLs marked suspicious by Bitly in the month of October 2013. Our results reveal that Bitly is not using its claimed spam detection services very effectively. We also show how a suspicious Bitly account goes unnoticed despite of a prolonged recurrent illegitimate activity. Bitly displays a warning page on identification of suspicious links, but we observed this approach to be weak in controlling the overall propagation of spam. We also identified some short URL based features and coupled them with two domain specific features to classify a Bitly URL as malicious or benign and achieved an accuracy of 86.41%. The feature set identified can be generalized to other URL shortening services as well. To the best of our knowledge, this is the first large scale study to highlight the issues with the implementation of Bitly’s spam detection policies and proposing suitable countermeasures.
@inproceedings{gupta2014deepdive,title={bit.ly/malicious: Deep Dive into Short URL based e-Crime Detection},author={Gupta, N. and A., Aggarwal and Kumaraguru, P.},year={2014},booktitle={9th APWG eCrime Research Symposium (eCRS) 2014},}
eCRS
Emerging Phishing Trends and Effectiveness of the Anti-Phishing Landing Page
S.
Gupta, and P.
Kumaraguru
In 9th APWG eCrime Research Symposium (eCRS) 2014, 2014
Each month, more attacks are launched with the aim of making web users believe that they are communicating with a trusted entity which compels them to share their personal, financial information. Phishing costs Internet users billions of dollars every year. Researchers at Carnegie Mellon University (CMU) created an anti-phishing landing page supported by Anti-Phishing Working Group (APWG) with the aim to train users on how to prevent themselves from phishing attacks. It is used by financial institutions, phish site take down vendors, government organizations, and online merchants. When a potential victim clicks on a phishing link that has been taken down, he / she is redirected to the landing page. In this paper, we present the comparative analysis on two datasets that we obtained from APWG’s landing page log files; one, from September 7, 2008 - November 11, 2009, and other from January 1, 2014 - April 30, 2014. We found that the landing page has been successful in training users against phishing. Forty six percent users clicked lesser number of phishing URLs from January 2014 to April 2014 which shows that training from the landing page helped users not to fall for phishing attacks. Our analysis shows that phishers have started to modify their techniques by creating more legitimate looking URLs and buying large number of domains to increase their activity. We observed that phishers are exploiting ICANN accredited registrars to launch their attacks even after strict surveillance. We saw that phishers are trying to exploit free subdomain registration services to carry out attacks. In this paper, we also compared the phishing e-mails used by phishers to lure victims in 2008 and 2014. We found that the phishing e-mails have changed considerably over time. Phishers have adopted new techniques like sending promotional e-mails and emotionally targeting users in clicking phishing URLs.
@inproceedings{gupta2014emergingphishingtrends,title={Emerging Phishing Trends and Effectiveness of the Anti-Phishing Landing Page},author={Gupta, S. and and Kumaraguru, P.},year={2014},booktitle={9th APWG eCrime Research Symposium (eCRS) 2014},}
eCRS
Analyzing Social and Stylometric Features to Identify Spear phishing Emails
P.
Dewan, A.
Kashyap, and P.
Kumaraguru
In 9th APWG eCrime Research Symposium (eCRS) 2014, 2014
Spear phishing is a complex targeted attack in which, an attacker harvests information about the victim prior to the attack. This information is then used to create sophisticated, genuine-looking attack vectors, drawing the victim to compromise confidential information. What makes spear phishing different, and more powerful than normal phishing, is this contextual information about the victim. Online social media services can be one such source for gathering vital information about an individual. In this paper, we characterize and examine a true positive dataset of spear phishing, spam, and normal phishing emails from Symantec’s enterprise email scanning service. We then present a model to detect spear phishing emails sent to employees of 14 international organizations, by using social features extracted from LinkedIn. Our dataset consists of 4,742 targeted attack emails sent to 2,434 victims, and 9,353 non targeted attack emails sent to 5,912 non victims; and publicly available information from their LinkedIn profiles. We applied various machine learning algorithms to this labeled data, and achieved an overall maximum accuracy of 97.76% in identifying spear phishing emails. We used a combination of social features from LinkedIn profiles, and stylometric features extracted from email subjects, bodies, and attachments. However, we achieved a slightly better accuracy of 98.28% without the social features. Our analysis revealed that social features extracted from LinkedIn do not help in identifying spear phishing emails. To the best of our knowledge, this is one of the first attempts to make use of a combination of stylometric features extracted from emails, and social features extracted from an online social network to detect targeted spear phishing emails.
@inproceedings{dewan2014analyzingsocialand,title={Analyzing Social and Stylometric Features to Identify Spear phishing Emails},author={Dewan, P. and Kashyap, A. and Kumaraguru, P.},year={2014},booktitle={9th APWG eCrime Research Symposium (eCRS) 2014},}
CoDS
Pinned it! A large scale study of the Pinterest network
S.
Mittal, N.
Gupta, P.
Dewan, and P.
Kumaraguru
In 1st ACM IKDD Conference on Data Sciences (CoDS) 2014, 2014
Pinterest is an image-based online social network, which was launched in the year 2010 and has gained a lot of traction, ever since. Within 3 years, Pinterest has attained 48.7 million unique users. This stupendous growth makes it interesting to study Pinterest, and gives rise to multiple questions about it’s users, and content. We characterized Pinterest on the basis of large scale crawls of 3.3 million user profiles, and 58.8 million pins. In particular, we explored various attributes of users, pins, boards, pin sources, and user locations, in detail and performed topical analysis of user generated textual content. The characterization revealed most prominent topics among users and pins, top image sources, and geographical distribution of users on Pinterest. We then tried to predict gender of American users based on a set of profile, network, and content features, and achieved an accuracy of 73.17% with a J48 Decision Tree classifier. We then exploited the users’ names by comparing them to a corpus of top male and female names in the U.S.A. and and achieved an accuracy of 86.18%. To the best of our knowledge, this is the first attempt to predict gender on Pinterest.
@inproceedings{mittal2014pinneda,title={Pinned it! A large scale study of the Pinterest network},author={Mittal, S. and Gupta, N. and Dewan, P. and Kumaraguru, P.},year={2014},booktitle={1st ACM IKDD Conference on Data Sciences (CoDS) 2014},}
ASIACCS
A Three-Way Investigation of a Game-CAPTCHA: Automated Attacks, Relay Attacks and Usability
M.
Mohamed, N.
Sachdeva, M.
Georgescu, S.
Gao, N.
Saxena, C.
Zhang, P.
Kumaraguru, P.
Van Oorschot, and W.
Chen
In 9th ACM Symposium on Information, Computer and Communications Security (ASIACCS) 2014, 2014
Existing captcha solutions on the Internet are a major source of user frustration. Game captchas are an interesting and, to date, little-studied approach claiming to make captcha solving a fun activity for the users. One broad form of such captchas – called Dynamic Cognitive Game (DCG) captchas – challenge the user to perform a game-like cognitive task interacting with a series of dynamic images. We pursue a comprehensive analysis of a representative category of DCG captchas. We formalize, design and implement such captchas, and dissect them across: (1) fully automated attacks, (2) human-solver relay attacks, and (3) usability. Our results suggest that the studied DCG captchas exhibit high usability and, unlike other known captchas, offer some resistance to relay attacks, but they are also vulnerable to our novel dictionary-based automated attack.
@inproceedings{mohamed2014ainvestigation,title={A Three-Way Investigation of a Game-CAPTCHA: Automated Attacks, Relay Attacks and Usability},author={Mohamed, M. and Sachdeva, N. and Georgescu, M. and Gao, S. and Saxena, N. and Zhang, C. and Kumaraguru, P. and Van Oorschot, P. and and Chen, W.},year={2014},booktitle={9th ACM Symposium on Information, Computer and Communications Security (ASIACCS) 2014},}
2013
APCHI
On the Viability of CAPTCHAs for Use in Telephony Systems: A Usability FieldStudy
N.
Sachdeva, N.
Saxena, and P.
Kumaraguru
In APCHI ’13: Proceedings of the 11th Asia Pacific Conference on Computer Human Interaction, 2013
Usability of security solution has always been a keen area of interest for researchers. CAPTCHA is one such security solution which presents various usability challenges for users. However, it has successfully reduced the abuse of the Internet resources, such as spam. Similar to the Internet, audio-based CAPTCHAs have been proposed as a solution to curb voice spam over telephony. Voice spam is often encountered on telephony in various forms, such as, an automated telemarketing call asking to call a number to win million of dollars. A large percentage of voice spam is generated through automated system which introduces the classical challenge of distinguishing machines from humans on the telephony. We present a large scale evaluation of audio CAPTCHA from the human perspective over telephony through a field study with 90 participants. We study two primary research questions: how much inconvenience does audio CAPTCHA causes to users on telephony, and how different features of the CAPTCHA, e.g. and duration and size influence usability of audio CAPTCHA on telephony. We found that captcha could be a viable solution for telephony with improved features, such as better voice and accent. We found that users were relatively close to the expected correct answers, which does suggest the possibility of deploying audio captcha on telephony platforms in the future. However, we did not find strong influence of captcha size and duration on solving accuracy.
@inproceedings{sachdeva2013ontheviability,title={On the Viability of CAPTCHAs for Use in Telephony Systems: A Usability FieldStudy},author={Sachdeva, N. and Saxena, N. and and Kumaraguru, P.},year={2013},booktitle={APCHI '13: Proceedings of the 11th Asia Pacific Conference on Computer Human Interaction},}
I-CARE
MultiOSN: Realtime Monitoring of Real World Events on Multiple Online Social Media
The flow of information in online social media during events has been widely studied in the computer science community. It has also been shown how information picked from online social media can help to eventually aid eventful, especially, crisis situations in real life. However, most of the work has focused on utilizing a single social network for monitoring such events, mostly Twitter. Given the immense popularity and diversity of various online social networks across the globe, studying multiple online social networks during an event can reveal much more information about the event, than a single online social network. In this work, we present MultiOSN, a framework which collects data from five different online social networks viz. Facebook, Twitter, Google+, YouTube, and Flickr, and presents real-time analytics and visualizations. MultiOSN can be particularly helpful to users and organizations which are directly or indirectly connected to law and order. Organizations can utilize MultiOSN to uncover the general sentiment of social media users about an event, and trace public gatherings for example, which are usually discussed and planned publicly on social networking platforms.
@inproceedings{dewan2013realtimemonitoring,title={MultiOSN: Realtime Monitoring of Real World Events on Multiple Online Social Media},author={Dewan, P. and Gupta, M. and Goyal, K. and and Kumaraguru, P.},year={2013},booktitle={I-CARE 2013},}
eCRS
$1.00 per RT #BostonMarathon #PrayForBoston: Analyzing Fake Content on Twitter
A.
Gupta, H.
Lamba, and P.
Kumaraguru
In IEEE APWG eCrime Research Summit (eCRS) 2013, 2013
Online social media has emerged as one of the prominent channels for dissemination of information during real world events. Malicious content is posted online during events, which can result in damage, chaos and monetary losses in the real world. We analyzed one such media i.e. Twitter, for content generated during the event of Boston Marathon Blasts, that occurred on April, 15th, 2013. A lot of fake content and malicious profiles originated on Twitter network during this event. The aim of this work is to perform in-depth characterization of what factors influenced in malicious content and profiles becoming viral. Our results showed that 29% of the most viral content on Twitter, during the Boston crisis were rumors and fake content; while 51% was generic opinions and comments; and rest was true information.We found that large number of users with high social reputation and verified accounts were responsible for spreading the fake content. Next, we used regression prediction model, to verify that, overall impact of all users who propagate the fake content at a given time, can be used to estimate the growth of that content in future. Many malicious accounts were created on Twitter during the Boston event, that were later suspended by Twitter. We identified over six thousand such user profiles, we observed that the creation of such profiles surged considerably right after the blasts occurred. We identified closed community structure and star formation in the interaction network of these suspended profiles amongst themselves.
@inproceedings{gupta2013perrt,title={$1.00 per RT #BostonMarathon #PrayForBoston: Analyzing Fake Content on Twitter},author={Gupta, A. and Lamba, H. and and Kumaraguru, P.},year={2013},booktitle={IEEE APWG eCrime Research Summit (eCRS) 2013},}
COSN
Call Me MayBe: Understanding Nature and Risks of Sharing Mobile Numbers on Online Social Networks
P.
Jain, P.
Jain, and P.
Kumaraguru
In Conference on Online Social Networks (COSN) 2013, 2013
There is a great concern about the potential for people to leak private information on OSNs, but few quantitative studies on this. This research explores the activity of sharing mobile numbers on OSNs, via public profiles and posts. We attempt to understand the characteristics and risks of mobile numbers sharing behaviour on OSNs and focus on Indian mobile numbers. We collected 76,347 unique mobile numbers posted by 85905 users on Twitter and Facebook and analysed 2997 numbers, prefixed with +91. We observed, most users shared their own mobile numbers to spread urgent information; and to market products and escort business. Fewer female users shared mobile numbers on OSNs. Users utilized other OSN platforms and third party applications like Twitterfeed, to post mobile numbers on multiple OSNs. In contrast to the user’s perception of numbers spreading quickly on OSN, we observed that except for emergency, most numbers did not diffuse deep. To assess risks associated with mobile numbers exposed on OSNs, we used numbers to gain sensitive information about their owners (e.g. name, Voter ID) by collating publicly available data from OSNs, Truecaller, OCEAN. On using the numbers on WhatApp, we obtained a myriad of sensitive details (relationship status, BBM pins) of the number owner. We communicated the observed risks to the owners by calling. Few users were surprised to know about the online presence of their number, while a few others intentionally posted it online for business purposes. We observed, 38.3% of users who were unaware of the online presence of their number have posted their number themselves on the social network. With these observations, we highlight that there is a need to monitor leakage of mobile numbers via profile and public posts. To the best of our knowledge, this is the first exploratory study to critically investigate the exposure of Indian mobile numbers on OSNs.
@inproceedings{jain2013callme,title={Call Me MayBe: Understanding Nature and Risks of Sharing Mobile Numbers on Online Social Networks},author={Jain, P. and Jain, P. and and Kumaraguru, P.},year={2013},booktitle={Conference on Online Social Networks (COSN) 2013},}
SNAKDD
Network Flows and the Link Prediction Problem
K.
Narang, K.
Lerman, and P.
Kumaraguru
In 7th Workshop on Social Network Mining and Analysis (SNAKDD) 2013, 2013
Link prediction is used by many applications to recommend new products or social connections to people. Link prediction leverages information in network structure to identify missing links or predict which new one will form in the future. Recent research has provided a theoretical justification for the success of some popular link prediction heuristics, such as the number of common neighbors and the Adamic-Adar score, by showing that they estimate the distance between nodes in some latent feature space. In this paper we examine the link prediction task from the novel perspective of network flows. We show that how easily two nodes can interact with or influence each other depends not only on their position in the network, but also on the nature of the flow that mediates interactions between them. We show that different types of flows lead to different notions of network proximity, some of which are mathematically equivalent to existing link prediction heuristics. We measure the performance of different heuristics on the missing link prediction task in a variety of real-world social, technological and biological networks. We show that heuristics based on a random walk-type processes outperform the popular Adamic-Adar and the number of common neighbors heuristics in many networks.
@inproceedings{narang2013networkflowsand,title={Network Flows and the Link Prediction Problem},author={Narang, K. and Lerman, K. and and Kumaraguru, P.},year={2013},booktitle={7th Workshop on Social Network Mining and Analysis (SNAKDD) 2013},}
Mobile HCI
The Paper Slip Should be There: Perceptions of Transaction Receipts in Branchless Banking
S.
Panjwani, M.
Ghosh, S.
Singh, and P.
Kumaraguru
Mobile-based branchless banking has become a key mechanism for enabling financial inclusion in the developing world. A key component of all branchless banking systems is a mechanism to provide receipts to users after each transaction as evidence for successful transaction completion. In this paper, we present results from a field study that explores user perceptions of different receipt delivery mechanisms in the context of a branchless banking system in India. Our study shows that users have an affinity for paper receipts: despite the provision of an SMS receipt functionality by the system developers and their discouragement of the use of paper, users have pro-actively initiated a practice of issuing and accepting paper receipts. Several users are aware of the security limitations of paper receipts but continue to use them because of their usability benefits. We conclude with design recommendations for receipt delivery systems in branchless banking.
@inproceedings{panjwani2013thepaperslip,title={The Paper Slip Should be There: Perceptions of Transaction Receipts in Branchless Banking},author={Panjwani, S. and Ghosh, M. and Singh, S. and and Kumaraguru, P.},year={2013},booktitle={Mobile HCI 2013},}
IITI
Limited Attention and Centrality in Social Networks
K.
Lerman, P.
Jain, R.
Ghosh, J.
Kang, and P.
Kumaraguru
In Proceedings of Intelligence and Technology, 2013
How does one find important or influential people in an online social network? Researchers have proposed a variety of centrality measures to identify individuals that are, for example, often visited by a random walk, infected in an epidemic, or receive many messages from friends. Recent research suggests that a social media users’ capacity to respond to an incoming message is constrained by their finite attention, which they divide over all incoming information, i.e. and information sent by users they follow. We propose a new measure of centrality — limited-attention version of Bonacich’s Alpha-centrality — that models the effect of limited attention on epidemic diffusion. The new measure describes a process in which nodes broadcast messages to their out-neighbors, but the neighbors’ ability to receive the message depends on the number of in-neighbors they have. We evaluate the proposed measure on real-world online social networks and show that it can better reproduce an empirical influence ranking of users than other popular centrality measures.
@inproceedings{lerman2013limitedattentionand,title={Limited Attention and Centrality in Social Networks},author={Lerman, K. and Jain, P. and Ghosh, R. and Kang, J. and and Kumaraguru, P.},year={2013},booktitle={Proceedings of Intelligence and Technology},}
WWW
uTrack: Track Yourself! Monitoring Information on Online Social Media
T.
Magalhães, P.
Dewan, P.
Kumaraguru, R.
Melo-Minardi, and V.
Almeida
In 22nd International World Wide Web Conference (WWW), 2013
The past one decade has witnessed an astounding outburst in the number of online social media (OSM) services, and a lot of these services have enthralled millions of users across the globe. With such tremendous number of users, the amount of content being generated and shared on OSM services is also enormous. As a result, trying to visualize all this overwhelming amount of content, and gain useful insights from it has become a challenge. In this work, we present uTrack, a personalized web service to analyze and visualize the diffusion of content shared by users across multiple OSM platforms. To the best of our knowledge, there exists no work which concentrates on monitoring information diffusion for personal accounts. Currently, uTrack monitors and supports logging in from Facebook, Twitter, and Google+. Once granted permissions by the user, uTrack monitors all URLs (like videos, photos, news articles) the user has shared in all OSM services supported, and generates useful visualizations and statistics from the collected data.
@inproceedings{magalhães2013track,title={uTrack: Track Yourself! Monitoring Information on Online Social Media},author={Magalhães, T. and Dewan, P. and Kumaraguru, P. and Melo-Minardi, R. and and Almeida, V.},year={2013},booktitle={22nd International World Wide Web Conference (WWW)},}
PSOSM
Faking Sandy: Characterizing and Identifying Fake Images on Twitter during Hurricane Sandy
A.
Gupta, H.
Lamba, P.
Kumaraguru, and A.
Joshi
In 2nd International Workshop on Privacy and Security in Online Social Media (PSOSM), in conjunction with the 22th International World Wide Web Conference (WWW), 2013
In today’s world, online social media plays a vital role during real world events, especially crisis events. There are both positive and negative effects of social media coverage of events, it can be used by authorities for effective disaster management or by malicious entities to spread rumors and fake news. The aim of this paper, is to highlight the role of Twitter, during Hurricane Sandy (2012) to spread fake images about the disaster. We identified 10,350 unique tweets containing fake images that were circulated on Twitter, during Hurricane Sandy. We performed a characterization analysis, to understand the temporal, social reputation and influence patterns for the spread of fake images. Eighty six percent of tweets spreading the fake images were retweets, hence very few were original tweets. Our results showed that top thirty users out of 10,215 users (0.3%) resulted in 90% of the retweets of fake images; also network links such as follower relationships of Twitter, contributed very less (only 11%) to the spread of these fake photos URLs. Next, we used classification models, to distinguish fake images from real images of Hurricane Sandy. Best results were obtained from Decision Tree classifier, we got 97% accuracy in predicting fake images from real. Also, tweet based features were very effective in distinguishing fake images tweets from real, while the performance of user based features was very poor. Our results, showed that, automated techniques can be used in identifying real images from fake images posted on Twitter.
@inproceedings{gupta2013fakingcharacterizing,title={Faking Sandy: Characterizing and Identifying Fake Images on Twitter during Hurricane Sandy},author={Gupta, A. and Lamba, H. and Kumaraguru, P. and and Joshi, A.},year={2013},booktitle={2nd International Workshop on Privacy and Security in Online Social Media (PSOSM), in conjunction with the 22th International World Wide Web Conference (WWW)},}
WoLE
I seek ’fb.me’: Identifying Users across Multiple Online Social Networks
P.
Jain, P.
Kumaraguru, and A.
Joshi
In 2nd International Workshop on Web of Linked Entities (WoLE), in conjunction with the 22th International World Wide Web Conference (WWW), 2013
An online user joins multiple social networks in order to enjoy different services. On each joined social network, she creates an identity and constitutes its three major dimensions namely profile, content and connection network. She largely governs her identity formulation on any social network and therefore can manipulate multiple aspects of it. With no global identifier to mark her presence uniquely in the online domain, her online identities remain unlinked, isolated and difficult to search. Literature has proposed identity search methods on the basis of profile attributes, but has left the other identity dimensions e.g. content and network, unexplored. In this work, we introduce two novel identity search algorithms based on content and network attributes and improve on traditional identity search algorithm based on profile attributes of a user. We apply proposed identity search algorithms to find a user’s identity on Facebook, given her identity on Twitter. We report that a combination of proposed identity search algorithms found Facebook identity for 39% of Twitter users searched while traditional method based on profile attributes found Facebook identity for only 27.4%. Each proposed identity search algorithm access publicly accessible attributes of a user on any social network. We deploy an identity resolution system, Finding Nemo, which uses proposed identity search methods to find a Twitter user’s identity on Facebook. We conclude that inclusion of more than one identity search algorithm, each exploiting distinct dimensional attributes of an identity, helps in improving the accuracy of an identity resolution process.
@inproceedings{jain2013iseek,title={I seek 'fb.me': Identifying Users across Multiple Online Social Networks},author={Jain, P. and Kumaraguru, P. and and Joshi, A.},year={2013},booktitle={2nd International Workshop on Web of Linked Entities (WoLE), in conjunction with the 22th International World Wide Web Conference (WWW)},}
MSND
Detection of Spam Tipping Behaviour on Foursquare
A.
Aggarwal, J.
Almeida, and P.
Kumaraguru
In 2nd International Workshop on Mining Social Network Dynamics (MSND), in conjunction with the 22th International World Wide Web Conference (WWW), 2013
In Foursquare, one of the currently most popular online location based social networking sites (LBSNs), users may not only check-in at specific venues but also post comments (or tips), sharing their opinions and previous experiences at the corresponding physical places. Foursquare tips, which are visible to everyone, provide venue owners with valuable user feedback besides helping other users to make an opinion about the specific venue. However, they have been the target of spamming activity by users who exploit this feature to spread tips with unrelated content. In this paper, we present what, to our knowledge, is the first effort to identify and analyze different patterns of tip spamming activity in Foursquare, with the goal of developing automatic tools to detect users who post spam tips - tip spammers. A manual investigation of a real dataset collected from Foursquare led us to identify four categories of spamming behavior, viz. Advertising/Spam, Self-promotion, Abusive and Malicious. We then applied machine learning techniques, jointly with a selected set of user, social and tip’s content features associated with each user, to develop automatic detection tools. Our experimental results indicate that we are able to not only correctly distinguish legitimate users from tip spammers with high accuracy (89.76%) but also correctly identify a large fraction (at least 78.88%) of spammers in each identified category.
@inproceedings{aggarwal2013detectionofspam,title={Detection of Spam Tipping Behaviour on Foursquare},author={Aggarwal, A. and Almeida, J. and and Kumaraguru, P.},year={2013},booktitle={2nd International Workshop on Mining Social Network Dynamics (MSND), in conjunction with the 22th International World Wide Web Conference (WWW)},}
ICWSM
Ladies First: Analyzing Gender Roles and Behaviors in Pinterest
R.
Ottoni, J.
Pesce, D.
Casas, G.
Franciscani, W.
Meira, P.
Kumaraguru, and V.
Almeida
In The International AAAI Conference on Weblogs and Social Media (ICWSM) 2013, 2013
Online social networks (OSNs) have become popular platforms for people to connect and interact with each other. Among those networks, Pinterest has recently become noteworthy for its growth and promotion of visual over textual content. The purpose of this study is to analyze this image-based network in a gender-sensitive fashion, in order to un- derstand (i) user motivation and usage pattern in the network, (ii) how communications and social interactions happen and (iii) how users describe themselves to others. This work is based on more than 220 million items generated by 683,273 users. We were able to find significant differences w.r.t. all mentioned aspects. We observed that, although the network does not encourage direct social communication, females make more use of lightweight interactions than males. Moreover, females invest more effort in reciprocating social links, are more active and generalist in content generation, and describe themselves using words of affection and positive emotions. Males, on the other hand, are more likely to be specialists and tend to describe themselves in an assertive way. We also observed that each gender has different interests in the network, females tend to make more use of the network’s commercial capabilities, while males are more prone to the role of curators of items that reflect their personal taste. It is important to understand gender differences in online social networks, so one can design services and applications that leverage human social interactions and provide more targeted and relevant user experiences.
@inproceedings{ottoni2013ladiesanalyzing,title={Ladies First: Analyzing Gender Roles and Behaviors in Pinterest},author={Ottoni, R. and Pesce, J. and Casas, D. and Franciscani, G. and Meira, W. and Kumaraguru, P. and and Almeida, V.},year={2013},booktitle={The International AAAI Conference on Weblogs and Social Media (ICWSM) 2013},}
2012
ESNAM
Misinformation on Twitter during Crisis Events
A.
Gupta, and P.
Kumaraguru
In Encyclopedia of Social Network Analysis and Mining (ESNAM), 2012
During crises, the proliferation of misinformation, often termed "infodemics," can severely compromise Shared Situational Awareness (SSA) and impede effective response. With the advent of technology, social media platforms have become crucial tools for response agencies to counteract misinformation and promote SSA. Yet, the intricate dynamics between information dissemination, communication strategies, and trust, especially in the digital realm, remain underexplored. This research looks at the utilisation of technology, specifically social media platforms like Facebook, by response agencies to navigate the challenges of infodemics. Drawing from Seppänen et al. (2013) SSA model, we identified potential risks in digital crisis communication strategies that might undermine public trust and SSA. We used a netnographic analysis of the response agencies’ social media pages, supplemented by field interviews with agency representatives. Our findings contribute to the fields of Information Systems (IS) and communication by 1) highlighting the potential of technology, particularly social media, in crisis communication and misinformation mitigation, 2) identifying the risks and pitfalls of leveraging digital platforms during crises, and 3) underlying the consequences of diminishing public trust in official digital information channels, offering insights into mitigating misinformation and improving crisis response.
@inproceedings{gupta2012misinformationontwitter,title={Misinformation on Twitter during Crisis Events},author={Gupta, A. and Kumaraguru, P.},year={2012},booktitle={Encyclopedia of Social Network Analysis and Mining (ESNAM)},}
MDM
Take Control of Your SMSes: Designing an Usable Spam SMS Filtering System
K.
Yadav, S.
Saha, P.
Kumaraguru, and R.
Kumra
In 13th International Conference on Mobile Data Management (MDM), 2012
Short Message Service (SMS) is one of the most frequently used services in the mobile phones, next to calls. In developing countries like India, SMS is the cheapest mode of communication. The advantage of this fact is exploited by the advertising companies to reach masses. The unsolicited SMS messages (a.k.a. spam SMS) generates notifications, thus consuming precious user attention. To formulate spam SMS problem and understand user’s needs and preceptions, we conducted an online survey with 458 participants in different cities of India. Most of the survey participants admitted that they are quite annoyed with burst of SMS spams and in-effectiveness of regulatory solutions. However, some participants reported that, they do get useful information from spam SMSes sometime(e.g. discounts at a popular food joint). In this paper, we present design and implementation of a user-centric spam SMS filtering application i.e. SMSAssassin that uses content based machine learning techniques with user generated features to filter unwanted SMSes and reduces the burden of notifications for a mobile user.
@inproceedings{yadav2012takecontrolof,title={Take Control of Your SMSes: Designing an Usable Spam SMS Filtering System},author={Yadav, K. and Saha, S. and Kumaraguru, P. and Kumra, R.},year={2012},booktitle={13th International Conference on Mobile Data Management (MDM)},}
eCRS
PhishAri: Automatic Realtime Phishing Detection on Twitter
A.
Aggarwal, A.
Rajadesingan, and P.
Kumaraguru
In Seventh IEEE APWG eCrime Research Summit (eCRS), 2012
With the advent of online social media, phishers have started using social networks like Twitter, Facebook, and Foursquare to spread phishing scams. Twitter is an immensely popular micro-blogging network where people post short messages of 140 characters called tweets. It has over 100 million active users who post about 200 million tweets everyday. Phishers have started using Twitter as a medium to spread phishing because of this vast information dissemination. Further, it is difficult to detect phishing on Twitter unlike emails because of the quick spread of phishing links in the network, short size of the content, and use of URL obfuscation to shorten the URL. Our technique, PhishAri, detects phishing on Twitter in realtime. We use Twitter specific features along with URL features to detect whether a tweet posted with a URL is phishing or not. Some of the Twitter specific features we use are tweet content and its characteristics like length, hashtags, and mentions. Other Twitter features used are the characteristics of the Twitter user posting the tweet such as age of the account, number of tweets, and the follower-followee ratio. These twitter specific features coupled with URL based features prove to be a strong mechanism to detect phishing tweets. We use machine learning classification techniques and detect phishing tweets with an accuracy of 92.52%. We have deployed our system for end-users by providing an easy to use Chrome browser extension. The extension works in realtime and classifies a tweet as phishing or safe. In this research, we show that we are able to detect phishing tweets at zero hour with high accuracy which is much faster than public blacklists and as well as Twitter’s own defense mechanism to detect malicious content. We also performed a quick user evaluation of PhishAri in a laboratory study to evaluate the usability and effectiveness of PhishAri and showed that users like and find it convenient to use PhishAri in real-world. To the best of o...
@inproceedings{aggarwal2012automaticrealtime,title={PhishAri: Automatic Realtime Phishing Detection on Twitter},author={Aggarwal, A. and Rajadesingan, A. and and Kumaraguru, P.},year={2012},booktitle={Seventh IEEE APWG eCrime Research Summit (eCRS)},}
PinSoda
Beware of What You Share: Inferring Home Location in Social Networks
T.
Pontes, G.
Magno, M.
Vasconcelos, A.
Gupta, J.
Almeida, P.
Kumaraguru, and V.
Almeida
In Privacy in Social Data (PinSoda), in conjunction with International Conference on Data Mining (ICDM), 2012
In recent years, social media users are voluntarily making large volume of personal data available on the social networks. Such data (e.g. and professional associations) can create opportunities for users to strengthen their social and professional ties. However, the same data can also be used against the user for viral marketing and other unsolicited purposes. The invasion of privacy occurs due to privacy unawareness and carelessness of making information publicly available. In this paper, we perform a large-scale inference study in three of the currently most popular social networks: Foursquare, Google+ and Twitter. Our work focuses on inferring a user’s home location, which may be a private attribute, for many users. We analyze whether a simple method can be used to infer the user home location using publicly available attributes and also the geographic information associated with locatable friends. We find that it is possible to infer the user home city with a high accuracy, around 67%, 72% and 82% of the cases in Foursquare, Google+ and Twitter, respectively. We also apply a finer-grained inference that reveals the geographic coordinates of the residence of a selected group of users in our datasets, achieving approximately up to 60% of accuracy within a radius of six kilometers.
@inproceedings{pontes2012bewareofwhat,title={Beware of What You Share: Inferring Home Location in Social Networks},author={Pontes, T. and Magno, G. and Vasconcelos, M. and Gupta, A. and Almeida, J. and Kumaraguru, P. and and Almeida, V.},year={2012},booktitle={Privacy in Social Data (PinSoda), in conjunction with International Conference on Data Mining (ICDM)},}
UBM @ CIKM
Identifying and Characterizing User Communities on Twitter during Crisis Events
A.
Gupta, A.
Joshi, and P.
Kumaraguru
In Workshop on Data-driven User Behavioral Modelling and Mining from Social Media, Co-located with CIKM, 2012
Twitter is a prominent online social media which is used to share information and opinions. Previous research has shown that current real world news topics and events dominate the discussions on Twitter. In this paper, we present a preliminary study to identify and characterize communities from a set of users who post messages on Twitter during crisis events. We present our work in progress by analyzing three major crisis events of 2011 as case studies (Hurricane Irene, Riots in England, and Earthquake in Virginia). Hurricane Irene alone, caused a damage of about 7-10 billion USD and claimed 56 lives. The aim of this paper is to identify the different user communities, and characterize them by the top central users. First, we defined a similarity metric between users based on their links, content posted and meta-data. Second, we applied spectral clustering to obtain communities of users formed during three different crisis events. Third, we evaluated the mechanism to identify top central users using degree centrality; we showed that the top users represent the topics and opinions of all the users in the community with 81% accuracy on an average. The top central people identified represent what the entire community shares. Therefore to understand a community, we need to monitor and analyze only these top users rather than all the users in a community.
@inproceedings{gupta2012identifyingandcharacterizing,title={Identifying and Characterizing User Communities on Twitter during Crisis Events},author={Gupta, A. and Joshi, A. and and Kumaraguru, P.},year={2012},booktitle={Workshop on Data-driven User Behavioral Modelling and Mining from Social Media, Co-located with CIKM},}
CSOSN
Studying user footprints in different online social networks
A.
Malhotra, L.
Totti, W.
Meira, P.
Kumaraguru, and V.
Almeida
In International Workshop on Cybersecurity of Online Social Network (CSOSN), 2012
With the growing popularity and usage of online social media services, people now have accounts (some times several) on multiple and diverse services like Facebook, LinkedIn, Twitter and YouTube. Publicly available information can be used to create a digital footprint of any user using these social media services. Generating such digital footprints can be very useful for personalization, profile management, detecting malicious behavior of users. A very important application of analyzing users’ online digital footprints is to protect users from potential privacy and security risks arising from the huge publicly available user information. We extracted information about user identities on different social networks through Social Graph API, FriendFeed, and Profilactic; we collated our own dataset to create the digital footprints of the users. We used username, display name, description, location, profile image, and number of connections to generate the digital footprints of the user. We applied context specific techniques (e.g. Jaro Winkler similarity, Wordnet based ontologies) to measure the similarity of the user profiles on different social networks. We specifically focused on Twitter and LinkedIn. In this paper, we present the analysis and results from applying automated classifiers for disambiguating profiles belonging to the same user from different social networks. UserID and Name were found to be the most discriminative features for disambiguating user profiles. Using the most promising set of features and similarity metrics, we achieved accuracy, precision and recall of 98%, 99%, and 96%, respectively.
@inproceedings{malhotra2012studyinguserfootprints,title={Studying user footprints in different online social networks},author={Malhotra, A. and Totti, L. and Meira, W. and Kumaraguru, P. and and Almeida, V.},year={2012},booktitle={International Workshop on Cybersecurity of Online Social Network (CSOSN)},}
LBSN
We Know Where You Live: Privacy Characterization of Foursquare Behavior
In the last few years, the increasing interest in location-based services (LBS) has favored the introduction of geo-referenced information in various Web 2.0 applications, as well as the rise of location-based social networks (LBSN). Foursquare, one of the most popular LBSNs, gives incentives to users who visit (check in) specific places (venues) by means of, for instance, mayorships to frequent visitors. Moreover, users may leave tips at specific venues as well as mark previous tips as done in sign of agreement. Unlike check ins, which are shared only with friends, the lists of mayorships, tips and dones of a user are publicly available to everyone, thus raising concerns about disclosure of the user’s movement patterns and interests. We analyze how users explore these publicly available features, and their potential as sources of information leakage. Specifically, we characterize the use of mayorships, tips and dones in Foursquare based on a dataset with around 13 million users. We also analyze whether it is possible to easily infer the home city (state and country) of a user from these publicly available information. Our results indicate that one can easily infer the home city of around 78% of the analyzed users within 50 kilometers.
@inproceedings{tatiana:we-know-where-you-live:2012:yuqfj,title={We Know Where You Live: Privacy Characterization of Foursquare Behavior},author={Pontes, Tatiana and Vasconcelos, Marisa and Almeida, Jussara and Kumaraguru, Ponnurangam and Almeida, Virgilio},year={2012},booktitle={Proceedings of the 2012 ACM Conference on Ubiquitous Computing},}
Tring! Tring! - An Exploration and Analysis of Interactive Voice Response Systems
In developing regions like India, voice based telecommunication services are one of the most appropriate medium for information dissemination as they overcome prevalent low literacy rate. However, voice based Interactive Voice Response (IVR) systems are still not exploited to their full potential and are commonly considered as frustrating to use. We did a real world experiment to investigate the usability issues of a voice based system. In this paper, we report analysis of our experimental IVR and interface difficulties as experienced by the user. We also highlight the user behavior towards accessing critical and non-critical information over multiple information media vis-a-vis IVR, web and talking to a human on the phone. The findings suggests that an IVR which can adapt its behavior will prove to be more efficient and provide a better user experience. We believe that our results can be used for efficient development of next-generation adaptable IVR systems.
@inproceedings{asthana:tring-tring---an-explorat:2012:nrtys,title={Tring! Tring! - An Exploration and Analysis of Interactive Voice Response Systems},author={Asthana, Siddharth and Singh, Pushpendra and Kumaraguru, Ponnurangam and Singh, Amarjeet and Naik, Vinayak},year={2012},booktitle={},}
PSOSM
Credibility Ranking of Tweets during High Impact Events
Aditi
Gupta, and Ponnurangam
Kumaraguru
In Proceedings of the 1st Workshop on Privacy and Security in Online Social Media, 2012
Twitter has evolved from being a conversation or opinion sharing medium among friends into a platform to share and disseminate information about current events. Events in the real world create a corresponding spur of posts (tweets) on Twitter. Not all content posted on Twitter is trustworthy or useful in providing information about the event. In this paper, we analyzed the credibility of information in tweets corresponding to fourteen high impact news events of 2011 around the globe. From the data we analyzed, on average 30% of total tweets posted about an event contained situational information about the event while 14% was spam. Only 17% of the total tweets posted about the event contained situational awareness information that was credible. Using regression analysis, we identified the important content and sourced based features, which can predict the credibility of information in a tweet. Prominent content based features were number of unique characters, swear words, pronouns, and emoticons in a tweet, and user based features like the number of followers and length of username. We adopted a supervised machine learning and relevance feedback approach using the above features, to rank tweets according to their credibility score. The performance of our ranking algorithm significantly enhanced when we applied re-ranking strategy. Results show that extraction of credible information from Twitter can be automated with high confidence.
@inproceedings{rengamani:the-unique-identification-num:2010,title={Credibility Ranking of Tweets during High Impact Events},author={Gupta, Aditi and Kumaraguru, Ponnurangam},year={2012},booktitle={Proceedings of the 1st Workshop on Privacy and Security in Online Social Media},}
2011
WebMedia
Privacy Albeit Late
G.
Rauber, V.
Almeida, and P.
Kumaraguru
In Seventeenth edition of WebMedia, Brazilian Symposium on Multimedia and the Web, 2011
Online Social Networks (OSNs) such as Facebook and Twitter have experienced exponential growth in recent years. Users are spending more time on OSNs than on any other sites and services on the Internet. Users post and share a lot of personal information on these sites without being aware of privacy implications or simply not caring much about them, what turns to be a treasure for marketing companies and cyber criminals. Characterizing the privacy awareness of users is important to design technologies and policy solutions. Users expect the OSN to provide good privacy protection or controls so they can make informed decisions about their privacy. This paper investigates the privacy awareness of users on Facebook using real-world data (not self-reported). The main findings are: only a low percentage of users change the default privacy settings; a large percentage of users expose their gender publicly; women are more concerned about disclosing personal information online; many users share their photo albums and links (content) to everyone; users exercise more control over content that are more potentially dangerous to their reputation. The present study is one of the first to characterize the privacy awareness on OSN through a real world experiment. Implications of the study are discussed.
@inproceedings{rauber2011privacyalbeitlate,title={Privacy Albeit Late},author={Rauber, G. and Almeida, V. and and Kumaraguru, P.},year={2011},booktitle={Seventeenth edition of WebMedia, Brazilian Symposium on Multimedia and the Web},}
HotMobile
User Controllable Security and Privacy for Mobile Mashups
A new paradigm in the domain of mobile applications is ’mobile mashups’, where Web content rendered on a mobile browser is amalgamated with data and features available on the device, such as user location, calendar information and camera. Although a number of frameworks exist that enable creation and execution of mobile mashups, they fail to address a very important issue of handling security and privacy considerations of a mobile user. In this paper, we characterize the nature of access control required for utilizing device features in a mashup setting; design a security and privacy middleware based on the well known XACML policy language; and describe how the middleware enables a user to easily control usage of device features. Implementation-wise, we realize our middleware on Android platform (but easily generalizable to other platforms), integrate it with an existing mashup framework, and demonstrate its utility through an e-commerce mobile mashup.
@inproceedings{adappa:user-controllable-securit:2011:kxyqv,title={{User Controllable Security and Privacy for Mobile Mashups}},author={Adappa, Shruthi and Agarwal, Vikas and Goyal, Sunil and Kumaraguru, Ponnurangam and Mittal, Sumit},year={2011},booktitle={Proceedings of the 12th Workshop on Mobile Computing Systems and Applications, Hotmobile 2011},}
IEEE SOLI
Enhancing the Rural Self Help Group – Bank Linkage Program
Empowerment of Self Help Groups (SHGs) is a dominating aspect as the micro-finance industry ushers into an era of maturity. Today SHGs are widely recognized as the hubs for information dissemination within villages and entry points for financial institutions as well as consumer goods organizations, though less has been done to deal with this highly illiterate population in terms of upgrading their skill sets or making them competent enough to soak the deluge of knowledge intensive programs aligned for them. In this paper, we observe that mobile penetration, the ease with which rural population uses the voice interface, and acceptability of mobile related technologies, all bring us to the confluence of mobility and innovative interaction technologies that can help in designing a system for the next billion population. We propose a system that uses voice as a medium to percolate knowledge through the thick layers of illiteracy, thereby serving as an effective mechanism to bring about a paradigm shift in the way SHGs are formed, operate and interact with the Micro-finance Institution (MFI). This system enables low cost financial services to be comprehended and adopted by the SHGs while empowering them to raise concerns and undertake active participation. This kind of empowerment of SHGs is unseen till date and can lead to, especially in case of women, better representation in elections of local panchayats, dowry upliftment and other social advancements, not understating the success of MFIs. Our system is designed and realized using IBM’s Spoken Web technology that employs an easy-to-use voice interface to create dynamic content in local vernacular language, based on the concept of ’Voice Sites’, interconnected by ’Voice Links’.
@inproceedings{agarwal:enhancing-the-rural-self-:2011:yuqfj,title={Enhancing the Rural Self Help Group -- Bank Linkage Program},author={Agarwal, Vikas and Desai, Vikram and Kapoor, Shalini and Kumaraguru, Ponnurangam and Mittal, Sumit},year={2011},booktitle={Published in 2011 Annual SRII Global Conference},}
CEAS
Phi.sh/$oCiaL: The Phishing Landscape through Short URLs
Sidharth
Chhabra, Anupama
Aggarwal, Fabricio
Benevenuto, and Ponnurangam
Kumaraguru
In The 8th Annual Collaboration, Electronic messaging, Anti-Abuse and Spam Conference, CEAS 2011, 2011
Size, accessibility, and rate of growth of Online Social Media (OSM) has attracted cyber crimes through them. One form of cyber crime that has been increasing steadily is phishing, where the goal (for the phishers) is to steal personal information from users which can be used for fraudulent purposes. Although the research community and industry has been developing techniques to identify phishing attacks through emails and instant messaging (IM), there is very little research done, that provides a deeper understanding of phishing in online social media. Due to constraints of limited text space in social systems like Twitter, phishers have begun to use URL shortener services. In this study, we provide an overview of phishing attacks for this new scenario. One of our main conclusions is that phishers are using URL shorteners not only for reducing space but also to hide their identity. We observe that social media websites like Facebook, Habbo, Orkut are competing with e-commerce services like PayPal, eBay in terms of traffic and focus of phishers. Orkut, Habbo, and Facebook are amongst the top 5 brands targeted by phishers. We study the referrals from Twitter to understand the evolving phishing strategy. A staggering 89% of references from Twitter (users) are inorganic accounts which are sparsely connected amongst themselves, but have large number of followers and followees. We observe that most of the phishing tweets spread by extensive use of attractive words and multiple hashtags. To the best of our knowledge, this is the first study to connect the phishing landscape using blacklisted phishing URLs from PhishTank, URL statistics from bit.ly and cues from Twitter to track the impact of phishing in online social media.
@inproceedings{chhabra:phi.sh/ocial:-the-phishin:2011:yuqfj,title={{Phi.sh/\$oCiaL: The Phishing Landscape through Short URLs}},author={Chhabra, Sidharth and Aggarwal, Anupama and Benevenuto, Fabricio and Kumaraguru, Ponnurangam},year={2011},booktitle={The 8th Annual Collaboration, Electronic messaging, Anti-Abuse and Spam Conference, CEAS 2011},}
IIWeb
Integrating Linked Open Data with Unstructured Text for Intelligence Gathering Tasks
We present techniques for uncovering links between terror incidents, organizations, and people involved with these incidents. Our methods involve performing shallow NLP tasks to extract entities of interest from documents and using linguistic pattern matching and filtering techniques to assign specific relations to the entities discovered. We also gather more information about these entities from the Linked Open Data Cloud, and further allow human analysts to add intelligent inference rules appropriate to the domain. All this information is integrated in a knowledge base in the form of a graph that maintains the semantics between different types of nodes involved in the graph. This knowledge base can then be queried by the analysts to create actionable intelligence.
@inproceedings{gupta:twitter-credibility-ranki:2011:yuqfj,title={Integrating Linked Open Data with Unstructured Text for Intelligence Gathering Tasks},author={Gupta, Archit and Viswanathan, Krishnamurthy Koduvayur and Joshi, Anupam and Finin, Timothy and Kumaraguru, Ponnurangam},year={2011},booktitle={Proceedings of the 8th International Workshop on Information Integration on the Web},}
PSOSM
\@Twitter Credibility Ranking of Tweets on Events #breakingnews
Aditi
Gupta, and Ponnurangam
Kumaraguru
In Proceedings of the 1st Workshop on Privacy and Security in Online Social Media, 2011
Twitter has evolved from being a conversation or opinion sharing medium among friends into a platform to share and disseminate information about current events. Events in the real world create a corresponding spur of posts (tweets) on Twitter. Not all content posted on Twitter is trustworthy or useful in providing information about the event. In this paper, we analyzed the credibility of information in tweets corresponding to fourteen high impact news events of 2011 around the globe. From the data we analyzed, on average 30% of total tweets posted about an event contained situational information about the event while 14% was spam. Only 17% of the total tweets posted about the event contained situational awareness information that was credible. Using regression analysis, we identified the important content and sourced based features, which can predict the credibility of information in a tweet. Prominent content based features were number of unique characters, swear words, pronouns, and emoticons in a tweet, and user based features like the number of followers and length of username. We adopted a supervised machine learning and relevance feedback approach using the above features, to rank tweets according to their credibility score. The performance of our ranking algorithm significantly enhanced when we applied re-ranking strategy. Results show that extraction of credible information from Twitter can be automated with high confidence.
@inproceedings{gupta:twitter-explodes-with-act:2011:yuqfj,title={{\@Twitter Credibility Ranking of Tweets on Events \#breakingnews}},author={Gupta, Aditi and Kumaraguru, Ponnurangam},year={2011},booktitle={Proceedings of the 1st Workshop on Privacy and Security in Online Social Media},}
Twitter Explodes with Activity in Mumbai Blasts! A Lifeline or an Unmonitored Daemon in the Lurking?
Online social media has become an integral part of every Internet users’ life. It has given common people a platform and forum to share information, post their opinions and promote campaigns. The threat of exploitation of social media like Facebook, Twitter, etc. by malicious entities, becomes crucial during a crisis situation, like bomb blasts or natural calamities such as earthquakes and floods. In this report, we attempt to characterize and extract patterns of activity of general users on Twitter during a crisis situation. This is the first attempt to study an India-centric crisis event such as the triple bomb blasts in Mumbai (India), using online social media. In this research, we perform content and activity analysis of content posted on Twitter after the bomb blasts. Through our analysis, we conclude, that the number of URLs and @-mentions in tweets increase during the time of the crisis in comparison to what researchers have exhibited for normal circumstances. In addition to the above, we empirically show that the number of tweets or updates by authority users (those with large number of followers) are very less, i.e. majority of content generated on Twitter during the crisis comes from non authority users. In the end, we discuss certain case scenarios during the Mumbai blasts, where rumors were spread through the network of Twitter.
@inproceedings{ion:home-is-safer-than-the-cl:2011:nrtys,title={{Twitter Explodes with Activity in Mumbai Blasts! A Lifeline or an Unmonitored Daemon in the Lurking?}},author={Gupta, Aditi and Kumaraguru, Ponnurangam},year={2011},booktitle={},}
SOUPS ’11
Home is Safer than the Cloud! Privacy Concerns for Consumer Cloud Storage
Iulia
Ion, Niharika
Sachdeva, Ponnurangam
Kumaraguru, and Srdjan
Capkun
In Symposium on Usable Privacy and Security (SOUPS), 2011
Several studies ranked security and privacy to be major areas of concern and impediments of cloud adoption for companies, but none have looked into end-users’ attitudes and practices. Not much is known about consumers’ privacy beliefs and expectations for cloud storage, such as web-mail, document and photo sharing platforms, or about users’ awareness of contractual terms and conditions. We conducted 36 in-depth interviews in Switzerland and India (two countries with different privacy perceptions and expectations); and followed up with an online survey with 402 participants in both countries. We study users’ privacy attitudes and beliefs regarding their use of cloud storage systems. Our results show that privacy requirements for consumer cloud storage differ from those of companies. Users are less concerned about some issues, such as guaranteed deletion of data, country of storage and storage outsourcing, but are uncertain about using cloud storage. Our results further show that end-users consider the Internet intrinsically insecure and prefer local storage for sensitive data over cloud storage. However, users desire better security and are ready to pay for services that provide strong privacy guarantees. Participants had misconceptions about the rights and guarantees their cloud storage providers offers. For example, users believed that their provider is liable in case of data loss, does not have the right to view and modify user data, and cannot disable user accounts. Finally, our results show that cultural differences greatly influence user attitudes and beliefs, such as their willingness to store sensitive data in the cloud and their acceptance that law enforcement agencies monitor user accounts. We believe that these observations can help in improving users privacy in cloud storage systems.
@inproceedings{jain:cross-pollination-of-info:2011:nrtys,title={Home is Safer than the Cloud! Privacy Concerns for Consumer Cloud Storage},author={Ion, Iulia and Sachdeva, Niharika and Kumaraguru, Ponnurangam and Capkun, Srdjan},year={2011},booktitle={Symposium on Usable Privacy and Security (SOUPS)},}
PASSAT
Cross-Pollination of Information in Online Social Media: A Case Study on Popular Social Networks
Paridhi
Jain, Tiago
Rodrigues, Gabriel
Magno, Ponnurangam
Kumaraguru, and Virgilo
Almeida
In published in SocialCom PASSAT 2011 as a six page short paper, 2011
Owing to the popularity of Online Social Media (OSM), Internet users share a lot of information (including personal) on and across OSM services every day. For example, it is common to find a YouTube video embedded in a blog post with an option to share the link on Facebook. Users recommend, comment, and forward information they receive from friends, contributing in spreading the information in and across OSM services. We term this information diffusion process from one OSM service to another as Cross-Pollination, and the network formed by users who participate in Cross-Pollination and content produced in the network as \emphCross-Pollinated network. Research has been done about information diffusion within one OSM service, but little is known about Cross-Pollination. In this paper, we aim at filling this gap by studying how information (video, photo, location) from three popular OSM services (YouTube, Flickr and Foursquare) diffuses on Twitter, the most popular microblogging service. Our results show that Cross-Pollinated networks follow temporal and topological characteristics of the diffusion OSM (Twitter in our study). Furthermore, popularity of information on source OSM (YouTube, Flickr and Foursquare) does not imply its popularity on Twitter. Our results also show that Cross-Pollination helps Twitter in terms of traffic generation and user involvement, but only a small fraction of videos and photos gain a significant number of views from Twitter. We believe this is the first research work which explicitly characterizes the diffusion of information across different OSM services.
@inproceedings{khot:marasim:-a-novel-jigsaw-b:2011:nrtys,title={Cross-Pollination of Information in Online Social Media: A Case Study on Popular Social Networks},author={Jain, Paridhi and Rodrigues, Tiago and Magno, Gabriel and Kumaraguru, Ponnurangam and Almeida, Virgilo},year={2011},booktitle={published in SocialCom PASSAT 2011 as a six page short paper},}
CHI ’11
Marasim: A Novel Jigsaw Based Authentication Scheme using Tagging
Rohit
Khot, Srinathan
Kannan, and Ponnurangam
Kumaraguru
In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, 2011
In this paper we propose and evaluate Marasim, a novel Jigsaw based graphical authentication mechanism using tagging. Marasim is aimed at achieving the security of random images with the memorability of personal images. Our scheme relies on the human ability to remember a personal image and later recognize the alternate visual representations (images) of the concepts occurred in the image. These concepts are retrieved from the tags assigned to the image. We illustrate how a Jigsaw based approach helps to create a portfolio of system-chosen random images to be used for authentication. The paper describes the complete design of Marasim along with the empirical studies of Marasim that provide evidences of increased memorability. Results show that 93% of all participants succeeded in the authentication tests using Marasim after three months while 71% succeeded in authentication tests using Marasim after nine months. Our findings indicate that Marasim has potential applications, especially where text input is hard (e.g., PDAs or ATMs), or in situations where passwords are infrequently used (e.g., web site passwords).
@inproceedings{kuldeep-yadav:smsassassin-:-crowdsourci:2011:yuqfj,title={Marasim: A Novel Jigsaw Based Authentication Scheme using Tagging},author={Khot, Rohit and Kannan, Srinathan and Kumaraguru, Ponnurangam},year={2011},booktitle={Proceedings of the SIGCHI Conference on Human Factors in Computing Systems},}
HotMobile ’11
SMSAssassin : Crowdsourcing Driven Mobile-based System for SMS Spam Filtering
Due to increase in use of Short Message Service (SMS) over mobile phones in developing countries, there has been a burst of spam SMSes. Content-based machine learning approaches were effective in filtering email spams. Researchers have used topical and stylistic features of the SMS to classify spam and ham. SMS spam filtering can be largely influenced by the presence of regional words, abbreviations and idioms. We have tested the feasibility of applying Bayesian learning and Support Vector Machine(SVM) based machine learning techniques which were reported to be most effective in email spam filtering on a India centric dataset. In our ongoing research, as an exploratory step, we have developed a mobile-based system SMSAssassin that can filter SMS spam messages based on bayesian learning and sender blacklisting mechanism. Since the spam SMS keywords and patterns keep on changing, SMSAssassin uses crowd sourcing to keep itself updated. Using a dataset that we are collecting from users in the real-world, we evaluated our approaches and found some interesting results.
@inproceedings{kumaraguru:a-survey-of-privacy-polic:2007:lrfkq,title={SMSAssassin : Crowdsourcing Driven Mobile-based System for SMS Spam Filtering},author={Yadav, Kuldeep and Kumaraguru, Ponnurangam and Goyal, Atul and Gupta, Ashish and Naik, Vinayak},year={2011},booktitle={Proceedings of the 12th Workshop on Mobile Computing Systems and Applications},}
2010
AIRS
Mining YouTube to Discover Hate Videos, Users and Hidden Communities
A.
Sureka, P.
Kumaraguru, A.
Goyal, and S.
Chhabra
In Sixth Asia Information Retrieval Societies Conference, 2010
We describe a semi-automated system to assist law enforcement and intelligence agencies dealing with cyber-crime related to promotion of hate and radicalization on the Internet. The focus of this work is on mining YouTube to discover hate videos, users and virtual hidden communities. Finding precise information on YouTube is a challenging task because of the huge size of the YouTube repository and a large subscriber base. We present a solution based on data mining and social network analysis (using a variety of relationships such as friends, subscriptions, favorites and related videos) to aid an analyst in discovering insightful and actionable information. Furthermore, we performed a systematic study of the features and properties of the data and hidden social networks which has implications in understanding extremism on Internet. We take a case study based approach and perform empirical validation of the proposed hypothesis. Our approach succeeded in finding hate videos which were validated manually.
@inproceedings{sureka2010miningyoutubeto,title={Mining YouTube to Discover Hate Videos, Users and Hidden Communities},author={Sureka, A. and Kumaraguru, P. and Goyal, A. and and Chhabra, S.},year={2010},booktitle={Sixth Asia Information Retrieval Societies Conference},}
NSDR
Challenges and Novelties while using Mobile Phones as ICT Devices for Indian Masses
K.
Yadav, V.
Naik, A.
Singh, P.
Singh, P.
Kumaraguru, and U.
Chandra
In ACM Workshop on Networked Systems for Developing Regions, 2010
Mobile phones have emerged as truly pervasive and affordable Information and Communication Technology (ICT) platform in the last decade. Large penetration of cellular networks and availability of advanced hardware platforms have inspired multiple innovative research opportunities in mobile computing domain. However, most of the research challenges have focused on typical scenarios existing in the developed economies. In this paper, we present research challenges and novelties in mobile computing domain that take account for differences between developing in particular India and developed economies. Our research is based on commonly available mobile platforms, communication cost, differences in user behavior and acceptable societal norms, among others.
@inproceedings{yadav2010challengesandnovelties,title={Challenges and Novelties while using Mobile Phones as ICT Devices for Indian Masses},author={Yadav, K. and Naik, V. and Singh, A. and Singh, P. and Kumaraguru, P. and and Chandra, U.},year={2010},booktitle={ACM Workshop on Networked Systems for Developing Regions},}
SOUPS
Influence of User Perception, Security Needs, and Social Factors on Device Pairing Method Choices
I.
Ion, M.
Langheinrich, P.
Kumaraguru, and S.
Capkun
In Symposium On Usable Privacy and Security (SOUPS), 2010
Recent years have seen a proliferation of secure device pairing methods that try to improve both the usability and security of today’s de-facto standard – PIN-based authentication. Evaluating such improvements is difficult. Most comparative laboratory studies have so far mainly focused on completeness, trying to find the single best method among the dozens of proposed approaches – one that is both rated the most usable by test subjects, and which provides the most robust security guarantees. This search for the "best" pairing method, however, fails to take into account the variety of situations in which such pairing protocols may be used in real life. The comparative study reported here, therefore, explicitly situates pairing tasks in a number of more realistic situations. Our results indicate that people do not always use the easiest or most popular method – they instead prefer different methods in different situations, based on the sensitivity of data involved, their time constraints, and the social conventions appropriate for a particular place and setting. Our study also provides qualitative data on factors influencing the perceived security of a particular method, the users’ mental models surrounding security of a method, and their security needs.
@inproceedings{ion2010influenceofuser,title={Influence of User Perception, Security Needs, and Social Factors on Device Pairing Method Choices},author={Ion, I. and Langheinrich, M. and Kumaraguru, P. and and Capkun, S.},year={2010},booktitle={Symposium On Usable Privacy and Security (SOUPS)},}
CHI
Who Falls for Phish? A Demographic Analysis of Phishing Susceptibility and Effectiveness of Interventions
S.
Sheng, M.
Holbrook, P.
Kumaraguru, L.
Cranor, and J.
Downs
In this paper we present the results of a roleplay survey instrument administered to 1001 online survey respondents to study both the relationship between demographics and phishing susceptibility and the effectiveness of several anti-phishing educational materials. Our results suggest that women are more susceptible than men to phishing and participants between the ages of 18 and 25 are more susceptible to phishing than other age groups. We explain these demographic factors through a mediation analysis. Educational materials reduced users’ tendency to enter information into phishing webpages by 40% percent; however, some of the educational materials we tested also slightly decreased participants’ tendency to click on legitimate links.
@inproceedings{sheng2010whofallsfor,title={Who Falls for Phish? A Demographic Analysis of Phishing Susceptibility and Effectiveness of Interventions},author={Sheng, S. and Holbrook, M. and Kumaraguru, P. and Cranor, L. and and Downs, J.},year={2010},booktitle={CHI},}
SafeConfig
Cue : A Framework for Generating Meaningful Feedback in XACML
Sunil Kumar
Ghai, Prateek
Nigam, and Ponnurangam
Kumaraguru
In Proceedings of the 3rd ACM workshop on Assurable and usable security configuration, 2010
With a number of access rules at play along with contexts in which they may or may not apply, it is not always obvious to the legitimate user what caused an authorization server to deny a request, neither is it possible for the administrator to specify a complete fail proof policy. It then becomes the responsibility of the system to act in a user friendly manner by providing feedback suggesting the requester about possible alternatives. The system should also cover any unhandled request that it may encounter due to an incomplete system policy. At the same time, it is essential for feedback to not reveal the entire policy to any user. In this paper we propose a framework Cue, for generating feedback in XACML using logic programming in Prolog. Feedback content is protected by the use of meta policy which itself is specified in XACML. We first translate XACML policies into logic based functors. Second, we execute a query using parameters in the denied XACML request, to identify conditions that failed. Third, the failed condition is notified as feedback if a meta policy allows the system to reveal it. Cue is capable of generating appropriate feedback while ensuring that a desired degree of confidentiality is maintained.
@inproceedings{ghai:cue-:-a-framework-for-gen:2010:kxyqv,title={{Cue : A Framework for Generating Meaningful Feedback in XACML}},author={Ghai, Sunil Kumar and Nigam, Prateek and Kumaraguru, Ponnurangam},year={2010},booktitle={Proceedings of the 3rd ACM workshop on Assurable and usable security configuration},}
ICEB
The Unique Identification Number Project: Challenges and Recommendations
Haricharan
Rengamani, Ponnurangam
Kumaraguru, Rajarishi
Chakraborty, and H. Raghav
Rao
In Third International Conference on Ethics and Policy of Biometrics, 2010
This paper elucidates the social, ethical, cultural, technical, and legal implications / challenges around the implementation of a biometric based unique identification (UID) number project. The Indian government has undertaken a huge effort to issue UID numbers to its residents. Apart from possible challenges that are expected in the implementation of UID, the paper also draws parallels from Social Security Number system in the US. We discuss the setbacks of using the Social Security Number as a unique identifier and how to avoid them with the system being proposed in India. We discuss the various biometric techniques used and a few recommendations associated with the use of biometrics.
@inproceedings{gupta:integrating-linked-open-d:2011:yuqfj,title={The Unique Identification Number Project: Challenges and Recommendations},author={Rengamani, Haricharan and Kumaraguru, Ponnurangam and Chakraborty, Rajarishi and Rao, H. Raghav},year={2010},booktitle={Third International Conference on Ethics and Policy of Biometrics},}
AIRS ’10
Mining YouTube to Discover Extremist Videos, Users and Hidden Communities
Ponnurangam
Kumaraguru, and Ashish
Sureka
In Proceedings of Asia Information Retrieval Societies Conference, 2010, 2010
We describe a semi-automated system to assist law enforcement and intelligence agencies dealing with cyber-crime related to promotion of hate and radicalization on the Internet. The focus of this work is on mining YouTube to discover hate videos, users and virtual hidden communities. Finding precise information on YouTube is a challenging task because of the huge size of the YouTube repository and a large subscriber base. We present a solution based on data mining and social network analysis (using a variety of relationships such as friends, subscriptions, favorites and related videos) to aid an analyst in discovering insightful and actionable information. Furthermore, we performed a systematic study of the features and properties of the data and hidden social networks which has implications in understanding extremism on Internet. We take a case study based approach and perform empirical validation of the proposed hypothesis. Our approach succeeded in finding hate videos which were validated manually.
@inproceedings{kumaraguru:anti-phishing-landing-page:2009:yuqfj,title={Mining YouTube to Discover Extremist Videos, Users and Hidden Communities},author={Kumaraguru, Ponnurangam and Sureka, Ashish},year={2010},booktitle={Proceedings of Asia Information Retrieval Societies Conference, 2010},}
2009
eCRS
Improving Phishing Countermeasures: An Analysis of Expert Interviews
S.
Sheng, P.
Kumaraguru, A.
Acquisti, L.
Cranor, and J.
Hong
In e-Crime Researchers Summit, Anti-Phishing Working Group, 2009
In this paper, we present data from 31 semi-structured interviews with anti-phishing experts from academia, law enforcement, and industry. Our analysis led to eight key findings and 18 recommendations to improve phishing countermeasures. Our findings describe the evolving phishing threat, stakeholder incentives to devote resources to anti-phishing efforts, what stakeholders should do to most effectively address the problem, and the role of education and law enforcement.
@inproceedings{sheng2009improvingphishing,title={Improving Phishing Countermeasures: An Analysis of Expert Interviews},author={Sheng, S. and Kumaraguru, P. and Acquisti, A. and Cranor, L. and and Hong, J.},year={2009},booktitle={e-Crime Researchers Summit, Anti-Phishing Working Group},}
JRD
A Policy Framework for Security and Privacy Management
J.
Karat, E.
Bertino, N.
Li, Q.
Ni, C.
Brodie, J.
Lobo, S. B.
Calo, L. F.
Cranor, P.
Kumaraguru, and R. W.
Reeder
In IBM Journal of Research and Development, Harmonizing Security and Privacy, Volume 53, Number 2, 2009
Policies that address security and privacy are pervasive parts of both technical and social systems, and technology that enables both organizations and individuals to create and manage such policies is a critical need in information technology (IT). This paper describes the notion of end-to-end policy management and advances a framework that can be useful in understanding the commonality in IT security and privacy policy management.
@inproceedings{karat2009apolicyframework,title={A Policy Framework for Security and Privacy Management},author={Karat, J. and Bertino, E. and Li, N. and Ni, Q. and Brodie, C. and Lobo, J. and Calo, S. B. and Cranor, L. F. and Kumaraguru, P. and and Reeder, R. W.},year={2009},booktitle={IBM Journal of Research and Development, Harmonizing Security and Privacy, Volume 53, Number 2},}
IBM
Policy framework for security and privacy management
J.
Karat, C.-M.
Karat, E.
Bertino, N.
Li, Q.
Ni, C.
Brodie, J.
Lobo, S. B.
Calo, L. F.
Cranor, P.
Kumaraguru, and R. W.
Reeder
In Published in IBM Journal of Research and Development, 2009
Policies that address security and privacy are pervasive parts of both technical and social systems, and technology that enables both organizations and individuals to create and manage such policies is a critical need in information technology (IT). This paper describes the notion of end-to-end policy management and advances a framework that can be useful in understanding the commonality in IT security and privacy policy management.
@inproceedings{Karat:2009:PFS:1850636.1850640,title={Policy framework for security and privacy management},author={Karat, J. and Karat, C.-M. and Bertino, E. and Li, N. and Ni, Q. and Brodie, C. and Lobo, J. and Calo, S. B. and Cranor, L. F. and Kumaraguru, P. and Reeder, R. W.},year={2009},booktitle={Published in IBM Journal of Research and Development},}
Anti-phishing landing page: Turning a 404 into a teachable moment for end users
Ponnurangam
Kumaraguru, Lorrie Faith
Cranor, and Laura
Mather
This paper describes the design and implementation of the Anti-Phishing Working Group (APWG) anti-phishing land-ing page, a web page with a succinct anti-phishing training message designed to be displayed in place of a phishing web-site that has been taken down. The landing page is currently being used by financial institutions, phish site take-down vendors, government organizations and online merchants. When would-be phishing victims try to visit a phishing web site that has been taken down, they are redirected to the landing page, hosted on the APWG website. In this paper, we discuss the iterative user-centered design process we used to develop the landing page content. We present the data we collected from the landing page log files from October 1, 2008 through March 31, 2009, during the first six months of the landing page program. Our analysis suggests that ap-proximately 70,000 Internet users have been educated by the landing page during this period. We identified 3,917 unique phishing URLs that had been redirected to the landing page. We found 81 URLs that appeared in our log files in email messages archived in the APWG phishing email repository. We present our analysis of the features of these emails.
@inproceedings{kumaraguru:getting-users-to-pay-atte:2007:yuqfj,title={Anti-phishing landing page: Turning a 404 into a teachable moment for end users},author={Kumaraguru, Ponnurangam and Cranor, Lorrie Faith and Mather, Laura},year={2009},booktitle={},}
SOUPS ’09
School of phish: a real-world evaluation of anti-phishing training
Ponnurangam
Kumaraguru, Justin
Cranshaw, Alessandro
Acquisti, Lorrie
Cranor, Jason
Hong, Mary Ann
Blair, and Theodore
Pham
In Proceedings of the 5th Symposium on Usable Privacy and Security, 2009
PhishGuru is an embedded training system that teaches users to avoid falling for phishing attacks by delivering a training message when the user clicks on the URL in a simulated phishing email. In previous lab and real-world experiments, we validated the effectiveness of this approach. Here, we extend our previous work with a 515-participant, real-world study in which we focus on long-term retention and the effect of two training messages. We also investigate demographic factors that influence training and general phishing susceptibility. Results of this study show that (1) users trained with PhishGuru retain knowledge even after 28 days; (2) adding a second training message to reinforce the original training decreases the likelihood of people giving information to phishing websites; and (3) training does not decrease users’ willingness to click on links in legitimate messages. We found no significant difference between males and females in the tendency to fall for phishing emails both before and after the training. We found that participants in the 18–25 age group were consistently more vulnerable to phishing attacks on all days of the study than older participants. Finally, our exit survey results indicate that most participants enjoyed receiving training during their normal use of email.
@inproceedings{kumaraguru:lessons-from-a-real-world:2008:lrfkq,title={School of phish: a real-world evaluation of anti-phishing training},author={Kumaraguru, Ponnurangam and Cranshaw, Justin and Acquisti, Alessandro and Cranor, Lorrie and Hong, Jason and Blair, Mary Ann and Pham, Theodore},year={2009},booktitle={Proceedings of the 5th Symposium on Usable Privacy and Security},}
Thesis
PhishGuru: A System for Educating Users about Semantic Attacks
The goal of this thesis is to show that computer users trained with an embedded training system - one grounded in the principles of learning science - are able to make more accurate online trust decisions than users who read traditional security training materials, which are distributed via email or posted online. To achieve this goal, we focus on "phishing," a type of semantic attack. We have developed a system called "PhishGuru" based on embedded training methodology and learning science principles. Embedded training is a methodology in which training materials are integrated into the primary tasks users perform in their day-to-day lives. In contrast to existing training methodologies, the PhishGuru shows training materials to users through emails at the moment ("teachable moment") users actually fall for phishing attacks.
@inproceedings{kumaraguru:privacy-in-india:-attitud:2030:lrfkq,title={PhishGuru: A System for Educating Users about Semantic Attacks},author={Kumaraguru, Ponnurangam},year={2009},booktitle={Research Thesis},}
2008
ICELW
Anti-Phishing Education
P.
Kumaraguru, L.F. Cranor
S. Sheng, and J.I.
Hong
In The Proceedings of The International Conference on E-Learning in the Workplace (ICELW 2008), 2008
Prior laboratory studies have shown that PhishGuru, an embedded training system, is an effective way to teach users to identify phishing scams. PhishGuru users are sent simulated phishing attacks and trained after they fall for the attacks. In this current study, we extend the PhishGuru methodology to train users about spear phishing and test it in a real world setting with employees of a Portuguese company. Our results demonstrate that the findings of PhishGuru laboratory studies do indeed hold up in a real world deployment. Specifically, the results from the field study showed that a large percentage of people who clicked on links in simulated emails proceeded to give some form of personal information to fake phishing websites, and that participants who received PhishGuru training were significantly less likely to fall for subsequent simulated phishing attacks one week later. This paper also presents some additional new findings. First, people trained with spear phishing training material did not make better decisions in identifying spear phishing emails compared to people trained with generic training material. Second, we observed that PhishGuru training could be effective in training other people in the organization who did not receive training messages directly from the system. Third, we also observed that employees in technical jobs were not different from employees with non-technical jobs in identifying phishing emails before and after the training. We conclude with some lessons that we learned in conducting the real world study.
@inproceedings{kumaraguru2008education,title={Anti-Phishing Education},author={Kumaraguru, P. and S. Sheng, A. Acquisti, L.F. Cranor and Hong, J.I.},year={2008},booktitle={The Proceedings of The International Conference on E-Learning in the Workplace (ICELW 2008)},}
IDMAN
A Contextual Method for Evaluating Privacy Preferences
Caroline
Sheedy, and Ponnurangam
Kumaraguru
In Policies and Research in Identity Management (IDMAN), 2008
Identity management is a relevant issue at a national and international level. Any approach to identity management is incomplete unless privacy is also a consideration. Existing research on evaluating an individual’s privacy preferences has shown discrepancies in the stated standards required by users, and the corresponding observed behaviour. We take a contextual approach to surveying privacy, using the framework proposed by contextual integrity, with the aim of further understanding users self reported views on privacy at a national level.
@inproceedings{gupta:credibility-ranking-of-tw:2012:yuqfj,title={A Contextual Method for Evaluating Privacy Preferences},author={Sheedy, Caroline and Kumaraguru, Ponnurangam},year={2008},booktitle={Policies and Research in Identity Management (IDMAN)},}
IEEE
Lessons From a Real World Evaluation of Anti-Phishing Training
Ponnurangam
Kumaraguru, Steve
Sheng, Alessandro
Acquisti, Lorrie Faith
Cranor, and Jason
Hong
In Published in 2008 eCrime Researchers Summit, 2008
Prior laboratory studies have shown that PhishGuru, an embedded training system, is an effective way to teach users to identify phishing scams. PhishGuru users are sent simulated phishing attacks and trained after they fall for the attacks. In this current study, we extend the PhishGuru methodology to train users about spear phishing and test it in a real world setting with employees of a Portuguese company. Our results demonstrate that the findings of PhishGuru laboratory studies do indeed hold up in a real world deployment. Specifically, the results from the field study showed that a large percentage of people who clicked on links in simulated emails proceeded to give some form of personal information to fake phishing websites, and that participants who received PhishGuru training were significantly less likely to fall for subsequent simulated phishing attacks one week later. This paper also presents some additional new findings. First, people trained with spear phishing training material did not make better decisions in identifying spear phishing emails compared to people trained with generic training material. Second, we observed that PhishGuru training could be effective in training other people in the organization who did not receive training messages directly from the system. Third, we also observed that employees in technical jobs were not different from employees with non-technical jobs in identifying phishing emails before and after the training. We conclude with some lessons that we learned in conducting the real world study.
@inproceedings{kumaraguru:phishguru:-a-system-for-e:2009:rcrwd,title={Lessons From a Real World Evaluation of Anti-Phishing Training},author={Kumaraguru, Ponnurangam and Sheng, Steve and Acquisti, Alessandro and Cranor, Lorrie Faith and Hong, Jason},year={2008},booktitle={Published in 2008 eCrime Researchers Summit},}
2007
IDMAN
A Contextual Method for Evaluating Privacy Preferences
C.
Sheedy, and P.
Kumaraguru
In Policies and Research in Identity Management (IDMAN), Rotterdam, The Netherlands, 2007
Identity management is a relevant issue at a national and international level. Any approach to identity management is incomplete unless privacy is also a consideration. Existing research on evaluating an individual’s privacy preferences has shown discrepancies in the stated standards required by users, and the corresponding observed behaviour. We take a contextual approach to surveying privacy, using the framework proposed by contextual integrity, with the aim of further understanding users self reported views on privacy at a national level.
@inproceedings{sheedy2007acontextualmethod,title={A Contextual Method for Evaluating Privacy Preferences},author={Sheedy, C. and Kumaraguru, P.},year={2007},booktitle={Policies and Research in Identity Management (IDMAN), Rotterdam, The Netherlands},}
SOUPS
Anti-phishing phil: The design and evaluation of a game that teaches people not to fall for phish
S.
Sheng, B.
Magnien, P.
Kumaraguru, A.
Acquisti, L.F.
Cranor, J.
Hong, and E.
Nunge
In Proceedings of the 3rd symposium on usable privacy and security, 2007
In this paper we describe the design and evaluation of Anti-Phishing Phil, an online game that teaches users good habits to help them avoid phishing attacks. We used learning science principles to design and iteratively refine the game. We evaluated the game through a user study: participants were tested on their ability to identify fraudulent web sites before and after spending 15 minutes engaged in one of three anti-phishing training activities (playing the game, reading an anti-phishing tutorial we created based on the game, or reading existing online training materials). We found that the participants who played the game were better able to identify fraudulent web sites compared to the participants in other conditions. We attribute these effects to both the content of the training messages presented in the game as well as the presentation of these materials in an interactive game format. Our results confirm that games can be an effective way of educating people about phishing and other security attacks.
@inproceedings{sheng2007the,title={Anti-phishing phil: The design and evaluation of a game that teaches people not to fall for phish},author={Sheng, S. and Magnien, B. and Kumaraguru, P. and Acquisti, A. and Cranor, L.F. and Hong, J. and and Nunge, E.},year={2007},booktitle={Proceedings of the 3rd symposium on usable privacy and security},}
SOUPS
A survey of privacy policy languages
P.
Kumaraguru, L.
Cranor, J.
Lobo, and S.
Calo
In Proceedings of the 3rd symposium on Usable privacy and security, SOUPS ’07, 2007
Most consumers are sensitive to privacy issues when conducting business online. Protecting information by enforcing security and privacy practices internally is a way for organizations to increase business by building trust with such consumers. They can express their privacy practices as policies in a human readable format to help consumers make informed decisions. Many privacy languages are available for representing policies, but they tend to use formats convenient to their implementations, and there is no single framework or metric to analyze and evaluate the effectiveness of these languages. In this research, we are interested in succinctly summarizing the literature available on privacy policy languages; providing an account of the features, characteristics and requirements of the languages; and, describing a comprehensive framework for analysis. We expect our results to aid implementers in choosing an existing language and to provide guidelines for building languages in the future. We expect this research to be a starting point towards developing frameworks and metrics for analyzing privacy policy languages.
@inproceedings{kumaraguru2007asurveyof,title={A survey of privacy policy languages},author={Kumaraguru, P. and Cranor, L. and Lobo, J. and and Calo, S.},year={2007},booktitle={Proceedings of the 3rd symposium on Usable privacy and security, SOUPS '07},}
SOUPS
A Survey of Privacy Policy Languages
Ponnurangam
Kumaraguru, Lorrie
Cranor, Jorge
Lobo, and Seraphin
Calo
In Symposium on Usable Privacy and Security (SOUPS), 2007
Most consumers are sensitive to privacy issues when conducting business online. Protecting information by enforcing security and privacy practices internally is a way for organizations to increase business by building trust with such consumers. They can express their privacy practices as policies in a human readable format to help consumers make informed decisions. Many privacy languages are available for representing policies, but they tend to use formats convenient to their implementations, and there is no single framework or metric to analyze and evaluate the effectiveness of these languages. In this research, we are interested in succinctly summarizing the literature available on privacy policy languages; providing an account of the features, characteristics and requirements of the languages; and, describing a comprehensive framework for analysis. We expect our results to aid implementers in choosing an existing language and to provide guidelines for building languages in the future. We expect this research to be a starting point towards developing frameworks and metrics for analyzing privacy policy languages.
@inproceedings{kumaraguru:anti---terrorism-in-india:2010:lrfkq,title={A Survey of Privacy Policy Languages},author={Kumaraguru, Ponnurangam and Cranor, Lorrie and Lobo, Jorge and Calo, Seraphin},year={2007},booktitle={Symposium on Usable Privacy and Security (SOUPS)},}
eCrime
Getting Users to Pay Attention to Anti-Phishing Education: Evaluation of Retention and Transfer
Ponnurangam
Kumaraguru, Yong
Rhee, Steve
Sheng, Sharique
Hasan, Alessandro
Acquisti, Lorrie Faith
Cranor, and Jason
Hong
In Proceedings of the anti-phishing working groups 2nd annual eCrime researchers summit, 2007
Educational materials designed to teach users not to fall for phishing attacks are widely available but are often ignored by users. In this paper, we extend an embedded training methodology using learning science principles in which phishing education is made part of a primary task for users. The goal is to motivate users to pay attention to the training materials. In embedded training, users are sent simulated phishing attacks and trained after they fall for the attacks. Prior studies tested users immediately after training and demonstrated that embedded training improved users’ ability to identify phishing emails and websites. In the present study, we tested users to determine how well they retained knowledge gained through embedded training and how well they transferred this knowledge to identify other types of phishing emails. We also compared the effectiveness of the same training materials delivered via embedded training and delivered as regular email messages. In our experiments, we found that: (a) users learn more effectively when the training materials are presented after users fall for the attack (embedded) than when the same training materials are sent by email (non-embedded); (b) users retain and transfer more knowledge after embedded training than after non-embedded training; and (c) users with higher Cognitive Reflection Test (CRT) scores are more likely than users with lower CRT scores to click on the links in the phishing emails from companies with which they have no account.
@inproceedings{Kumaraguru:2009:SPR:1572532.1572536,title={Getting Users to Pay Attention to Anti-Phishing Education: Evaluation of Retention and Transfer},author={Kumaraguru, Ponnurangam and Rhee, Yong and Sheng, Steve and Hasan, Sharique and Acquisti, Alessandro and Cranor, Lorrie Faith and Hong, Jason},year={2007},booktitle={Proceedings of the anti-phishing working groups 2nd annual eCrime researchers summit},}
2006
PST
Trust modeling for online transactions: A phishing scenario
Trust is an important component of online transactions. The increasing amount and sophistication of spam, phishing, and other semantic attacks increase users’ uncertainty about the consequences of their actions and their distrust towards other online parties. In this paper, we highlight some key characteristics of a model that we are developing to represent and compare the online trust decision processes of "expert" and "non-expert" computer users. We also report on preliminary data we are gathering to validate, refine, and apply our model. This research is part of a broader project that aims at developing tools and training modules to help online users make good trust decisions.
@inproceedings{kumaraguru2006trustmodelingfor,title={Trust modeling for online transactions: A phishing scenario},author={Kumaraguru, P. and Acquisti, A. and and Cranor, L.},year={2006},booktitle={Proceedings of Privacy Security Trust},}
2005
TPRC
Privacy Perceptions in India and the United States: An Interview Study
P.
Kumaraguru, L.
Cranor, and E.
Newton
In The 33rd Research Conference on Communication, Information and Internet Policy (TPRC), 2005
Directions references which are bold are references which need to be checked / verified. Details regarding US data has to be filled in table 2. Also some details provided here might not be actually presented in the final paper. This abstract is same as what we submitted as Abstract to TPRC. If needed we need to modify it. As members of the educated population in India are increasingly using the Internet, adopting new technologies such as camera phones, and acquiring credit cards, Indians are increasingly becoming exposed to many of the same privacy risks that have raised concerns in other parts of the world. Although some have interpreted the Indian constitution and some Indian laws as providing some privacy protections, there are no Indian laws that explicitly address data privacy. However, with the recent growth of the Indian business process outsourcing industry, there has been considerable interest in adopting laws that would provide legal safeguards for personal data handled by businesses. While many privacy studies have been done in the United States, Europe, Canada, and Australia, little research has been done to investigate attitudes about privacy in India. We conducted an exploratory study to gain an initial understanding of perceptions about privacy among Indians. We conducted 29 one-on-one ”mental model” interviews that asked people 16 open-ended questions related to privacy. A mental model is the symbolic representation of an idea that an individual uses to interact with the real world and to represent social relationships. The questions were organized into several categories: general understanding of privacy and security, security and privacy of computerized data, knowledge of risks and protection against privacy risks, knowledge and awareness about laws regarding privacy, knowledge of data sharing and selling in organizations and government, and
@inproceedings{kumaraguru2005privacyperceptionsin,title={Privacy Perceptions in India and the United States: An Interview Study},author={Kumaraguru, P. and Cranor, L. and and Newton, E.},year={2005},booktitle={The 33rd Research Conference on Communication, Information and Internet Policy (TPRC)},}
ICA
Mental Models of Data Privacy and Security Extracted from Interviews with Indians
J.
Diesner, P.
Kumaraguru, and K.
Carley
In 55th Annual Conference of the International Communication Association (ICA), 2005
The Indian software and services market continues to gain momentum, with offshore outsourcing from the US, Europe and other countries becoming mainstream. As jobs that involve processing of personal data are increasingly outsourced to India, concerns are being raised about the protection of this data. While a large number of studies has been conducted in order to assess people’s attitudes about data privacy and security in the US, Australia, Canada and Europe, little information is available on this topic in India. The research we present seeks to gain an empiric and exploratory understanding of Indians’ attitudes about data privacy and security. We study these attitudes by analyzing the mental models that are reflected in interviews which we conducted among Indians. We will report on a methodology for extracting, analyzing and comparing mental models from texts and on the knowledge we gained about the perception of data privacy and security among the subjects.
@inproceedings{diesner2005mentalmodelsof,title={Mental Models of Data Privacy and Security Extracted from Interviews with Indians},author={Diesner, J. and Kumaraguru, P. and and Carley, K.},year={2005},booktitle={55th Annual Conference of the International Communication Association (ICA)},}
PET ’05
Privacy in India: Attitudes and Awareness
Ponnurangam
Kumaraguru, and Lorrie
Cranor
In Proceedings of the 2005 Workshop on Privacy Enhancing Technologies (PET2005), 2005
In recent years, numerous surveys have been conducted to assess attitudes about privacy in the United States, Australia, Canada, and the European Union. Very little information has been published about privacy attitudes in India. As India is becoming a leader in business process outsourcing, increasing amounts of personal information from other countries is flowing into India. Questions have been raised about the ability of Indian companies to adequately protect this information. We conducted an exploratory. study to gain an initial understanding of attitudes about privacy among the Indian high tech workforce. We carried out a written survey and one-on-one inter-views to assess the level of awareness about privacy-related issues and concern about privacy among a sample of educated people in India. Our results demonstrate an overall lack of awareness of privacy issues and less concern about privacy in India than has been found in similar studies conducted in the United States.
@inproceedings{kumaraguru:protecting-people-from-ph:2007:lrfkq,title={{Privacy in India: Attitudes and Awareness}},author={Kumaraguru, Ponnurangam and Cranor, Lorrie},year={2005},booktitle={Proceedings of the 2005 Workshop on Privacy Enhancing Technologies (PET2005)},}