is a group of researchers who study, analyze, and build various aspects of AI (including social) systems. Our work spans several areas - Applied Machine Learning, Responsible and Safe AI, Natural Language Processing, and Social Network Analysis. By understanding and measuring AI systems, we aim to develop solutions that contribute to the greater good of society.
We study model merging as a practical alternative to conventional adaptation strategies for code-mixed NLP. Starting from a multilingual base model, we: (i) perform continued pre-training (CPT) on unlabeled code-mixed text to obtain an adapted checkpoint, (ii) merge checkpoint with the base model, and (iii) fine-tune (FT) on the downstream task data. We evaluate our approach for sentence classification (sentiment and hate speech) task in English-Hindi (En-Hi) and English-Spanish (En-Es) using XLM-R and Llama-3.2-1B models. Our results show that merged models consistently outperform full fine-tuning and CPT->FT. We observe gains of 2–5 points in F1 over full fine-tuning and 1-2 points over CPT->FT, indicating that unlabeled data is leveraged more effectively via merging than via CPT alone. Zero-/few-shot prompting with larger LLMs (e.g., Llama-3.3-70B) lags behind fine-tuned and merged checkpoints, underscoring limits of in-context learning for code-mixed inputs. We further test cross-pair transfer by training on En-Hi and evaluating on En-Ta and En-Ml: merged checkpoints transfer more strongly than monolingual-English baselines (e.g., TV/TIES variants reaching 0.65-0.68 F1 vs 0.61-0.63 for full fine-tuning), suggesting that code-mixed knowledge is a more reliable substrate for low-resource pairs. We conclude with adaptation recipes matched to common data regimes (labeled only; labeled+unlabeled; transfer-only) and discuss limitations and scaling considerations for broader tasks and larger models.
@inproceedings{kodali2025adaptingmultilingualmodelscodemixed,title={Adapting Multilingual Models to Code-Mixed Tasks via Model Merging},author={Kodali, Prashant and Shivkumar, Vaishnavi and Joshi, Swarang and Choudhary, Monojit and Kumaraguru, Ponnurangam and Shrivastava, Manish},year={2025},eprint={2510.19782},archiveprefix={arXiv},primaryclass={cs.CL},booktitle={13th International Conference on Data Science},url={https://arxiv.org/abs/2510.19782},saral={https://www.youtube.com/watch?v=zaYBL54XY5k}}
EMNLP
SEMMA: A Semantic Aware Knowledge Graph Foundation Model
Arvindh
Arun, Sumit
Kumar, Mojtaba
Nayyeri, Bo
Xiong, Ponnurangam
Kumaraguru, Antonio
Vergari, and Steffen
Staab
In The 2025 Conference on Empirical Methods in Natural Language Processing, 2025
Knowledge Graph Foundation Models (KGFMs) have shown promise in enabling zero-shot reasoning over unseen graphs by learning transferable patterns. However, most existing KGFMs rely solely on graph structure, overlooking the rich semantic signals encoded in textual attributes. We introduce SEMMA, a dual-module KGFM that systematically integrates transferable textual semantics alongside structure. SEMMA leverages Large Language Models (LLMs) to enrich relation identifiers, generating semantic embeddings that subsequently form a textual relation graph, which is fused with the structural component. Across 54 diverse KGs, SEMMA outperforms purely structural baselines like ULTRA in fully inductive link prediction. Crucially, we show that in more challenging generalization settings, where the test-time relation vocabulary is entirely unseen, structural methods collapse while SEMMA is 2x more effective. Our findings demonstrate that textual semantics are critical for generalization in settings where structure alone fails, highlighting the need for foundation models that unify structural and linguistic signals in knowledge reasoning.
@inproceedings{arun2025semmasemanticawareknowledge,title={SEMMA: A Semantic Aware Knowledge Graph Foundation Model},booktitle={The 2025 Conference on Empirical Methods in Natural Language Processing},author={Arun, Arvindh and Kumar, Sumit and Nayyeri, Mojtaba and Xiong, Bo and Kumaraguru, Ponnurangam and Vergari, Antonio and Staab, Steffen},year={2025},eprint={2505.20422},archiveprefix={arXiv},primaryclass={cs.CL},url={https://arxiv.org/abs/2505.20422},saral={https://www.youtube.com/watch?v=nh_YKOLULpo}}
EMNLP
Do LLMs Adhere to Label Definitions? Examining Their Receptivity to External Label Definitions
Seyedali
Mohammadi, Bhaskara Hanuma
Vedula, Hemank
Lamba, Edward
Raff, Ponnurangam
Kumaraguru, Francis
Ferraro, and Manas
Gaur
In The 2025 Conference on Empirical Methods in Natural Language Processing, 2025
Do LLMs genuinely incorporate external definitions, or do they primarily rely on their parametric knowledge? To address these questions, we conduct controlled experiments across multiple explanation benchmark datasets (general and domain-specific) and label definition conditions, including expert-curated, LLM-generated, perturbed, and swapped definitions. Our results reveal that while explicit label definitions can enhance accuracy and explainability, their integration into an LLM’s task-solving processes is neither guaranteed nor consistent, suggesting reliance on internalized representations in many cases. Models often default to their internal representations, particularly in general tasks, whereas domain-specific tasks benefit more from explicit definitions. These findings underscore the need for a deeper understanding of how LLMs process external knowledge alongside their pre-existing capabilities.
@inproceedings{mohammadi2025do,title={Do {LLM}s Adhere to Label Definitions? Examining Their Receptivity to External Label Definitions},author={Mohammadi, Seyedali and Vedula, Bhaskara Hanuma and Lamba, Hemank and Raff, Edward and Kumaraguru, Ponnurangam and Ferraro, Francis and Gaur, Manas},booktitle={The 2025 Conference on Empirical Methods in Natural Language Processing},year={2025},url={https://openreview.net/forum?id=gJqW0wwloH},saral={https://www.youtube.com/watch?v=RXtUBVyQ1Zw&t=1s&pp=0gcJCQMKAYcqIYzv}}
COLM
Can Language Models Falsify? Evaluating Algorithmic Reasoning with Counterexample Creation
Shiven
Sinha, Shashwat
Goel, Ponnurangam
Kumaraguru, Jonas
Geiping, Matthias
Bethge, and Ameya
Prabhu
There is growing excitement about the potential of Language Models (LMs) to accelerate scientific discovery. Falsifying hypotheses is key to scientific progress, as it allows claims to be iteratively refined over time. This process requires significant researcher effort, reasoning, and ingenuity. Yet current benchmarks for LMs predominantly assess their ability to generate solutions rather than challenge them. We advocate for developing benchmarks that evaluate this inverse capability - creating counterexamples for subtly incorrect solutions. To demonstrate this approach, we start with the domain of algorithmic problem solving, where counterexamples can be evaluated automatically using code execution. Specifically, we introduce REFUTE, a dynamically updating benchmark that includes recent problems and incorrect submissions from programming competitions, where human experts successfully identified counterexamples. Our analysis finds that the best reasoning agents, even OpenAI o3-mini (high) with code execution feedback, can create counterexamples for only <9% of incorrect solutions in REFUTE, even though ratings indicate its ability to solve up to 48% of these problems from scratch. We hope our work spurs progress in evaluating and enhancing LMs’ ability to falsify incorrect solutions - a capability that is crucial for both accelerating research and making models self-improve through reliable reflective reasoning.
@article{sinha2025falsify,title={Can Language Models Falsify? Evaluating Algorithmic Reasoning with Counterexample Creation},author={Sinha, Shiven and Goel, Shashwat and Kumaraguru, Ponnurangam and Geiping, Jonas and Bethge, Matthias and Prabhu, Ameya},year={2025},journal={Conference on Language Modelling},saral={https://www.youtube.com/watch?v=vbEO6tTm4f8},}
ICML
Great Models Think Alike and this Undermines AI Oversight
Shashwat
Goel, Joschka
Struber, Ilze Amanda
Auzina, Karuna K
Chandra, Ponnurangam
Kumaraguru, Douwe
Kiela, Ameya
Prabhu, Matthias
Bethge, and Jonas
Geiping
Forty Second International Conference on Machine Learning, 2025
As Language Model (LM) capabilities advance, evaluating and supervising them at scale is getting harder for humans. There is hope that other language models can automate both these tasks, which we refer to as "AI Oversight". We study how model similarity affects both aspects of AI oversight by proposing a probabilistic metric for LM similarity based on overlap in model mistakes. Using this metric, we first show that LLM-as-a-judge scores favor models similar to the judge, generalizing recent self-preference results. Then, we study training on LM annotations, and find complementary knowledge between the weak supervisor and strong student model plays a crucial role in gains from "weak-to-strong generalization". As model capabilities increase, it becomes harder to find their mistakes, and we might defer more to AI oversight. However, we observe a concerning trend – model mistakes are becoming more similar with increasing capabilities, pointing to risks from correlated failures. Our work underscores the importance of reporting and correcting for model similarity, especially in the emerging paradigm of AI oversight.
@article{goel2025greatmodelsthinkalike,title={Great Models Think Alike and this Undermines AI Oversight},author={Goel, Shashwat and Struber, Joschka and Auzina, Ilze Amanda and Chandra, Karuna K and Kumaraguru, Ponnurangam and Kiela, Douwe and Prabhu, Ameya and Bethge, Matthias and Geiping, Jonas},year={2025},journal={Forty Second International Conference on Machine Learning},}
ICML
A Cognac shot to forget bad memories: Corrective Unlearning in GNNs
Graph Neural Networks (GNNs) are increasingly being used for a variety of ML applications on graph data. Because graph data does not follow the independently and identically distributed (i.i.d.) assumption, adversarial manipulations or incorrect data can propagate to other data points through message passing, which deteriorates the model’s performance. To allow model developers to remove the adverse effects of manipulated entities from a trained GNN, we study the recently formulated problem of Corrective Unlearning. We find that current graph unlearning methods fail to unlearn the effect of manipulations even when the whole manipulated set is known. We introduce a new graph unlearning method, Cognac, which can unlearn the effect of the manipulation set even when only 5% of it is identified.
@article{kolipaka2024cognacshotforgetbad,title={A Cognac shot to forget bad memories: Corrective Unlearning in GNNs},author={Kolipaka, Varshita and Sinha, Akshit and Mishra, Debangan and Kumar, Sumit and Arun, Arvindh and Goel, Shashwat and Kumaraguru, Ponnurangam},journal={Forty-Second International Conference on Machine Learning},year={2025},saral={https://www.youtube.com/watch?v=9f0cp7mHLLg},}
EASE
Small Models, Big Tasks: An Exploratory Empirical Study on Small Language Models for Function Calling
Ishan
Kavathekar, Raghav
Donakanti, Ponnurangam
Kumaraguru, and Karthik
Vaidhyanathan
International Conference on Evaluation and Assessment in Software Engineering (EASE), 2025
Function calling is a complex task with widespread applications in domains such as information retrieval, software engineering and automation. For example, a query to book the shortest flight from New York to London on January 15 requires identifying the correct parameters to generate accurate function calls. Large Language Models (LLMs) can automate this process but are computationally expensive and impractical in resource-constrained settings. In contrast, Small Language Models (SLMs) can operate efficiently, offering faster response times, and lower computational demands, making them potential candidates for function calling on edge devices. In this exploratory empirical study, we evaluate the efficacy of SLMs in generating function calls across diverse domains using zero-shot, few-shot, and fine-tuning approaches, both with and without prompt injection, while also providing the finetuned models to facilitate future applications. Furthermore, we analyze the model responses across a range of metrics, capturing various aspects of function call generation. Additionally, we perform experiments on an edge device to evaluate their performance in terms of latency and memory usage, providing useful insights into their practical applicability. Our findings show that while SLMs improve from zero-shot to few-shot and perform best with fine-tuning, they struggle significantly with adhering to the given output format. Prompt injection experiments further indicate that the models are generally robust and exhibit only a slight decline in performance. While SLMs demonstrate potential for the function call generation task, our results also highlight areas that need further refinement for real-time functioning.
@article{kavathekar2025smallmodelsbigtasks,title={Small Models, Big Tasks: An Exploratory Empirical Study on Small Language Models for Function Calling},author={Kavathekar, Ishan and Donakanti, Raghav and Kumaraguru, Ponnurangam and Vaidhyanathan, Karthik},year={2025},journal={International Conference on Evaluation and Assessment in Software Engineering (EASE)},eprint={2504.19277},archiveprefix={arXiv},primaryclass={cs.AI},url={https://arxiv.org/abs/2504.19277}}
Thank you for your interest in joining our team! We are always looking for talented and motivated individuals to join our team. If you are interested in working with us, please
apply here.