Avik Dutta | Publications

2026

An Empirical Investigation of Robustness in Large Language Models under Tabular Distortions Avik Dutta, Harshit Nigam, Hosein Hasanbeig, Arjun Radhakrishna, Sumit Gulwani Preprint [Paper]

2025

ConDABench: Interactive Evaluation of Language Models for Data Analysis Avik Dutta, Priyanshu Gupta, Hosein Hasanbeig, Rahul Pratap Singh, Harshit Nigam, Sumit Gulwani, Arjun Radhakrishna, Gustavo Soares, Ashish Tiwari ACM SIGMOD 2026 [Abs] [Paper]
Real-world data analysis tasks often come with under-specified goals and unclean data. User interaction is necessary to understand and disambiguate a user's intent, and hence, essential to solving these complex tasks. Existing benchmarks for evaluating LLMs on data analysis tasks do not capture these complexities or provide first-class support for interactivity. We introduce ConDABench, a framework for generating conversational data analysis (ConDA) benchmarks and evaluating external tools on the generated benchmarks. ConDABench consists of (a) a multi-agent workflow for generating realistic benchmarks from articles describing insights gained from public datasets, (b) 1,420 ConDA problems generated using this workflow, and (c) an evaluation harness that, for the first time, makes it possible to systematically evaluate conversational data analysis tools on the generated ConDA problems. Evaluation of state-of-the-art LLMs on the benchmarks reveals that while the new generation of models are better at solving more instances, they are not necessarily better at solving tasks that require sustained, long-form engagement. ConDABench is an avenue for model builders to measure progress towards truly collaborative models that can complete complex interactive tasks.

2024

RAR: Retrieval-augmented retrieval for code generation in low resource languages Avik Dutta, Mukul Singh, Gust Verbruggen, Sumit Gulwani, Vu Le EMNLP 2024 [Abs] [Paper] [Presentation]
Language models struggle in generating code for low-resource programming languages, since these are underrepresented in training data. Either examples or documentation are commonly used for improved code generation. We propose to use both types of information together and present retrieval augmented retrieval (RAR) as a two-step method for selecting relevant examples and documentation. Experiments on three low-resource languages (Power Query M, OfficeScript and Excel formulas) show that RAR outperforms independently example and grammar retrieval (+2.81--26.14%). Interestingly, we show that two-step retrieval selects better examples and documentation when used independently as well.
Context Matters: Pushing the Boundaries of Open-Ended Answer Generation with Graph-Structured Knowledge Context Somnath Banerjee, Amruit Sahoo, Sayan Layek, Avik Dutta, Rima Hazra, Animesh Mukherjee EMNLP 2024 [Abs] [Paper]
This paper introduces a novel framework that combines graph-driven context retrieval in conjunction to knowledge graphs based enhancement, honing the proficiency of LLMs, especially in domain specific community question answering platforms like AskUbuntu, Unix, and ServerFault. We conduct experiments on various LLMs with different parameter sizes to evaluate their ability to ground knowledge and determine factual accuracy in answers to open-ended questions. Our methodology GraphContextGen consistently outperforms dominant text-based retrieval systems, demonstrating its robustness and adaptability to a larger number of use cases. This advancement highlights the importance of pairing context rich data retrieval with LLMs, offering a renewed approach to knowledge sourcing and generation in AI systems. We also show that, due to rich contextual data retrieval, the crucial entities, along with the generated answer, remain factually coherent with the gold answer.
DistALANER: Distantly Supervised Active Learning Augmented Named Entity Recognition in the Open Source Software Ecosystem Somnath Banerjee, Avik Dutta, Aaditya Agrawal, Rima Hazra, Animesh Mukherjee ECML-PKDD 2024 [Abs] [Paper] [Code]
As the AI revolution unfolds, the push toward automating support systems in diverse professional fields ranging from open-source software to healthcare, and banking to transportation has become more pronounced. Central to the automation of these systems is the early detection of named entities, a task that is foundational yet fraught with challenges due to the need for domain-specific expert annotations amid a backdrop of specialized terminologies, making the process both costly and complex. In response to this challenge, our paper presents an innovative named entity recognition (NER) framework (https://github.com/NeuralSentinel/DistALANER) tailored for the open-source software domain. Our method stands out by employing a distantly supervised, two-step annotation process that cleverly exploits language heuristics, bespoke lookup tables, external knowledge bases, and an active learning model. This multifaceted strategy not only elevates model performance but also addresses the critical hurdles of high costs and the dearth of expert annotators. A notable achievement of our approach is its capability to enable pre-large language models (pre-LLMs) to significantly outperform specially designed generic/domain specific LLMs for NER tasks. We also show the effectiveness of NER in the downstream task of relation extraction.
Redefining Developer Assistance: Through Large Language Models in Software Ecosystem Somnath Banerjee, Avik Dutta, Sayan Layek, Amruit Sahoo, Sam Conrad Joyce, Rima Hazra Preprint [Abs] [Paper]
In this paper, we delve into the advancement of domain-specific Large Language Models (LLMs) with a focus on their application in software development. We introduce DevAssistLlama, a model developed through instruction tuning, to assist developers in processing software-related natural language queries. This model, a variant of instruction tuned LLM, is particularly adept at handling intricate technical documentation, enhancing developer capability in software specific tasks. The creation of DevAssistLlama involved constructing an extensive instruction dataset from various software systems, enabling effective handling of Named Entity Recognition (NER), Relation Extraction (RE), and Link Prediction (LP). Our results demonstrate DevAssistLlama's superior capabilities in these tasks, in comparison with other models including ChatGPT. This research not only highlights the potential of specialized LLMs in software development also the pioneer LLM for this domain.