Multilingual and Code-Mixed NLP
Benchmarking transformer and LLM behavior on Hindi-English named entity recognition, with attention to script variation, domain shift, and evaluation cost.
Research Agenda
My research sits at the intersection of machine learning and the empirical study of structured systems — both social and biological. I am most interested in questions where statistical learning methods need to engage with substantive domain theory: not just whether a model performs well, but what its representations reveal about the underlying system, and what counts as rigorous evidence of generalization. The three threads below organize my current work.
Benchmarking transformer and LLM behavior on Hindi-English named entity recognition, with attention to script variation, domain shift, and evaluation cost.
Studying protein representation learning for membrane protein and signal peptide prediction using ESM, ProtBERT, LoRA, and biological priors.
Building multi-agent simulations of coordination, norm emergence, and AI-mediated interventions under heterogeneous incentives.
Theme 1
How do artificial agents reshape cooperation in groups where responsibility is distributed across many actors? Most human–AI collaboration research studies one human interacting with one AI system, but many real-world coordination problems are multi-party: several actors share responsibility for a common outcome but each face private incentives to free-ride. I am interested in how AI agents — acting as advisors, coordinators, monitors, or commitment-support systems — can change the dynamics of cooperation in these settings.
My current work develops a simulation-based research platform for this question, using global supply chains as the empirical domain. Supply chains are a useful site for studying multi-party coordination because they combine several mechanisms in one environment: public-goods support for ethical production, strategic ignorance under noisy monitoring, threshold coordination on remediation, and exit-versus-engagement after misconduct. Together, these create a setting where individually rational behavior reliably produces collectively worse outcomes — and where AI intervention has plausible levers.
Built a calibrated game-theoretic simulation modeling five buyers and three suppliers across repeated rounds, integrating public goods support, strategic ignorance, threshold remediation, exit-versus-engagement, and auditing into a single domain-grounded model. Buyer policy uses quantal response equilibrium with a Stage 2 belief solver that jointly fixes support, monitoring, and action probabilities. The simulation reproduces the target inefficient-equilibrium pattern (~22–26% welfare loss versus the cooperative ceiling, ~99% supplier defection rate, ~7% remediation rate) across a 12-cell calibration sweep with 50 simulations per cell. Backend has 101 passing tests with three completed audit rounds.
The next phase turns this simulation into an experimental platform comparing the algorithmic baseline against human-participant sessions in which AI agents act as advisors, coordinators, monitors, or commitment-support systems. The longer-term question is whether AI mediation reduces welfare loss in genuinely multi-party settings — and through what mechanism: better information sharing, stronger commitments, clearer coordination, or redesigned incentives.
Capstone seminar paper available on request.
Theme 2
Large pretrained models are increasingly used as general-purpose feature extractors in scientific domains, but their adaptation to specific downstream tasks remains expensive and often biologically uninformed. I am interested in how domain-specific structure — substitution patterns, physicochemical constraints, evolutionary priors — can be injected into parameter-efficient fine-tuning, and how we evaluate whether such inductive biases actually help.
Co-developed a parameter-efficient fine-tuning method that injects residue-level biological priors (BLOSUM62, hydrophobicity, Grantham distance) into LoRA adapters for ESM-2 protein language models. On SignalP6 benchmarks, BLOSUM62-guided LoRA matches or exceeds full fine-tuning across SP MCC, cleavage-site exactness, and residue-level F1 — while training only ~3.6% of parameters and using ~43% less peak GPU memory at the 3B-parameter scale.
A key methodological finding: single-seed improvements frequently fail to replicate under multi-seed confirmation. We document a hyperparameter screen where the strongest single-seed alternatives both lost to the original baseline under three-seed validation, motivating multi-seed evaluation as a default in this setting.
Manuscript under review.
Theme 3
Large multilingual models often perform well on resource-rich languages but degrade on code-mixed and low-resource settings — a problem with both practical and methodological dimensions. I am interested in how task-specific fine-tuning compares to large-model in-context approaches when the underlying linguistic structure is genuinely mixed.
Fine-tuned multilingual transformers (mBERT, XLM-RoBERTa) on the COMI-LINGUA dataset for Hindi–English code-mixed NER, achieving 78% entity-level F1 — outperforming zero-shot GPT-4o and LLaMA 3.1 baselines under controlled evaluation. Analyzed trade-offs between model scale, task-specific fine-tuning, generalization, and computational cost.