Publications

(2025). VeriLA: A Human-Centered Evaluation Framework for Interpretable Verification of LLM Agent Failures. ACM CHI Human-centered Evaluation and Auditing of Language Models (HEAL@CHI Workshop 2025).

Arxiv

(2025). Is your benchmark truly adversarial? AdvScore: Evaluating Human-Grounded Adversarialness. Nations of the Americas Chapter of the Association for Computational Linguistics (NAACL 2025). Main.

Arxiv

(2025). Granular Benchmark for Evaluating Model Calibration against Human Calibration.

Arxiv

(2024). You Make me Feel like a Natural Question: Training QA Systems on Transformed Trivia Questions. Empirical Methods in Natural Language Processing (EMNLP 2024). Main.

PDF Code

(2023). Not all Fake News is Written: A Dataset and Analysis of Misleading Video Headlines. Empirical Methods in Natural Language Processing (EMNLP 2023). Main.

PDF Dataset Code Video

(2020). Topical Keyphrase Extraction with Hierarchical Semantic Networks. Decision Support Systems (DSS 2020).

PDF Arxiv