Adversarial datasets should ensure AI robustness that matches human performance. However, as models evolve, datasets can become obsolete. Thus, adversarial datasets should be periodically updated based on their degradation in adversarialness. Given the lack of a standardized metric for measuring adversarialness, we propose AdvScore, a human-grounded evaluation metric. AdvScore assesses a dataset’s true adversarialness by capturing models’ and humans’ varying abilities, while also identifying poor examples. AdvScore then motivates a new dataset creation pipeline for realistic and high-quality adversarial samples, enabling us to collect an adversarial question answering (QA) dataset, AdvQA. We apply AdvScore using 9,347 human responses and ten language model predictions to track the models’ improvement over five years (from 2020 to 2024). AdvScore assesses whether adversarial datasets remain suitable for model evaluation, measures model improvements, and provides guidance for better alignment with human capabilities.