ArchEHR-QA 2025 (pronounced "Archer"):
Shared Task on Grounded Electronic Health Record Question Answering

BioNLP @ACL 2025 Shared Task on grounded question answering (QA) from electronic health records (EHRs)

Home
News
Introduction
Important Dates
Task Overview
Data
Evaluation
System Submission
Paper Submission
Organizers
Program Committee
References

BioNLP @ACL 2025 Shared Task on grounded question answering (QA) from electronic health records (EHRs)


News

Introduction

Responding to patients’ inbox messages through patient portals is one of the main contributors to increasing clinician burden. To this end, automatically generating answers to questions from patients considering their medical records is important. The objective of this shared task is to automatically answer patients’ questions given important clinical evidence from their electronic health records (EHRs).

Important Dates

(Tentative)

All deadlines are 11:59 PM (“Anywhere on Earth”).

Join our Google Group at https://groups.google.com/g/archehr-qa to get the important updates! For any questions related to the shared task, please reach out using the Google Group or email at sarvesh.soni@nih.gov.

Note: The Google Group e-mails may end up in your spam folder. Please add archehr-qa@googlegroups.com and noreply@groups.google.com to your address book to ensure delivery of these emails.

Task Overview

We propose the task of automatically generating answers to patients’ health-related questions that are grounded in the evidence from patients’ clinical notes. The dataset will consist of hand-curated realistic patient questions (submitted through a patient portal) and their corresponding clinician-rewritten versions (crafted to assist in formulating their responses). The task is to construct coherent answers or responses to input questions that must use and be grounded in the provided clinical note excerpts.

Example #1

Patient Question (underlined is the main question)

Took my 59 yo father to ER ultrasound discovered he had an aortic aneurysm. He had a salvage repair (tube graft). Long surgery / recovery for couple hours then removed packs. why did they do this surgery????? After this time he spent 1 month in hospital now sent home.

Clinician Question

Why did they perform the emergency salvage repair on him?

Clinical Note (sentences numbered for grounding)

1: He was transferred to the hospital on 2025-1-20 for emergent repair of his ruptured thoracoabdominal aortic aneurysm. 2: He was immediately taken to the operating room where he underwent an emergent salvage repair of ruptured thoracoabdominal aortic aneurysm with a 34-mm Dacron tube graft using deep hypothermic circulatory arrest. 3: Please see operative note for details which included cardiac arrest x2. 4: Postoperatively he was taken to the intensive care unit for monitoring with an open chest. 5: He remained intubated and sedated on pressors and inotropes. 6: On 2025-1-22, he returned to the operating room where he underwent exploration and chest closure. 7: On 1-25 he returned to the OR for abd closure JP/ drain placement/ feeding jejunostomy placed at that time for nutritional support.

8: Thoracoabdominal wound healing well with exception of very small open area mid wound that is @1cm around and 1/2cm deep, no surrounding erythema. 9: Packed with dry gauze and covered w/DSD.

Answer (hover over a citation to highlight relevant sentences)

His aortic aneurysm was caused by the rupture of a thoracoabdominal aortic aneurysm, which required emergent surgical intervention (1). He underwent a complex salvage repair using a 34-mm Dacron tube graft and deep hypothermic circulatory arrest to address the rupture (2). The extended recovery time and hospital stay were necessary due to the severity of the rupture and the complexity of the surgery, though his wound is now healing well with only a small open area noted (8).

Example #2

Patient Question (underlined is the main question)

I had severe abdomen pain and was hospitalised for 15 days in ICU, diagnoised with CBD sludge. Doctor advised for ERCP. My question is if the sludge was there does not any medication help in flushing it out? Whether ERCP was the only cure?

Clinician Question

Why was ERCP recommended over a medication-based treatment for CBD sludge?

Clinical Note (sentences numbered for grounding)

1: During the ERCP a pancreatic stent was required to facilitate access to the biliary system (removed at the end of the procedure), and a common bile duct stent was placed to allow drainage of the biliary obstruction caused by stones and sludge. 2: However, due to the patient’s elevated INR, no sphincterotomy or stone removal was performed. 3: Frank pus was noted to be draining from the common bile duct, and post-ERCP it was recommended that the patient remain on IV Zosyn for at least a week. 4: The Vancomycin was discontinued.

5: On hospital day 4 (post-procedure day 3) the patient returned to ERCP for re-evaluation of her biliary stent as her LFTs and bilirubin continued an upward trend. 6: On ERCP the previous biliary stent was noted to be acutely obstructed by biliary sludge and stones. 7: As the patient’s INR was normalized to 1.2, a sphincterotomy was safely performed, with removal of several biliary stones in addition to the common bile duct stent. 8: At the conclusion of the procedure, retrograde cholangiogram was negative for filling defects.

Answer (hover over a citation to highlight relevant sentences)

Medications can sometimes help in managing bile duct sludge, but in this case, ERCP was necessary due to the severity of the obstruction and its complications. The initial ERCP revealed significant biliary obstruction caused by sludge and stones, requiring the placement of a stent to restore bile drainage (1). However, even after this intervention, the liver function tests and bilirubin levels continued to rise, indicating that the obstruction was not fully resolved (5). A follow-up ERCP confirmed that the stent itself had become acutely obstructed by sludge and stones, necessitating further intervention (6). During this procedure, a sphincterotomy was performed, and several stones were physically removed, which medications alone could not have achieved (7). These findings confirm that ERCP was essential in addressing his condition and preventing further complications.

Task Details

Inputs: We do not enforce the use of any specific version of questions (patient or clinician) to generate answers; the participants can use one or both. Note that the entire clinical note excerpt provided in the dataset may not be required in order to generate an answer to the question. Thus, using all sentences from the clinical notes for grounding is not mandatory. Further, the questions may not be answered in full by the clinical notes. This is a natural scenario and a step toward answering questions using the whole EHR. The model must only make use of the provided inputs (questions and note excerpts) to generate an answer. The questions may also require additional world knowledge to answer. However, no external knowledge should be explicitly given to the model (see the note below).

Note: Participants must submit at least one run (out of a maximum of three) following the guidelines prohibiting external knowledge. Additionally, they may use extra data, but we ask that they specify their approach, including any use of public or non-public datasets, in preparing the submission.

Outputs: The sentences in the generated answer may be supported using one, multiple, or none (unsupported) of the sentences from the clinical note. The unsupported sentences in the answer may be ignored during the quantitative evaluation. The answers should be in the professional register to better match the contents of the clinical notes. Simplification of answers to lay language is assumed to be performed later and is not the focus of this task. The generated answer should be limited to 75 words, which roughly correspond to 5 sentences. This is based on our observations from the baseline experiments and existing literature supporting that a paragraph-long answer is preferred by users 1,2. There are no limitations to the number of note sentences cited.

Data

The dataset consists of questions (inspired by real patient questions) and associated EHR data (derived from the MIMIC database3) containing important clinical evidence to answer these questions. Each instance of the question-note pairs is referred to as a “case”. Clinical note excerpts come pre-annotated with sentence numbers which must be used to cite the clinical evidence sentences in system responses. Each sentence is manually annotated with a “relevance” label to mark its importance in answering the given question as "essential", "supplementary", or "not-relevant".

The development set comes with the relevance keys. For the test set cases, the submissions should return a natural language answer with citations to the clinical note sentence numbers.

Access

The dataset is available on PhysioNet. Please sign up for PhysioNet4 and complete the required training to access the dataset.

Evaluation

Submissions will be evaluated based on their use of clinical evidence for grounding (“Factuality”) and the relevance of the generated answers (“Relevance”).

Factuality will be assessed by calculating Precision, Recall, and F1 Scores between the cited evidence sentences in the generated answers and the manually annotated ground truth set of evidence sentences. Each note sentence is manually annotated with a 'relevance' label to mark its importance in answering the given question as 'essential', 'supplementary', or 'not-relevant'. Two variations of Citation F1 Scores will be calculated. In the “strict” variation, only 'essential' labels will be considered as answers. In the “lenient” variation, both 'essential' and 'supplementary' labels will be considered as answers.

Relevance will be evaluated by comparing the generated answer text with the ground truth 'essential' note sentences and the question using the following evaluation metrics:

The Overall scores for the leaderboard will be the mean of Factuality (Strict Citation F1 Scores) and Relevance (combination of all the normalized scores) scores. The scoring script is available on GitHub at https://github.com/soni-sarvesh/archehr-qa/tree/main/evaluation.

System Submission

Please visit the competition on Codabench at https://www.codabench.org/competitions/5302/ to submit system responses. Each team may make up to three successful submissions to Codabench11 in total. Note that the automatically computed scores on Codabench are not final, however, they should be fairly close to the final scores that will be computed after reconciliation of double annotations.

Paper Submission

All shared task participants are invited to submit a paper describing their systems to the Proceedings of the 24th Workshop on Biomedical Natural Language Processing (BioNLP) at ACL 2025. Only short papers will be accepted for shared task participants. The shared task papers will go through a faster review process. All submissions will go through START at https://softconf.com/acl2025/BioNLP2025-ST. Regardless of participants’ decision to submit a paper, they must submit a short one-paragraph summary of their best system to sarvesh.soni@nih.gov to be included in the shared task overview paper.

Organizers

Sarvesh Soni

Sarvesh Soni

NLM, NIH

Dina Demner-Fushman

Dina Demner-Fushman

NLM, NIH

Program Committee

We are looking for people to join the program committee, where the responsibilities will include reviewing papers. If you are interested, please send an email to sarvesh.soni@nih.gov.

References

  1. Lin, J., Quan, D., Sinha, V., Bakshi, K., Huynh, D., Katz, B., & Karger, D. R. (2003, September). What Makes a Good Answer? The Role of Context in Question Answering. In INTERACT. link 

  2. Jeon, J., Croft, W. B., Lee, J. H., & Park, S. (2006, August). A framework to predict the quality of answers with non-textual features. In Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval (pp. 228-235). https://doi.org/10.1145/1148170.1148212 

  3. Alistair E. W. Johnson, Tom J. Pollard, Lu Shen, Li-wei H. Lehman, Mengling Feng, Mohammad Ghassemi, Benjamin Moody, Peter Szolovits, Leo Anthony Celi, and Roger G. Mark. 2016. MIMIC-III, a freely accessible critical care database. Scientific Data, 3(1):160035. https://doi.org/10.1038/sdata.2016.35 

  4. PhysioNet. https://physionet.org/ Accessed Dec 26, 2024 

  5. Papineni, K., Roukos, S., Ward, T., & Zhu, W. J. (2002, July). Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics (pp. 311-318). https://doi.org/10.3115/1073083.1073135 

  6. Chin-Yew Lin. 2004. ROUGE: A Package for Automatic Evaluation of Summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain. Association for Computational Linguistics. https://aclanthology.org/W04-1013/ 

  7. Xu, W., Napoles, C., Pavlick, E., Chen, Q., & Callison-Burch, C. (2016). Optimizing statistical machine translation for text simplification. Transactions of the Association for Computational Linguistics, 4, 401-415. https://aclanthology.org/Q16-1029/ 

  8. Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, and Yoav Artzi. 2019. BERTScore: Evaluating Text Generation with BERT. In International Conference on Learning Representations. https://openreview.net/forum?id=SkeHuCVFDr 

  9. Yuheng Zha, Yichi Yang, Ruichen Li, and Zhiting Hu. 2023. AlignScore: Evaluating Factual Consistency with A Unified Alignment Function. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11328–11348, Toronto, Canada. Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.acl-long.634 

  10. Wen-wai Yim, Yujuan Fu, Asma Ben Abacha, Neal Snider, Thomas Lin, and Meliha Yetisgen. 2023. Aci-bench: a Novel Ambient Clinical Intelligence Dataset for Benchmarking Automatic Visit Note Generation. Scientific Data, 10(1):586. https://doi.org/10.1038/s41597-023-02487-3 

  11. Zhen Xu, Sergio Escalera, Adrien Pavão, Magali Richard, Wei-Wei Tu, Quanming Yao, Huan Zhao, and Isabelle Guyon. 2022. Codabench: Flexible, easy-to-use, and reproducible meta-benchmark platform. Patterns, 3(7):100543. https://doi.org/10.1016/j.patter.2022.100543