Responding to patients’ inbox messages through patient portals is one of the main contributors to increasing clinician burden. To this end, automatically generating answers to questions from patients considering their medical records is important. The objective of this shared task is to automatically answer patients’ questions given important clinical evidence from their electronic health records (EHRs).
(Tentative)
All deadlines are 11:59 PM (“Anywhere on Earth”).
Join our Google Group at https://groups.google.com/g/archehr-qa to get the important updates! For any questions related to the shared task, please reach out using the Google Group or email at sarvesh.soni@nih.gov.
Note: The Google Group e-mails may end up in your spam folder. Please add archehr-qa@googlegroups.com and noreply@groups.google.com to your address book to ensure delivery of these emails.
We propose the task of automatically generating answers to patients’ health-related questions that are grounded in the evidence from patients’ clinical notes. The dataset will consist of hand-curated realistic patient questions (submitted through a patient portal) and their corresponding clinician-rewritten versions (crafted to assist in formulating their responses). The task is to construct coherent answers or responses to input questions that must use and be grounded in the provided clinical note excerpts.
Example #1
Patient Question (underlined is the main question)Took my 59 yo father to ER ultrasound discovered he had an aortic aneurysm. He had a salvage repair (tube graft). Long surgery / recovery for couple hours then removed packs. why did they do this surgery????? After this time he spent 1 month in hospital now sent home.
Clinician QuestionWhy did they perform the emergency salvage repair on him?
Clinical Note (sentences numbered for grounding)1: He was transferred to the hospital on 2025-1-20 for emergent repair of his ruptured thoracoabdominal aortic aneurysm. 2: He was immediately taken to the operating room where he underwent an emergent salvage repair of ruptured thoracoabdominal aortic aneurysm with a 34-mm Dacron tube graft using deep hypothermic circulatory arrest. 3: Please see operative note for details which included cardiac arrest x2. 4: Postoperatively he was taken to the intensive care unit for monitoring with an open chest. 5: He remained intubated and sedated on pressors and inotropes. 6: On 2025-1-22, he returned to the operating room where he underwent exploration and chest closure. 7: On 1-25 he returned to the OR for abd closure JP/ drain placement/ feeding jejunostomy placed at that time for nutritional support.
8: Thoracoabdominal wound healing well with exception of very small open area mid wound that is @1cm around and 1/2cm deep, no surrounding erythema. 9: Packed with dry gauze and covered w/DSD.
Answer (hover over a citation to highlight relevant sentences)His aortic aneurysm was caused by the rupture of a thoracoabdominal aortic aneurysm, which required emergent surgical intervention (1). He underwent a complex salvage repair using a 34-mm Dacron tube graft and deep hypothermic circulatory arrest to address the rupture (2). The extended recovery time and hospital stay were necessary due to the severity of the rupture and the complexity of the surgery, though his wound is now healing well with only a small open area noted (8).
Example #2
Patient Question (underlined is the main question)I had severe abdomen pain and was hospitalised for 15 days in ICU, diagnoised with CBD sludge. Doctor advised for ERCP. My question is if the sludge was there does not any medication help in flushing it out? Whether ERCP was the only cure?
Clinician QuestionWhy was ERCP recommended over a medication-based treatment for CBD sludge?
Clinical Note (sentences numbered for grounding)1: During the ERCP a pancreatic stent was required to facilitate access to the biliary system (removed at the end of the procedure), and a common bile duct stent was placed to allow drainage of the biliary obstruction caused by stones and sludge. 2: However, due to the patient’s elevated INR, no sphincterotomy or stone removal was performed. 3: Frank pus was noted to be draining from the common bile duct, and post-ERCP it was recommended that the patient remain on IV Zosyn for at least a week. 4: The Vancomycin was discontinued.
5: On hospital day 4 (post-procedure day 3) the patient returned to ERCP for re-evaluation of her biliary stent as her LFTs and bilirubin continued an upward trend. 6: On ERCP the previous biliary stent was noted to be acutely obstructed by biliary sludge and stones. 7: As the patient’s INR was normalized to 1.2, a sphincterotomy was safely performed, with removal of several biliary stones in addition to the common bile duct stent. 8: At the conclusion of the procedure, retrograde cholangiogram was negative for filling defects.
Answer (hover over a citation to highlight relevant sentences)Medications can sometimes help in managing bile duct sludge, but in this case, ERCP was necessary due to the severity of the obstruction and its complications. The initial ERCP revealed significant biliary obstruction caused by sludge and stones, requiring the placement of a stent to restore bile drainage (1). However, even after this intervention, the liver function tests and bilirubin levels continued to rise, indicating that the obstruction was not fully resolved (5). A follow-up ERCP confirmed that the stent itself had become acutely obstructed by sludge and stones, necessitating further intervention (6). During this procedure, a sphincterotomy was performed, and several stones were physically removed, which medications alone could not have achieved (7). These findings confirm that ERCP was essential in addressing his condition and preventing further complications.
We do not enforce the use of any specific version of questions (patient or clinician) to generate answers; the participants can use one or both. Note that the entire clinical note excerpt provided in the dataset may not be required in order to generate a correct answer to the question. Thus, using all the sentences from the provided clinical note for grounding is not mandatory. Further, sentences in the generated answer may be supported using one, multiple, or none (unsupported) of the sentences from the clinical note. The unsupported sentences in the answer may be ignored during the quantitative evaluation. The answers should be in the professional register to better match the contents of the clinical notes. Simplification of answers to lay language is assumed to be performed later and is not the focus of this task.
The questions in the dataset are inspired from queries submitted by consumers to the National Library of Medicine (NLM) through MedlinePlus1, CHiQA2, and customer services. The clinical note excerpts used in the dataset are inspired from MIMIC3. The development and test sets will consist of a sample set of questions with corresponding clinical note excerpts. Each question in the development set will also contain a plausible answer grounded in the corresponding clinical note. For test questions, the submissions should return a natural language answer with citations to the sentence numbers from corresponding clinical note excerpt. Clinical note excerpts will come pre-annotated with sentence numbers which must be used to cite the clinical evidence sentences in system responses.
The development and test datasets will be made available in February and March (tentatively) through PhysioNet4. To ensure timely access to the datasets upon release, please sign up for PhysioNet and complete the required training to access the MIMIC-III Clinical Database.
The submissions will be evaluated for both the quality of generated answers and the use of clinical evidence for grounding. The evidence sentences cited in the generated answers will be evaluated using Precision, Recall, and F1 Scores considering a manually annotated ground truth set of evidence sentences. The alignment of sentences in the generated answer with the cited evidence sentence(s) from the clinical notes will be assessed using ROUGE5, BERTScore6, AlignScore7, and MEDCON8.
Submissions of system responses will be made through Codabench9. Detailed instructions will be added soon.
All shared task participants are invited to submit a paper describing their systems to the Proceedings of the 24rd Workshop on Biomedical Natural Language Processing (BioNLP) at ACL 2025. Only short papers will be accepted for shared task participants. The shared task papers will go through a faster review process. All submissions will go through START at https://softconf.com/acl2025/BioNLP2025-ST. Regardless of participants’ decision to submit a paper, they must submit a short one-paragraph summary of their best system to sarvesh.soni@nih.gov to be included in the shared task overview paper.
Sarvesh SoniNLM, NIH |
Dina Demner-FushmanNLM, NIH |
We are looking for people to join the program committee, where the responsibilities will include reviewing papers. If you are interested, please send an email to sarvesh.soni@nih.gov.
MedlinePlus. https://medlineplus.gov/ Accessed Dec 26, 2024 ↩
Dina Demner-Fushman, Yassine Mrabet, and Asma Ben Abacha. 2020. Consumer health information and question answering: helping consumers find answers to their health-related information needs. Journal of the American Medical Informatics Association, 27(2):194–201. https://doi.org/10.1093/jamia/ocz152 ↩
Alistair E. W. Johnson, Tom J. Pollard, Lu Shen, Li-wei H. Lehman, Mengling Feng, Mohammad Ghassemi, Benjamin Moody, Peter Szolovits, Leo Anthony Celi, and Roger G. Mark. 2016. MIMIC-III, a freely accessible critical care database. Scientific Data, 3(1):160035. https://doi.org/10.1038/sdata.2016.35 ↩
PhysioNet. https://physionet.org/ Accessed Dec 26, 2024 ↩
Chin-Yew Lin. 2004. ROUGE: A Package for Automatic Evaluation of Summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain. Association for Computational Linguistics. https://aclanthology.org/W04-1013/ ↩
Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, and Yoav Artzi. 2019. BERTScore: Evaluating Text Generation with BERT. In International Conference on Learning Representations. https://openreview.net/forum?id=SkeHuCVFDr ↩
Yuheng Zha, Yichi Yang, Ruichen Li, and Zhiting Hu. 2023. AlignScore: Evaluating Factual Consistency with A Unified Alignment Function. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11328–11348, Toronto, Canada. Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.acl-long.634 ↩
Wen-wai Yim, Yujuan Fu, Asma Ben Abacha, Neal Snider, Thomas Lin, and Meliha Yetisgen. 2023. Aci-bench: a Novel Ambient Clinical Intelligence Dataset for Benchmarking Automatic Visit Note Generation. Scientific Data, 10(1):586. https://doi.org/10.1038/s41597-023-02487-3 ↩
Zhen Xu, Sergio Escalera, Adrien Pavão, Magali Richard, Wei-Wei Tu, Quanming Yao, Huan Zhao, and Isabelle Guyon. 2022. Codabench: Flexible, easy-to-use, and reproducible meta-benchmark platform. Patterns, 3(7):100543. https://doi.org/10.1016/j.patter.2022.100543 ↩