ArchEHR-QA 2026 ("Archer")

Logo

CL4Health @LREC 2026 Shared Task on grounded question answering (QA) from electronic health records (EHRs)

Home
News
Introduction
Important Dates
Task Overview
Subtask 1: Question Interpretation
Subtask 2: Evidence Identification
Subtask 3: Answer Generation
Subtask 4: Evidence Alignment
Data
Evaluation
System Submission
Paper Submission
Organizers
Program Committee
References

ArchEHR-QA 2025 (Previous Year)

CL4Health @LREC 2026 Shared Task on grounded question answering (QA) from electronic health records (EHRs)


News

Introduction

The ArchEHR-QA (“Archer”) shared task focuses on answering patients’ health-related questions using their own electronic health records (EHRs). While prior work has explored general health question answering, far less attention has been paid to leveraging patient-specific records and to grounding model outputs in explicit clinical evidence, i.e., linking answers to specific supporting content in the clinical notes. ArchEHR-QA addresses this gap by targeting the problem of producing answers to patient questions that are supported by and explicitly linked to the underlying clinical notes.

Important Dates

(Tentative)

All deadlines are 11:59 PM (“Anywhere on Earth”).

Join our Google Group at https://groups.google.com/g/archehr-qa to get the important updates! For any questions related to the shared task, please reach out using the Google Group or email at sarvesh.soni@nih.gov.

Note: The Google Group e-mails may end up in your spam folder. Please add archehr-qa@googlegroups.com and noreply@groups.google.com to your address book to ensure delivery of these emails.

Task Overview

This second iteration builds on the 2025 challenge (which was collocated with the ACL 2025 BioNLP Workshop) by expanding the dataset and introducing four complementary subtasks spanning patient question interpretation, clinical evidence identification, answer generation, and answer–evidence alignment. This year’s shared task will follow a staged data release schedule, with separate deadlines for Subtask 1, Subtasks 2–3, and Subtask 4. Teams may participate in any subset of subtasks and will be invited to submit system description papers detailing their approaches and results.

Example Cases from Dataset

The dataset consists of patient-authored questions, corresponding clinician-interpreted counterparts, clinical note excerpts with sentence-level relevance annotations, and reference clinician-authored answers grounded in the notes.

Example Case #1

Patient Question

I had severe abdomen pain and was hospitalised for 15 days in ICU, diagnoised with CBD sludge. Doctor advised for ERCP. My question is if the sludge was there does not any medication help in flushing it out? Whether ERCP was the only cure?

Clinician-Interpreted Question

Why was ERCP recommended over a medication-based treatment for CBD sludge?

Clinical Note Excerpt (sentences numbered for grounding)

1: During the ERCP a pancreatic stent was required to facilitate access to the biliary system (removed at the end of the procedure), and a common bile duct stent was placed to allow drainage of the biliary obstruction caused by stones and sludge. 2: However, due to the patient’s elevated INR, no sphincterotomy or stone removal was performed. 3: Frank pus was noted to be draining from the common bile duct, and post-ERCP it was recommended that the patient remain on IV Zosyn for at least a week. 4: The Vancomycin was discontinued.

5: On hospital day 4 (post-procedure day 3) the patient returned to ERCP for re-evaluation of her biliary stent as her LFTs and bilirubin continued an upward trend. 6: On ERCP the previous biliary stent was noted to be acutely obstructed by biliary sludge and stones. 7: As the patient’s INR was normalized to 1.2, a sphincterotomy was safely performed, with removal of several biliary stones in addition to the common bile duct stent. 8: At the conclusion of the procedure, retrograde cholangiogram was negative for filling defects.

Clinical Specialty

Gastroenterology

Answer (hover over a citation to highlight relevant sentences)

An endoscopic retrograde cholangiopancreatography, ERCP, was recommended to place a common bile duct stent [1]. This stent was placed to allow drainage of the biliary obstruction which was caused by stones and sludge [1]. Due to no improvement in liver function, the patient needed a repeat ERCP [5]. The repeat ERCP showed that the biliary stent placed in the first ERCP was obstructed by stones and sludge [6]. The stones and stent were successfully removed during this procedure by performing a sphincterotomy [7].

Example Case #2

Patient Question

I just wrote about my dad given multiple shots of lasciks after he was already so swelled his shin looked like it would burst open. Why would they give him so much. He was on oxygen and they took him off of the higher flow rate.

Clinician-Interpreted Question

Why was he given lasix and his oxygen flow rate was reduced?

Clinical Note Excerpt (sentences numbered for grounding)

1: Acute diastolic heart failure: Pt developed signs and symptoms of volume overload with shortness of breath, increased oxygen requirement and lower extremity edema. 2: Echo showed preserved EF, no WMA and worsening AI. 3: CHF most likely secondary to worsening valvular disease. 4: He was diuresed with lasix IV, intermittently on lasix gtt then transitioned to PO torsemide with improvement in symptoms, although remained on a small amount of supplemental oxygen for comfort.

5: Respiratory failure: The patient was intubated for lethargy and acidosis initially and was given 8 L on his presentation to help maintain his BP’s. 6: This undoubtedly contributed to his continued hypoxemic respiratory failure. 7: He was advanced to pressure support with stable ventilation and oxygenation. 8: On transfer to the CCU patient was still intubated but off pressors. 9: Patient was extubated successfully. 10: He was reintubated transiently for 48 hours for urgent TEE and subsequently extubated without adverse effect or complication.

Clinical Specialty

Cardiology

Answer (hover over a citation to highlight relevant sentences)

The patient was given Lasix for acute diastolic heart failure with symptoms including shortness of breath and lower extremity edema [1,4]. The patient was given 8 liters of fluid to help maintain his blood pressure which contributed to his respiratory failure [5,6]. After the patient’s heart failure was treated with Lasix, he showed improvement in shortness of breath and his oxygen requirement, and he only needed to remain on a small amount of oxygen for comfort [4].

Subtask 1: Question Interpretation

Patient questions are often long and verbose, making it important to quickly identify the underlying information need. This subtask evaluates a system’s ability to transform a free-text, patient-authored question into a clear and concise clinician-interpreted question that reflects how a clinician would query a smart electronic health record (EHR) system when preparing a response to the patient.

Example #1

Input – Patient Question

Took my 59 yo father to ER ultrasound discovered he had an aortic aneurysm. He had a salvage repair (tube graft). Long surgery / recovery for couple hours then removed packs. why did they do this surgery????? After this time he spent 1 month in hospital now sent home.

Target output – Clinician-Interpreted Question

Why did they perform the emergency salvage repair on him?

Sample System Response (Good)

What was the indication for emergent surgical repair for his aortic aneurysm?

Sample System Response (Bad – generic)

What is the purpose of salvage repair for aortic aneurysm treatment?

Example #2

Patient Question

I just wrote about my dad given multiple shots of lasciks after he was already so swelled his shin looked like it would burst open. Why would they give him so much. He was on oxygen and they took him off of the higher flow rate.

Target output – Clinician-Interpreted Question

Why was he given lasix and his oxygen flow rate was reduced?

Sample System Response (Good)

Why were multiple doses of lasix administered despite decreased oxygen requirements and swelling?

Sample System Response (Ok – generic/malformed)

Lasix dosage protocol for severe edema with oxygen therapy?

More details

Subtask 2: Evidence Identification

Clinical notes are lengthy and provide rich context across multiple problems and events. This subtask evaluates a system’s ability to identify the minimal set of note sentences that provide the clinical evidence needed to answer a patient’s question.

Example #1

Input – Patient Question

I had severe abdomen pain and was hospitalised for 15 days in ICU, diagnoised with CBD sludge. Doctor advised for ERCP. My question is if the sludge was there does not any medication help in flushing it out? Whether ERCP was the only cure?

Input – Clinician-Interpreted Question

Why was ERCP recommended over a medication-based treatment for CBD sludge?

Input – Clinical Note Excerpt (with sentences numbered)

1: During the ERCP a pancreatic stent was required to facilitate access to the biliary system (removed at the end of the procedure), and a common bile duct stent was placed to allow drainage of the biliary obstruction caused by stones and sludge. 2: However, due to the patient’s elevated INR, no sphincterotomy or stone removal was performed. 3: Frank pus was noted to be draining from the common bile duct, and post-ERCP it was recommended that the patient remain on IV Zosyn for at least a week. 4: The Vancomycin was discontinued.

5: On hospital day 4 (post-procedure day 3) the patient returned to ERCP for re-evaluation of her biliary stent as her LFTs and bilirubin continued an upward trend. 6: On ERCP the previous biliary stent was noted to be acutely obstructed by biliary sludge and stones. 7: As the patient’s INR was normalized to 1.2, a sphincterotomy was safely performed, with removal of several biliary stones in addition to the common bile duct stent. 8: At the conclusion of the procedure, retrograde cholangiogram was negative for filling defects.

Target output – Relevant Evidence Sentences

[1, 5, 6, 7]

Sample System Response (missing key evidence – Precision 1.00, Recall 0.50, F1 0.67)

[5, 6]

Sample System Response (over-inclusive – Precision 0.50, Recall 1.00, F1 0.67)

[1, 2, 3, 4, 5, 6, 7, 8]

More details

Subtask 3: Answer Generation

This subtask evaluates a system’s ability to generate an answer grounded in the provided clinical note excerpt.

Example #1

Input – Patient Question

I had severe abdomen pain and was hospitalised for 15 days in ICU, diagnoised with CBD sludge. Doctor advised for ERCP. My question is if the sludge was there does not any medication help in flushing it out? Whether ERCP was the only cure?

Input – Clinician-Interpreted Question

Why was ERCP recommended over a medication-based treatment for CBD sludge?

Input – Clinical Note Excerpt (with sentences numbered)

(See Example Case #1 above.)

Target output – Answer (without citations)

An endoscopic retrograde cholangiopancreatography, ERCP, was recommended to place a common bile duct stent. This stent was placed to allow drainage of the biliary obstruction which was caused by stones and sludge. Due to no improvement in liver function, the patient needed a repeat ERCP. The repeat ERCP showed that the biliary stent placed in the first ERCP was obstructed by stones and sludge. The stones and stent were successfully removed during this procedure by performing a sphincterotomy.

Sample System Response (bad – generic/not grounded)

Sometimes bile duct sludge is treated with observation or medications, but procedures like ERCP are commonly recommended when symptoms are severe or there is concern for blockage. ERCP lets clinicians look at the ducts and place a stent to help bile flow, and they may repeat the procedure if tests do not improve. Decisions depend on overall risk, lab trends, and how the patient responds to initial treatment.

Sample System Response (ok – some grounding but mostly generic)

ERCP is often used when sludge or stones are blocking the common bile duct, because it can relieve obstruction and improve drainage. In the note, a bile duct stent was placed to help drainage, and a repeat ERCP was performed when labs continued to worsen. If medications are unlikely to clear a mechanical blockage, endoscopic treatment can allow direct removal of stones/sludge once it is safe to do so.

More details

Subtask 4: Evidence Alignment

Grounded EHR QA requires not only producing an answer, but also explicitly showing where in the EHR the answer comes from. This subtask evaluates a system’s ability to align each answer sentence to the specific supporting sentence(s) in the clinical note excerpt.

Example #1

Input – Patient Question

I had severe abdomen pain and was hospitalised for 15 days in ICU, diagnoised with CBD sludge. Doctor advised for ERCP. My question is if the sludge was there does not any medication help in flushing it out? Whether ERCP was the only cure?

Input – Clinician-Interpreted Question

Why was ERCP recommended over a medication-based treatment for CBD sludge?

Input – Clinical Note Excerpt (with sentences numbered)

1: During the ERCP a pancreatic stent was required to facilitate access to the biliary system (removed at the end of the procedure), and a common bile duct stent was placed to allow drainage of the biliary obstruction caused by stones and sludge. 2: However, due to the patient’s elevated INR, no sphincterotomy or stone removal was performed. 3: Frank pus was noted to be draining from the common bile duct, and post-ERCP it was recommended that the patient remain on IV Zosyn for at least a week. 4: The Vancomycin was discontinued.

5: On hospital day 4 (post-procedure day 3) the patient returned to ERCP for re-evaluation of her biliary stent as her LFTs and bilirubin continued an upward trend. 6: On ERCP the previous biliary stent was noted to be acutely obstructed by biliary sludge and stones. 7: As the patient’s INR was normalized to 1.2, a sphincterotomy was safely performed, with removal of several biliary stones in addition to the common bile duct stent. 8: At the conclusion of the procedure, retrograde cholangiogram was negative for filling defects.

Input – Answer (with sentences numbered and no citations)

1: An endoscopic retrograde cholangiopancreatography, ERCP, was recommended to place a common bile duct stent. 2: This stent was placed to allow drainage of the biliary obstruction which was caused by stones and sludge. 3: Due to no improvement in liver function, the patient needed a repeat ERCP. 4: The repeat ERCP showed that the biliary stent placed in the first ERCP was obstructed by stones and sludge. 5: The stones and stent were successfully removed during this procedure by performing a sphincterotomy.

Target Output – Answer (with citations)

1: An endoscopic retrograde cholangiopancreatography, ERCP, was recommended to place a common bile duct stent [1]. 2: This stent was placed to allow drainage of the biliary obstruction which was caused by stones and sludge [1]. 3: Due to no improvement in liver function, the patient needed a repeat ERCP [5]. 4: The repeat ERCP showed that the biliary stent placed in the first ERCP was obstructed by stones and sludge [6]. 5: The stones and stent were successfully removed during this procedure by performing a sphincterotomy [7].

Sample System Response (under-citing – Micro Precision 0.75, Recall 0.6, F1 0.67)

1: An endoscopic […] bile duct stent [1]. 2: This stent was placed […] caused by stones and sludge. 3: Due to no improvement […] repeat ERCP [5,6]. 4: The repeat ERCP […] obstructed by stones and sludge [6]. 5: The stones and stent […] sphincterotomy.

Sample System Response (over-citing – Micro Precision 0.50, Recall 1.00, F1 0.67)

1: An endoscopic […] bile duct stent [1]. 2: This stent was placed […] caused by stones and sludge [1,6]. 3: Due to no improvement […] repeat ERCP [5]. 4: The repeat ERCP […] obstructed by stones and sludge [1,5,6]. 5: The stones and stent […] sphincterotomy [6,7,8].

More details

Data

The dataset consists of patient-authored questions (inspired by real patient questions) and associated clinical note excerpts (derived from the MIMIC database3). Each data instance is referred to as a case and contains:

ID format. Case IDs and sentence IDs are integers stored as strings (e.g., "1", "5").

Development set

The development set includes the gold outputs needed to develop and/or validate systems for the subtasks (e.g., clinician-interpreted questions, sentence-level evidence labels, reference answers, and answer–evidence alignments). These are case IDs 1–20 under the “dev” directory at the PhysioNet repository.

Test set

The test set provides the inputs (questions, note excerpts, and any required auxiliary fields) while withholding gold outputs. Participants submit predictions for any subset of subtasks they choose to attempt, following the formats defined in System Submission.

Case usage by subtask (official evaluation)

Staged release plan for the test data

To support the sequential schedule of the shared task, test inputs will be released in stages:

Access

The dataset is available on PhysioNet at https://doi.org/10.13026/zzax-sy62. Please sign up for PhysioNet4 and complete the required training to access the dataset.

Evaluation

Each subtask is evaluated independently using automatic metrics. Teams may participate in any subset of subtasks. There will be separate leaderboards for each of the subtasks.

Subtask 1: Question Interpretation

Subtask 1 is evaluated by comparing the system-generated clinician-interpreted question to the reference clinician-interpreted question. Outputs that violate the length constraint (15 words) will be truncated to the first 15 words before scoring. We report the following automatic text generation metrics:

Subtask 2: Evidence Identification

Subtask 2 is evaluated by comparing the set of evidence sentence IDs predicted by the system to the ground truth evidence set. We report Precision, Recall, and F1 over the predicted versus gold evidence sentences.

Two variants are reported:

Subtask 3: Answer Generation

Subtask 3 evaluates the quality of the generated answer text relative to the reference answer. We report the following automatic text generation metrics:

Subtask 4: Answer–Evidence Alignment

Subtask 4 is evaluated by comparing the system’s predicted alignments between answer sentences and note sentences to the ground truth alignments. We report micro-averaged Precision, Recall, and F1 over predicted alignment links.

Scoring scripts

Subtask 1 scoring script is available on GitHub at https://github.com/soni-sarvesh/archehr-qa-2026/tree/main/evaluation.

System Submission

Please visit the respective Codabench competitions (one per subtask) to submit system outputs. Teams may participate in any subset of subtasks (1–4).

Before submitting, please register your team using our participant registration form: https://forms.gle/JkkE2g3rvyXMsTp99. We will use this information to help match team registrations to Codabench submissions and for participation-related shared task communications.

Links: Subtask 1 (Question Interpretation) CodaBench link: https://www.codabench.org/competitions/12865/. Links for the remaining subtasks will be released soon.

The required submission formats for each subtask are as follows.

Submission package

All subtasks use results submissions only (no code execution). You will upload a single .zip archive containing your predictions.

Your zip file must be structured as:

submission.zip
└── submission.json

submission.json must be a valid JSON file containing a single array of objects, with one object per case in the evaluation split for that subtask:

[
  {
    "case_id": "1",
    "prediction": "..."
  }
]

Each object must include:

Because each Codabench competition is subtask-specific, the validator will expect the prediction field to match the format for that subtask.

Required formats by subtask

Subtask 1: Question Interpretation

For Subtask 1, prediction must be a string containing the clinician-interpreted question (≤ 15 words).

Example submission.json

[
  {
    "case_id": "1",
    "prediction": "Why was ERCP recommended over medication-based treatment for CBD sludge?"
  }
]

Subtask 2: Evidence Identification

For Subtask 2, prediction must be an array of strings, where each string is a sentence ID from the provided clinical note excerpt (e.g., "1", "5").

Example submission.json

[
  {
    "case_id": "1",
    "prediction": ["1", "5", "6", "7"]
  }
]

Subtask 3: Answer Generation

For Subtask 3, prediction must be a string containing the generated answer (≤ 75 words).

Example submission.json

[
  {
    "case_id": "1",
    "prediction": "ERCP was used to relieve a bile duct obstruction caused by stones and sludge by placing a common bile duct stent. Because liver tests and bilirubin continued to worsen, the patient required a repeat ERCP, which found the stent obstructed by sludge and stones. Once the INR normalized, a sphincterotomy was performed and stones were removed, improving drainage."
  }
]

Subtask 4: Answer–Evidence Alignment

For Subtask 4, prediction must be an array of objects representing sentence-level alignments between the input answer sentences and the note excerpt sentences.

Example submission.json

[
  {
    "case_id": "1",
    "prediction": [
      {
        "answer_id": "1",
        "evidence_id": ["1"]
      },
      {
        "answer_id": "2",
        "evidence_id": ["5", "6"]
      }
    ]
  }
]

Each element in prediction must contain:

Paper Submission

All shared task participants are invited to submit a paper describing their systems to the Proceedings of the Third Workshop on Patient-Oriented Language Processing (CL4Health) at LREC 2026. Shared task papers will go through a faster review process.

Paper length (based on participation)

Format

Unless otherwise specified below, please follow the official LREC 2026 paper guidelines. For details, refer to the LREC Author’s Kit.

Submissions

All submissions will go through the Softconf/START Conference Manager at https://softconf.com/lrec2026/CL4Health/. Regardless of participants’ decision to submit a paper, they must submit a short one-paragraph summary (about 75 words) of their best system for each subtask they participated in to sarvesh.soni@nih.gov to be included in the shared task overview paper.

References

We ask participants to ensure that their paper includes proper citations to both the shared task overview paper and the dataset paper.

Overview (tentative):

@inproceedings{soni-etal-2026-archehr-qa,
  title = "Overview of the ArchEHR-QA 2026 Shared Task on Grounded Question Answering from Electronic Health Records",
  author = "Soni, Sarvesh and Demner-Fushman, Dina",
  booktitle = "Proceedings of the Third Workshop on Patient-Oriented Language Processing (CL4Health)",
  year = "2026",
  address = "Palma, Mallorca (Spain)",
  publisher = "ELRA",
}

Dataset:

@article{soni-demner-fushman-2025-dataset,
  title={A Dataset for Addressing Patient's Information Needs related to Clinical Course of Hospitalization},
  author = {Soni, Sarvesh and Demner-Fushman, Dina},
  year = "2025",
  journal={arXiv preprint arXiv:2506.04156},
  doi = {10.48550/arXiv.2506.04156},
  publisher = {arXiv},
}

Organizers

Sarvesh Soni

Sarvesh Soni

NLM, NIH

Dina Demner-Fushman

Dina Demner-Fushman

NLM, NIH

Program Committee

We are looking for people to join the program committee, where the responsibilities will include reviewing papers. If you are interested, please send an email to sarvesh.soni@nih.gov.

References

  1. Lin, J., Quan, D., Sinha, V., Bakshi, K., Huynh, D., Katz, B., & Karger, D. R. (2003, September). What Makes a Good Answer? The Role of Context in Question Answering. In INTERACT. link 

  2. Jeon, J., Croft, W. B., Lee, J. H., & Park, S. (2006, August). A framework to predict the quality of answers with non-textual features. In Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval (pp. 228-235). https://doi.org/10.1145/1148170.1148212 

  3. Alistair E. W. Johnson, Tom J. Pollard, Lu Shen, Li-wei H. Lehman, Mengling Feng, Mohammad Ghassemi, Benjamin Moody, Peter Szolovits, Leo Anthony Celi, and Roger G. Mark. 2016. MIMIC-III, a freely accessible critical care database. Scientific Data, 3(1):160035. https://doi.org/10.1038/sdata.2016.35 

  4. PhysioNet. https://physionet.org/ Accessed Dec 26, 2024 

  5. Chin-Yew Lin. 2004. ROUGE: A Package for Automatic Evaluation of Summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain. Association for Computational Linguistics. https://aclanthology.org/W04-1013/  2

  6. Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, and Yoav Artzi. 2019. BERTScore: Evaluating Text Generation with BERT. In International Conference on Learning Representations. https://openreview.net/forum?id=SkeHuCVFDr  2

  7. Yuheng Zha, Yichi Yang, Ruichen Li, and Zhiting Hu. 2023. AlignScore: Evaluating Factual Consistency with A Unified Alignment Function. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11328–11348, Toronto, Canada. Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.acl-long.634  2

  8. Wen-wai Yim, Yujuan Fu, Asma Ben Abacha, Neal Snider, Thomas Lin, and Meliha Yetisgen. 2023. Aci-bench: a Novel Ambient Clinical Intelligence Dataset for Benchmarking Automatic Visit Note Generation. Scientific Data, 10(1):586. https://doi.org/10.1038/s41597-023-02487-3  2

  9. Papineni, K., Roukos, S., Ward, T., & Zhu, W. J. (2002, July). Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics (pp. 311-318). https://doi.org/10.3115/1073083.1073135 

  10. Xu, W., Napoles, C., Pavlick, E., Chen, Q., & Callison-Burch, C. (2016). Optimizing statistical machine translation for text simplification. Transactions of the Association for Computational Linguistics, 4, 401-415. https://aclanthology.org/Q16-1029/