Toward Generative Contract Understanding Tools

Introduction

Health insurance claims denials in the U.S. pose serious problems for patients. They cause delays, forgone care, and detrimental health outcomes. Sometimes patients receive services despite denials, incurring bills, debt, and collections lawsuits. For the most vulnerable, there are calamitous effects (Pollitz et al., 2023), (Collins et al., 2023), (Lopes et al., 2022).

Evaluating claims necessitates precise understanding of a complex web of case-specific laws, contracts, and medical literature. Specialists such as medical coders, pharmacists, doctors, and attorneys perform the work, which is manual, time consuming, and rate-limited by research.

Advances in natural language understanding (NLU) present an opportunity to support case workers and improve patient outcomes. In particular, generative question answering (QA) models can increase the efficiency with which specialists understand cases. However, there are barriers preventing responsible and effective use of such tools.

Background

Claims denials are valuable and far-reaching. Conservatively, hundreds of millions of claims worth billions of dollars are denied each year (Pollitz, Lo, et al., 2023) (Gartner, 2023). One estimate suggests 30% of Americans have healthcare debt related to bills they expected insurance to cover (Lopes et al., 2022).

While some denials are warranted, there is a growing body of evidence that suggests inappropriate denials are pervasive, systemic, and inequitably distributed (Levinson, 2018), (Grimm, 2022), (Grimm, 2023), (Armstrong et al., 2023), (Ross & Herman, 2023), (Gartner, 2023). Mechanisms exist to support patients seeking recourse, but they are underutilized and overburdened (Pollitz, Lo, et al., 2023), (Gartner, 2023). Tools that support efficient and accurate contract understanding stand to serve patients and case workers alike.

Barriers to Deployment

Existing language models (e.g. (OpenAI et al., 2023) or (Touvron et al., 2023)) can be deployed to serve as generative support tools. Coupled with inference-time retrieval (Lewis et al., 2020), this approach can be effective (cf. Figure 1).

**Figure 1.**
Bing Chat's response to a question about an Affordable Care Act compliant NY health plan includes relevant, authoritative supporting documentation backing up its claims. The incorporation of information from retrieved documents, and attribution of individual claims to those documents, are desirable features that allow the output to be systematically audited for quality as part of responsible use by a case worker. Date accessed: 1/9/24. Conversation style set to 'precise'.

However, there are at least two barriers to responsibly deploying this approach:

Hallucinations are common (Ji et al., 2022) (Huang et al., 2023).
There is a dearth of domain-specific datasets sufficient for use as knowledge bases.

The first barrier diminishes the utility of generative support tools lacking built-in verification mechanisms. Such tools place an additional burden on case workers engaged in responsible use. While correct claims may streamline research, incorrect claims require extra work to debunk. More concerningly, users that forgo verification jeopardize case outcomes. Figure 2 shows a relevant example.

**Figure 2.**
Anthropic's Claude responds to a question about the Medicare National Coverage Determination for cochlear implants. The response provides an accurate summary of key points, but omits criteria stated in the NCD necessary to determine coverage in individual cases. As a result, applying the unverified response to the facts of a case has a nontrivial likelihood of resulting in an incorrect conclusion. Date accessed: 1/9/24.

Retrieval-supplemented language models rely on their indexed corpora. Lack of a domain-specific dataset often leads existing models to inappropriately rely on non-authoritative documents, or lack requisite knowledge, as seen in Figure 3.

**Figure 3.**
Bing Chat's response to a question about coverage rules in Maine's Medicaid program, MaineCare. The chat pipeline's knowledge base index apparently either lacks access to relevant authoritative sources (such as the the MaineCare Member Handbook) that can help address the question, or alternatively the model may simply be erring in failing to retrieve them in this case. The response is too generic to provide any utility to a skilled caseworker. Date accessed: 1/9/24. Conversation style set to 'precise'.

These deficiencies, coupled with the dire implications of negative case outcomes for patients, make systematic utilization of existing systems modestly effective at best, and precarious and fraught at worst.

Addressing Deficiencies

Addressing hallucinations is a highly active field of research (Towhidul Islam Tonmoy et al., 2024). Use of Retrieval Augmented Generation (Lewis et al., 2020) produces systematically verifiable outputs, which we hypothesize would sufficiently mitigate hallucinations when used with the right knowledge base. Models tailored to span-selection based QA (Rajpurkar et al., 2018), (Ram et al., 2021) are compelling in that they forgo the hallucination problem altogether. Approaches to such QA that incorporate language model representations suggest a promising, alternative approach.

There is a pressing need for a comprehensive dataset of authoritative documents to enable fine-tuning and knowledge base construction. While high-quality pre-training data exists in the public domain, much of it cannot be legally scraped. Other key data is never publicly disseminated. High-quality corpora have contributed to the rise of many successful legal NLU applications (Elwany et al., 2019) (Hendrycks et al., 2021) (Wang et al., 2023) (Koreeda & Manning, 2021), and we expect the same will be true in this domain. We are actively curating such a dataset now, and hope to release it and use it to train prototype case support tools in the near future.

References

Armstrong, D., Rucker, P., & Miller, M. (2023). UnitedHealthcare Tried to Deny Coverage to a Chronically Ill Patient. He Fought Back, Exposing the Insurer’s Inner Workings. . Propublica. https://www.propublica.org/article/unitedhealth-healthcare-insurance-denial-ulcerative-colitis

Collins, S., Roy, S., & Masitha, R. (2023). Paying for It: How Health Care Costs and Medical Debt Are Making Americans Sicker and Poorer. https://www.commonwealthfund.org/publications/surveys/2023/oct/paying-for-it-costs-debt-americans-sicker-poorer-2023-affordability-survey

Elwany, E., Moore, D., & Oberoi, G. (2019). BERT Goes to Law School: Quantifying the Competitive Advantage of Access to Large Legal Corpora in Contract Understanding. arXiv E-Prints, earXiv:1911.00473. https://doi.org/10.48550/arXiv.1911.00473

Gartner, M. (2023). Claims Denials In U.S. Health Insurance. https://blog.persius.org/investigations/claims_denials

Grimm, C. (2022). Some Medicare Advantage Organization Denials of Prior Authorization Requests Raise Concerns About Beneficiary Access to Medically Necessary Care. HHS OIG. https://oig.hhs.gov/oei/reports/OEI-09-18-00260.pdf

Grimm, C. (2023). Department of Health and Human Services Office of Inspector General High Rates of Prior Authorization Denials by Some Plans and Limited State Oversight Raise Concerns About Access to Care in Medicaid Managed Care. HHS OIG. https://oig.hhs.gov/oei/reports/OEI-09-19-00350.pdf

Hendrycks, D., Burns, C., Chen, A., & Ball, S. (2021). CUAD: An Expert-Annotated NLP Dataset for Legal Contract Review. arXiv E-Prints, earXiv:2103.06268. https://doi.org/10.48550/arXiv.2103.06268

Huang, L., Yu, W., Ma, W., Zhong, W., Feng, Z., Wang, H., Chen, Q., Peng, W., Feng, X., Qin, B., & Liu, T. (2023). A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions. arXiv E-Prints, earXiv:2311.05232. https://doi.org/10.48550/arXiv.2311.05232

Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y., Dai, W., Madotto, A., & Fung, P. (2022). Survey of Hallucination in Natural Language Generation. arXiv E-Prints, earXiv:2202.03629. https://doi.org/10.48550/arXiv.2202.03629

Koreeda, Y., & Manning, C. D. (2021). ContractNLI: A Dataset for Document-level Natural Language Inference for Contracts. arXiv E-Prints, earXiv:2110.01799. https://doi.org/10.48550/arXiv.2110.01799

Levinson, D. (2018). Medicare Advantage Appeal Outcomes and Audit Findings Raise Concerns About Service and Payment Denials. HHS OIG. https://oig.hhs.gov/oei/reports/oei-09-16-00410.pdf

Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Lewis, M., Yih, W., Rocktäschel, T., Riedel, S., & Kiela, D. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. arXiv E-Prints, earXiv:2005.11401. https://doi.org/10.48550/arXiv.2005.11401

Lopes, L., Kearney, A., Montero, A., Hamel, L., & Brodie, M. (2022). Health Care Debt In The U.S.: The Broad Consequences Of Medical And Dental Bills. https://www.kff.org/report-section/kff-health-care-debt-survey-main-findings/

OpenAI, :, Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Leoni Aleman, F., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., Avila, R., Babuschkin, I., Balaji, S., Balcom, V., Baltescu, P., Bao, H., Bavarian, M., … Zoph, B. (2023). GPT-4 Technical Report. arXiv E-Prints, earXiv:2303.08774. https://doi.org/10.48550/arXiv.2303.08774

Pollitz, K., Lo, J., & Mengitsu, S. (2023). Claims Denials and Appeals in ACA Marketplace Plans in 2021. https://www.kff.org/private-insurance/issue-brief/claims-denials-and-appeals-in-aca-marketplace-plans/

Pollitz, K., Pestaina, K., Lopes, L., Wallace, R., & Lo, J. (2023). Consumer Survey Highlights Problems with Denied Health Insurance Claims. https://www.kff.org/health-reform/issue-brief/consumer-survey-highlights-problems-with-denied-health-insurance-claims/

Rajpurkar, P., Jia, R., & Liang, P. (2018). Know What You Don’t Know: Unanswerable Questions for SQuAD. arXiv E-Prints, earXiv:1806.03822. https://doi.org/10.48550/arXiv.1806.03822

Ram, O., Kirstain, Y., Berant, J., Globerson, A., & Levy, O. (2021). Few-Shot Question Answering by Pretraining Span Selection. arXiv E-Prints, earXiv:2101.00438. https://doi.org/10.48550/arXiv.2101.00438

Ross, C., & Herman, B. (2023). UnitedHealth pushed employees to follow an algorithm to cut off Medicare patients’ rehab care. STAT. https://www.statnews.com/2023/11/14/unitedhealth-algorithm-medicare-advantage-investigation/

Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., Bikel, D., Blecher, L., Canton Ferrer, C., Chen, M., Cucurull, G., Esiobu, D., Fernandes, J., Fu, J., Fu, W., … Scialom, T. (2023). Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv E-Prints, earXiv:2307.09288. https://doi.org/10.48550/arXiv.2307.09288

Towhidul Islam Tonmoy, S. M., Mehedi Zaman, S. M., Jain, V., Rani, A., Rawte, V., Chadha, A., & Das, A. (2024). A Comprehensive Survey of Hallucination Mitigation Techniques in Large Language Models. arXiv E-Prints, earXiv:2401.01313. https://doi.org/10.48550/arXiv.2401.01313

Wang, S. H., Scardigli, A., Tang, L., Chen, W., Levkin, D., Chen, A., Ball, S., Woodside, T., Zhang, O., & Hendrycks, D. (2023). MAUD: An Expert-Annotated Legal NLP Dataset for Merger Agreement Understanding. arXiv E-Prints, earXiv:2301.00876. https://doi.org/10.48550/arXiv.2301.00876