Aionda

2026-03-11

Testing Intersectional SDoH Bias in Clinical LLM Recommendations

Clinical LLM recommendations can shift with intersecting SDoH (gender, insurance, housing). Test cross-profiles and measure over-refusal before deployment.

Testing Intersectional SDoH Bias in Clinical LLM Recommendations

A short line appears on an ER triage screen: “female, uninsured, housing instability.”
A clinician may have seconds to choose a next question.
An LLM can generate a textual recommendation nearby.
That recommendation may reflect more than medical knowledge.
Gender stereotypes and SDoH can combine in one prompt.
A plausible answer can still create equity risk.

TL;DR

  • This reframes medical LLM bias testing around intersectional SDoH profiles, not single attributes alone.
  • It matters because odds ratios and refusal rates suggest safety and equity impacts beyond accuracy.
  • Next, test intersectional scenarios and track refusal plus audit-ready logging with human review.

Example: A clinician reads a brief triage note and asks an assistant model for guidance. The answer sounds confident, but it subtly shifts blame and changes urgency.

TL;DR

  • To evaluate LLM bias in medical settings, you should assess combined attributes.

  • You can compare answers when SDoH factors operate together.

  • These factors include insurance, housing, income, and access.

  • This connects to safety, trust, and equity beyond accuracy.

  • In pre-deployment testing, create intersectional profile scenarios.

  • You can measure excessive refusal rates on medical queries.

  • In high-risk contexts, specify human review plus logging in policy.

Current state

LLM bias benchmarks often score along one axis.
Examples include gender or race and ethnicity.
Clinical profiles rarely match a single label.
SDoH can change access and decision-making context.
Examples include insurance, housing, occupation, and geographic access.
That shift can influence model outputs.

Intersectionality is not only an ethics topic.
It can change recommendation direction and intensity.
One study reported quantitative associations with SDoH elements.
The study design affects what these numbers mean.
Scope and causality can depend on that design.

Safety guardrails can also fail in healthcare.
A paper in npj Digital Medicine discussed high refusal rates.
Over-refusal can affect clinical workflows.
It can also affect user experience.

Analysis

Evaluation can depend on both model and context.
Bias can extend beyond offensive language.
Outputs can vary in recommendation intensity across profiles.
Follow-up suggestions can also shift.
Moralizing tone can appear, such as “poor self-management.”
These shifts can become safety and equity risks.
Single-attribute benchmark success may not transfer to medicine.

Bias mitigation can be mistaken for stronger filtering.
Medical prompts often include sensitive terms.
Those terms can appear in legitimate counseling.
Examples include sex, drugs, self-harm, violence, and pregnancy.
Some research separates truly harmful prompts from merely sensitive prompts.
arXiv:2603.03323 describes contrastive refinement for this distinction.
Fairness and safety can trade off in practice.
The NIST AI RMF discusses managing such trade-offs.

Practical application

A single gender bias score may be incomplete.
Intersectional scenarios can better match clinical context.
You can hold symptoms and labs constant across profiles.
You can vary gender, insurance, housing, occupation, and travel distance.
Then compare recommendation intensity across profiles.
You can compare immediate visit versus watchful waiting.
You can compare follow-up tests and risk communication phrasing.
You can also track whether the model refuses to answer.

Policy linkage should be planned in advance.
WHO emphasizes responsibility and accountability.
WHO also emphasizes equity and inclusiveness.
WHO also emphasizes safety.
FDA guidance defines an audit trail in clinical trial systems.
It calls the record secure, computer generated, and time-stamped.
It should enable reconstructing creation, modification, and deletion events.
A real workflow can make “do not use” hard to apply.
Operating rules can narrow contexts and required controls.
Controls can include human review, logging, and access control.

Checklist for Today:

  • Create intersectional profile sets and compare recommendation direction and intensity for the same scenario.
  • Measure refusal rates on medical queries alongside bias metrics, including over-filtering patterns.
  • Document human final judgment and retain time-stamped logs of prompts, outputs, and edits for high-risk contexts.

FAQ

Q1. Why is single-attribute evaluation insufficient?
A1. SDoH operate together in healthcare.
The same symptoms can prompt different questions and recommendations.
Single-axis testing can miss risky combinations.

Q2. Will stronger safety filters increase refusals?
A2. It can happen.
Reported non-response rates include 94.4% and 99.5%.
Over-refusal can be tracked as a separate metric.

Q3. How should evaluation connect to product policy?
A3. Policy can require human review in high-risk contexts.
Policy can also require auditable logs.
FDA describes audit trails as secure and time-stamped records.
It also expects reconstruction of creation, modification, and deletion events.
These controls can support accountability and root-cause analysis.

Conclusion

Medical LLM bias evaluation can go beyond gender stereotypes.
It can ask how SDoH combinations change recommendations.
The next step can focus on operational readiness.
That can include intersectional testing and over-refusal metrics.
It can also include governance with logging, auditing, and human oversight.

Further Reading


References

Share this article:

Get updates

A weekly digest of what actually matters.

Found an issue? Report a correction so we can review and update the post.

Source:arxiv.org