This post is also available in:
עברית (Hebrew)
As large language models (LLMs) become increasingly influential in decision-making, a growing concern is whether the explanations they offer for their outputs are truly accurate—or simply convincing. A recent research collaboration between Microsoft and MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) introduces a new framework designed to evaluate how faithful these explanations really are.
The proposed method, dubbed causal concept faithfulness, goes beyond scoring explanation quality with opaque metrics. Instead, it evaluates whether an LLM’s stated rationale genuinely reflects the concepts that drove its answer.
At the core of the technique is a comparison between two sets of information: the concepts an LLM explanation implies were influential, and those that demonstrably influenced the model’s output. If the model says it based its answer on experience or qualifications, but actually responded differently when only a candidate’s gender was changed, the explanation may be misleading—even if it sounds plausible.
According to TechXplore, to determine this, researchers use an auxiliary LLM to identify core concepts in a user query. They then generate “counterfactual” versions of the input—altering just one element, such as race or a medical detail—and observe how the main LLM’s response changes. If the output shifts significantly, that concept had causal influence. If it wasn’t mentioned in the model’s explanation, that suggests unfaithfulness.
Empirical testing on datasets involving social bias and clinical decision-making revealed notable discrepancies. For instance, some LLMs masked their reliance on sensitive identity features and prejudice, and instead offered explanations that referenced unrelated traits, such as behaviour. In other cases, key clinical factors were left out of justifications, despite heavily influencing treatment recommendations.
While the method isn’t without limitations—such as its dependence on auxiliary model accuracy and difficulty with correlated inputs—it represents a step toward more transparent AI. By pinpointing when and how explanations are misleading, the technique could inform safer deployment in fields like healthcare, hiring, or policy. It also gives both users and developers tools to identify, understand, and address bias and inconsistency in AI-generated responses.