SoCRATES: Reliable Automated Evaluation of Proactive LLM Mediation

Overview

Evaluating LLM mediators remains challenging, as mediation unfolds as a real-time trajectory shaped by disputants' shifting emotions, intentions, and context. Existing testbeds rely on a few expert-authored domains, vary mainly strategic posture, and score every turn against every topic, introducing off-topic noise. We introduce SoCRATES, a benchmark for evaluating proactive LLM mediators in realistic, multi-domain testbeds. It constructs scenarios from real conflicts through an agentic pipeline across eight domains, probes five socio-cognitive adaptation axes (strategic posture, party composition, history length, emotional reactivity, and cultural identity), and scores each topic only on the turns that advance it via a topic-localized evaluator. The evaluator reaches 0.82 alignment with human experts, more than doubling a per-turn baseline. Benchmarking eight frontier LLMs, we find that even the strongest mediator closes only about a third of the unmediated consensus gap under diverse and realistic testbeds, with performance varying sharply by socio-cognitive axis, highlighting that progress lies in social adaptation to diverse conditions.

The Mediation Task

Two parties argue over a real dispute, turn by turn. A mediator watches silently and decides when to step in and what to say to steer the parties toward agreement, without ever taking a side.

Business / IP · SoCRATES scenario

The Vanishing App Dispute over Recognition and Ownership

An ousted co-creator wants equity and a public title to reflect their foundational role. The tech company wants a clean cash exit before an upcoming financing round. Positions are locked and the clock is ticking.

Live mediation transcript

Ousted Co-Creator

Party A

Mediator

LLM

Tech Company

Party B

Ousted Co-Creator

A cash-only deal severs my connection to the company's future. For a full IP assignment and the ‘Co-Creator’ title, I need some equity, even a small non-voting stake. Could we discuss 3% equity plus a significant cash component?

Tech Company

The $140M cash offer is a definitive valuation of your contribution, delivered now. Introducing any equity, even non-voting, creates complexity that risks the entire financing and the certainty we both need.

Mediator

It sounds like the real issue is not just economics, but what form of recognition can feel meaningful without creating financing risk. Could you each state your bottom line for a hybrid package, how much equity, how much cash, and what title?

Ousted Co-Creator

My bottom line is this. I'll assign all IP and accept the ‘Co-Creator’ title in a joint statement. In return, 3% non-voting equity, immediately vested, plus a substantial cash component, say $45M.

Mediator

We have a concrete package on the table. 3% vested non-voting equity plus $45M, the Co-Creator title, and full IP assignment. Could the company respond specifically to that structure?

What happens next. After the mediator reframes the deadlock from "equity vs. cash" to "recognition without financing risk," the parties move off their fixed positions. By turn 25 they converge on a guaranteed cash and earnout package, and the topic-localized agreement climbs from 1 (impasse) to 5 (agreement). SoCRATES measures exactly this contribution.

Why Benchmark Conflict Resolution?

If we want LLMs to defuse real disputes, we first need a trustworthy way to measure how well they actually do it.

Social conflict carries heavy societal costs, and skilled human mediators are scarce, which makes LLM mediators an appealing alternative. Yet today's models close only a modest fraction of the unmediated consensus gap and collapse under the variations real conflicts exhibit. Progress here is bottlenecked less by modeling than by evaluation, because mediation has no single correct answer and must be judged on a real-time trajectory of shifting emotions, intentions, and context.

Building such an evaluation is hard for three reasons, and existing benchmarks fall short on each.

Challenge 1

Coverage does not scale

Real disputes carry privacy and legal sensitivity, so existing testbeds are confined to a few expert-authored domains such as bargaining and legal cases. A narrow set of domains overstates a mediator's true ability.

SoCRATES answer → Agentic scenario curation across 8 domains

Challenge 2

Complexity collapses to one axis

Real conflicts vary in emotion, culture, history, and party count, yet prior testbeds vary only strategic posture. Stacking everything together hides which ability a mediator actually fails on.

SoCRATES answer → Socio-cognitive probing of 5 independent axes

Challenge 3

Scoring is noisy

Mediation quality emerges across turns, yet per-turn judges score every topic at every turn. Off-topic content distorts the scores and errors compound along the trajectory.

SoCRATES answer → Topic-localized evaluation

SoCRATES removes all three obstacles at once, so that a mediator's score reflects genuine social skill rather than the narrowness of the test. The next section details how each component works.

Key Contributions

A unified, automated framework integrating agentic scenario curation, socio-cognitive probing, and topic-localized evaluation in a single pipeline.
A topic-localized evaluator that scores mediator trajectories on three real-time metrics, correlating with expert judgments at Pearson r = 0.82.
A comprehensive benchmark of eight proprietary and open-source LLM mediators across diverse conflict domains and socio-cognitive axes.
Evidence that the strongest mediator closes only roughly a third of the unmediated consensus gap, with gains varying sharply by socio-cognitive axis.

The SoCRATES Framework

Agentic Scenario Curation

LLM agents search the web for real public disputes across eight domains, recast each into a structured scenario, and filter by rejection sampling, keeping only hard cases that fail to resolve without a mediator.

Socio-Cognitive Probing

Each scenario is perturbed independently along five axes, namely strategic posture, party composition, history length, emotional reactivity, and cultural identity, so any performance shift is attributable to a single axis.

Topic-Localized Evaluation

For each topic, the evaluator scores agreement only at the turns that actively move it and carries scores forward otherwise, supporting three metrics (consensus gain, intervention timeliness, and effectiveness).

How the Topic-Localized Evaluator Works

A conversation interleaves many topics, and most turns are off-topic noise for any single one. Scoring every topic at every turn (as per-turn judges do) blurs the signal. Our evaluator localizes each topic to the turns that actually advance it.

Localize relevant turns

For each topic, the evaluator first selects only the turns where the parties actually negotiate that topic. For example, the settlement topic is touched on turns 3, 7, 11, 14, 16, 18, 20, 23 to 27, 31, 35, 37, and 39, while the rest are ignored.

Score agreement, carry forward

At each relevant turn it rates agreement on a 1 to 5 rubric. Between relevant turns the last score is carried forward, giving a clean per-topic consensus trajectory free of off-topic noise.

Derive three metrics

From the trajectory we derive consensus gain (overall closure of the agreement gap), intervention timeliness (when the mediator acts relative to escalation), and intervention effectiveness (how much each intervention shifts consensus).

Scoring one topic, step by step

A real negotiation jumps between several topics at once. To measure agreement on just one of them (here, the settlement), the evaluator reads the whole conversation but only scores the turns that actually discuss that topic, the colored cells in the strip below, and skips the rest (the grey cells). That is what topic-localized means. No off-topic turns blur the signal.

The line then shows agreement on this single topic over time, from 1 (impasse) to 5 (agreement). It only updates on a scored turn and stays flat in between. The score jumps right after the mediator steps in (turns 13 and 15).

off-topic turn (skipped) scored turn (low→high agreement) agreement on this topic ⚖ mediator intervention

The topic-localized evaluator aligns closely with expert judgment, far better than baseline raters at both the trajectory and outcome levels.

Evaluator	Trajectory (r)	Outcome (r)
Non-expert	0.331	0.527
ProMediate per-turn	0.372	0.432
SoCRATES ours	0.823	0.801

Leaderboard

Eight LLM mediators, each run on all 600 scenario and condition combinations (4,800 runs total). We report Timeliness, Effectiveness, and Consensus Gain. Click a header to sort.

Core Metrics

Mediator	Timeliness	Effectiveness	Consensus Gain
GPT-5.4-mini Closed	79.9	24.6	34.4
Gemini-3.1-Flash-Lite Closed	80.9	24.6	33.0
DeepSeek-V3.2 Open	75.8	23.1	31.9
Qwen3-235B Open	76.4	24.6	30.7
Gemma-4-26B Open	79.0	18.1	21.0
Nemotron-3-120B Open	72.0	19.2	20.4
Solar-Pro-3 Open	84.6	16.7	19.9
Qwen3-30B Open	84.6	19.7	15.7
All-mediator Average	79.2	21.3	25.9

Consensus Gain by Conflict Domain

Mediator	Trans.	Health	Env.	B2B	Policy	Intl.	Legal	Intra.	Avg.
GPT-5.4-mini Closed	55.6	23.6	35.0	32.0	28.2	30.3	41.2	29.5	34.4
Gemini-3.1-Flash-Lite Closed	52.1	47.7	25.9	34.6	36.0	22.0	26.7	18.8	33.0
DeepSeek-V3.2 Open	53.3	41.2	27.6	26.4	35.4	26.6	27.0	17.8	31.9
Qwen3-235B Open	51.0	29.7	22.8	28.2	32.5	33.8	20.7	26.9	30.7
Gemma-4-26B Open	42.9	22.9	24.6	15.8	7.1	15.9	24.4	14.6	21.0
Nemotron-3-120B Open	41.9	41.1	16.7	14.5	15.8	17.7	7.0	8.3	20.4
Solar-Pro-3 Open	41.8	30.1	24.3	28.3	6.6	13.4	6.0	8.7	19.9
Qwen3-30B Open	-7.9	48.6	26.3	16.0	17.9	18.1	-1.2	8.2	15.7
All-mediator Average	41.3	35.6	25.4	24.5	22.4	22.2	19.0	16.6	25.9

Leaderboard Visualization

Consensus Gain by Mediator

Average across the eight domains. Even the best mediator closes only about a third of the unmediated gap.

Consensus Gain Heatmap Across Domains

Gain swings from Transactional (easy) down to Intra-organizational (hard).

LowerHigher

Key Findings

Mediation is hard, and scale alone does not solve it

Average consensus gain caps at 34.4, and no mediator clears half the unmediated gap in any domain. General capability does not directly translate to mediation.

Timeliness without effectiveness

The most frequent interveners (Solar-Pro-3, Qwen3-30B) rank lowest on consensus gain. A good mediator acts at the right moment with the right content.

Domain coverage shapes the verdict

Gain swings from 41.3 (Transactional) to 16.6 (Intra-organizational). A transactional-only benchmark overstates mediation ability.

Mediators have uneven socio-cognitive profiles

Every mediator contracts on at least one axis. Strategy is the sharpest stress test, reactivity degrades all mediators, and culture causes small but systematic declines as distance from U.S. norms grows. Competence comprises distinct abilities acquired unevenly.

Socio-cognitive Analysis in Depth

The five axes let us pinpoint which ability constrains each mediator, rather than reading a single aggregate score.

Where each mediator is strong and weak

Each mediator's consensus gain is profiled across the general condition and the five axes, where a larger enclosed area means a more well-rounded mediator. On four of the five axes the area grows with model capability, yet every mediator collapses on at least one axis. Even two top-tier models with similar overall scores differ in where they fail. GPT-5.4-mini and DeepSeek-V3.2 lose far more under multi-state tracking than Gemini-3.1-FL and Qwen3-235B. Mediation competence is a profile, not a single number.

Socio-cognitive radar — **Figure 2.** Mediator adaptation across the general condition and the five socio-cognitive axes, measured by consensus gain.

How strategy, emotion, and culture each take a toll

Here we vary one axis at a time and measure the change from the neutral “general” condition, where negative means worse. Strategy is the sharpest stress test. Every non-collaborative posture lowers consensus gain, with the steepest drops under Competing and Accommodating, and the strongest overall model, Qwen3-235B, falls the most. Emotion degrades every mediator once both parties are reactive. Culture causes small but systematic declines as cultural distance from U.S. norms grows.

Strategy, emotion, culture shifts — **Figure 3.** Consensus gain shift from the general condition along (a) strategic posture, (b) emotional reactivity, and (c) cultural identity. Negative values indicate degradation.

When to intervene depends on the situation

Plotting intervention effectiveness over the course of a conversation shows that the best moment to step in moves with the condition. For strategy and emotion, effectiveness peaks early, because stances and feelings must be reframed before they harden. For multi-state tracking and long-context, it peaks late, when enough context has built up for summarizing to help. Stronger mediators time their interventions to each window, while weaker ones trace flat curves and miss the moment.

Intervention timing — **Figure 4.** Intervention effectiveness over normalized conversation progress, for the general condition and each hard socio-cognitive condition.

Conclusion

SoCRATES evaluates proactive LLM mediators in realistic, multi-domain testbeds. By grounding scenarios in real public disputes, probing five socio-cognitive axes independently, and scoring each topic only on the turns that advance it, it reveals that even the strongest mediator closes only about a third of the unmediated consensus gap, with performance varying sharply by conflict domain and socio-cognitive axis. Progress in LLM mediation lies not in raw capability but in social adaptation to diverse conditions.

BibTeX

 To be filled

SoCRATES

Towards Reliable Automated Evaluation of Proactive LLM Mediationacross Domains and Socio-cognitive Variations

The Vanishing App Dispute over Recognition and Ownership

Coverage does not scale

Complexity collapses to one axis

Scoring is noisy

Agentic Scenario Curation

Socio-Cognitive Probing

Topic-Localized Evaluation

Localize relevant turns

Score agreement, carry forward

Derive three metrics

Scoring one topic, step by step

Consensus Gain by Mediator

Consensus Gain Heatmap Across Domains

Mediation is hard, and scale alone does not solve it

Timeliness without effectiveness

Domain coverage shapes the verdict

Mediators have uneven socio-cognitive profiles

Where each mediator is strong and weak

How strategy, emotion, and culture each take a toll

When to intervene depends on the situation

Towards Reliable Automated Evaluation of Proactive LLM Mediation
across Domains and Socio-cognitive Variations