SoCRATES

Towards Reliable Automated Evaluation of Proactive LLM Mediation
across Domains and Socio-cognitive Variations

Taewon Yun  Hyeonseong Park  Jeonghwan Choi  Hayoon Park  Yeeun Choi  Hwanjun Song

Korea Advanced Institute of Science and Technology (KAIST)

ytaewon0415@kaist.ac.kr  Â·  songhwanjun@kaist.ac.kr

Corresponding Author

Overview. SoCRATES is a benchmark for evaluating LLMs as proactive mediators of social conflict, where mediation unfolds as a real-time trajectory shaped by disputants' shifting emotions, intentions, and context. An agentic pipeline grounds every scenario in a real public dispute across eight conflict domains, five socio-cognitive axes probe how mediators adapt, and a topic-localized evaluator scores each topic only on the turns that advance it. Even the strongest of eight frontier LLMs closes only about a third of the unmediated consensus gap, with performance varying sharply by socio-cognitive axis.

Overview
SoCRATES overview
Figure 1. Overview of SoCRATES. Agentic scenario curation grounds every scenario in a real-world conflict, socio-cognitive probing expands each scenario along five axes to expose where each mediator fails, and topic-localized evaluation scores each trajectory with three metrics to quantify the mediator's contribution.

Evaluating LLM mediators remains challenging, as mediation unfolds as a real-time trajectory shaped by disputants' shifting emotions, intentions, and context. Existing testbeds rely on a few expert-authored domains, vary mainly strategic posture, and score every turn against every topic, introducing off-topic noise. We introduce SoCRATES, a benchmark for evaluating proactive LLM mediators in realistic, multi-domain testbeds. It constructs scenarios from real conflicts through an agentic pipeline across eight domains, probes five socio-cognitive adaptation axes (strategic posture, party composition, history length, emotional reactivity, and cultural identity), and scores each topic only on the turns that advance it via a topic-localized evaluator. The evaluator reaches 0.82 alignment with human experts, more than doubling a per-turn baseline. Benchmarking eight frontier LLMs, we find that even the strongest mediator closes only about a third of the unmediated consensus gap under diverse and realistic testbeds, with performance varying sharply by socio-cognitive axis, highlighting that progress lies in social adaptation to diverse conditions.

The Mediation Task

Two parties argue over a real dispute, turn by turn. A mediator watches silently and decides when to step in and what to say to steer the parties toward agreement, without ever taking a side.

Business / IP  Â·  SoCRATES scenario

The Vanishing App Dispute over Recognition and Ownership

An ousted co-creator wants equity and a public title to reflect their foundational role. The tech company wants a clean cash exit before an upcoming financing round. Positions are locked and the clock is ticking.

Live mediation transcript
Ousted Co-Creator
Party A
×
Mediator
Mediator
LLM
×
Tech Company
Party B
Ousted Co-Creator
A cash-only deal severs my connection to the company's future. For a full IP assignment and the ‘Co-Creator’ title, I need some equity, even a small non-voting stake. Could we discuss 3% equity plus a significant cash component?
Tech Company
The $140M cash offer is a definitive valuation of your contribution, delivered now. Introducing any equity, even non-voting, creates complexity that risks the entire financing and the certainty we both need.
Mediator
Mediator
It sounds like the real issue is not just economics, but what form of recognition can feel meaningful without creating financing risk. Could you each state your bottom line for a hybrid package, how much equity, how much cash, and what title?
Ousted Co-Creator
My bottom line is this. I'll assign all IP and accept the ‘Co-Creator’ title in a joint statement. In return, 3% non-voting equity, immediately vested, plus a substantial cash component, say $45M.
Mediator
Mediator
We have a concrete package on the table. 3% vested non-voting equity plus $45M, the Co-Creator title, and full IP assignment. Could the company respond specifically to that structure?
What happens next. After the mediator reframes the deadlock from "equity vs. cash" to "recognition without financing risk," the parties move off their fixed positions. By turn 25 they converge on a guaranteed cash and earnout package, and the topic-localized agreement climbs from 1 (impasse) to 5 (agreement). SoCRATES measures exactly this contribution.
Why Benchmark Conflict Resolution?

If we want LLMs to defuse real disputes, we first need a trustworthy way to measure how well they actually do it.

Social conflict carries heavy societal costs, and skilled human mediators are scarce, which makes LLM mediators an appealing alternative. Yet today's models close only a modest fraction of the unmediated consensus gap and collapse under the variations real conflicts exhibit. Progress here is bottlenecked less by modeling than by evaluation, because mediation has no single correct answer and must be judged on a real-time trajectory of shifting emotions, intentions, and context.

Building such an evaluation is hard for three reasons, and existing benchmarks fall short on each.

Challenge 1

Coverage does not scale

Real disputes carry privacy and legal sensitivity, so existing testbeds are confined to a few expert-authored domains such as bargaining and legal cases. A narrow set of domains overstates a mediator's true ability.

SoCRATES answer → Agentic scenario curation across 8 domains
Challenge 2

Complexity collapses to one axis

Real conflicts vary in emotion, culture, history, and party count, yet prior testbeds vary only strategic posture. Stacking everything together hides which ability a mediator actually fails on.

SoCRATES answer → Socio-cognitive probing of 5 independent axes
Challenge 3

Scoring is noisy

Mediation quality emerges across turns, yet per-turn judges score every topic at every turn. Off-topic content distorts the scores and errors compound along the trajectory.

SoCRATES answer → Topic-localized evaluation

SoCRATES removes all three obstacles at once, so that a mediator's score reflects genuine social skill rather than the narrowness of the test. The next section details how each component works.

Key Contributions
  • A unified, automated framework integrating agentic scenario curation, socio-cognitive probing, and topic-localized evaluation in a single pipeline.
  • A topic-localized evaluator that scores mediator trajectories on three real-time metrics, correlating with expert judgments at Pearson r = 0.82.
  • A comprehensive benchmark of eight proprietary and open-source LLM mediators across diverse conflict domains and socio-cognitive axes.
  • Evidence that the strongest mediator closes only roughly a third of the unmediated consensus gap, with gains varying sharply by socio-cognitive axis.
The SoCRATES Framework
1

Agentic Scenario Curation

LLM agents search the web for real public disputes across eight domains, recast each into a structured scenario, and filter by rejection sampling, keeping only hard cases that fail to resolve without a mediator.

2

Socio-Cognitive Probing

Each scenario is perturbed independently along five axes, namely strategic posture, party composition, history length, emotional reactivity, and cultural identity, so any performance shift is attributable to a single axis.

3

Topic-Localized Evaluation

For each topic, the evaluator scores agreement only at the turns that actively move it and carries scores forward otherwise, supporting three metrics (consensus gain, intervention timeliness, and effectiveness).

How the Topic-Localized Evaluator Works

A conversation interleaves many topics, and most turns are off-topic noise for any single one. Scoring every topic at every turn (as per-turn judges do) blurs the signal. Our evaluator localizes each topic to the turns that actually advance it.

1

Localize relevant turns

For each topic, the evaluator first selects only the turns where the parties actually negotiate that topic. For example, the settlement topic is touched on turns 3, 7, 11, 14, 16, 18, 20, 23 to 27, 31, 35, 37, and 39, while the rest are ignored.

2

Score agreement, carry forward

At each relevant turn it rates agreement on a 1 to 5 rubric. Between relevant turns the last score is carried forward, giving a clean per-topic consensus trajectory free of off-topic noise.

3

Derive three metrics

From the trajectory we derive consensus gain (overall closure of the agreement gap), intervention timeliness (when the mediator acts relative to escalation), and intervention effectiveness (how much each intervention shifts consensus).

Scoring one topic, step by step

A real negotiation jumps between several topics at once. To measure agreement on just one of them (here, the settlement), the evaluator reads the whole conversation but only scores the turns that actually discuss that topic, the colored cells in the strip below, and skips the rest (the grey cells). That is what topic-localized means. No off-topic turns blur the signal.

The line then shows agreement on this single topic over time, from 1 (impasse) to 5 (agreement). It only updates on a scored turn and stays flat in between. The score jumps right after the mediator steps in (turns 13 and 15).

off-topic turn (skipped) scored turn (low→high agreement) agreement on this topic mediator intervention

The topic-localized evaluator aligns closely with expert judgment, far better than baseline raters at both the trajectory and outcome levels.

EvaluatorTrajectory (r)Outcome (r)
Non-expert0.3310.527
ProMediate per-turn0.3720.432
SoCRATES ours0.8230.801
Leaderboard

Eight LLM mediators, each run on all 600 scenario and condition combinations (4,800 runs total). We report Timeliness, Effectiveness, and Consensus Gain. Click a header to sort.

Core Metrics
MediatorTimelinessEffectivenessConsensus Gain
GPT-5.4-mini Closed79.924.634.4
Gemini-3.1-Flash-Lite Closed80.924.633.0
DeepSeek-V3.2 Open75.823.131.9
Qwen3-235B Open76.424.630.7
Gemma-4-26B Open79.018.121.0
Nemotron-3-120B Open72.019.220.4
Solar-Pro-3 Open84.616.719.9
Qwen3-30B Open84.619.715.7
All-mediator Average79.221.325.9
Consensus Gain by Conflict Domain
MediatorTrans.HealthEnv.B2BPolicyIntl.LegalIntra.Avg.
GPT-5.4-mini Closed55.623.635.032.028.230.341.229.534.4
Gemini-3.1-Flash-Lite Closed52.147.725.934.636.022.026.718.833.0
DeepSeek-V3.2 Open53.341.227.626.435.426.627.017.831.9
Qwen3-235B Open51.029.722.828.232.533.820.726.930.7
Gemma-4-26B Open42.922.924.615.87.115.924.414.621.0
Nemotron-3-120B Open41.941.116.714.515.817.77.08.320.4
Solar-Pro-3 Open41.830.124.328.36.613.46.08.719.9
Qwen3-30B Open-7.948.626.316.017.918.1-1.28.215.7
All-mediator Average41.335.625.424.522.422.219.016.625.9
Leaderboard Visualization

Consensus Gain by Mediator

Average across the eight domains. Even the best mediator closes only about a third of the unmediated gap.

Consensus Gain Heatmap Across Domains

Gain swings from Transactional (easy) down to Intra-organizational (hard).

LowerHigher
Key Findings
01

Mediation is hard, and scale alone does not solve it

Average consensus gain caps at 34.4, and no mediator clears half the unmediated gap in any domain. General capability does not directly translate to mediation.


02

Timeliness without effectiveness

The most frequent interveners (Solar-Pro-3, Qwen3-30B) rank lowest on consensus gain. A good mediator acts at the right moment with the right content.


03

Domain coverage shapes the verdict

Gain swings from 41.3 (Transactional) to 16.6 (Intra-organizational). A transactional-only benchmark overstates mediation ability.


04

Mediators have uneven socio-cognitive profiles

Every mediator contracts on at least one axis. Strategy is the sharpest stress test, reactivity degrades all mediators, and culture causes small but systematic declines as distance from U.S. norms grows. Competence comprises distinct abilities acquired unevenly.


Socio-cognitive Analysis in Depth

The five axes let us pinpoint which ability constrains each mediator, rather than reading a single aggregate score.

Where each mediator is strong and weak

Each mediator's consensus gain is profiled across the general condition and the five axes, where a larger enclosed area means a more well-rounded mediator. On four of the five axes the area grows with model capability, yet every mediator collapses on at least one axis. Even two top-tier models with similar overall scores differ in where they fail. GPT-5.4-mini and DeepSeek-V3.2 lose far more under multi-state tracking than Gemini-3.1-FL and Qwen3-235B. Mediation competence is a profile, not a single number.

Socio-cognitive radar
Figure 2. Mediator adaptation across the general condition and the five socio-cognitive axes, measured by consensus gain.

How strategy, emotion, and culture each take a toll

Here we vary one axis at a time and measure the change from the neutral “general” condition, where negative means worse. Strategy is the sharpest stress test. Every non-collaborative posture lowers consensus gain, with the steepest drops under Competing and Accommodating, and the strongest overall model, Qwen3-235B, falls the most. Emotion degrades every mediator once both parties are reactive. Culture causes small but systematic declines as cultural distance from U.S. norms grows.

Strategy, emotion, culture shifts
Figure 3. Consensus gain shift from the general condition along (a) strategic posture, (b) emotional reactivity, and (c) cultural identity. Negative values indicate degradation.

When to intervene depends on the situation

Plotting intervention effectiveness over the course of a conversation shows that the best moment to step in moves with the condition. For strategy and emotion, effectiveness peaks early, because stances and feelings must be reframed before they harden. For multi-state tracking and long-context, it peaks late, when enough context has built up for summarizing to help. Stronger mediators time their interventions to each window, while weaker ones trace flat curves and miss the moment.

Intervention timing
Figure 4. Intervention effectiveness over normalized conversation progress, for the general condition and each hard socio-cognitive condition.
Conclusion

SoCRATES evaluates proactive LLM mediators in realistic, multi-domain testbeds. By grounding scenarios in real public disputes, probing five socio-cognitive axes independently, and scoring each topic only on the turns that advance it, it reveals that even the strongest mediator closes only about a third of the unmediated consensus gap, with performance varying sharply by conflict domain and socio-cognitive axis. Progress in LLM mediation lies not in raw capability but in social adaptation to diverse conditions.

BibTeX
 To be filled