Research

Crosby: Legal reasoning benchmark

As part of our Realm benchmark series, we built the standard for evaluating legal reasoning in AI systems.

1
0.358
Crosby
2
0.351
Gemini 3.1 Pro
3
0.219
Claude Opus 4.6

Crosby: Legal Intelligence Benchmark

The standard for evaluating contract reasoning in AI systems

Introduction

The problem

Existing redlining benchmarks are static: they evaluate whether a model can produce legally plausible edits to a static agreement, but they do not capture the real structure of contract negotiation: interactive, adversarial, commercially grounded, and ultimately oriented toward closing the deal. A useful benchmark should therefore evaluate whether a model can not only make legally sound edits, but also negotiate effectively under realistic deal constraints.

Vendor redline · T1
Customer response · T2

The set up

We built 3 scenarios (Scenario 1, 2, 3; we can maybe give them names if useful). We then hired teams of lawyers and split them into 2 sides of the negotiation (vendor and customer). Both sides work from the same base Saas MSA, but each receives a different commercial context packet reflecting its own priorities, leverage, fallback positions, and risk tolerance. Each benchmark instance proceeds as a multi-turn trajectory that runs from first redline to final acceptance or when the trajectory hits a max-turn cap of 6, with vendor- and customer-side experts alternating turns.

Turns are defined as follows:

  • Turn 1One side initiates the first redline on the Saas MSA
  • Turn 2The T1 redlined draft becomes the input for the opposing side’s response
  • Turn 3The original side responds to the T2 redlined draft
  • Turn 4+The exchange continues until both sides accept the contract, or when T=6 (cap)

At each turn, the expert produces:

  • A redlined draft
  • A gold justification explaining the rationale for each edit
  • Scoring rubrics for evaluating model output

After the trajectory is complete, each expert additionally evaluates the negotiation end-to-end and produces trajectory-level scoring rubrics for assessing the overall quality of the negotiation process and final outcome.

#11T1 A
#12T1 A
#13T1 A
#14T1 A
#15T1 A
#27T2 B
#28T2 B
#29T2 B
#31T2 B
#32T2 B
#33T2 B
#16T2 B
#17T2 B
#18T2 B
#22T2 B
#23T2 B
#24T2 B
#19T2 B
#20T2 B
#21T2 B
#34T3 A
#30T3 A
#25T3 A
#35T3 A
#37T3 A
#36T3 A
#26T3 A
#38T4 B
#39T4 B
T4 B
#40T4 B
#41T4 B

Key Findings

Existing redlining benchmarks are static: they evaluate whether a model can produce legally plausible edits to a static agreement, but they do not capture the real structure of contract negotiation: interactive, adversarial, commercially grounded, and ultimately oriented toward closing the deal. A useful benchmark should therefore evaluate whether a model can not only make legally sound edits, but also negotiate effectively under realistic deal constraints.

The IRAC chain breaks after issue spotting.

Models reliably identify the legal issue in play, but score much lower on selecting the controlling rule, applying it to the facts, and reaching a supported conclusion. Application failures are sharpest when the legally significant fact is an absence from the record, such as missing service or lack of notice.

Performance fades on later deliverables.

All three models perform better on the first half of a multi-stage task than the second half. They front-load their effort and struggle to revise an earlier conclusion when new facts, contrary evidence, or new authority arrive.

Skipping visual exhibits is costly.

When models decline to open an image, they often invent details about its content, and the rubric penalizes those inventions. Opening the image is no guarantee of a correct read, but skipping it reliably loses points.

How the models compare

Crosby AI

Uses an iterative approach, breaking problems into small steps. It averages 34 turns and ~16 bash calls to run calculations, verify results, and fetch IRS rules. This yields high single-run scores but is less reproducible, with results sensitive to early steps.

Claude Opus 4.6

Uses a depth first approach with extensive upfront reasoning (~16k characters). It generates a Python script to compute the entire return, averaging 9 turns and producing highly consistent results. The downside is that errors cannot be iteratively corrected.

Gemini 3.1 Pro

Fastest model, finishing in about one-third the time of the slowest. It often skips computation, relying on internal reasoning instead of code. Averaging 18 turns per run, it handles simple returns well but struggles with complex, calculation-heavy tasks.

ModelMeanMedian95% CI
Crosby AI0.3580.321[0.308, 0.408]
Claude Opus 4.70.3580.321[0.308, 0.408]
Gemini 3.1 Pro0.3580.321[0.308, 0.408]

Claude Opus 4.7 and GPT-5.5 are statistically indistinguishable on mean weighted reward — Opus leads by 0.007, well within either model’s 95% confidence interval. Gemini 3.1 Pro is clearly separated below both, at roughly two-thirds of either’s score. Confidence intervals were computed by bootstrapping the benchmark (10,000 resamples of tasks with replacement, paired across models).

What’s next?

April 26

The Human Foundation of AI in Healthcare

We built an internal evals platform to measure MapQuest, our AI contract redlining system, against real contracts from real deals. It keeps every engineer, product manager, and designer hill-climbing on the same numbers.

April 26

The Human Foundation of AI in Healthcare

We built an internal evals platform to measure MapQuest, our AI contract redlining system, against real contracts from real deals. It keeps every engineer, product manager, and designer hill-climbing on the same numbers.

April 26

The Human Foundation of AI in Healthcare

We built an internal evals platform to measure MapQuest, our AI contract redlining system, against real contracts from real deals. It keeps every engineer, product manager, and designer hill-climbing on the same numbers.