Research

How we measure AI contract redlining at Crosby

SSharan

We built an internal evals platform to measure MapQuest, our AI contract redlining system, against real contracts from real deals. It keeps every engineer, product manager, and designer hill-climbing on the same numbers.

Legal AI has a measurement problem. A redline is not a classification label or a numeric answer; it is a set of tracked changes and comments inside a Microsoft Word document, negotiated between two parties with conflicting incentives, interpreted by lawyers with decades of judgment. There is no single “correct” output.

The same counterparty paragraph can be redlined three different ways by three different attorneys at the same firm, and all three can be “right.” At Crosby, we use MapQuest to accelerate contract redlining for clients who depend on it to close real deals, and need to rigorously track and measure whether the system is improving. We need to define whether a redline is “right” or “wrong” to teach AI to take the first pass on contracts.

In order to do this, we built rubrics with our lawyers, tuned LLM-as-a-judge scorers to match their labels, and built a three-part internal platform — datasets, experiments, and logs — that turned those scorers into an experimentation flywheel. We’ve made exciting progress in a few months, but are still working through hard problems like how often to refresh benchmarks, how to keep judges calibrated as the production data distribution shifts, and what it takes to run any of this in a domain where the ground truth is a lawyer.

Why measuring contract redlining is hard

MapQuest accelerates contract negotiations for our clients. A counterparty sends a draft (an MSA, an NDA, a DPA, etc.) and MapQuest reads it, applies redlines / inline comments against the client’s playbook, and hands it to a barred attorney at Crosby to review and send back. Most of our customers see MapQuest as a senior associate that does the first pass in minutes instead of hours.

AI improves quickly; so does MapQuest. New models ship every few weeks, agentic frameworks are rewritten every few months, and design principles we took for granted in the winter are obsolete by the spring. Several pods of engineers at Crosby work on different layers of the system in parallel: context engineering, OOXML editing, agent infrastructure, etc. Any one of them can cause the system to regress in a way that is invisible to the team shipping the change.

Without a measurement layer, “is this better?” becomes a vibes-based question. Two reasonable engineers could look at five redlined paragraphs, disagree, and ship whichever prompt they prefer.

The problem is that typical evaluation techniques do not transfer cleanly. Exact-match scoring is useless because two correct redlines can differ by a word. String similarity scores reward paraphrases of the wrong clause. Public benchmarks do not exist for this task, and even if they did, they would not reflect the actual distribution of our clients’ deals. We had to build the rubric, the data, and the harness ourselves.

Working hand-in-hand with lawyers to build the rubric

Crosby is a hybrid law firm. Engineers sit next to lawyers, and the lawyers use MapQuest every day to close deals for the clients they advise. This matters more than it sounds. Most AI-for-legal companies have to hire outside counsel to label data or evaluate outputs. We walk across the office.

The first problem was writing down what “good” looks like. A redline has two surfaces we care about: the tracked change (the edit applied to the document) and the comment (the inline explanation, if any). Each can be correct, incorrect, missing, or superfluous. We iterated on a rubric with our attorneys that scores redlines and comments along the same axes they reason about — did MapQuest catch the right issue, did it edit the right span, is the legal substance correct, does the tone match what the lawyer would have sent, is the rationale in the comment defensible, etc.

Once the rubric was created, we labeled redlines and comments from past deals against it. These were not synthetic contracts. They were real MSAs, NDAs, DPAs, and bespoke agreements from real negotiations Crosby closed for our clients, redlined by the same attorneys who handled the live deal. The dataset spans the full range of variables that matter — deal size, contract type, template origin (client paper versus counterparty paper), turn number, business line, transaction side. That breadth is the reason we trust the numbers to generalize.

With a labeled set in hand, we built LLM-as-a-judge scorers to approximate the lawyers’ labels at scale. For each scorer, we measured agreement between the judge and the attorney labels on a held-out set; we iterated the judge prompt until parity was high enough that a change detected by the judge meaningfully predicted a change that the lawyer would flag, and only then trust the judge’s measure of the system’s performance. Our two main production scorers — comment_similarity and redline_similarity — are the result of roughly three rounds of this loop.

Two things make this process work at Crosby. First, the dogfooding loop is genuinely fast: an engineer can notice a case where the judge is wrong, walk over to the lawyer on the deal, get the correct label, and push a new judge prompt the same afternoon. Second, because our lawyers see the system they are evaluating as their own tool, the labeling doesn’t feel like a chore performed for the ML team, but how they improve the product they use every day.

The evals platform combines live client data and well-structured experiments to improve redlines

Once the rubric and scorers were stable, we wrapped them in a platform designed to be used by the entire engineering, product, and design org — not just the MapQuest team. The platform has three surfaces, and everyone who ships anything that touches MapQuest uses all three.

Precisely filterable datasets of real-world contracts

The datasets page is where the corpus of labeled, real-world contracts lives. Every deal Crosby has closed contributes documents, which are indexed with rich metadata — client business, counterparty, contract type (MSA, NDA, DPA, and so on), template origin, transaction side, turn number, redlining lawyer, deal date, and more.

Fig 1.Judge score

Judge score

Redline SimilarityComment Similarity
100%75%50%25%0%
Mar 19Mar 25Mar 29Apr 2Apr 6Apr 10Apr 15Apr 19

Anyone on the team can create a dataset by slicing the corpus along any of these axes. If a pod is working on improving MapQuest’s handling of counterparty-paper DPAs for a specific business line, they can build a benchmark of exactly those contracts in a few clicks. If another pod wants to stress-test later-turn negotiations (turn 3 and up, where the easy issues have already been flagged and the remaining ones are harder), they can slice for that. Today we have several datasets drawn from a pool of 9k+ labeled contract documents, each tagged along a dozen dimensions.

What’s next?

April 26

The Human Foundation of AI in Healthcare

We built an internal evals platform to measure MapQuest, our AI contract redlining system, against real contracts from real deals. It keeps every engineer, product manager, and designer hill-climbing on the same numbers.

April 26

The Human Foundation of AI in Healthcare

We built an internal evals platform to measure MapQuest, our AI contract redlining system, against real contracts from real deals. It keeps every engineer, product manager, and designer hill-climbing on the same numbers.

April 26

The Human Foundation of AI in Healthcare

We built an internal evals platform to measure MapQuest, our AI contract redlining system, against real contracts from real deals. It keeps every engineer, product manager, and designer hill-climbing on the same numbers.