Crosby is a hybrid law firm. Engineers sit next to lawyers, and the lawyers use MapQuest every day to close deals for the clients they advise. This matters more than it sounds. Most AI-for-legal companies have to hire outside counsel to label data or evaluate outputs. We walk across the office.
The first problem was writing down what “good” looks like. A redline has two surfaces we care about: the tracked change (the edit applied to the document) and the comment (the inline explanation, if any). Each can be correct, incorrect, missing, or superfluous. We iterated on a rubric with our attorneys that scores redlines and comments along the same axes they reason about — did MapQuest catch the right issue, did it edit the right span, is the legal substance correct, does the tone match what the lawyer would have sent, is the rationale in the comment defensible, etc.
Once the rubric was created, we labeled redlines and comments from past deals against it. These were not synthetic contracts. They were real MSAs, NDAs, DPAs, and bespoke agreements from real negotiations Crosby closed for our clients, redlined by the same attorneys who handled the live deal. The dataset spans the full range of variables that matter — deal size, contract type, template origin (client paper versus counterparty paper), turn number, business line, transaction side. That breadth is the reason we trust the numbers to generalize.
With a labeled set in hand, we built LLM-as-a-judge scorers to approximate the lawyers’ labels at scale. For each scorer, we measured agreement between the judge and the attorney labels on a held-out set; we iterated the judge prompt until parity was high enough that a change detected by the judge meaningfully predicted a change that the lawyer would flag, and only then trust the judge’s measure of the system’s performance. Our two main production scorers — comment_similarity and redline_similarity — are the result of roughly three rounds of this loop.
Two things make this process work at Crosby. First, the dogfooding loop is genuinely fast: an engineer can notice a case where the judge is wrong, walk over to the lawyer on the deal, get the correct label, and push a new judge prompt the same afternoon. Second, because our lawyers see the system they are evaluating as their own tool, the labeling doesn’t feel like a chore performed for the ML team, but how they improve the product they use every day.