How were the two models tested?
**Evaluation set:** 60 tasks across nine attorney workflows (6-7 each), drawn from PACER federal filings, state appellate opinions, and synthetic fact patterns reviewed by two practicing attorneys. Known correct answers were locked before either model saw the task.
**Models:** Claude Opus 4.7 via Anthropic API and Gemini 2.5 Pro via Google AI Studio API, default temperature, equivalent role framing. General-purpose frontier models, not specialized legal products like Westlaw Precision AI or Lexis+ AI.
**Scoring:** Two reviewers (one practicing attorney, one PhD NLP researcher) graded each output on a 5-point rubric — legal accuracy, citation validity, jurisdictional correctness, completeness, usability. Cohen's kappa: 0.78. Every cite was verified against Westlaw; non-resolving cites logged as hallucinations.