How were the two models tested?
**Evaluation set:** 63 tasks across nine workflows (7 each), drawn from real briefs three teams shipped in Q1 2026 — a B2B SaaS in fintech, a DTC skincare brand, and a developer-tools startup. Topics, audiences, and brand voices were locked before either model saw the prompt.
**Models:** Claude Opus 4.7 via Anthropic API and Gemini 2.5 Pro via Google AI Studio API. Default temperature. Identical role framing, brand-voice guide, and reference examples passed to each.
**Scoring:** Three content marketers graded each output on a 6-point rubric — voice fidelity, structural quality, factual accuracy, originality, usability, and time-to-publish. Inter-rater agreement (Krippendorff's alpha) was 0.74. Anything flagged factual was verified against primary sources.