How were the two models tested for newsletter writing?
**Corpus:** 200 archived issues across four niches (personal finance, ops/SaaS, devtools, lifestyle), pulled from Beehiiv and Kit public sample sends. List sizes 1k to 250k. Formats spanned breaking-news, explainers, founder letters, and link roundups so no single template biased the result.
**Models:** Claude Opus 4.7 via Anthropic API and ChatGPT GPT-5 via OpenAI API, identical role framing, default temperature, 120-word operator brief.
**Scoring:** Two human reviewers (one 90k-sub newsletter operator, one Substack/Beehiiv editor) graded outputs on a 5-point rubric per row. Cohen's kappa: 0.81. Subject lines were also run through a holdout open-rate predictor trained on the Kit benchmark dataset; voice match was scored against a 30-issue style guide per author.