Methodology

How Storybench scores AI storytelling

The benchmark is designed to be pleasant for listeners and legible to researchers: blind ratings first, transparent aggregation second, no visible arena mechanics.

What The Audience Score Measures

Storybench is Rotten Tomatoes for AI storytelling: ordinary listeners rate individual stories they chose to hear, and the aggregate Audience Score shows which models produce stories people enjoy. It measures audience enjoyment, not an objective craft exam.

Blind Rating Flow

Model identity stays hidden until this listener submits a rating for the work. The app rejects reveal calls for unrated works, and dashboard inputs come from ordinary single-story ratings. There is no visible arena, no versus screen, no side-by-side battle, and no human-or-AI guessing prompt.

Rating Criteria

Optional criteria use concrete listener statements: Engagement, Clarity, Originality reverse-scored from predictability, Emotional impact, and Ending payoff. Tags and reviews explain why a story worked or lost the listener, but the star rating remains the headline input.

Exposure Policy

Top shelves reserve exposure for promising low-sample works and Discover surfaces provide randomized listening opportunities. Every impression is logged with surface and position, and randomized exposure is the cleanest slice for debiased leaderboard analysis.

Statistics

Audience Score uses Bayesian-adjusted star ratings with visible vote counts, confidence intervals, and clustered bootstrap checks. Same-listener same-premise ratings plus capped premise_preferences feed a backend Bradley-Terry consistency check once the comparison graph is connected enough to publish.

Prompt And Season Policy

Public prompt slots are published with each leaderboard season, while a private prompt slice remains held back to reduce overfitting. Seasons rotate prompt slots, date-stamp model versions, and keep historical results comparable instead of silently moving the target.

Narrator Normalization

Voice production happens outside the app, but the policy is documented here: the same narrator is used across all model takes within a genre or slot whenever possible, and loudness/duration checks run during ingestion so the score is about the story rather than audio polish.

Covers, Titles, And Selection Effects

Unique covers and model-written titles influence what listeners pick, as book jackets do. That tradeoff is allowed, but covers follow one house art direction, prompts never identify the authoring model, and identity-leak QC rejects titles, summaries, text, or artwork that reveal the model.

Retention And Drop-Off

Storybench publishes retention beside stars: stream rate, five-minute reach, length-normalized completion, and drop-off curves from maximum listen position. A model should not be rewarded only by the people who finished and rated.

Quality Weighting And Anomaly Screens

Ratings with very low listen percentage are flagged or down-weighted. The statistics layer also watches for repetitive scoring, implausibly fast rating bursts, source-surface bias, and other anomaly screens before a leaderboard season is treated as mature.

Benchmark Tracks

Listeners and Audience Score remain primary. AI Critics and Expert Panel tracks are ingested snapshot rows computed outside the listener app; the app never calls an LLM, TTS, or image-generation API to create scores, audio, text, or covers.

Lab-facing integrity policy

Submitted model results are permanent within a leaderboard season. Providers cannot retract weak results after seeing scores.

Model versions are immutable. New variants get new version identifiers and enter the exposure system under the same rules as other models.

Deprecated models remain visible in the season where they were evaluated and are retired forward only. Later model releases enter as new immutable versions rather than replacing earlier scores.

Seasons rotate public prompt slots and reserve a private prompt slice so the benchmark remains harder to overfit.

AI Critics and Expert Panel tracks are parallel snapshots. They never license in-app LLM calls and never displace the Listener Audience Score as the headline.