LLM Stories Repeat More Than Human Narratives

When 10 representative LLMs were compared with human-written stories, one pattern stood out. LLM narratives were more similar to one another. Using r/WritingPrompts prompts and human-authored stories, the study examines how repetitive stories are. It does not focus on overall writing quality alone. That is why this question matters. A model can write plausible sentences and still repeat the same narrative mold.

TL;DR

This article covers a study of 10 LLMs and asks whether their stories become too similar.
It matters because repeated plots can lower variety in creative tools, marketing, games, and education.
Readers should test multiple samples per prompt and compare similarity with both automated and human review.

Example: A team uses a story generator for a creative product. The writing looks polished, but many outputs share the same plot shape, motives, and ending mood.

Current State

The starting point of this study is simple. Natural-sounding text does not necessarily mean varied stories. According to the paper’s abstract and summary, researchers examined narrative similarity with an r/WritingPrompts-based dataset. They used a contrastive learning framework. They also combined human evaluation with three automated annotation methods across 10 LLMs.

The reported direction was consistent. The researchers described a “consistent trend.” LLM-generated narratives were more similar to one another than human-written stories. This tendency was reported across model families, scales, and post-training. However, public search results do not fully verify exact model rankings or all effect sizes.

The findings also note limits in mitigation. Available summaries say negative prompting and temperature scaling did not meaningfully reduce narrative homogeneity. Separate studies describe a trade-off in sampling-based decoding. Temperature tuning and tail truncation can change quality and diversity. In other words, settings can matter. Still, the view that a few setting changes will solve repetition does not appear well supported.

This pattern is not limited to one paper. Other studies also report narrower imaginative range in AI storytelling than in human storytelling. That said, public search results did not confirm correlation coefficients between similarity scores and human judgments of creativity.

Analysis

This study matters because it changes the evaluation axis. Many LLM evaluations focus on accuracy, fluency, and safety. In some products, novelty is part of the value. That applies to stories, advertising copy, character dialogue, and branded content. In those cases, similarity can become a direct cost. Outputs can look polished and still lose value when plots repeat. For teams, that can affect product differentiation.

From a decision-making view, the trade-off is fairly clear. Some products value consistency. Factual summaries and policy documents are examples. In those cases, some repetition may help. Creative assistance and entertainment are different. There, similar stories may look more like a defect. This is where the decoding trade-off appears. More aggressive decoding may raise novelty. It may also lower quality or increase risk. Separate studies frame sampling and parameter choice as balancing “diversity and risk.”

The limitations also matter. First, public information leaves detailed model-by-model comparison incomplete, even across 10 models. Second, human judgments of creativity combine surprise, novelty, completeness, and coherence. That makes reduction to a single score difficult. Third, an r/WritingPrompts-based benchmark is useful, but limited. It does not cover every genre or product setting. Enterprise copy, game quests, and educational case narratives may show different patterns.

Practical Application

The practical question is not only, “Does this model write well?” Another question is, “How often does it retell the same story?” Teams should keep a similarity evaluation sheet beside a quality evaluation sheet. If only one answer is generated per prompt, the issue is hard to spot. Teams should generate multiple outputs from the same prompt. Then they can review overlap in plot structure, character relationships, and ending patterns.

For an interactive story app, teams can compare samples from the same prompt. They can check whether the reversal method repeats. They can review whether the protagonist’s motivation stays similar. They can also inspect whether the ending’s emotional arc repeats. A marketing team should check whether the core narrative frame is duplicated. Sentence style alone is not enough. An educational content team should review whether case narratives keep converging on the same moral structure.

Checklist for Today:

Save multiple outputs for each prompt and include cross-sample similarity in the review criteria.
If decoding settings changed, measure narrative duplication separately from perceived quality.
Add human evaluation alongside automated scoring to compare similarity with perceived creativity.

FAQ

Q. Did this study conclude that LLMs are not creative?
It is hard to state that categorically. The reported finding is narrower. LLM-generated narratives showed a consistent tendency toward greater similarity than human-authored narratives. The study is better read as highlighting repetition as an evaluation axis.

Q. Can good prompting solve this problem?
Public search results suggest the issue is not that simple. Available summaries say negative prompting and temperature scaling did not meaningfully reduce narrative homogeneity. Prompt revision can help in some cases. It does not look sufficient on its own.

Q. Is it acceptable to make product decisions using only automated similarity scores?
That approach seems incomplete. The study used human evaluation together with automated methods. In creative products, users may perceive sameness even when automated scores are low. The reverse can also happen. Both forms of evaluation should be considered together.

Conclusion

The paper’s message is fairly direct. LLM evaluation is not only about “how well does it write.” It is also about “how differently does it write stories.” Teams building creative products should consider a narrative repetition benchmark alongside a quality benchmark.

Aionda