In my last post I introduced TinyFabulist, a pipeline for generating structured synthetic fables. Since then, I have been scaling the system from prototype to production. The target: three million fables generated by multiple open-weight language models, each from a fully specified YAML story template.

The Generation Architecture

The pipeline has three main stages:

1. Story element generation. A combinatorial engine produces YAML specifications by sampling from pools of characters, settings, conflicts, and morals. Each combination is unique, and the structured format means every generated fable has a known specification to evaluate against.

2. Prompt construction. Each YAML spec is transformed into a natural language prompt with explicit constraints. The prompt includes the characters, their roles, the setting, the conflict type, and the intended moral. This is more prescriptive than typical creative writing prompts, which is the point – controlled generation requires controlled inputs.

3. Model inference. The prompts are dispatched to multiple open-weight models. I am using models from several families – Llama, Qwen, Mistral, Phi, and others – to study cross-family variation. Each model generates independently from the same prompt, so the resulting dataset contains multiple realizations of every story specification.

Multi-Model Strategy

Using multiple model families is central to the design. A dataset generated by a single model inherits that model’s biases, stylistic preferences, and failure modes. By generating from diverse families, the aggregate dataset has richer variation and is more useful for downstream training.

In practice, different models handle the structured prompts differently. Some follow moral constraints faithfully but produce formulaic prose. Others write more creatively but occasionally drift from the specified moral. This variation is a feature, not a bug – it gives me a natural axis for studying generation quality.

Infrastructure Challenges

Generating three million fables is computationally non-trivial. Some lessons from scaling up:

Batching matters. Naive sequential generation is too slow. I batch requests and use asynchronous processing to keep GPU utilization high. The pipeline runs in Docker containers, which makes deployment consistent across machines.

Error handling at scale. At three million items, even a 0.1% failure rate means 3,000 errors. The pipeline includes retry logic, output validation (does the response contain a recognizable fable?), and logging that lets me trace any generated fable back to its YAML spec, model, and generation parameters.

Storage and indexing. The raw dataset is stored as JSONL, with each record containing the YAML spec, the prompt, the model identifier, generation parameters, and the output text. This makes downstream filtering and analysis straightforward.

{
    "story_id": "tf1-00142857",
    "spec": {"characters": [...], "setting": "...", "moral": "..."},
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "temperature": 0.7,
    "output": "Once upon a time, in a dense mountain forest..."
}

Early Quality Observations

Even before formal evaluation, some patterns are visible in the generated data:

  • Moral adherence varies significantly across models. Some models embed the moral naturally in the narrative; others tack it on as a final sentence regardless of the story content.
  • Length distributions differ by model family. Qwen models tend to produce longer, more elaborate fables, while Phi models are more concise.
  • Structural compliance is generally high – most outputs are recognizable fables with the requested elements – but creative reinterpretation of characters or settings happens in roughly 5-10% of outputs.

These observations will be quantified formally through a multi-dimensional evaluation rubric that I am developing in parallel. More on that in a future post.

Next Steps

The generation run is ongoing. Once complete, I plan to release the full dataset on HuggingFace and submit a paper describing the methodology. The evaluation framework is the next major piece to build – assessing three million fables requires automated evaluation that goes beyond surface-level metrics.