Running LLM Evaluation Locally with Ollama on Apple Silicon

A recurring practical challenge in my research has been running LLM-based evaluation at scale without depending on cloud APIs. In this post, I want to share the setup I have converged on: Ollama running on a Mac Studio for local judge inference with structured JSON output.

Why Local Inference

There are three reasons I moved evaluation to local inference:

Cost. Evaluating three million fables across four dimensions with a proprietary API is expensive. At typical API pricing, a single evaluation pass would cost hundreds of dollars. Local inference has a one-time hardware cost and then runs at marginal cost of electricity.

Reproducibility. Cloud API models change without notice. A model version that produces certain scores today might produce different scores next month. Local models are versioned and deterministic (at temperature 0), which makes experiments reproducible.

Speed of iteration. When developing evaluation rubrics, I need to test prompts quickly and iterate. Waiting for API rate limits and network latency slows down the development loop. Local inference on a fast machine gives me sub-second response times for models up to 8B parameters.

The Ollama Setup

Ollama is a tool for running open-weight language models locally. It handles model downloading, quantization, and serving through a simple API. My setup:

Hardware: Mac Studio with M-series chip, 64GB unified memory
Models: Granite 4.1 30B (primary judge), EXAONE 3.5 32B (primary judge), Granite 3.3 8B (arbiter)
Serving: Ollama running as a background service, accessible via HTTP API

The unified memory architecture of Apple Silicon is particularly well-suited for this. Models up to 30B parameters fit comfortably in memory, and the memory bandwidth is high enough for reasonable inference speeds.

Structured JSON Output

One of the most important features for evaluation is structured output. When a judge model scores a fable, I need the response in a predictable format that I can parse programmatically. Ollama supports JSON schema enforcement through the response_format parameter:

import requests

response = requests.post(
    "http://localhost:11434/api/chat",
    json={
        "model": "granite3.3:8b",
        "messages": [
            {"role": "system", "content": rubric_prompt},
            {"role": "user", "content": fable_text}
        ],
        "format": {
            "type": "object",
            "properties": {
                "grammar_score": {"type": "integer", "minimum": 1, "maximum": 10},
                "grammar_justification": {"type": "string"},
                "creativity_score": {"type": "integer", "minimum": 1, "maximum": 10},
                "creativity_justification": {"type": "string"}
            },
            "required": ["grammar_score", "grammar_justification",
                         "creativity_score", "creativity_justification"]
        },
        "options": {"temperature": 0}
    }
)

scores = response.json()["message"]["content"]

The JSON schema enforcement ensures that every response is valid, parseable JSON with the expected fields. No more regex-based output parsing or retries on malformed responses.

Multi-Judge Panel Design

Using multiple judge models from different families is central to my evaluation approach. The three-model panel I use draws from two families (IBM Granite and LG EXAONE), with the smaller Granite model serving as an arbiter when the two primary judges disagree.

A critical constraint: the judge models must not come from the same family as the models being evaluated. If I am evaluating fables generated by Llama, Qwen, and Mistral, I cannot use judges from those families. This avoids the self-preference bias documented in the LLM-as-judge literature.

Practical Tips

A few things I learned through trial and error:

Temperature 0 is essential for reproducibility. Even small temperature values introduce stochastic variation that makes results non-deterministic. For evaluation, determinism matters.

Model selection requires empirical testing. Not all models follow structured output schemas reliably. I tested roughly a dozen models before settling on the current panel. Some models produced valid JSON but poor-quality evaluations; others had good judgment but unreliable output formatting.

Batch processing needs error handling. Even with local inference, occasional failures happen (out-of-memory on long inputs, model hangs on adversarial text). The evaluation pipeline includes timeouts, retries, and fallback logic.

Monitor memory pressure. Running a 30B model saturates the unified memory on a 64GB machine. I avoid running multiple large models simultaneously and schedule evaluation jobs sequentially by model.

This setup has been running reliably for several months now and has processed hundreds of thousands of evaluations. The combination of Ollama, Apple Silicon, and structured JSON output has made local LLM evaluation practical and affordable.