LoRA Fine-Tuning for Literary Translation

One of the more practical contributions of the TF2 project is demonstrating that lightweight fine-tuning can meaningfully improve literary translation quality with open-weight models. In this post, I want to discuss the approach and some observations from the process.

The Case for Fine-Tuning

General-purpose language models can translate text out of the box, but literary translation has specific requirements that zero-shot prompting does not always meet. Models tend to produce literal translations that are accurate but stylistically flat, or they take creative liberties that sacrifice accuracy for fluency.

Fine-tuning on a curated set of high-quality literary translations can shift the model’s behavior toward the balance we want: faithful to the source meaning while natural and stylistically appropriate in the target language.

The challenge is doing this efficiently. Full fine-tuning of a 7B+ parameter model requires significant compute. LoRA (Low-Rank Adaptation) offers an alternative: freeze the base model weights and train small low-rank adapter matrices that modify the model’s behavior. The adapter typically adds less than 1% of the base model’s parameters.

The Training Setup

For TF2, my LoRA fine-tuning setup looks like this:

Base models: Multiple open-weight models in the 7B-12B range
Training data: A curated subset of English-Romanian literary translations, including fables, short stories, and folk tales
LoRA configuration: Rank 16, alpha 32, targeting attention projection layers
Training: 3 epochs, learning rate 2e-4 with cosine schedule, gradient accumulation for effective batch size 32

from peft import LoraConfig, get_peft_model

lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
    lora_dropout=0.05,
    task_type="CAUSAL_LM"
)

model = get_peft_model(base_model, lora_config)

The adapter files are small (tens of megabytes) and can be swapped in and out of the base model, which means I can maintain multiple specialized adapters without duplicating the full model weights.

Quantization for Deployment

Beyond training, I also experimented with quantization to make the fine-tuned models more practical to deploy. Two approaches:

GGUF quantization via llama.cpp. This converts the model to a format optimized for CPU inference, making it possible to run translation on consumer hardware. The LoRA adapter is merged into the base model before quantization.

W8A8 quantization (8-bit weights, 8-bit activations). This preserves more precision than aggressive 4-bit quantization while still reducing memory requirements substantially. For literary translation, where subtle word choices matter, I found that 8-bit quantization preserves quality better than 4-bit.

Observations

Some patterns from the fine-tuning experiments:

LoRA adapters transfer across model sizes. An adapter trained on a 7B model does not directly transfer to a 12B model, but the training recipe transfers well. The same hyperparameters and data produce good results across model sizes, which simplifies experimentation.

Small training sets work. Literary translation fine-tuning benefits from quality over quantity. A few thousand carefully curated translation pairs outperform a larger but noisier dataset. This matters because high-quality literary translations are expensive to produce.

Style is the hardest dimension to improve. Fine-tuning consistently improves accuracy and fluency scores, but style improvement is more variable. Some models respond well to style-focused training examples; others seem to have strong stylistic priors that resist adaptation.

The cost is minimal. A LoRA fine-tuning run on a single GPU takes hours, not days. This makes iteration fast and accessible to researchers without large compute budgets. For my setup, a single run on a Mac Studio with an M-series chip completes overnight.

Next Steps

The TF2 paper will include a systematic comparison of zero-shot, few-shot, and LoRA-fine-tuned models across all five evaluation dimensions. The fine-tuned models are also being used to generate the large-scale parallel corpus, so the fine-tuning work feeds directly into the dataset release.