Building Large-Scale EN-RO Translation Resources

The TF2 paper is now on arXiv, and both datasets are available on HuggingFace. This post summarizes what we built and why it matters for Romanian NLP.

The preprint is available at arXiv:2509.07829. The datasets are klusai/ds-tf2-en-ro-3m (the full 3M parallel corpus) and klusai/ds-tf2-en-ro-15k (a curated 15K subset).

What TF2 Contains

The TF2 project produced three main artifacts:

TF2-12B: A LoRA-fine-tuned translation model based on a 12B-parameter open-weight model, specialized for English-to-Romanian literary translation.

DS-TF2-EN-RO-3M: A parallel corpus of three million English-Romanian fable pairs. Each pair includes the English source (from TF1), the Romanian translation, the model that produced the translation, and five-dimensional quality scores.

DS-TF2-EN-RO-15K: A curated subset of 15,000 high-quality translation pairs, selected based on evaluation scores across all five dimensions. This subset is intended for fine-tuning and benchmarking.

Why Two Datasets

The full 3M corpus is useful for pre-training and large-scale experiments, but not every translation in it is high quality. The 15K curated subset applies quality thresholds on all five evaluation dimensions (accuracy, fluency, coherence, style, cultural/pragmatic adaptation) to provide a clean, high-quality resource.

This two-tier approach is deliberate. Researchers working on data filtering or quality estimation benefit from having the full range of quality levels. Researchers who just need good parallel data can use the 15K subset directly.

Evaluation Highlights

The paper includes a systematic comparison of zero-shot, few-shot, and LoRA-fine-tuned models across the five-dimensional rubric. Key findings:

Fine-tuned models consistently outperform zero-shot models on accuracy and fluency
Style scores show the most variation across models and the least improvement from fine-tuning
The cultural/pragmatic dimension is where Romanian-language expertise matters most – models often produce technically correct translations that sound unnatural to native speakers

Budget-Aware Translation

One practical contribution of TF2 is a cost analysis of large-scale translation with open models. Running translation inference on consumer hardware (Apple Silicon) is feasible but slow. The paper includes timing benchmarks and cost estimates for different hardware configurations, which I hope will be useful for other researchers planning similar projects on a budget.

What Comes Next

TF2 gives us a large parallel corpus. The natural next question is: can we use this corpus to train a compact Romanian language model from scratch? That is the premise of TF3, which I have already started working on.