Helle v. BargenMay 2025

Document-Based LLM Translation: Seeing the Bigger Picture | 2025

13:22

Document-based LLM translation marks a paradigm shift in machine translation technology within TMS systems like MemoQ or Trados – this is the conclusion of our interview with translation technology expert Jourik Ciesielski. Traditional MT systems process texts segment-by-segment within TMS environments, often losing valuable contextual information. To address this, Jourik Ciesielski developed an innovative approach that processes the entire document within TMS systems, enabling more accurate, coherent, and natural translations.

This article, based on Ciesielski's expertise, explores the technical foundations, specific advantages, and future prospects of this promising technology.

Jourik Ciesielski: Pioneer in LLM-based Translation Technology

As CTO at Yamagata Europe and founder of C-Jay International, Jourik Ciesielski has developed a groundbreaking prototype for document-based LLM translation. His solution integrates advanced AI technologies with the practical needs of the localization industry.

With his unique background as a trained translator and technology expert, Ciesielski understands both the linguistic nuances and technical challenges of machine translation – a rare combination that makes him a leading innovator in this field.

C-Jay International Website

The Technical Architecture of Document-Based LLM Translation

To understand the revolutionary nature of document-based LLM translation, we first need to examine the limitations of MT translation within traditional TMS systems.

The Traditional Segmentation Approach and Its Limitations

In conventional Translation Management Systems (TMS), documents are broken down into individual segments. The goal is to facilitate bilingual processing and efficiently match content with previous translations stored in Translation Memories and glossaries. However, this approach can limit how MT engines handle the content. The core problem is that machine translation in TMS systems often occurs on a sentence-by-sentence basis. This segmentation results in a critical loss of context necessary for correctly interpreting and translating linguistic elements such as pronouns and references. Without this context, the translation model is forced to “guess” what certain words refer to, which can lead to inconsistencies and errors.

The Document-Based Workflow

Ciesielski’s prototype for document-based LLM translation employs a fundamentally different approach. Instead of viewing text as isolated segments, it treats the entire document as a coherent unit, preserving full context. The workflow includes several key steps, as described by Ciesielski:

XLIFF Processing: The prototype reads XLIFF files (XML Localization Interchange File Format) – the standard format for exchanging translation content in the localization industry.
Smart Content Extraction: The system identifies translatable content and filters out fully matched segments from the translation memory.
Text Reconstruction: The segments to be translated are recombined into a coherent text.
Context-Aware Translation: The reconstructed text is sent as a whole to an LLM, which can leverage the entire context for translation.
Accurate Resegmentation: The translated text is split back into its original segment structure and reintegrated into the XLIFF file.

This process overcomes the limitations of traditional segment-based MT translation by enabling the LLM to understand and utilize the full document context – from the first line to the last.

"The model UNDERSTANDS the relationship between sentences (whereas with limited context, it only PREDICTS). As a result, it makes correct grammatical and semantic decisions, even with low-context prepositions, references, etc."

Jourik Ciesielski on the advantages of the document-based approach

Detailed Analysis of Specific Improvements

Document-based LLM translation brings significant improvements to the automatic translation of content within TMS systems like Trados and MemoQ. Below are three major challenges and how the document-based approach addresses them.

1. Pronoun Translation and Coreferential Relationships

Correctly translating pronouns is one of the biggest challenges for machine translation systems within TMS environments – especially in gendered languages like German, French, or Spanish.

The Pronoun Challenge

In many languages, pronouns must match the grammatical gender of their antecedents. With traditional segment-based MT in TMS systems, this information is often lost when the antecedent is in a different segment.

Example: Translating from English to German, pronouns like "it" can be especially challenging. Consider: "The company announced its new policy. It will be implemented next month." Here, "it" clearly refers to "company". When translating this into German, "company" becomes "Firma" (feminine), so the pronoun must be adjusted accordingly. Traditional MT engines in TMS systems often fail to recognize this relationship, particularly when the two sentences are in separate segments. As a result, some MT engines may incorrectly translate "it" as "es" instead of "sie", which is grammatically incorrect when "company" was previously translated as "Firma".

Document-Based Solution: Because the LLM processes the text as a whole, it can identify the coreferential relationship between "company" and "it" and accurately translate the sentence as "Die Firma hat ihre neue Richtlinie angekündigt. Sie wird nächsten Monat umgesetzt."

Internal tests with Ciesielski’s prototype showed a significant improvement in pronoun translation. As Ciesielski emphasizes, the document-based approach leads to a substantial reduction in errors related to pronoun references.

2. Terminological Consistency

Maintaining consistent use of terminology is essential in technical, medical, and legal texts. Traditional MT systems struggle to consistently translate specialized terms across long documents.

Advanced Terminology Management

Ciesielski’s prototype includes a custom parser that can read glossaries in CSV format and inject the extracted entries directly into the translation prompt.

Implementation details:

The glossary parser converts the glossary into an LLM-friendly format
The glossary entries are passed to the LLM as part of the prompt
Using a “few-shot” approach, the LLM receives not only instructions but also clear examples of how to handle the text

This advanced terminology handling results in significantly improved terminological consistency. According to Ciesielski’s experience, document-based LLM translation achieves a much higher consistency rate than traditional neural MT systems – a major advantage when working with complex subject matter.

"I’ve invested a lot of work into improving terminology. The prototype features a dedicated parser that reads glossaries in CSV format and injects the parsed entries directly into the translation prompt. The prompt also includes tailored terminology instructions along with a few targeted few-shot examples."

Jourik Ciesielski on terminology management in his prototype

3. Tag Handling and Structural Integrity

In professional translation, proper handling of tags (markup elements) is critical for preserving formatting and structural information.

Smart Tag Management

Ciesielski’s prototype implements a comprehensive tag-handling algorithm that preserves the structural integrity of the document throughout the translation process.

The tag-handling process includes several steps:

Tag Detection: The system can identify inline tags used in various TMS platforms (memoQ, Phrase, Trados, XTM)
Tag Conversion: The tags are converted into a simplified, LLM-readable format while retaining all original tag content
Prompt Optimization: The translation prompt includes specific instructions for tag handling, supported by targeted few-shot examples
Tag Restoration: After translation, the tags are automatically restored to their original form
Automatic Correction: If issues arise, an AI-driven automated tag correction is triggered

This sophisticated tag-handling prevents tag-related errors – especially important in highly formatted content such as technical documentation, user manuals, and HTML content. As Ciesielski notes, this can significantly reduce the workload for post-editors by minimizing the need for manual tag corrections.

"Inline tags are crucial but complex in machine translation – not only must the tags themselves be preserved, but their order and surrounding spaces must also be handled correctly or adjusted as needed."

Jourik Ciesielski on the importance of precise tag handling

Process and Technical Advantages

In addition to specific linguistic improvements, document-based LLM translation offers significant process and technical advantages for businesses and language service providers.

Better Handling of Segmentation Issues in TMS

One major benefit of the document-based approach is its improved handling of segmentation issues. Poor segmentation in TMS systems can stem from poorly formatted or complex source files and poses a serious challenge in traditional systems.

Segmentation errors can result in incomplete or grammatically incorrect sentences, loss of coherence between related text parts, and inefficient use of translation memories. In such cases, sentences are not cleanly separated into segments, and traditional MT engines analyze and translate each segment individually.

The document-based approach elegantly bypasses these problems by allowing the LLM to consider entire sentences, even when they span across segments or when the source files are poorly structured. This ensures that translation quality is not compromised by segmentation issues.

TMS Compatibility and Workflow Integration

Practical usability of any new translation technology depends on its compatibility with existing Translation Management Systems and workflows. Ciesielski’s tool was designed with seamless integration and usability in mind:

It is fully compatible with TMS platforms such as MemoQ, Trados, and Phrase
It is easy to use, with a very short and intuitive training phase
The tool is adaptable to specific use cases and user needs (e.g., incorporating style guides as prompts, using it for translation or QA only, respecting 100% TM matches and locked segments, etc.)

Current State and Future Prospects

Current Industry Trends

Document-based LLM translation aligns with broader industry trends emphasizing AI integration and hybrid approaches. Many experts believe that LLMs and traditional neural MT engines will increasingly be used in combination rather than in competition. Major providers like DeepL have already integrated next-gen LLMs into their translation services.

Hybrid solutions that combine LLMs with traditional MT have shown notable quality improvements – some implementations have achieved very high MQM (Multidimensional Quality Metrics) scores without human intervention. This underscores the enormous potential of document-based approaches.

Document-based LLM translation is part of a broader paradigm shift in translation technology. Ciesielski predicts that LLMs will soon replace traditional neural MT engines as the standard in machine translation.

This transition will likely be accompanied by several parallel developments:

Multimodal Translation Systems: Integration of text, image, and possibly audio in a unified translation workflow
Agent-Based Translation Systems: Autonomous AI agents that coordinate complex localization projects with minimal human input
Advanced Retrieval-Augmented Generation (RAG): Dynamic retrieval and integration of external knowledge sources during the translation process

"I expect LLMs to soon replace traditional neural MT engines as the standard for machine translation. With this prototype, I aim to prepare users for this paradigm shift by equipping them with all the necessary enhancements and optimizations."

Jourik Ciesielski on the future of translation technology

Glossary of Key Terms

Document-Based LLM Translation

A translation approach that treats the entire document as a coherent unit rather than processing it segment by segment as in traditional MT systems.

XLIFF (XML Localization Interchange File Format)

An XML-based format for exchanging localizable data between different translation tools.

Few-Shot Learning

A method in which LLMs learn to perform tasks from just a few targeted examples, without extensive retraining.

Tag Handling

The process of correctly identifying, processing, and restoring formatting tags in translation documents.

Coreferential Relationships

References in text where different linguistic elements (like pronouns) refer to the same antecedent.

Translation as a Feature (TaaF)

The integration of translation functionality directly into applications, websites, or platforms, allowing translations to be delivered seamlessly and "invisibly."