Document-based LLM translation marks a paradigm shift in machine translation technology within TMS systems like MemoQ or Trados – this is the conclusion of our interview with translation technology expert Jourik Ciesielski. Traditional MT systems process texts segment-by-segment within TMS environments, often losing valuable contextual information. To address this, Jourik Ciesielski developed an innovative approach that processes the entire document within TMS systems, enabling more accurate, coherent, and natural translations.
This article, based on Ciesielski's expertise, explores the technical foundations, specific advantages, and future prospects of this promising technology.
As CTO at Yamagata Europe and founder of C-Jay International, Jourik Ciesielski has developed a groundbreaking prototype for document-based LLM translation. His solution integrates advanced AI technologies with the practical needs of the localization industry.
With his unique background as a trained translator and technology expert, Ciesielski understands both the linguistic nuances and technical challenges of machine translation – a rare combination that makes him a leading innovator in this field.
To understand the revolutionary nature of document-based LLM translation, we first need to examine the limitations of MT translation within traditional TMS systems.
In conventional Translation Management Systems (TMS), documents are broken down into individual segments. The goal is to facilitate bilingual processing and efficiently match content with previous translations stored in Translation Memories and glossaries. However, this approach can limit how MT engines handle the content. The core problem is that machine translation in TMS systems often occurs on a sentence-by-sentence basis. This segmentation results in a critical loss of context necessary for correctly interpreting and translating linguistic elements such as pronouns and references. Without this context, the translation model is forced to “guess” what certain words refer to, which can lead to inconsistencies and errors.
Ciesielski’s prototype for document-based LLM translation employs a fundamentally different approach. Instead of viewing text as isolated segments, it treats the entire document as a coherent unit, preserving full context. The workflow includes several key steps, as described by Ciesielski:
This process overcomes the limitations of traditional segment-based MT translation by enabling the LLM to understand and utilize the full document context – from the first line to the last.
"The model UNDERSTANDS the relationship between sentences (whereas with limited context, it only PREDICTS). As a result, it makes correct grammatical and semantic decisions, even with low-context prepositions, references, etc."
Document-based LLM translation brings significant improvements to the automatic translation of content within TMS systems like Trados and MemoQ. Below are three major challenges and how the document-based approach addresses them.
Correctly translating pronouns is one of the biggest challenges for machine translation systems within TMS environments – especially in gendered languages like German, French, or Spanish.
In many languages, pronouns must match the grammatical gender of their antecedents. With traditional segment-based MT in TMS systems, this information is often lost when the antecedent is in a different segment.
Example: Translating from English to German, pronouns like "it" can be especially challenging. Consider: "The company announced its new policy. It will be implemented next month." Here, "it" clearly refers to "company". When translating this into German, "company" becomes "Firma" (feminine), so the pronoun must be adjusted accordingly. Traditional MT engines in TMS systems often fail to recognize this relationship, particularly when the two sentences are in separate segments. As a result, some MT engines may incorrectly translate "it" as "es" instead of "sie", which is grammatically incorrect when "company" was previously translated as "Firma".
Document-Based Solution: Because the LLM processes the text as a whole, it can identify the coreferential relationship between "company" and "it" and accurately translate the sentence as "Die Firma hat ihre neue Richtlinie angekündigt. Sie wird nächsten Monat umgesetzt."
Internal tests with Ciesielski’s prototype showed a significant improvement in pronoun translation. As Ciesielski emphasizes, the document-based approach leads to a substantial reduction in errors related to pronoun references.
Maintaining consistent use of terminology is essential in technical, medical, and legal texts. Traditional MT systems struggle to consistently translate specialized terms across long documents.
Ciesielski’s prototype includes a custom parser that can read glossaries in CSV format and inject the extracted entries directly into the translation prompt.
Implementation details:
This advanced terminology handling results in significantly improved terminological consistency. According to Ciesielski’s experience, document-based LLM translation achieves a much higher consistency rate than traditional neural MT systems – a major advantage when working with complex subject matter.
"I’ve invested a lot of work into improving terminology. The prototype features a dedicated parser that reads glossaries in CSV format and injects the parsed entries directly into the translation prompt. The prompt also includes tailored terminology instructions along with a few targeted few-shot examples."
In professional translation, proper handling of tags (markup elements) is critical for preserving formatting and structural information.
Ciesielski’s prototype implements a comprehensive tag-handling algorithm that preserves the structural integrity of the document throughout the translation process.
The tag-handling process includes several steps:
This sophisticated tag-handling prevents tag-related errors – especially important in highly formatted content such as technical documentation, user manuals, and HTML content. As Ciesielski notes, this can significantly reduce the workload for post-editors by minimizing the need for manual tag corrections.
"Inline tags are crucial but complex in machine translation – not only must the tags themselves be preserved, but their order and surrounding spaces must also be handled correctly or adjusted as needed."
In addition to specific linguistic improvements, document-based LLM translation offers significant process and technical advantages for businesses and language service providers.
One major benefit of the document-based approach is its improved handling of segmentation issues. Poor segmentation in TMS systems can stem from poorly formatted or complex source files and poses a serious challenge in traditional systems.
Segmentation errors can result in incomplete or grammatically incorrect sentences, loss of coherence between related text parts, and inefficient use of translation memories. In such cases, sentences are not cleanly separated into segments, and traditional MT engines analyze and translate each segment individually.
The document-based approach elegantly bypasses these problems by allowing the LLM to consider entire sentences, even when they span across segments or when the source files are poorly structured. This ensures that translation quality is not compromised by segmentation issues.
Practical usability of any new translation technology depends on its compatibility with existing Translation Management Systems and workflows. Ciesielski’s tool was designed with seamless integration and usability in mind:
Document-based LLM translation aligns with broader industry trends emphasizing AI integration and hybrid approaches. Many experts believe that LLMs and traditional neural MT engines will increasingly be used in combination rather than in competition. Major providers like DeepL have already integrated next-gen LLMs into their translation services.
Hybrid solutions that combine LLMs with traditional MT have shown notable quality improvements – some implementations have achieved very high MQM (Multidimensional Quality Metrics) scores without human intervention. This underscores the enormous potential of document-based approaches.
Document-based LLM translation is part of a broader paradigm shift in translation technology. Ciesielski predicts that LLMs will soon replace traditional neural MT engines as the standard in machine translation.
This transition will likely be accompanied by several parallel developments:
"I expect LLMs to soon replace traditional neural MT engines as the standard for machine translation. With this prototype, I aim to prepare users for this paradigm shift by equipping them with all the necessary enhancements and optimizations."