Editing TMs for training an LLM: How to swap source and target segments (Post-editing & Machine Translation)

الترجمة- فن ومهنة » Post-editing & Machine Translation »
Editing TMs for training an LLM: How to swap source and target segments
Track this topic

Editing TMs for training an LLM: How to swap source and target segments

ناشر الموضوع: trans-agrar

trans-agrar

ألمانيا
Local time: 12:39
أنجليزي إلى ألماني
+ ...

Nov 28

Hello you smart people out there.

I'm currently cleansing my TM's for using them to train an LLM. My Cat Tool is MemoQ.

The problem now is that there are a lot of segments, in which source and target are reversed - not just one single segment here and there but a series of many segments. So, I simply must be able to swap them in batches.

MQ doesn't seem to offer a function for this.
1. Using Phyton scripts didn't work, because Phyton continues to have Path issues on my machine and I'm unable to delete those two Phyton files that seem to cause the issue.
2. Notepad: How to tell Notepad to detect any German words in target segments.
3. I tried Olifant, but I don't manage to use the Swap function based on what I learn from the Help desk. Besides, it doesn't seem to provide a batch function (e.g. swap all marked segments).
4. ChatGPT tells me all kinds of bla bla bla.
5.Can anybody say whether OmegaT is a real option for this?
6. Any other experience out there?

There must be some solution....
Appreciate your thoughts
Thanks
Barbara ▲ Collapse

Dan Lucas

المملكة المتحدة
Local time: 11:39
عضو (2014)
ياباني إلى أنجليزي

Complex

Nov 28

I don't use MemoQ, but I can imagine exporting this as a TMX, exporting that as an Excel file or maybe as "Wordfast TM file" as described in this thread, then opening it in Excel and putting source and target languages in two columns. Then I would write some VBA to step through each row, and swap them (putting the output in columns four and five or something) when required.

But how would you indicate which segments need to be reversed? Unless you can come up with a foolproof way of determining whether the text in the source column is English or German, you'd have to make a decision for every unit in the TM. For dozens of units, that may not be a problem, but for thousands of units...? (This might work in theory, but in practice it may not.)

With all due respect for your ambition, this task would require some skill in coding. On the other hand, ChatGPT could probably write you some useful VBA code provided that you can describe exactly what you want and how to get it. You'd also have to be familiar with using the VBA editor in Excel.

Regards,
Dan

[Edited at 2024-11-28 21:18 GMT] ▲ Collapse

Stepan Konev

الاتحاد الروسي
Local time: 14:39
أنجليزي إلى روسي

Swap single segment

Nov 29

I don't know how to do that in batch mode, but you can try the below AHK script to swap single segments in the memoQ TM editing mode:

f11::
Send, ^a
Sleep 30
Send, ^x
Sleep 30
Send, {tab}
Sleep 100
Send, ^{PgUp}
Sleep 100
Send, ^v
Sleep 30
Send, ^+{PgDn}
Sleep 100
Send, ^x
Sleep 30
Send, {tab}
Sleep 100
Send, ^v
Return

How to use:
1) Run the script.
2... See more

Hans Lenting
هولندا
عضو (2006)
ألماني إلى هولندي

CafeTran Espresso

Nov 29

Open the tmx in tmx edit mode. Enter an expression with typical words for one of the languages in the Find field of the Find dialogue. E.g.: the|left|at|…

Filter the segments. Export the filtered segments. Then remove them.

Open the exported segments in a new project. Save as tab-delimited. Open with Modern CSV or any other TSV handler. Swap the columns.

Import swapped columns in CafeTran Espresso, save as tmx.

Note: In the first step you ... See more

" rel="nofollow">08.03.45@2x">

[Edited at 2024-11-29 07:04 GMT] ▲ Collapse

Kyriaki Ailamaki
اليونان
Local time: 13:39
عضو (2006)
أنجليزي إلى يوناني
+ ...

Here is a potential solution

Nov 30

Hi Barbara,

See below a potential solution for a similar situation I had:

1. Assuming that your languages are English and German, use Olifant to mark all wrong German segments in the English part by searching for all German specific characters, e.g. lower and upper case characters with umlaut etc.
2. Export the marked segments in a tmx file.
3. Delete marked segments from tmx file.
4. Open the exported wrong tmx file and, with Notepad++ or Emeditor, and swap the languages through a third language, i.e. change English to Greek, German to English and Greek to German in the segments section only.
5. Fix the source language i, the tmx header.
6. Import the extracted and fixed tmx to the tmx you have deleted the wrong segments.

Good luck!
Kyriaki ▲ Collapse

trans-agrar

ألمانيا
Local time: 12:39
أنجليزي إلى ألماني
+ ...

بادئ الموضوع

Thank you everybody

Dec 2

Dear colleagues,
thank you all for putting your heads into this. And especially thanks to Kyriaki for your advice. This worked eventually for me - at last after so many trials and errors.
So I sucessfully marked a selection, exported it and swapped the languages and carried out the delete/import processes as described. Thanks for this.

There is still a minor hitch though: I don't manage to flag an Olifant selection in a batch procedure. This time I marked 64 entries one... See more

Hans Lenting
هولندا
عضو (2006)
ألماني إلى هولندي

Which MT system?

Dec 3

It is good to hear that you have found a solution. Can you tell us which MT system you are training?

trans-agrar

ألمانيا
Local time: 12:39
أنجليزي إلى ألماني
+ ...

بادئ الموضوع

MT system

Dec 3

Hello Hans,

I did the first trial with Kantan. But at the time, my memories were not cleansed yet. To suit, the result was really discouraging. There wasn't a single coherent sentence. I might try Kantan again, once I've sorted out the relevant memories. Another idea was to try Globalese, but this has now been purchased by MemoQ and has become unaffordable for a freelance translator. Yet, it's one step at a time right now. Sorting my memories successfully is the first stopover in th... See more

Stepan Konev

الاتحاد الروسي
Local time: 14:39
أنجليزي إلى روسي

OPUS CAT MT

Dec 3

Have you tried OPUS CAT MT? It is free, offline and it uses your TMs.
https://helsinki-nlp.github.io/OPUS-CAT/install

trans-agrar

ألمانيا
Local time: 12:39
أنجليزي إلى ألماني
+ ...

بادئ الموضوع

Scripting in Olifant

Dec 4

Thanks Stepan for the Opus Cat tool. I'll look at it nearer the time.

Meanwhile I've been trying to use Olifant coding to speed up my selection marking processes.

The following script should list all segments between the key no. 1500 and 2000. But something is wrong, because my list does start at 1500 but then runs to infinity. Can anybody find out what is wrong?

[Key]

Philippe Locquet

البرتغال
Local time: 11:39
عضو (2013)
أنجليزي إلى فرنسي
+ ...

Size, Plus Tools, Wordfast Anywhere

09:05

trans-agrar wrote:

Hello you smart people out there.

I'm currently cleansing my TM's for using them to train an LLM. My Cat Tool is MemoQ.

Barbara

Hello, I came to this very late. I agree Kyriaki's method is the simplest, once listed export, remove, reverse, feed back in.

Regarding the set of tools:
If your TM is small, Wordfast Anywhere has an immense set of tools to work on TM.
If your TM is big I recommend Plus Tools. It's a free tool actively developed by Yves Champollion. It can handle TMs bigger than 1 million lines. No need for a beefy computer, it's coded to work directly on disk, so no need for 64 GB of RAM.
As Yves welcomes feedback, if you are performing actions that other professionals will likely need to execute too, he might add functionalities in the tool (he ultimately decides what goes in or not). Try it here https://www.wordfast.net/PlusTools/ (accept the browser warnings, the site is legit).

I tried Kantan back in the day (I have a video on it), nice tool!

My bests,

Philippe

Login to reply/comment

لم يتم تعيين مشرف خاص بهذا المنتدى
للإبلاغ عن انتهاكات لقواعد الموقع أو الحصول على مساعدة، يرجى الاتصال بـ العاملين في الموقع »

Editing TMs for training an LLM: How to swap source and target segments

Translation news

» Bridging the Communication Gap: Multi-Modal AI in Language Translation and Interpretation
(0 comments)
» Shortlist Announced for 2024 Great Britain Sasakawa Foundation Translation Prize
(0 comments)
» This manga publisher is using Anthropic’s AI to translate Japanese comics into English
(0 comments)

Submit translation news »
Read more translation news »

Forum rules

Help and orientation

Trados Studio 2022 Freelance
The leading translation software used by over 270,000 translators. Designed with your feedback in mind, Trados Studio 2022 delivers an unrivalled, powerful desktop and cloud solution, empowering you to work in the most efficient and cost-effective way. More info »

Wordfast Pro
Translation Memory Software for Any Platform Exclusive discount for ProZ.com users! Save over 13% when purchasing Wordfast Pro through ProZ.com. Wordfast is the world's #1 provider of platform-independent Translation Memory software. Consistently ranked the most user-friendly and highest value Buy now! »

الطروحات الحديثة | أسئلة شائعة | قوانين | مشرفين | قاعدة بيانات المقالات

Your current localization setting

عربي

Select a language

More languages...

Editing TMs for training an LLM: How to swap source and target segments

Editing TMs for training an LLM: How to swap source and target segments

You have native languages that can be verified

Your current localization setting

Select a language