Editing TMs for training an LLM: How to swap source and target segments
ناشر الموضوع: trans-agrar
trans-agrar
trans-agrar  Identity Verified
ألمانيا
Local time: 12:39
أنجليزي إلى ألماني
+ ...
Nov 28

Hello you smart people out there.

I'm currently cleansing my TM's for using them to train an LLM. My Cat Tool is MemoQ.

The problem now is that there are a lot of segments, in which source and target are reversed - not just one single segment here and there but a series of many segments. So, I simply must be able to swap them in batches.

MQ doesn't seem to offer a function for this.
1. Using Phyton scripts didn't work, because Phyton continues to have
... See more
Hello you smart people out there.

I'm currently cleansing my TM's for using them to train an LLM. My Cat Tool is MemoQ.

The problem now is that there are a lot of segments, in which source and target are reversed - not just one single segment here and there but a series of many segments. So, I simply must be able to swap them in batches.

MQ doesn't seem to offer a function for this.
1. Using Phyton scripts didn't work, because Phyton continues to have Path issues on my machine and I'm unable to delete those two Phyton files that seem to cause the issue.
2. Notepad: How to tell Notepad to detect any German words in target segments.
3. I tried Olifant, but I don't manage to use the Swap function based on what I learn from the Help desk. Besides, it doesn't seem to provide a batch function (e.g. swap all marked segments).
4. ChatGPT tells me all kinds of bla bla bla.
5.Can anybody say whether OmegaT is a real option for this?
6. Any other experience out there?

There must be some solution....
Appreciate your thoughts
Thanks
Barbara
Collapse


 
Dan Lucas
Dan Lucas  Identity Verified
المملكة المتحدة
Local time: 11:39
عضو (2014)
ياباني إلى أنجليزي
Complex Nov 28

I don't use MemoQ, but I can imagine exporting this as a TMX, exporting that as an Excel file or maybe as "Wordfast TM file" as described in this thread, then opening it in Excel and putting source and target languages in two columns. Then I would write some VBA to step through each row, and swap them (putting the output in columns fou... See more
I don't use MemoQ, but I can imagine exporting this as a TMX, exporting that as an Excel file or maybe as "Wordfast TM file" as described in this thread, then opening it in Excel and putting source and target languages in two columns. Then I would write some VBA to step through each row, and swap them (putting the output in columns four and five or something) when required.

But how would you indicate which segments need to be reversed? Unless you can come up with a foolproof way of determining whether the text in the source column is English or German, you'd have to make a decision for every unit in the TM. For dozens of units, that may not be a problem, but for thousands of units...? (This might work in theory, but in practice it may not.)

With all due respect for your ambition, this task would require some skill in coding. On the other hand, ChatGPT could probably write you some useful VBA code provided that you can describe exactly what you want and how to get it. You'd also have to be familiar with using the VBA editor in Excel.

Regards,
Dan


[Edited at 2024-11-28 21:18 GMT]
Collapse


 
Stepan Konev
Stepan Konev  Identity Verified
الاتحاد الروسي
Local time: 14:39
أنجليزي إلى روسي
Swap single segment Nov 29

I don't know how to do that in batch mode, but you can try the below AHK script to swap single segments in the memoQ TM editing mode:

f11::
Send, ^a
Sleep 30
Send, ^x
Sleep 30
Send, {tab}
Sleep 100
Send, ^{PgUp}
Sleep 100
Send, ^v
Sleep 30
Send, ^+{PgDn}
Sleep 100
Send, ^x
Sleep 30
Send, {tab}
Sleep 100
Send, ^v
Return

How to use:
1) Run the script.
2
... See more
I don't know how to do that in batch mode, but you can try the below AHK script to swap single segments in the memoQ TM editing mode:

f11::
Send, ^a
Sleep 30
Send, ^x
Sleep 30
Send, {tab}
Sleep 100
Send, ^{PgUp}
Sleep 100
Send, ^v
Sleep 30
Send, ^+{PgDn}
Sleep 100
Send, ^x
Sleep 30
Send, {tab}
Sleep 100
Send, ^v
Return

How to use:
1) Run the script.
2) Put the cursor anywhere in the source or target part of the segment you want to swap
3) Press F11.
(I assume you know how to use AHK and create scripts.)
Collapse


 
Hans Lenting
Hans Lenting
هولندا
عضو (2006)
ألماني إلى هولندي
CafeTran Espresso Nov 29

Open the tmx in tmx edit mode. Enter an expression with typical words for one of the languages in the Find field of the Find dialogue. E.g.: the|left|at|…

Filter the segments. Export the filtered segments. Then remove them.

Open the exported segments in a new project. Save as tab-delimited. Open with Modern CSV or any other TSV handler. Swap the columns.

Import swapped columns in CafeTran Espresso, save as tmx.

Note: In the first step you
... See more
Open the tmx in tmx edit mode. Enter an expression with typical words for one of the languages in the Find field of the Find dialogue. E.g.: the|left|at|…

Filter the segments. Export the filtered segments. Then remove them.

Open the exported segments in a new project. Save as tab-delimited. Open with Modern CSV or any other TSV handler. Swap the columns.

Import swapped columns in CafeTran Espresso, save as tmx.

Note: In the first step you can also filter on spelling errors. But you will have to add a lot of words from the correct target language to user.dic. Unless you already have such a list from memoQ.

Filtering on German source segments:
CleanShot 2024-11-29 at <a href=" rel="nofollow">08.03.45@2x">


[Edited at 2024-11-29 07:04 GMT]
Collapse


 
Kyriaki Ailamaki
Kyriaki Ailamaki
اليونان
Local time: 13:39
عضو (2006)
أنجليزي إلى يوناني
+ ...
Here is a potential solution Nov 30

Hi Barbara,

See below a potential solution for a similar situation I had:

1. Assuming that your languages are English and German, use Olifant to mark all wrong German segments in the English part by searching for all German specific characters, e.g. lower and upper case characters with umlaut etc.
2. Export the marked segments in a tmx file.
3. Delete marked segments from tmx file.
4. Open the exported wrong tmx file and, with Notepad++ or Emeditor, an
... See more
Hi Barbara,

See below a potential solution for a similar situation I had:

1. Assuming that your languages are English and German, use Olifant to mark all wrong German segments in the English part by searching for all German specific characters, e.g. lower and upper case characters with umlaut etc.
2. Export the marked segments in a tmx file.
3. Delete marked segments from tmx file.
4. Open the exported wrong tmx file and, with Notepad++ or Emeditor, and swap the languages through a third language, i.e. change English to Greek, German to English and Greek to German in the segments section only.
5. Fix the source language i, the tmx header.
6. Import the extracted and fixed tmx to the tmx you have deleted the wrong segments.

Good luck!
Kyriaki
Collapse


 
trans-agrar
trans-agrar  Identity Verified
ألمانيا
Local time: 12:39
أنجليزي إلى ألماني
+ ...
بادئ الموضوع
Thank you everybody Dec 2

Dear colleagues,
thank you all for putting your heads into this. And especially thanks to Kyriaki for your advice. This worked eventually for me - at last after so many trials and errors.
So I sucessfully marked a selection, exported it and swapped the languages and carried out the delete/import processes as described. Thanks for this.

There is still a minor hitch though: I don't manage to flag an Olifant selection in a batch procedure. This time I marked 64 entries one
... See more
Dear colleagues,
thank you all for putting your heads into this. And especially thanks to Kyriaki for your advice. This worked eventually for me - at last after so many trials and errors.
So I sucessfully marked a selection, exported it and swapped the languages and carried out the delete/import processes as described. Thanks for this.

There is still a minor hitch though: I don't manage to flag an Olifant selection in a batch procedure. This time I marked 64 entries one by one as I don't seem to find the "batch flag" function. I'm using Olifant 3.0.8.

Thanks again to everybody. Really grateful for your ideas.

Barbara
Collapse


 
Hans Lenting
Hans Lenting
هولندا
عضو (2006)
ألماني إلى هولندي
Which MT system? Dec 3

It is good to hear that you have found a solution. Can you tell us which MT system you are training?

 
trans-agrar
trans-agrar  Identity Verified
ألمانيا
Local time: 12:39
أنجليزي إلى ألماني
+ ...
بادئ الموضوع
MT system Dec 3

Hello Hans,

I did the first trial with Kantan. But at the time, my memories were not cleansed yet. To suit, the result was really discouraging. There wasn't a single coherent sentence. I might try Kantan again, once I've sorted out the relevant memories. Another idea was to try Globalese, but this has now been purchased by MemoQ and has become unaffordable for a freelance translator. Yet, it's one step at a time right now. Sorting my memories successfully is the first stopover in th
... See more
Hello Hans,

I did the first trial with Kantan. But at the time, my memories were not cleansed yet. To suit, the result was really discouraging. There wasn't a single coherent sentence. I might try Kantan again, once I've sorted out the relevant memories. Another idea was to try Globalese, but this has now been purchased by MemoQ and has become unaffordable for a freelance translator. Yet, it's one step at a time right now. Sorting my memories successfully is the first stopover in this journey. The next is to complete the meta data in my termbases. After that I embark on the training adventure.

Cheers
Barbara
Collapse


 
Stepan Konev
Stepan Konev  Identity Verified
الاتحاد الروسي
Local time: 14:39
أنجليزي إلى روسي
OPUS CAT MT Dec 3

Have you tried OPUS CAT MT? It is free, offline and it uses your TMs.
https://helsinki-nlp.github.io/OPUS-CAT/install


 
trans-agrar
trans-agrar  Identity Verified
ألمانيا
Local time: 12:39
أنجليزي إلى ألماني
+ ...
بادئ الموضوع
Scripting in Olifant Dec 4

Thanks Stepan for the Opus Cat tool. I'll look at it nearer the time.

Meanwhile I've been trying to use Olifant coding to speed up my selection marking processes.

The following script should list all segments between the key no. 1500 and 2000. But something is wrong, because my list does start at 1500 but then runs to infinity. Can anybody find out what is wrong?

[Key]


 
Philippe Locquet
Philippe Locquet  Identity Verified
البرتغال
Local time: 11:39
عضو (2013)
أنجليزي إلى فرنسي
+ ...
Size, Plus Tools, Wordfast Anywhere 09:05

trans-agrar wrote:

Hello you smart people out there.

I'm currently cleansing my TM's for using them to train an LLM. My Cat Tool is MemoQ.

Barbara


Hello, I came to this very late. I agree Kyriaki's method is the simplest, once listed export, remove, reverse, feed back in.

Regarding the set of tools:
If your TM is small, Wordfast Anywhere has an immense set of tools to work on TM.
If your TM is big I recommend Plus Tools. It's a free tool actively developed by Yves Champollion. It can handle TMs bigger than 1 million lines. No need for a beefy computer, it's coded to work directly on disk, so no need for 64 GB of RAM.
As Yves welcomes feedback, if you are performing actions that other professionals will likely need to execute too, he might add functionalities in the tool (he ultimately decides what goes in or not). Try it here https://www.wordfast.net/PlusTools/ (accept the browser warnings, the site is legit).

I tried Kantan back in the day (I have a video on it), nice tool!

My bests,

Philippe


 


لم يتم تعيين مشرف خاص بهذا المنتدى
للإبلاغ عن انتهاكات لقواعد الموقع أو الحصول على مساعدة، يرجى الاتصال بـ العاملين في الموقع »


Editing TMs for training an LLM: How to swap source and target segments







Trados Studio 2022 Freelance
The leading translation software used by over 270,000 translators.

Designed with your feedback in mind, Trados Studio 2022 delivers an unrivalled, powerful desktop and cloud solution, empowering you to work in the most efficient and cost-effective way.

More info »
Wordfast Pro
Translation Memory Software for Any Platform

Exclusive discount for ProZ.com users! Save over 13% when purchasing Wordfast Pro through ProZ.com. Wordfast is the world's #1 provider of platform-independent Translation Memory software. Consistently ranked the most user-friendly and highest value

Buy now! »