Java tool + automatic translations of PDFs (Google Translate etc)

posted on Mar, 12 2016 @ 10:30 AM

link

Okay, some of you know from my threads on ATS over the last few years that I like to share searchable PDFs of UFO documents and publications.

I'm currently writing a thread about a collection of official Swedish documents kindly freely shared online by members of UFO-Sweden (including Clas Svahn, a member of the AFU that I've worked with on a few things) and the producers of a new documentary.

I've already downloaded the the document collection and converted it into searchable PDFs, but one problem is that the documents are (unsurprisingly...) in Swedish.

I'd like to generate English translations of the relevant 1,600 or so pages of PDF using Google Translate or similar automated software.

I'm well aware that automated translations can be imperfect (and sometime downright silly). The problems with such automated may be particular serious given that quite a bit of the relevant official documentation is handwritten (since most of the documents are from 1946) and Google Translate seems less accurate with Swedish than, say, French. However, I'd like to give automated translations a go - not least because it may be possible to apply the same techniques to other collections of foreign UFO documents I've obtained, e.g. from the French government. (I've obtained good results with smaller French documents...).

BUT Google Translate has a 1MB file size limit. The relevant PDFs from Sweden total a few hundred MBs. I'd prefer to avoid splitting the PDF files into several hundred smaller files and doing individual translations.

It seems that there WAS a useful way around the Google Translate file size limit using the Onlinedoctranslator.com website at the link below:
www.onlinedoctranslator.com

That Java tool used the Google Translate engine but with a few extra useful features (mainly overcoming the 1MB file size limit but also retaining the format of the original PDF document).

Sadly, that tool appears to have been written several years ago and does not appear to work on my Windows 10 computer using the current version of Java.

After installing the current version of Java and adding the Onlinedoctranslator.com website to the list of trusted websites, the relevant tool appears to run (since various pop-ups appear and I can select the PDF to be translated and enter the name for saving it) BUT I get various error messages stating "Illegalname" and if I ignore that then I get another error message about

Here is a single page PDF file extract from one of the relevant PDFs:
we.tl...

A bit bigger (at 42MB) here is a larger extract from the relevant PDFs, including a couple of hundred pages:
we.tl...

So, I wonder:

(1) Does the relevant Onlinedoctranslator.com Java translation tool currently work for other people?

(2) If not, is there any other useful tool for producing automated translations of PDF which does not suffer from Google Translate's 1MB file size limit?

Any input would be most appreciated...

(If the translation tool works, the translations could be added to the existing website put together by UFO Sweden and the documentary producers, or - if I can get their permission - simply to accompany a searchable PDF versions of the collection of Swedish documents)

If I can get this translation tool to work (or find a substitute), there are several other large collections of official UFO documents that I could upload from around the world.

BeefNoMeat

posted on Mar, 12 2016 @ 10:43 AM

link

a reply to: IsaacKoi

I wish I could provide some assistance but I can't. However, your UFO-related threads are some of the best, if not the best, and I hope someone chimes in with the technical assistance you're looking/asking for. Good luck and looking forward to reading about Swedish UFO sightings/encounters.

Cheers.

IsaacKoi

posted on Mar, 12 2016 @ 12:08 PM

link

originally posted by: BeefNoMeat
I hope someone chimes in with the technical assistance you're looking/asking for. Good luck

Thanks. We'll see if anyone comes forward to help or with any suggestions.

I'm happy to have a go a technical things, but Java and things like that are a bit beyond my (rather dated) comfort zone...

edit on 12-3-2016 by IsaacKoi because: (no reason given)

ForteanOrg

posted on Mar, 12 2016 @ 02:33 PM

link

a reply to: IsaacKoi

You'll have significant problems, no matter how you do it, as it is quite difficult to convert handwritten texts, regardless which type of software you use. For typed texts you could use a chain of tools to convert the PDF's into a format that is suitable for OCR scanning software. Tesseract is by far the best solution if you use it in tandem with a 'language pack', in this case the Swedish language pack (and I checked: they have one).

To get you started some links:

This one
This one

I will try to fiddle a bit with your example, convert, unpaper and tesseract. Don't get your hopes up..

edit on 12-3-2016 by ForteanOrg because: he found a bug

ForteanOrg

posted on Mar, 12 2016 @ 02:52 PM

link

a reply to: ForteanOrg

Just tested a bit with your example, but haven't achieved good results. Do you have a high resolution version of that scan?

IsaacKoi

posted on Mar, 12 2016 @ 02:55 PM

link

originally posted by: ForteanOrg
You'll have significant problems, no matter how you do it, as it is quite difficult to convert handwritten texts,

I don't expect to succeed with the handwritten text - but there is fair amount of typed text. (Not as much as in collections of modern official documents, but still about - say - 50% of the material).

I've used FineReader for OCR and got acceptable results already on the typed text (with its Swedish language recognition pack) - and some of the handwritten material.

I'm now focusing just on the translation.

edit on 12-3-2016 by IsaacKoi because: (no reason given)

IsaacKoi

posted on Mar, 12 2016 @ 02:57 PM

link

originally posted by: ForteanOrg
Just tested a bit with your example, but haven't achieved good results. Do you have a high resolution version of that scan?

The document images were sourced from the website at the link below, but I might be able to obtain the original scans (which may or may not be higher resolution...) from the researchers involved (particularly Clas Svahn). However, as I mentioned above, I have acceptable OCR results already ("acceptable" being a relative term, given the age/nature of the source documents). Although I'd always welcome better OCR results, I am focusing more on the translation issue.
ghostrockets.portaplay.dk...

I'll paste below one of the English language newspaper articles in the collection with the existing FineReader results. If I could have similar quality translations of the Swedish material by quickly using an automated translation tool then I'd probably just move on to my next project rather than wanting to spend much of my time (or asking others to spend much of their time...) improving the OCR results (unless such an improvement would be relatively quick and easy). But if I can't translate the Swedish documents, I'm loathe to spend much time on improving the OCR results for them...

files.abovetopsecret.com...

Stockholm . Sept. 26 (INS).— Rus
sian rockets w hich have been ap
pearing over Sw eden since early
last June are of an astonishing
new type which
S w e d i s h air
f o r c e inform
ants assert are
far superior to
the German V-I,
V-2 and sim ilar
Am erican pro
jectiles.
An air force (
officer, one of
the few Sw edish
officials who w ill
C onsidine talk about the
rockets, said today they perform
with a "partly human, partly
fiendish intelligence.'’
The rockets have been com ing
over in several shapes, all suffi
ciently large to contain an atom ic
bomb warhead. The m ost efficient
design appears to resem ble the V-l
buzz bomb, som ewhat enlarged, but
with a startling versatility.
Though dozens of these bombs
have streaked over Sw eden by
conservative estim ates, none has
left a crater. Those which landed
sim ply disappeared.
This spooky evanescence is the
source of chief worry, especially be
cause no explosions have been re
ported. Nor has there been any
trace of debris or bomb pits at the
spots over which the rockets were
last seen.
“The only explanation for this
phenom enom ,” said the officer, "is
that the rockets may be able to con
sum e them selves com pletely at a
controlled mom ent.

edit on 12-3-2016 by IsaacKoi because: (no reason given)

ForteanOrg

posted on Mar, 12 2016 @ 03:07 PM

link

a reply to: IsaacKoi

This any good?

Or perhaps Cafetran, which is Java based.

edit on 12-3-2016 by ForteanOrg because: he added some espresso

IsaacKoi

posted on Mar, 12 2016 @ 03:47 PM

link

originally posted by: ForteanOrg
This any good?

Or perhaps Cafetran, which is Java based.

Many thanks. Both look interesting. I'll try each one.

(The real challenge is avoiding a file size limit such as the 1MB limit that applies on Google Translate. I'll check what limit, if any, applies to Apertium and Cafetran at the links you helpfully provided).

ForteanOrg

posted on Mar, 12 2016 @ 03:53 PM

link

a reply to: IsaacKoi

Cafetran has some limits built in, which can be removed if you pay them some money. But you only need to do that after you evaluated the quality of the product. They offer a one year license for € 80 or an unlimited license for € 200. You'll need Java 8.

IsaacKoi

posted on Mar, 12 2016 @ 04:09 PM

link

originally posted by: ForteanOrg

This any good?

Mmm. It looks like that website (Apertium) can currently only translate Swedish into Danish and Icelandic at the moment.

I'll have a look at Cafetran next.

edit on 12-3-2016 by IsaacKoi because: (no reason given)

IsaacKoi

posted on Mar, 12 2016 @ 05:02 PM

link

originally posted by: ForteanOrg
Or perhaps Cafetran, which is Java based.

Sadly, it looks like Cafetran is a tool for managing translation projects rather than an automatic translation tool.

I still haven't found a working alternative to the document translation tool at onlinedoctranslator.com (nor have I managed to get that Java tool to work on a Windows 10 computer, whether using Google Chrome, Microsoft Edge or Firefox).

ArMaP

posted on Mar, 12 2016 @ 06:08 PM

link

To do automatic translations of large amounts of text (thousands of records from historical archives) I have been using for some time Moses for Mere Mortals.

The problem with MMM is that it only works in Linux and it takes some time to get good results.

The biggest advantages are that it doesn't have any size limit for the text and it can translate any language (I think only language that are written from left to right, but I'm not sure) for which you give it training text.

As the system I use is only trained for translating from Portuguese to Spanish and English I cannot try some of your text.

DexterRiley

posted on Mar, 12 2016 @ 06:29 PM

link

Hello Barrister.

I was able to get onlinedoctranslator.com to run on my Windows 7 computer. I used version 43.0.4 of Firefox with the latest version of Java. The only other trick was to change a configuration option. On the security tab there is a setting to configure exceptions. Click that link and add onlinedoctranslator.com. When you run the translation app make sure to press the allow button. In my case it popped up about 4 times.

But it didn't like the PDF files, it just seemed to hang.

I was able to export text from the small example file via Adobe PDF reader. It appeared to go through the translation applet. But the result was trash. Then I looked at the input text as exported from the reader, but there was basically nothing there.

I then did a copy and paste of some text data from the 42 MB file into a standard text file. Then I compared it to the text from the original image. But, it appears that the OCR of the Swedish text is mostly garbage. It only caught about 1/2 - 2/3 of the actual letters. It completely missed all of the umlaut characters; it mostly recognized them as the letter "d". And then there is a smattering of numbers and punctuation in odd places.

In the end I don't think you'll be able to use your current data without cleaning it up a lot more.

Sorry...
- dex

ArMaP

posted on Mar, 12 2016 @ 06:47 PM

link

I forgot to say that Moses works only with text files, not PDFs.

IsaacKoi

posted on Mar, 14 2016 @ 04:21 PM

link

originally posted by: ArMaP
To do automatic translations of large amounts of text (thousands of records from historical archives) I have been using for some time Moses for Mere Mortals.

The problem with MMM is that it only works in Linux and it takes some time to get good results.

As always, thanks for your input ArMaP. However, Linux is a bit outside my comfort zone.

I'm a Windows man. Always have been.

DexterRiley

posted on Mar, 14 2016 @ 04:38 PM

link

a reply to: IsaacKoi

I played around with some OCR stuff last night. No joy yet.

What's the possibility of getting some of the raw scans? What I found on the Swedish website wasn't usable in my experiments.

-dex

IsaacKoi

posted on Mar, 14 2016 @ 04:40 PM

link

originally posted by: DexterRiley
I then did a copy and paste of some text data from the 42 MB file into a standard text file.

...

In the end I don't think you'll be able to use your current data without cleaning it up a lot more.

Thanks Dex.

I had tried converting the larger (30-50MB) PDF files into text files, but Adobe Acrobat crashed when trying to save them as Microsoft Word doc files or a Rich Text Format rtf text file.

However, your trick of cutting and pasting text data from the PDF files allowed me to extract the OCR results without Adobe Acrobat crashing. When pasted into a Microsoft Word doc file or an RTF file, the file size is below 1MB (i.e. below the file size limit for uploading to Google Translate).

So, I'm now able to run the OCR results from each large PDF file through Google Translate.

Unfortunately, as you found, the results of the Google Translate translations of the official Swedish documents are currently largely gibberish.

In case anyone is interested, I've uploaded the results of the automated translation of the first 200 pages of the files to the link below. Before anyone bothers downloading that file, have a glance at the sample results below.
we.tl...

Sample 1 (typical poor result):

Sample 2 (one of the better results, relatively speaking)

Sample 3 (one of the better results, relatively speaking)

Now that I know a way of uploading hundreds of pages in a single operation, I'm happy to try to get better OCR results to see if the translations improve.

I think that I was too optimistic about how forgiving Google Translate may be of OCR results. I thought that since the English results are fairly readable, Google Translate may be able to cope with similar quality Swedish OCR text. I now think that relatively minor problems with the English OCR results (in particular, the random spaces in the middle of words which the human eye/brain can almost ignore when looking at the OCR results of an English document) cause Google Translate severe problems when uploaded to it without any improvement/correction.

If I can get better resolution copies of the scanned images, I may try the above (relatively quick and painless) cut and paste method again.

So, thanks for the input guys.

I'll contact the some of the members of the relevant Swedish team and ask for better scans (if available). I've had contact with several of them in the past and found them helpful and cooperative - but very busy and hence sometimes tending to give delayed responses.

edit on 14-3-2016 by IsaacKoi because: (no reason given)

IsaacKoi

posted on Mar, 14 2016 @ 04:56 PM

link

originally posted by: DexterRiley
What's the possibility of getting some of the raw scans? What I found on the Swedish website wasn't usable in my experiments.

I'll see what I can get.

(I've been working on an unrelated UFO project during the last few months with some of the Swedish team. They are friendly but rather busy people).

DexterRiley

posted on Mar, 14 2016 @ 05:33 PM

link

a reply to: IsaacKoi

I'm glad you were able to make some headway with the translation effort. It strikes me that having a reliable tool set to scan, OCR, and translate foreign language UFO documents could make a big difference in global UFO research efforts.

When you are able to get some better scans let me know. I'd like to work on this a bit more to see if we can establish a process to streamline and enhance the digitization effort.

It also occurred to me that even an imperfect translation could be useful as a means of indexing the location of pertinent data for researchers. A researcher may be able to locate an applicable document using either a unique search term, or a dense grouping of related, but less unique, terms. Once the specific document is found, a more concerted attempt at an accurate translation could then be mounted. For instance, the researcher could forward the scanned document to the Swedish Embassy.

Unfortunately the current quality of OCR probably won't provide a detailed enough set of searchable terms. The biggest problem is with the OCR system's inability to recognize diacritical marks. Maybe there's a way to clean up the scan images themselves to get better results.

-dex

Java tool + automatic translations of PDFs (Google Translate etc) - Any help? Foreign UFO documents