Gothic Text Problem, page 1

posted on Jun, 11 2016 @ 09:32 AM

link

Recently I downloaded the text of a book from archive.org. It is an old German language book about politics from the 1930s. Archive.org offers downloads in several formats including plain text.

At first I downloaded the .pdf version of the book.

However, I couldn't copy and paste words and sentences into Google translate because the .pdf was made up of page "images".

I then downloaded in the .pdf format with "text". This gave be the ability to copy and paste but because the book had been type set in Gothic or Fraktur type, when I tried to paste a copied paragraph, the result looked like this:

t^enispanasiatischerBewegvng * Potlrfsche Brennpunkt^ la Unruhegablete ^ tiommunistisch -panasjatlsma Aufstandsraui-ne 5 Unnihaherde das Islam Utt.6 2)te SWonfunianBct bt-ett ftanbe, »on bem ant bie .Scaftjlcome teilweife gelenEt, minbeflen^ ber ub*
tigen SBelt begreiflic^ gemacT;* werben fonnen; baju gepct in etflet Sinie geopo:= litifc^ci, in jttieitet S)ol£gpolitifc^e ©c^ulung unb

Downloading in plain text produced the same result.

Does anybody know how to convert Gothic or Fraktur text into modern text suitable for use in Google translate? As usual, all help is appreciated.

edit on 11-6-2016 by ipsedixit because: (no reason given)

InTheLight

posted on Jun, 11 2016 @ 09:46 AM

link

Why don't you find a PDF conversion software that converts it to the text you require?

mysterioustranger

posted on Jun, 11 2016 @ 10:07 AM

link

a reply to: ipsedixit

***Deleted by poster* I see the issue. Sorry.

edit on 11-6-2016 by mysterioustranger because: delete original

ipsedixit

posted on Jun, 11 2016 @ 11:24 AM

link

a reply to: InTheLight

The issue is the code used to render Gothic type on the computer. If you convert to text you just get the code in modern English type. I'm looking for a way to bulk convert this code to regular modern type.

The "work around" is simply to learn to read the Gothic type and then manually type it into Google translate, converting to modern type as you go. I may have to do this because I couldn't find a conversion program even on a German blog where the question came up.

It's just a lot of work, which I probably shouldn't shirk since it is a good way to thoroughly familiarize myself with this kind of text (Nazi stuff is all in it) and to learn German.

edit on 11-6-2016 by ipsedixit because: (no reason given)

BigBrotherDarkness

posted on Jun, 11 2016 @ 12:09 PM

link

a reply to: ipsedixit

Are you set to unicode on your machine and browser or what? Encoding is why youre seeing well coding and not translation.

edit on 11-6-2016 by BigBrotherDarkness because: (no reason given)

ipsedixit

posted on Jun, 11 2016 @ 03:28 PM

link

a reply to: BigBrotherDarkness

How do you set the machine and browser to unicode? I've never heard of doing this. I don't think it is the problem but I'm no computer whiz.

BigBrotherDarkness

posted on Jun, 11 2016 @ 04:05 PM

link

a reply to: ipsedixit

Well cut and paste is the default text youve chosen... for the comp. Then you ave browser customization with yet another text typical... then you have the text you wanna snatch for translate then the PDF text. My advice is find out the text standard of the translator with an F12 on the keyboard if its an online sort... then go into your text make things custom and pretty for the computer by right clicking under windows and looking for whatever Its called cant recall im on an a limited ajax machine im just going to assume youre not on Linux as comp savy are its typical users not saying you arent a savy user but tats my lining of excuse for not recalling linux either... dang ive been hacked off reg machines for too long at tis point do be brain farting but... basically find the translator text with the F12 then change the computer settings via custom pretty where screen saver and crap is, then go into browser setting and look for font there and match it too, then of course match the translator so out put wont be all codey and off... because each ting has to code and uncode from cut and paste to whatever comp is then into translator whatever that is and into the translator whatever that is and finally hope it comes out right... which of course likely wont... last ditch if it doesnt work that way set everything to the text you are cutting and pasting.

Of course you can also use boolean operators do a search for that and use file extensions and hopefully someone has already done it... so basically going in a search bar putting quotes around the title with _ between the spaces sometimes and things like .pdf or .txt etc as extension please note a lot of times the title is coded and man or man do I wish * was a boolean operator like in dos to be a wildcard for anything after it.

Ceeker63

posted on Jun, 11 2016 @ 05:09 PM

link

You could also try www.babelfish.com... for translation.

ipsedixit

posted on Jun, 11 2016 @ 06:38 PM

link

a reply to: Ceeker63

Babelfish doesn't work. It behaves the same way that Google translate does. It can't interpret the ASCII code for the Gothic type face.

BigBrotherDarkness: You are way over my head, but thanks for the response.

ArMaP

posted on Jun, 11 2016 @ 06:50 PM

link

It does sound like an encoding problem. It would be easier if you said what book it is, so other people could try it.

ipsedixit

posted on Jun, 11 2016 @ 06:58 PM

link

a reply to: ArMaP

The title of the book is Weltpolitik Von Heute by Karl Haushofer. Roughly translated: "World Politics Today". According to Webster Tarpley it is typical Nazi style political thinking on the subject of lebensraum, if memory serves.

Here is a link to the text:

archive.org...

ArMaP

posted on Jun, 12 2016 @ 05:42 AM

link

a reply to: ipsedixit

After taking a look at those files it looks like the text was recognised with the wrong language, so it's natural that it's full of recognition errors.
If that's the case then it's a bad work and useless.

ipsedixit

posted on Jun, 12 2016 @ 07:38 AM

link

a reply to: ArMaP

I think the only way this problem can be solved is with proprietorial optical character recognition software made by ABBYY.

www.frakturschrift.com...:why_gothic_ocr

Without the special OCR the text looks like this:

ABBYY's product makes it look like this:

BigBrotherDarkness

posted on Jun, 12 2016 @ 08:07 AM

link

a reply to: ipsedixit

Well the encoding being out and its the same letters... heres the free work around... do a simple find and replace from the character thats wrong into the one you know it should be or to the group of letters it wants to chop into something else.

Looks like a style sheet problem the more I look at it as the copyright symbol being seen as that instead of the styling of the letter... or art of it. Where AI goes by imaging a lot these days searching for pictures... im sure there are vast swaths of scanned books like this that they cant read or misinterpret all the time when looking at them.

So they obviously dont have access to all knowledge even though chatbot is designed to grow and explore the entire web using google it makes many of its books using te picture text kinda off limits by default or accident.

Ill look and see if I can source the title and write you out the link if it has already been done... ill let you know if I dont find any as well so youre not left hanging.

BigBrotherDarkness

posted on Jun, 12 2016 @ 08:26 AM

link

Found it on scribed or at least a version of it with added comentary by a fellow named Burt Chapman... e canged the title of course. I cant cut and paste from my machine and hand writing a link out when encrypted takes forever... but someone has already done the work just added their own commentary and projection forward as per usual in out of print or date books.

ArMaP

posted on Jun, 12 2016 @ 08:34 AM

link

originally posted by: BigBrotherDarkness
Well the encoding being out and its the same letters... heres the free work around... do a simple find and replace from the character thats wrong into the one you know it should be or to the group of letters it wants to chop into something else.

As someone that has done this type of work before, I can tell you that, although that may work in 80% of the cases, it will create new problems with words that become harder to understand and will leave many errors that cannot be corrected automatically.

Looks like a style sheet problem the more I look at it as the copyright symbol being seen as that instead of the styling of the letter... or art of it. Where AI goes by imaging a lot these days searching for pictures... im sure there are vast swaths of scanned books like this that they cant read or misinterpret all the time when looking at them.

It's an OCR problem. Most modern OCR programs work with dictionaries, so they can try to recognise a word, see if that word exists and, if it doesn't exist in the dictionary they will try a similar looking one that does. When they do not have a dictionary or are working with the wrong dictionary (as one of the XML files on the download page appear to indicate), the program doesn't know what to do with the recognized words or letters.

For working with specific types of letters, like in this case, the programs usually need to be trained for them, as the specific way of writing those letters is unknown to them.

So they obviously dont have access to all knowledge even though chatbot is designed to grow and explore the entire web using google it makes many of its books using te picture text kinda off limits by default or accident.

If you look at the books available on Google Books you will find may cases like this, even in English, as the automatic systems are not good enough to recreate a text. Yes, they may be 99.9% accurate on a good image, but on a bad scan of an old book that falls to something like 80% or 70%.

BigBrotherDarkness

posted on Jun, 12 2016 @ 09:00 AM

link

a reply to: ArMaP

I have too its super tedious and a labor of love as a passion or else it like screw this... so yeah I know what youre talking about... with that and the bad scans that impact of the press created a shadow effect as many of these old books were type set and pressed in tooling as each letter was set in by hand or a common word set and when imaged then that shadow was picked up if off a photo copier as it was a reversible negative with that opps tons of black toner without something covering the excess glass that over exposed it...

Of course these days Ive seen it done more of a square setup with a tripod over the top to digitially image two pages at once with a very good digital camera.

Of course a secluded reading room would be a good place to set such up in many libraries might catch fornication going on in there at times but hey college experience eh? I used to pour through all the old Eastern Religion stuff way back in the day and some of the old handwritten musical sheets. The age was not only in sight but often smell of these things too but not like old decay but old material smells like inks and pulps...

In such a way books truely are alive. I cant read a book in digital to save my life... its obnoxious actually even trying to.

edit on 12-6-2016 by BigBrotherDarkness because: (no reason given)

ArMaP

posted on Jun, 12 2016 @ 02:50 PM

link

I just tried with an old version of Abbyy Fine Reader I have but it didn't recognize those letters, resulting in many strange characters.

ipsedixit

posted on Jun, 13 2016 @ 07:23 AM

link

a reply to: ArMaP

I guess it would take the latest version, which I think is an on line paid "service". The text has to be acceptable for look up in Google translator for it to be useful to me. I think I am going to have to simply type the entire thing out for myself, interpreting the Gothic type as I go.

BigBrotherDarkness

posted on Jun, 13 2016 @ 08:39 AM

link

Good luck and I hope you enjoy the journey of it, trying to not keep the goal in mind while on the journey should help cease procrastination towards completion page by page, step by step and itll be done before you know it with a set goal of a certain amount of pages for that day.

I used to get writers block by having a full outline or framework already lain out when writing novella trying to flesh it in and guide it piece by piece until letting the story just flow or unfold on its own and try to catch up as things try to run ahead of typing speed.

Of course this is a lot different but the so many pages is basically how many writers get through it unless its so compelling chasing the story along they have to force themselves to stop, I suppose thats why some writers get termed as prolific though...

Have a nice day.