Enhanced OCR for Japenese

  • Status: Closed
  • Prize: £300
  • Entries Received: 1
  • Winner: WangJing0612

Contest Brief

I have a printed Japanese language intonation dictionary that I would like to convert into a machine readable file format.

One difficulty that prevent simple OCR is the special intonation 'guides' printed above the readings that indicate intonation. I've attached a scan of some pages from dictionary as examples.

Here are the first four entries from the dictionary, including the above text marks that indicate intonation.

 ─
アー (~言う, ~した) →61

─┐
アー (~は行かない, ~だこうだ) →76, 86a 【感】(~驚いた) →66

 ─────
アークトー arc 灯 →15

 ───┐ ─┐
アーケード,アーケード arcade →9

And one more (from page 1) to highlight the use of handakuten (カ゚キ゚ク゚ケ゚コ゚):

アイキョーケ゚ン (間狂言) →15

Each entry in the dictionary has the following pattern:

1- one or more readings (with intonation above) separated by commas
2- a space
3- one or more word data sectionss

Word data is:

a- the word
b- descriptive data about the word

I would need the above turned into a file with three columns,:

1- the reading with intonation
2- the intonation patter (H: High, L: Low)
3- the word(s) only (no descriptive data)

For the five entries above, the result would be:

アー {LH} (~言う, ~した)
アー {HL} (~は行かない, ~だこうだ)
アークトー {LHHHH} arc 灯
アーケード,アーケード {HHHLL,HLLLL} arcade
アイキョーケ゚ン {HHHHLLL} (間狂言)


I've tried to keep the above simple, but I'm sure I am missing some edge cases.

Please don't hesitate to ask if anything is not clear or you have questions.

I don't know much about OCR, so I am hoping you will be able to help me figure out the following:

1- Is what I am asking for possible?
2- How long will it take (or how much will it cost)?

I have two mains questions for this project:

Firstly, I understand the basics of OCR, but I don't see how OCR can get the intonation information from the scans. Can you describe how you can achieve this a bit more?

Secondly, I'm worried about extracting data for the third column.

From the samples you can see that there is a lot of extra information in the dictionary's third 'column'. For me it's garbage since I just want the word(s).

Will it be possible to extract the words from all that 'garbage', and if so how do you propose to do it?

Recommended Skills

Employer Feedback

“He is very nice and super quick”

Profile image anhanh1122, Vietnam.

Top entries from this contest

View More Entries

Public Clarification Board

No messages yet.

How to get started with contests

  • Post your contest

    Post Your Contest Quick and easy

  • Get tons of entries

    Get Tons of Entries From around the world

  • Award the best entry

    Award the best entry Download the files - Easy!

Post a Contest Now or Join us Today!