Page 1 of 1
Japanese: word count
Posted: Sun Oct 12, 2008 1:28 am
by kurochyan
Hi, I really need your help. I have a problem with counting words in Japaneses text. It is easy to use MS word to count simply if the file is not big. But my data are too big, around 2Gb of them in text format. I am very new to TextPipe and feel myself like a "newbee", so I really need help. The problem is that Japanese text does not have a space separations like in English and has a sentence look like this: "日本語の文には区切りがありません”. Can anybody help me?
Re: Japanese: word count
Posted: Mon Oct 13, 2008 8:34 am
by DataMystic Support
Sorry, TextPipe's word count is designed for spaces between words stored in ANSI or UTF-8.
Re: Japanese: word count
Posted: Sat Nov 22, 2008 12:45 am
by tahoar
Counting words requires words be identified with boundaries. or "segmented." Europoean languages use spaces. Arabic scripts change the form of the characters. East Asian languages, such as Chinese, Japanese and Korean don't have a consistent method, and Thai doesn't do it at all. Microsoft developed some rudamentary segmentation technology in MS Word for these languages, but it is very inaccurate. Even the best computational linguists are still struggling to do with a CPU what educated humans do naturally. Just search Google for "japanese word segmentation algorithm"
There is no simple solution.
Re: Japanese: word count
Posted: Mon Nov 24, 2008 3:10 pm
by DataMystic Support
Whew! Thanks for getting us off the hook