Japanese: word count
Moderators: DataMystic Support, Moderators, DataMystic Support, Moderators, DataMystic Support, Moderators
Japanese: word count
Hi, I really need your help. I have a problem with counting words in Japaneses text. It is easy to use MS word to count simply if the file is not big. But my data are too big, around 2Gb of them in text format. I am very new to TextPipe and feel myself like a "newbee", so I really need help. The problem is that Japanese text does not have a space separations like in English and has a sentence look like this: "日本語の文には区切りがありません”. Can anybody help me?
- DataMystic Support
- Site Admin
- Posts: 2227
- Joined: Mon Jun 30, 2003 12:32 pm
- Location: Melbourne, Australia
- Contact:
Re: Japanese: word count
Sorry, TextPipe's word count is designed for spaces between words stored in ANSI or UTF-8.
Re: Japanese: word count
Counting words requires words be identified with boundaries. or "segmented." Europoean languages use spaces. Arabic scripts change the form of the characters. East Asian languages, such as Chinese, Japanese and Korean don't have a consistent method, and Thai doesn't do it at all. Microsoft developed some rudamentary segmentation technology in MS Word for these languages, but it is very inaccurate. Even the best computational linguists are still struggling to do with a CPU what educated humans do naturally. Just search Google for "japanese word segmentation algorithm"
There is no simple solution.
There is no simple solution.
- DataMystic Support
- Site Admin
- Posts: 2227
- Joined: Mon Jun 30, 2003 12:32 pm
- Location: Melbourne, Australia
- Contact:
Re: Japanese: word count
Whew! Thanks for getting us off the hook