Japanese: word count

Get help with installation and running here.

Moderators: DataMystic Support, Moderators, DataMystic Support, Moderators, DataMystic Support, Moderators

Post Reply
kurochyan
Posts: 1
Joined: Sun Oct 12, 2008 1:06 am

Japanese: word count

Post by kurochyan »

Hi, I really need your help. I have a problem with counting words in Japaneses text. It is easy to use MS word to count simply if the file is not big. But my data are too big, around 2Gb of them in text format. I am very new to TextPipe and feel myself like a "newbee", so I really need help. The problem is that Japanese text does not have a space separations like in English and has a sentence look like this: "日本語の文には区切りがありません”. Can anybody help me? :cry:
User avatar
DataMystic Support
Site Admin
Posts: 2227
Joined: Mon Jun 30, 2003 12:32 pm
Location: Melbourne, Australia
Contact:

Re: Japanese: word count

Post by DataMystic Support »

Sorry, TextPipe's word count is designed for spaces between words stored in ANSI or UTF-8.
tahoar
Posts: 8
Joined: Tue Sep 23, 2008 10:35 am

Re: Japanese: word count

Post by tahoar »

Counting words requires words be identified with boundaries. or "segmented." Europoean languages use spaces. Arabic scripts change the form of the characters. East Asian languages, such as Chinese, Japanese and Korean don't have a consistent method, and Thai doesn't do it at all. Microsoft developed some rudamentary segmentation technology in MS Word for these languages, but it is very inaccurate. Even the best computational linguists are still struggling to do with a CPU what educated humans do naturally. Just search Google for "japanese word segmentation algorithm"

There is no simple solution.
User avatar
DataMystic Support
Site Admin
Posts: 2227
Joined: Mon Jun 30, 2003 12:32 pm
Location: Melbourne, Australia
Contact:

Re: Japanese: word count

Post by DataMystic Support »

Whew! Thanks for getting us off the hook :-)
Post Reply