Page 1 of 1
split text file per character
Posted: Mon Dec 03, 2007 8:11 pm
by YST
how to split text file per character,
the files is in chinese ,therefore the file splitted must be readable after splitting?
how to do that?
thanks in advance
Posted: Tue Dec 04, 2007 8:17 am
by DataMystic Support
Convert the file to utf-8 with textpipe first. Then only break on characters \x00-\x7f as the other characters are the remaining multi-byte characters.
Posted: Tue Dec 04, 2007 11:41 am
by YST
how about count 1000 characters number of chinese then split after every 1000 characters,\x00-\x7f meaning some specific character to look for,but what I mean is not lik that.
Have converted to utf8 then split at 2100 bytes,but the files will become unreadable,because some characters become gibberish!
Posted: Tue Dec 04, 2007 12:43 pm
by DataMystic Support
Try splitting using this pattern:
([\x00-\x7f][^\x00-\x7f]*){1000}
Posted: Thu Dec 06, 2007 9:31 pm
by YST
Trying to split at pattern,but error occur:regular expression is too big!THen I split at ([\x00-\x7f][^\x00-\x7f]*),after this many:1000,this time the rubbish character dissapear,(but little files still appear ??? when reconverted it UTF-8 to BIG5.)
after splitting the files will have 50000 characters per file,that is not the standard I want.after counting the files,they seem to irregular.some of them have only 1 line,some are blank,some are around 1000 ,but only when "after this many:30" can create file per aound 1000,but not exactly as some are blank ,some are more or less than other,some have 100 characters only,some have 50 to 60 characters;if using"after this many:1000",,then the file will contain over 50000 characters),why would be like that?
really thanks your reply
Posted: Fri Dec 07, 2007 9:04 am
by DataMystic Support
Please email a sample file to our support address
Posted: Sun Dec 16, 2007 9:17 pm
by YST
Hi:
Why for example a text are splitted by 1000 characters,then how to avoid a bland text being included in the output path,
e.g.1.txt,2.txt,3.txt
each have 1000 character using the above regular pattern you mentioned,(before or after),but the 3.txt have only blank content in it,how to split so that if a text is blank then not to output ,therefore only have output 1.txt and 2.txt?(because I want to compile them in chm ,but my application will have problems when meeting a blank content text .)