split text file per character

Get help with installation and running here.

Moderators: DataMystic Support, Moderators, DataMystic Support, Moderators, DataMystic Support, Moderators

Post Reply
YST
Posts: 4
Joined: Mon Dec 03, 2007 7:55 pm

split text file per character

Post by YST »

how to split text file per character,
the files is in chinese ,therefore the file splitted must be readable after splitting?
how to do that?

thanks in advance :)
User avatar
DataMystic Support
Site Admin
Posts: 2227
Joined: Mon Jun 30, 2003 12:32 pm
Location: Melbourne, Australia
Contact:

Post by DataMystic Support »

Convert the file to utf-8 with textpipe first. Then only break on characters \x00-\x7f as the other characters are the remaining multi-byte characters.
YST
Posts: 4
Joined: Mon Dec 03, 2007 7:55 pm

Post by YST »

how about count 1000 characters number of chinese then split after every 1000 characters,\x00-\x7f meaning some specific character to look for,but what I mean is not lik that.

Have converted to utf8 then split at 2100 bytes,but the files will become unreadable,because some characters become gibberish! :cry:
User avatar
DataMystic Support
Site Admin
Posts: 2227
Joined: Mon Jun 30, 2003 12:32 pm
Location: Melbourne, Australia
Contact:

Post by DataMystic Support »

Try splitting using this pattern:
([\x00-\x7f][^\x00-\x7f]*){1000}
YST
Posts: 4
Joined: Mon Dec 03, 2007 7:55 pm

Post by YST »

Trying to split at pattern,but error occur:regular expression is too big!THen I split at ([\x00-\x7f][^\x00-\x7f]*),after this many:1000,this time the rubbish character dissapear,(but little files still appear ??? when reconverted it UTF-8 to BIG5.)

after splitting the files will have 50000 characters per file,that is not the standard I want.after counting the files,they seem to irregular.some of them have only 1 line,some are blank,some are around 1000 ,but only when "after this many:30" can create file per aound 1000,but not exactly as some are blank ,some are more or less than other,some have 100 characters only,some have 50 to 60 characters;if using"after this many:1000",,then the file will contain over 50000 characters),why would be like that? :(

really thanks your reply :D
User avatar
DataMystic Support
Site Admin
Posts: 2227
Joined: Mon Jun 30, 2003 12:32 pm
Location: Melbourne, Australia
Contact:

Post by DataMystic Support »

Please email a sample file to our support address
YST
Posts: 4
Joined: Mon Dec 03, 2007 7:55 pm

Post by YST »

Hi:

Why for example a text are splitted by 1000 characters,then how to avoid a bland text being included in the output path,
e.g.1.txt,2.txt,3.txt
each have 1000 character using the above regular pattern you mentioned,(before or after),but the 3.txt have only blank content in it,how to split so that if a text is blank then not to output ,therefore only have output 1.txt and 2.txt?(because I want to compile them in chm ,but my application will have problems when meeting a blank content text .)
Post Reply