Many users do not fully appreciate that Wordpad fully supports Unicode. Files saved from Wordpad as Unicode text files are stored as UTF16 Little Endian.
Full Unicode support appertains to Wordpad shipped in Windows XP (& Vista). cf. Windows 9x had limited support.
Microsoft Wordpad and Unicode
Moderators: DataMystic Support, Moderators, DataMystic Support, Moderators, DataMystic Support, Moderators
-
- Posts: 988
- Joined: Sun Dec 09, 2007 2:49 am
- Location: UK
- DataMystic Support
- Site Admin
- Posts: 2227
- Joined: Mon Jun 30, 2003 12:32 pm
- Location: Melbourne, Australia
- Contact:
-
- Posts: 988
- Joined: Sun Dec 09, 2007 2:49 am
- Location: UK
Notepad and Unicode
According to some of the experts on Unicode, Notepad does not provide full support for it. Using Notepad may sometimes give unexpected results.
No matter what encoding you saved a text file with, Notepad will always 'try to guess' when you open the file.
On a WinXP machine, try typing in Notepad the following line: then save the file as ANSI type and reopen it with Notepad.
Notes:
No matter what encoding you saved a text file with, Notepad will always 'try to guess' when you open the file.
On a WinXP machine, try typing in Notepad the following line:
Code: Select all
Bush hid the facts
Notes:
- The above illustrated bug may have been solved in the Notepad shipped within Vista.
The versions of Notepad supplied with Windows 95, Windows 98 and Windows ME do not support Unicode.
- DataMystic Support
- Site Admin
- Posts: 2227
- Joined: Mon Jun 30, 2003 12:32 pm
- Location: Melbourne, Australia
- Contact:
-
- Posts: 988
- Joined: Sun Dec 09, 2007 2:49 am
- Location: UK
Windows API call IsTextUnicode()
http://msdn2.microsoft.com/en-us/library/ms776445.aspx
explains this behaviour. Notepad makes this call.
explains this behaviour. Notepad makes this call.
-
- Posts: 11
- Joined: Mon Aug 25, 2008 9:16 pm
- Location: Dublin and Manila
- Contact:
Re: Notepad and Unicode
Yes, this is because all text editors try to guess the text encoding of a file, unless it has a BOM (byte order mark).
They all use different algorithms to calculate this. Also they don't read all of the file, they only read a small portion to use as a test sample and the size of this sample is also different depending on the application. I'm not exactly sure of the average size.
For example, if you have a very very large text file (tens of thousands of lines), with all English text, and at the very bottom have some French text containing accented characters, then save as UTF-8 with no BOM (you don't have this BOM option with notepad, its always on by default). This will save the file as UTF-8, encoding the french accented characters in utf-8 format, with multiple bytes.
If you open this file in any text editor they will think its just ascii and display the text file using the current OS codepage settings, and the french text at the bottom will appear corrupted, unless you force reload the file as UTF-8.
To prevent this incorrect guess work you need to use a BOM.
BOM is optional for UTF-8 files, not sure why, probably because there is only 1 type of utf-8, whereas for Unicode(utf-16) you can have 2 types, Little Endian (LE) or Big Endian (BE). Also note that Windows uses utf-16LE as standard when saving files as Unicode, whereas Mac use utf-16BE, just to be different !
When saved as UTF-8, the actually encoding of the characters in this example do not change. The file is exactly the same because the text has only english characters. The only change to the file is the addition of a UTF-8 BOM \xEF\xBB\xBF header.
By having this BOM it tells Notepad to read the file as UTF-8, so it knows how to display the characters correctly, and not try to read it as unicode.
In this Bush example above Notepad tries to guess the encoding by scanning the characters and thinks the file is a unicode utf-16LE file. I think this is a pure fluke that the combination of a few of these English character encoding matches up to some UTF-16LE encoded characters. This is a mistake because Notepad should know that Unicode utf-16 files must have a BOM to be valid by the standards, so it should not try to guess Unicode. It should only try to guess if its utf-8 or ascii because BOM in utf-8 is optional by the standards. Other text editors I've used do not have this error.
Hope this adds some light to this strange notepad behavior.
Anthony.
They all use different algorithms to calculate this. Also they don't read all of the file, they only read a small portion to use as a test sample and the size of this sample is also different depending on the application. I'm not exactly sure of the average size.
For example, if you have a very very large text file (tens of thousands of lines), with all English text, and at the very bottom have some French text containing accented characters, then save as UTF-8 with no BOM (you don't have this BOM option with notepad, its always on by default). This will save the file as UTF-8, encoding the french accented characters in utf-8 format, with multiple bytes.
If you open this file in any text editor they will think its just ascii and display the text file using the current OS codepage settings, and the french text at the bottom will appear corrupted, unless you force reload the file as UTF-8.
To prevent this incorrect guess work you need to use a BOM.
BOM is optional for UTF-8 files, not sure why, probably because there is only 1 type of utf-8, whereas for Unicode(utf-16) you can have 2 types, Little Endian (LE) or Big Endian (BE). Also note that Windows uses utf-16LE as standard when saving files as Unicode, whereas Mac use utf-16BE, just to be different !
This is not exactly true. If you type in this text and save the file as unicode(utf16-LE) or utf-8 you will not see garbage when you open the file again. The reason for this is because notepad (and all microsoft apps) add BOMs to the file when saving in unicode formats by default.DFH wrote: No matter what encoding you saved a text file with, Notepad will always 'try to guess' when you open the file.
When saved as UTF-8, the actually encoding of the characters in this example do not change. The file is exactly the same because the text has only english characters. The only change to the file is the addition of a UTF-8 BOM \xEF\xBB\xBF header.
By having this BOM it tells Notepad to read the file as UTF-8, so it knows how to display the characters correctly, and not try to read it as unicode.
In this Bush example above Notepad tries to guess the encoding by scanning the characters and thinks the file is a unicode utf-16LE file. I think this is a pure fluke that the combination of a few of these English character encoding matches up to some UTF-16LE encoded characters. This is a mistake because Notepad should know that Unicode utf-16 files must have a BOM to be valid by the standards, so it should not try to guess Unicode. It should only try to guess if its utf-8 or ascii because BOM in utf-8 is optional by the standards. Other text editors I've used do not have this error.
Hope this adds some light to this strange notepad behavior.
Anthony.
http://www.TheLocalizer.com/
High Quality, Low Cost, Localization Engineering, Testing, Audio & Video Services
High Quality, Low Cost, Localization Engineering, Testing, Audio & Video Services