[HanoiLUG] vietnamese language and Unicode

Jean Christophe André jean-christophe.andre at auf.org
Thu Jul 5 12:20:38 ICT 2007


Phan Thái Trung a écrit :
> I means TCVN3 is not ISO standard
Agreed.

> and is not supported in modern OS systems like WinXp or Linux.
I don't know much about Windows XP, but with Linux, yes it is!!

Linux has been supporting TCVN3 for a few years already, in console
(text) mode and Xwindow (graphic) mode as well as in the recoding
functionalities from the GNU libc. And when I arrived in Vietnam I had
not so much difficulty to find command line tools for vietnamese
recoding in Linux.

> In fact, before Unicode, North VN ppl used to use TCVN3 (ABC) while
> South VN ppl like "VNI like" encoding.
Absolutly.

But in fact the real huge problem with these encodings is that they have
been used in a way that invalidated them for interroperability.

I mean, since vietnamese people had no choice but to use proprietary
software at the time they created these encodings, they had to use very
bad trick to be able to use them with these softwares.

More precisely, since it was not possible to add the official TCVN3
encoding to Windows (and still it's not, thanks to proprietary
software), they had to use the default ISO-8859-1 encoding instead to
store real data. And to be able to display them correctly, they had to
create vietnamese fonts with TCVN3 encoding positions but declared using
the ISO-8859-1 encoding too.

Now interroperability is dead! Because you have no way to know what is
the real encoding used unless by guessing it (using font name or
vietnamese words properties)...

It has consequences for usage in Linux too (since it's the main topic
here, isn't it? ;-)) : when you install TCVN3 (.vnTimes) or VNI
(VNITimes) TrueType fonts, they get recognised as ISO-8859-1 fonts and
give totaly incorrect result when you try to use a real TCVN3 encoded
document (not through ISO-8859-1) in a Unicode environment. And off
course, these fonts become totaly unusable for real ISO-8859-1
documents: just try to type some french text (or anything with accents)
with these fonts...

Even worse, it has consequences for usage with OpenOffice.org too, even
using it in Windows! For exemple the [ư] (u with horn) is encoded as the
dash sign [-] in ISO-8859-1, which is used for hyphenation (cutting word
when it is too long) so that OpenOffice.org treat it specificaly and do
not allow it's recoding using the Unikey toolkit to go from TCVN3 to
Unicode...


Now we have plenty of old vietnamese documents all declared beeing
encoded in ISO-8859-1 although they really are TCVN3 encoded, so it
won't be easy to create auto-conversion tools for all these "still
legaly needed" vietnamese administration legacy documents...

Think about it: it adds more difficulty to migrate to a modern operating
system (say Linux) which won't support these bad hacks by default... We,
Linux community in Vietnam, have some research job to do here!

> When we meet a document encoded by these old character sets, we can
> use the free Unikey toolkit to quickly convert them to Unicode
> content, even in rich-tech format, without losing old text format.
Some (like David) use VietPad for this, a vietnamese editor written in Java.

Some others (like me) use "recode" (or also "iconv" now) for this, a
command line recoding tools supporting a *lot* of recodings (not
dedicated to vietnamese ones).

-- 
Jean Christophe "プログフ" ANDRÉ — http://asie-pacifique.auf.org/
Responsable technique régional
Agence universitaire de la Francophonie (AuF) — Bureau Asie-Pacifique (BAP)
Adresse postale : AUF, 21 Lê Thánh Tông, T.T. Hoàn Kiếm, Hà Nội, Việt Nam
Tél. : +84 4 9331108   Fax : +84 4 8247383   Mobile : +84 91 3248747
⎧ Note personnelle : merci d'éviter de m'envoyer des fichiers PowerPoint  ⎫
⎩ ou Word, voir http://www.gnu.org/philosophy/no-word-attachments.fr.html


More information about the HanoiLUG mailing list