[HanoiLUG] vietnamese language and Unicode

Jean Christophe André jean-christophe.andre at auf.org
Thu Jul 5 14:13:18 ICT 2007


Phan Thái Trung a écrit :
> I don't think Linux has built-in TCVN3 installed, so one must search,
> download/copy & install TCVN3 fonts for use.
Supporting an encoding is not only about fonts. Juste try "locate tcvn"
in any modern GNU/Linux distribution, you may be suprised... ;-)

> I think you are talking about storing data in the web - like
> environment, or database
No, I was just talking about choosing encoding to store data, in any
application carring about encoding. And here I was talking more
specificaly about Microsoft closed products, that do care about encoding
but did not allow (because they are closed) to add new encoding easily.

> which only supports ANSI characters. In those environments, apps need
> encode non-ANSI characters like TCVN3 into encoding character sets
> such as ISO-8859-1 or UTF-x.
I guess you are talking about ASCII = ISO-646, the 7 bits American
Standard Code for Information Interchange (cf "man ascii") and not ANSI,
the American National Standards Institute (american equivalent for TCVN).

If you are really talking about ASCII-only compatible application, then
the only reasonnable choice is to use UTF-7 which allow to store the
full Unicode charset in a 7-bits space.

TCVN 5712, as well as ISO-8859-1 (aka Latin-1), needs a 8-bits space.

> But in desktop apps, not web apps, TCVN3 strings usually do not need
> to be encoded, because TCVN3 is 1-byte characters and they can easily
> stored in the normal string in the developing environment without
> losing. This is one of reasons why TCVN3 is commonly used in VNmese
> apps til now.
Well... There is a clear misunderstanding on what is "encoding" here. I
was not talking about "recoding" data to store it, but to store data in
its "original encoding" (here TCVN3).

How do you define "normal string" here? I can guess that you meant
"8-bits strings", but in reality it's very architecture and system
dependant! And it's why we need standards to define encodings
independantly of which architecture they will be used with.

But, to illustrate what I'm talking about, just check the encoding of
any .vn* font file. You will find it's declared as ISO-8859-1 encoded
font file, which is plain wrong. That's what I'm talking about here
precisely.

> The 2nd reason is about how to develop/apply true Unicode apps, in
> Windows (currently widely used in VN). It is really a difficult
> problem even at this moment. I'm currently an I.T instructor in an I.T
> department, where my students are developing VNmese-softwares yearly.
> They, students, have difficulty to develop Unicode supported apps in
> common developing IDE. Their products mostly still using TCVN3-like
> fonts (badly!) and rarely very little support Unicode. The same
> problem in many I.T VN software product for now. The best way is using
> modern frameworks such as MS .NET or Java, but they are not Native
> machine language apps.
I'm not sure to understand the problem here... I'm a developper myself
and, even if I never developped under recent Windows versions, I can
tell that they *do* have a very good Unicode support from Windows 2000.
Windows Unicode support even started with Windows 98 but was very
limited at this time. The best proof of this is OpenOffice.org that does
a very good use of the Unicode support.

In my opinion, the biggest problem is about Unicode understanding. There
is not only one encoding for Unicode and, as always, Microsoft made some
incompatibility strategy choice in using UCS-2 instead of the most
frequently used UTF-8 in the Internet world. So that Windows Unicode
encoded text are totaly different from Unix one and here begins the
needs for recoding and so starts the difficulties for young
unexperimented students.

> This is only one reason for ppl who are not technical at that time.
> They don't mind about Unicode, ISO or Thu tuong Chinh phu blah blah...
> but they only want his/her web browser must display Vnmese content
> correctly, that's all.
You are totaly right on this point. And it's important to keep this in
mind when developping application.

> There are some patch of Dang Minh Tuan (author of Vietkey) to fix dash
> sign [-] problem for IE, but it is not a good choice. The good choice
> is switch to Unicode, mostly in the web environment at that time.
Exactly! I'm glad to see that, even if we have long discussions about
details, we totaly agree about the objectives! :-)

-- 
Jean Christophe "プログフ" ANDRÉ — http://asie-pacifique.auf.org/
Responsable technique régional
Agence universitaire de la Francophonie (AuF) — Bureau Asie-Pacifique (BAP)
Adresse postale : AUF, 21 Lê Thánh Tông, T.T. Hoàn Kiếm, Hà Nội, Việt Nam
Tél. : +84 4 9331108   Fax : +84 4 8247383   Mobile : +84 91 3248747
⎧ Note personnelle : merci d'éviter de m'envoyer des fichiers PowerPoint  ⎫
⎩ ou Word, voir http://www.gnu.org/philosophy/no-word-attachments.fr.html


More information about the HanoiLUG mailing list