How to change the locale from iso8859-1 to utf-8 on a Linux workstation

(or why this is not as simple as some programmers think)


In 2008 I wrote an internal memo (or laid down brainstorming) about this. It is about 6 pages long and is in German.

A few points extracted in short (and translated):

Continuous business vs. version releases

In our company (engineering consulting) we are running a network of Windows and Linux servers and workstations.

Over the years we have an ever floating pool of projects, some are active, some are retired. We have a few hundred thousand files on different servers (not to mention old cd-roms and dds-tapes somewhere...).

The point is, the whole data pool does not have distinct "versions" (or any other distinct time borders) where we can make a cut between. This is a big difference to some software projects (Linux distributions for example) which start a new version and make changes that are not backward compatible.

We cannot simply say from day x on we switch to another character set and more or less drop all old data.

Changing the locale in file names vs. file contents

You may consider to hold on the company for a weekend and convert "everything" to utf-8. Of course, this cannot be done on an individual basis for each file. With tools like iconv all (data-) filesystems on all Linux machines would have to be converted.

Converting file and directory names is not a problem. But what to do with file contents? Many files are binary (most prominent: CAD drawings and OpenOffice documents!). As a consequence you would need an specialized converter for every binary file format. No way, so the file contents must be kept. 

This leads to the next problem: Many (binary) files contain path and file names within themselves. References to other CAD files and linked images, for example. 

So the file names in the filesystem have to be kept also.

Splitting up namespaces

There is a way to escape of this, of course. You need a facility to have different locales for the file names and the file contents. This is what Windows did. In its beginning most console applications used the DOS "OEM" codepage (from IBM) while the Windows GUI ever used the ANSI codepage.

To let this work, in Windows each individual program can select its codepages at runtime (one at least for the filesystem api, perhaps other for the console).

In this way file names and contents are decoupled. Of course, each application still must handle its own file format (AutoCAD stores the dwgcodepage variable in its drawing header).

On Linux the whole software stack from the application down to the filesystem shares the same locale. There are some exceptions when using samba or special fuse filesystems, but those are not on the application level.

As long there is nothing like "SetFileApisToANSI" on Linux, we are stuck with the ISO8859-1 (or -15) codepage for everything.

It will be interesting, whether the use of the ISO8859-* locales will be declared obsolete one day. As long it isn't software should respect the setting of the LANG variable.