Information Concerning Character Encodings
The new default character encoding at the Mathematical Institute is UTF-8.
What are the advantages of Unicode compared to 8-bit-encodings (e.g. Latin1)?
8-bit-encodings only contain a small number of characters. This means that every region of the world has its own encoding so that the characters of that particular region can be displayed.This leads to the following problems:
- In text files, only the special characters of one region (e.g. German "Umlaute") can be used.
- Exchanging files can be problematic if it is unclear which character set was used. Strictly speaking, this also applies to UTF-8, but there are ways to avoid this problem.
What are the advantages of UTF-8?
Obviously, the problems of 8-bit encodings are avoided. Still, there are other advantages:
- Mac OS X (and newer Linux versions) use UTF-8 as their default encoding. Switching our standard encoding to UTF-8 makes it easier to exchange data between the Institute and your personal device(s). In the past, it was sometimes necessary to explicitly recode files so that they would be correctly displayed.
- Programs exist that only work with Unicode (Latin1 character sets are not displayed correctly, nor are program files processed correctly). It is to be expected that the number of such files will increase in the future.
For which types of files the character encoding plays a role?
Generally Text files, which include: Latex documents, html documents and program files (e.g. C code). Open-/Libre-/MS- office and pdf files are not text files.
Commands for manual conversion of files
Show the actual encoding:
file -i Datei
Example outputs are:
UTF-8: Datei: text/plain; charset=utf-8
Latin1: Datei: text/plain; charset=iso-8859-1
If the output contains charset=ascii, then no conversion is necessary since no special characters are involved.
If the output contains charset=unknown-8bit, it is very likely that the files contain both Latin1 and UTF-8 characters. If you are not able to solve this problem on your own, please do not hesitate to contact us.
Conversion of file content
Latin1 to UTF-8:
iconv -f latin1 -t utf8 file > newfile
UTF-8 to Latin1:
iconv -f utf8 -t latin1 file > newfile
Please make sure that the input (file) end output (newfile) file names are different in order to avoid corruption of your files.
Conversion of file names and folders:
Latin1 to UTF-8:
convmv -f latin1 -t utf8 --notest Datei[en]
UTF-8 to Latin1:
convmv -f utf8 -t latin1 --notest Datei[en]
Conversion of latex files:
First, please convert the file content as explained above.
The next step is to replace
If this is not sufficient, please leave us a note.
In a file Umlauts are not displayed correctly, what should I do?
Please don't edit the file and especially do not save the file afterwards, because it could become corrupted.
Many editors do have heuristics to detect the encoding. But this doesn't always work, respectively some editors don't have this ability. In this case you do have to manually set the encoding -- in most cases it should be sufficient to select UTF-8 or Latin1 (or ISO-8859-1, ISO-8859-15). If you are not able to solve this problem on your own, please do not hesitate to contact us.
Umlauts are not displayed correctly in the file manager dolphin and it isn't possible to open or edit the file, what should I do?
Unfortunately, many KDE applications have a bug, so that it isn't possible to open, edit or rename files with other character encodings in filenames.
On the command line you can automatically rename file names as described above with
convmv. After that it should be possible to work again with these files. Alternatively the graphical file manager
thunar is installed, which allows you to rename files with a different encoding in their names manually.
Furthermore, the KDE bug relates to unpacking archives that contain file names with a different encoding in their names. Please use the program
file-roller to extract files from these archives.
If you are not able to solve this problem on your own, please do not hesitate to contact us.