Edgar Huckert

Convert texts in Codepage 850 to UTF-8 using D

D has a number of packages in the standard library Phobos to convert strings between encodings. Just to name some of them:

std.string
std.uni
std.utf
std.encoding

I found all that very confusing and academic (and I am not alone with that - see alternative solutions by Adam Ruppe in his arsd package). My confusion comes probaby from the fact that I also have to write programs in C++ where very different conversion functions exist in the libraries Boost and STL. And wxWidgets - my favorite GUI solution - uses different conversion functions again. The Windows API also has its own set of conversion functions.

By default D strings have no specific encoding. In the Western world ISO-Latin-1 (a 1-byte encoding) is probably most often used. Other D classes like wstring expect 2-byte encodings (UTF-16). Class dstring expects a 4-byte encoding (UTF-32). In the DOS world (pre-Windows) codepage 850 was a frequently used encoding. If you use old texts (my texts date sometimes from the pre-DOS world) or if you use VIM on the console level then codepage 850 is probably the base encoding. This encoding uses 1 byte to represent ASCII (codes 0-127) and all usual French accents and German umlauts (codes >127). In the Windows 7 world other codepages like 1250 are used for external files (internally Windows uses a 2-byte encoding).

The normal external encoding for text files is now UTF-8 - a multibyte encoding or ISO-Latin-1 (a single byte encoding). UTF-8 codes the ASCII characters in the range 0-127. All other characters (also German Umlauts and French accents) require 2 to 4 bytes to be encoded. The usual german umlauts and French accents are two byte codes starting with 0xc3.

I have written a simple conversion program in D that trancodes such old texts into UTF-8 texts. This isn't however a complete conversion: I concentrate here only on German umlauts and French accents. This D programm is a partial solution. Here is my conversion program written in D. Hints for compilation and usage are given in the source code. You may test it with this short text file that uses codepage 850 - the text will probably be shown with strange wrong characters in the browser. You can test the conversion result with Scite or any other modern programming editor.

This conversion module is also used in a more complex program called dictionary.d. This program is a rather simple dictionary program based on an associative array (the dictionary). It reads its data from a file (may be codepage 850 encoded) and accepts keys (for test) from the Windows console.

A similar program written in standard C is here.

Contact

If you want to contact me: this is my mail address