Aaron Andrew Hunt

Respecting All the Languages of the World

January 17, 2016

When I launched my new websites in April 2015, I confronted the problem of multi-lingual support. The main site at Zentral.zone, my personal site, and my new publisher site, Zwillinge, were all designed in both English and German, and I also wanted to support all languages for user accounts at on my business sites at H-Pi Instruments and Mather Point Software. As a programmer I knew that the answer to this problem was to use Unicode UTF-8 character encoding system-wide, and I had coded my web pages accordingly.

The Problem

However, I found that no matter what I did, my database tables would not cooperate whenever non-Latin characters were involved. Needless to say, this was rather frustrating, and for someone who considers himself a decent programmer, a bit humiliating. Searching far and wide for answers, I found no way to solve the problem.

With a lot of other work on my hands and a lack of infinite patience, I had to find a temporary solution. This was twofold: first to use HTML entities for non-Latin characters in the German text stored in the database, and secondly to instruct visitors wherever necessary that all submitted data had to be in Latin letters. Though I did not like these temporary workarounds, they worked well enough.

Network payment notifications received from PayPal originating from places like Japan or Russia, however, were doomed. Whenever a transaction of this sort was processed, things would become a bit confused. Safeguards within the system ensured that original data was never lost, but the database would not record information properly in these cases.

The problem was compounded by the mechanism I had designed to generate email responses and software licenses, processing the data received from PayPal notifications. Everything would run smoothly as long as only Latin letters were involved, but the moment a non-Latin letter appeared, even something as ostensibly harmless as an umlaut in a German name, the system became confused, response emails would be sent with greetings to garbled or completely blank names ("Hello .") and software licenses would likewise be issued to garbled or blank names. Handling these cases manually was my only option, which was not fun. After a few months running the system, I added a patch which replaced umlaut letters with their double-ASCII equivalents (oe instead of ö) which helped some of the more trivial cases. I also coded my own lookup table to Latinize Russian characters, but I found that it did not work. By the end of the year, I knew it was high time to finally find the proper solution.

The Solution

What I discovered about all this was quite interesting. The solution is a bit complicated, and involves tweaking Apache, MySQL, PHP, and HTML so that they all to work together to support UTF-8. The good news is that once these changes are made, everything works together like a well-oiled machine. I believe we would all be better off if all of these tweaks needed for UTF-8 were instead the standards implemented on all systems by default. Optimization for other uses would still be possible, after all.

Of all the resources I found online about these issues, these two were the most helpful:

Though these pages may seem a little old, the information is still relevant, although not entirely error-free. The mistakes are minor and easy to spot, and in many cases have to do with outdated coding practices in PHP. The complete list of changes which need to be made are listed on these pages, and anyone wanting to support UTF-8 would do well to thoroughly study these references.

The main points as I see them:

Apache: Headers sent from Apache can conflict with meta-charset header tags in HTML documents
MySQL: there are several character-set and collation settings which must be changed to ‘utf8’
PHP: is encoding agnostic and has native string functions that may corrupt UTF-8 data
HTML: meta-charset header tags may not be enough; use the accept-charset attribute on database-related forms

Apache

Add this line to all .htaccess files so that HTML headers are sent as UTF-8: AddDefaultCharset UTF-8

MySQL

Change all character sets to utf8. You can see what your settings are using the SQL command:

show variables like ‘char%';

character_set_client utf8
character_set_connection utf8
character_set_filesystem binary
character_set_results utf8
character_set_system utf8

Filesystem remains binary, of course. If you have shared hosting, you will not have access to your server or database connection settings. I thought this would cause issues, but it turns out that leaving these as latin1 poses no problems.

character_set_database latin1
character_set_server latin1

To change these settings, execute SQL commands such as the following: set global character_set_client = utf8;

Use the PHPMyAdmin Operations tab to set all collation settings server, database, all tables, and all columns within your tables to either utf8_general_ci or utf8_unicode_ci I made all my changes manually to be safe. This was extremely time consuming. There are also scripts available to make these changes via code, which in my opinion you should only use if you read through the code and understand it all. And, of course, back up your database first.

PHP

Add this line directly after your database connection is made: mysqli_set_charset($con, ‘utf8');
Where appropriate, replace string functions with UTF-8-safe functions. Use mb_string https://php.net/manual/en/book.mbstring.php or the UTF-8 safe functions supplied on the pages linked above.

HTML

Add the attribute accept-charset=“utf8" to all database-related HTML form elements (although not strictly required when the meta-charset header tag is used, some browsers may not respect the meta tag properly). See W3C

Conclusion

What I have breifly outlined here may seem like a mere technical matter, but the result is something quite human, and personal. I'm pleased that after tweaking the system, someone from another culture who speaks another language can create a user account through my sites using the correct characters for their own name in their own language. To me this was not just a practical issue of record keeping. It was also an important matter of respecting others.

There are some use cases requiring further qualifications; again, a thorough reading of the cited resources is a good idea if you are working through similar issues.

In Other News...

Besides today’s organ practice, during which I decided BWV 536 will be the first fugue I will learn in 2016, the rest of my week was occupied with software development and preparing keyboard hardware for DIY orders. MIDI Tapper beta 55 will be released early, since a major problem was solved concerning MIDI File export, and a Custom Scale Editor update will be issued as well, since a new feature has been added at customer request.

Regards,
Aaron