Those who follow the Gemini mailing list may have noticed a message or two about IDNs and IRIs. This is the first time I'm taking a deeper look at this stuff, so here is what I've learned.
When it comes to Internationalized Domain Names, I have been blissfully unaware that it basically relies on a kludge that requires applying a complicated, special encoding to convert Unicode domains to a small-ish ASCII representation. Well, RFC 3492 is 17 years old so this is surely something that happens under the hood, a minor implementation detail in the OS? Alas, internationalization has been left to the application layer to worry about, so it needs to be handled manually.
Since Gemini allows UTF-8 encoded URLs, implementing RFC 3492 is virtually a requirement. Otherwise, one cannot make DNS lookups if the domain name contains non-ASCII characters.
As to the rest of the URL, the story is a bit simpler: normalization and escaping reserved characters. The former is needed because Unicode has multiple ways to represent the same character. Applications that deal with UTF-8 already need to use some sort of a Unicode library to actually conform to the standard. Such a library should have routines for normalization so that's one problem that's easy to deal with. (Lagrange uses GNU libunistring.) The other issue is handled by percent-encoding reserved characters, which is also straightforward.
All these encodings and translations should happen automatically and transparently.
Lagrange v0.13 embraces Unicode in both domain names and URL paths:
Speaking of Unicode, actually rendering it on screen is not straightforward at all. Lagrange uses custom text rendering routines that currently only support left-to-right text. A small number of special Unicode codepoints are recognized and handled (such as soft hyphens) but many are just ignored, for example variation selectors.
Version 0.13 has a bunch of improvements for text rendering: