Comparing the Usage of Non-Latin Character Sets in Top Level Domains
According to Wikipedia
Punycode is a representation of Unicode with the limited ASCII character subset used for Internet hostnames. Using Punycode, host names containing Unicode characters are transcoded to a subset of ASCII consisting of letters, digits, and hyphens, which is called the Letter-Digit-Hyphen (LDH) subset. For example, München (German name for Munich) is encoded as Mnchen-3ya.
Domains using this Punycode system start with xn– and is the only way for domains to include characters that are not A-z, 0-9, or a hyphen (-). For example the TLD “商城.” becomes “xn–czru2d.” when converted to Punycode.
Recently I filed a Centralized Zone Data Service request for the zone files for all Top Level Domains (TLDs) that used Punycode to encode characters. This request got me access to the zone files for 69 TLDs that used Punycode. Using these zone files one can find out the names of all Second Level Domains (SLDs) for each TLD, which allowed me to analyze the character sets used in the domains as well as the popularity of TLDs.
The first bit of data I collected was the breakdown of what character sets the TLDs I had collected were using. Below is a doughnut chart showing the distribution.
As you can see, the majority of Punycode based TLDs are using Chinese, Japanese, and Korean characters. In Unicode, these are all mixed into one category of CJK. Note that the CJK character is used in just over half of the TLDs. This next graph was created by finding the character set used in the TLD for all 168 thousand SLDs from the 69 zone files I used.
This graph tells a completely different story than the last. It shows that Punycode based domains are mostly if not all using CJK, with 90.4% being CJK, 2.3% being Katakana (a Japanese script not included in the CJK block in Unicode), and 1.1% being Hiragana (another Japanese script). This means that out of all of the SLDs that use Punycode TLDs 94% use a TLD with CJK, Katakana, or Hiragana even though only 65% of TLDs are CJK, Katakana, or Hiragana. Another notable point is that Arabic TLDs while making up 14.5% of TLDs are only used on 0.5% of SLDs with Punycode TLDs.
Looking at the chart of the most common Punycode TLDs shows a similar story.
All of the top 10 Punycode TLDs are CJK, Katakana, which makes sense considering their previously mentioned 94% monopoly over Punycode TLD usage. Next I decided to look at the character used in SLDs.
As with the previous examples CJK makes up a vast majority, but this time latin is the next most common. This makes sense because there were a considerable amount of company domains registered as well as these weird domains that looked like “48e41vbg8k4cq64po86561pkwo7l0p9g4b”. After some research and consulting with colleagues, it was found out that these domains were likely for DNSSEC. Since they are not used by a person, but just by the DNS resolver, I decided to filter those out. Below is the SLD character set graph without these DNSSEC domains.
The Latin category went down as would be expected with the CJK category also going down a slight bit due to an issue with my regex.
Next, is a graph showing how many SLDs had the same character set as their TLD. For instance “グリーンズ.コム” or “xn–qckua7llb5b.xn–tckwe.” has the same character set of Katakana in both its TLD and SLD.
Like previous graphs, it pretty much just shows CJK with a slight hint of Hangul. This is once again due to the high usage of CJK TLDs in comparison with others which has been the main thing found in these graphs. CJK based TLDs in comparison with other character sets are the most widely adopted and used. Future research in this area could involve checking WHOIS records for the TLDs and SLDs to find the countries these domains are registered in and having a headless browser attempt to load up each site.