Let me talk a bit about my latest little project I wrote in Python 3.
Problematic
As I'm part of the team behind Ultimate Host Blacklist along with other similar projects, we often encounter domains which are flagged as INVALID
by PyFunceble.
So how can we convert each domain of the INVALID.txt
generated by PyFunceble to IDNA format so we can reintroduce them for testing?
My response? domain2idna.
Understanding Punycode and IDNA
Before continuing to read, I'll invite you to read the following from charset.com which explain Punycode/IDNA:
Punycode is an encoding syntax by which a Unicode (UTF-8) string of characters can be translated into the basic ASCII-characters permitted in network host names. Punycode is used for internationalized domain names, in short IDN or IDNA (Internationalizing Domain Names in Applications).
For example, when you would type café.com in your browser, your browser (which is the IDNA-enabled application) first converts the string to Punycode "xn--caf-dma.com", because the character 'é' is not allowed in regular domain names. Punycode domains won't work in very old browsers (Internet Explorer 6 and earlier).
Find more detailed info in the specification RFC 3492.
With another example, a domain like lifehacĸer.com
(note the K) is actually translated to xn--lifehacer-1rb.com
. You may not encounter those kinds of domains in your daily navigation over the Internet but when coming to hosts file, we encounter them almost everywhere.
Indeed, today IDNA formatted domain are mostly used for phishing like this hacker news article which explain a bit deeper the danger about IDNA.
About domain2idna
Domain2idna can be found on GitHub and is ready to use!
It can be used in two different ways: As an imported module or As a command-line command.
As an imported module
As Python allow an installed module to be imported here is an example of how to use domain2idna into an existing code or infrastructure.
#!/usr/bin/env python3
"""
This module uses domains2idna to convert a given domain.
Author:
Nissar Chababy, @funilrys, contactTATAfunilrysTODTODcom
Contributors:
Let's contribute to this example!!
Repository:
https://github.com/funilrys/domain2idna
"""
from colorama import Style
from colorama import init as initiate
from domain2idna.core import Core
DOMAINS = [
"bittréẋ.com", "bịllogram.com", "coinbȧse.com", "cryptopiạ.com", "cṙyptopia.com"
]
# We activate the automatical reset of string formatting
initiate(True)
# The following return the result of the whole loop.
print(
"%sList of converted domains:%s %s"
% (Style.BRIGHT, Style.RESET_ALL, Core(DOMAINS).to_idna())
)
# The following return the result of only one element.
print(
"%sString representing a converted domain:%s %s"
% (Style.BRIGHT, Style.RESET_ALL, Core(DOMAINS[-1]).to_idna())
)
That is a simple example to understand how the domain2idna works.
As you can note, domains2idna can return two type: a list or a str. Indeed, because I'll mostly use domain2idna to convert big lists, I wrote domain2idna so it can handle a given list and return a list with the converted domains. In the other side, as most people will want to get the IDNA format of only a domain, domain2idna also return an str
if a string is given as input.
As a command-line
This part is less "interesting" but you may find the following usage which explains greatly how it's working.
usage: domain2idna [-h] [-d DOMAIN] [-f FILE] [-o OUTPUT]
domain2idna - A tool to convert a domain or a file with a list of domain to
the famous IDNA format.
optional arguments:
-h, --help show this help message and exit
-d DOMAIN, --domain DOMAIN
Set the domain to convert.
-f FILE, --file FILE Set the domain to convert.
-o OUTPUT, --output OUTPUT
Set the file where we write the converted domain(s).
Crafted with ♥ by Nissar Chababy (Funilrys)
As the conclusion, it was fun to write that little project and I hope that it'll help the Open-Source community!
That's it for the presentation of the project! A detailed code comment/explanation may come soon on the programming section.
Thanks for reading.