Let me talk a bit about my latest little project I wrote in Python 3.

Problematic

As I'm part of the team behind Ultimate Host Blacklist along with other similar projects, we often encounter domains which are flagged as INVALID by PyFunceble.

So how can we convert each domain of the INVALID.txt generated by PyFunceble to IDNA format so we can reintroduce them for testing? My response? domain2idna.

Understanding Punycode and IDNA

Before continuing to read, I'll invite you to read the following from charset.com which explain Punycode/IDNA:

Punycode is an encoding syntax by which a Unicode (UTF-8) string of characters can be translated into the basic ASCII-characters permitted in network host names. Punycode is used for internationalized domain names, in short IDN or IDNA (Internationalizing Domain Names in Applications).

For example, when you would type café.com in your browser, your browser (which is the IDNA-enabled application) first converts the string to Punycode "xn--caf-dma.com", because the character 'é' is not allowed in regular domain names. Punycode domains won't work in very old browsers (Internet Explorer 6 and earlier).

Find more detailed info in the specification RFC 3492.

With another example, a domain like lifehacĸer.com (note the K) is actually translated to xn--lifehacer-1rb.com. You may not encounter those kinds of domains in your daily navigation over the Internet but when coming to hosts file, we encounter them almost everywhere.

Indeed, today IDNA formatted domain are mostly used for phishing like this hacker news article which explain a bit deeper the danger about IDNA.

About domain2idna

Domain2idna can be found on GitHub and is ready to use!

It can be used in two different ways: As an imported module or As a command-line command.

As an imported module

As Python allow an installed module to be imported here is an example of how to use domain2idna into an existing code or infrastructure.

#!/usr/bin/env python3

"""
This module uses domains2idna to convert a given domain.

Author:
    Nissar Chababy, @funilrys, contactTATAfunilrysTODTODcom

Contributors:
    Let's contribute to this example!!

Repository:
    https://github.com/funilrys/domain2idna
"""

from colorama import Style
from colorama import init as initiate

from domain2idna.core import Core

DOMAINS = [
    "bittréẋ.com", "bịllogram.com", "coinbȧse.com", "cryptopiạ.com", "cṙyptopia.com"
]

# We activate the automatical reset of string formatting
initiate(True)

# The following return the result of the whole loop.
print(
    "%sList of converted domains:%s %s"
    % (Style.BRIGHT, Style.RESET_ALL, Core(DOMAINS).to_idna())
)

# The following return the result of only one element.
print(
    "%sString representing a converted domain:%s %s"
    % (Style.BRIGHT, Style.RESET_ALL, Core(DOMAINS[-1]).to_idna())
)

That is a simple example to understand how the domain2idna works.

As you can note, domains2idna can return two type: a list or a str. Indeed, because I'll mostly use domain2idna to convert big lists, I wrote domain2idna so it can handle a given list and return a list with the converted domains. In the other side, as most people will want to get the IDNA format of only a domain, domain2idna also return an str if a string is given as input.

As a command-line

This part is less "interesting" but you may find the following usage which explains greatly how it's working.

usage: domain2idna [-h] [-d DOMAIN] [-f FILE] [-o OUTPUT]

domain2idna - A tool to convert a domain or a file with a list of domain to
the famous IDNA format.

optional arguments:
-h, --help            show this help message and exit
-d DOMAIN, --domain DOMAIN
                    Set the domain to convert.
-f FILE, --file FILE  Set the domain to convert.
-o OUTPUT, --output OUTPUT
                    Set the file where we write the converted domain(s).

Crafted with ♥ by Nissar Chababy (Funilrys)

As the conclusion, it was fun to write that little project and I hope that it'll help the Open-Source community!

That's it for the presentation of the project! A detailed code comment/explanation may come soon on the programming section.

Thanks for reading.