blog.bejarano.io

Domain name validation with regular expressions

I like regular expressions.

I have a love-hate relationship with regular expressions because they are great for efficient pattern matching, but their black magic syntax make them hard to read and write.

Once you've crafted a bunch, you realize they are not that hard to write, but it's easy to miss edge cases if you didn't set your rules beforehand, specially when dealing with something as complex as domain name validation.

Domain names are one of the most common input fields online, in fact, the browser you are reading this with surely has one big input field above, that accepts domain names as input, among others.

The rules

The Internet Engineering Task Force has defined domain name as:

  1. Domain names are made with labels, separated by dots (.)
  2. Labels are sequences of 1 to 63 octets
  3. Total domain name length is limited to 255 characters

This definition is great, but it's not what we generally use. We generally use a mixture of "domain name" and "hostname", a more restrictive syntax for labels that limits characters to lowercase/uppercase letters (a-z, A-Z), digits (0-9) and can contain (but not start or end with) hyphens (-). Over time, underscores (_) have also made their way into domain names.

Internet domain names also end in a set of IANA-issued top-level domains, which are subject to the same rules as labels except that none is shorter than 2 characters and that not a single one has an underscore in it.

The regular expression

^(?=.{4,255}$)([a-zA-Z0-9_]([a-zA-Z0-9_-]{0,61}[a-zA-Z0-9_])?.){1,126}[a-zA-Z0-9][a-zA-Z0-9-]{0,61}[a-zA-Z0-9]$

The above regular expression matches domain names up to 255 characters long, with one or more labels up to 63 characters long, separated by dots, and with a top-level domain of 2 to 63 characters. Labels can contain lowercase letters, uppercase letters, digits, underscores, and (contain but not start or end with) hyphens.

That should be good enough, I'm 99.9% sure it matches everything out there, but again, regular expressions make it easy to miss edge cases, so if you find one, don't hesitate to let me know!