Regex for Matching Protocol and Host of a URL-string

A Javascript (JS) Regex that matches the protocol and host of a URL.

Post meta:

I needed to keep only the pathname of several URL/location-strings in a Node.js app, and just in case I expanded the Regex to be more general by matching any protocol and host substring of a URL. I’m sure I’m not alone in not remembering Regex those few times a year I have to use it. Maybe this will help someone like me.

As far as I know, the protocol and host of a URL is identified by:

  • Always being the start of a URL-string.
  • Always ending at the first occurrence of a “?”, a “#”, a single “/”, or end of input.

The Regex

Here’s my Regex-variable, though I’m not sure if it’s the most optimal:

const matchProtocolAndHost =
    /.*?(?=(\?|#|(?<!\/)\/(?!\/)|$))/

It consist of two parts, .*? and (\?|#|(?<!\/)\/(?!\/)|$). The latter is ‘wrapped’ by ?= which I’ll get back to.

The First Part

.*? matches everything up till the second part. Let’s break it apart:

  • . matches any single character except line terminators: \n, \r, \u2028 or \u2029.
  • * matches the preceding item (. in this case) 0 or more times.
  • ? _non-greedy_—if used immediately after any of the quantifiers *, +, ?, or {}, makes the quantifier non-greedy (matching the minimum number of times). In other words, the Regex will match everything until the subsequent part—part two in this instance—of the Regex is matched (if any).

The Second Part

(\?|#|(?<!\/)\/(?!\/)|$) matches the four possible endings:

  • \? (“?”)
  • # (“#”)
  • a single “/” by the more complex (?<!\\/)\/(?!\/)—for only matching a single “/”, not if it is preceded or followed by another “/”
  • $ for end of input

The ‘wrapping’ ?= is to check if the string that follows matches the expression. This is used so the character that ends the host-string isn’t included in the match.

Sources