It is incorrect to "normalize" // in HTTP URL paths

Collapsing double slashes in HTTP URL paths is often mistaken for normalization, but according to RFC standards, it actually alters the resource identifier and violates URI syntax rules.

(See discussion on Lobsters.)

Collapsing // to / inside an HTTP URL path is not normalization.

The URI syntax permits empty path segments

RFC 3986 defines the path component and the segment grammar in a way that allows for empty segments. A double slash is therefore syntactically meaningful. It represents a zero-length segment between two separators.

3.3. Path

The path component contains data, usually organized in hierarchical form, that, along with data in the non-hierarchical query component (Section 3.4), serves to identify a resource within the scope of the URI’s scheme and naming authority (if any). The path is terminated by the first question mark ("?") or number sign ("#") character, or by the end of the URI.

If a URI contains an authority component, then the path component must either be empty or begin with a slash ("/") character. If a URI does not contain an authority component, then the path cannot begin with two slash characters ("//"). In addition, a URI reference (Section 4.1) may be a relative-path reference, in which case the first path segment cannot contain a colon (":") character. The ABNF requires five separate rules to disambiguate these cases, only one of which will match the path substring within a given URI reference. We use the generic term “path component” to describe the URI substring matched by the parser to one of these rules.

path = path-abempty ; begins with "/" or is empty / path-absolute ; begins with "/" but not "//" / path-noscheme ; begins with a non-colon segment / path-rootless ; begins with a segment / path-empty ; zero characters path-abempty = *( "/" segment ) path-absolute = "/" [ segment-nz *( "/" segment ) ] path-noscheme = segment-nz-nc *( "/" segment ) path-rootless = segment-nz *( "/" segment ) path-empty = 0<pchar> segment = *pchar segment-nz = 1*pchar segment-nz-nc = 1*( unreserved / pct-encoded / sub-delims / "@" ) ; non-zero-length segment without any colon ":" pchar = unreserved / pct-encoded / sub-delims / ":" / "@"

A path consists of a sequence of path segments separated by a slash ("/") character. A path is always defined for a URI, though the defined path may be empty (zero length). Use of the slash character to indicate hierarchy is only required when a URI will be used as the context for relative references. For example, the URI mailto:[email protected] has a path of “[email protected]”, whereas the URI foo://info.example.com?fred has an empty path.

The path segments “.” and “..”, also known as dot-segments, are defined for relative reference within the path name hierarchy. They are intended for use at the beginning of a relative-path reference (Section 4.2) to indicate relative position within the hierarchical tree of names. This is similar to their role within some operating systems’ file directory structures to indicate the current directory and parent directory, respectively. However, unlike in a file system, these dot-segments are only interpreted within the URI path hierarchy and are removed as part of the resolution process (Section 5.2).

Aside from dot-segments in hierarchical paths, a path segment is considered opaque by the generic syntax.

Because segment = *pchar, the empty string is a valid segment. Therefore, path-abempty = *( "/" segment ) allows a slash followed by an empty segment. Any transformation that collapses // to / removes a syntactically valid segment and thus changes the parsed sequence of segments.

HTTP uses RFC 3986 path grammar

HTTP (RFC 9110) uses the RFC 3986 path grammar for request targets.

4.1. URI References

URI references are used to target requests, indicate redirects, and define relationships.

The definitions of “URI-reference”, “absolute-URI”, “relative-part”, “authority”, “port”, “host”, “path-abempty”, “segment”, and “query” are adopted from the URI generic syntax. An “absolute-path” rule is defined for protocol elements that can contain a non-empty path component. (This rule differs slightly from the path-abempty rule of RFC 3986, which allows for an empty path, and path-absolute rule, which does not allow paths that begin with “//”.) A “partial-URI” rule is defined for protocol elements that can contain a relative URI but not a fragment component.

URI-reference = <URI-reference, see [URI], Section 4.1> absolute-URI = <absolute-URI, see [URI], Section 4.3> relative-part = <relative-part, see [URI], Section 4.2> authority = <authority, see [URI], Section 3.2> uri-host = <host, see [URI], Section 3.2.2> port = <port, see [URI], Section 3.2.3> path-abempty = <path-abempty, see [URI], Section 3.3> segment = <segment, see [URI], Section 3.3> query = <query, see [URI], Section 3.4> absolute-path = 1*( "/" segment ) partial-URI = relative-part [ "?" query ]

4.2.1. http URI Scheme

http-URI = "http" "://" authority path-abempty [ "?" query ]

The origin server for an “http” URI is identified by the authority component, which includes a host identifier ([URI], Section 3.2.2) and optional port number ([URI], Section 3.2.3). If the port subcomponent is empty or not given, TCP port 80 (the reserved port for WWW services) is the default.

The hierarchical path component and optional query component identify the target resource within that origin server’s namespace.

Collapsing // alters the sequence of segments and therefore alters the identifier. Unless the origin explicitly defines those two identifiers as equivalent, a generic normalizer has no authority to do so. Only the origin could munge URIs in its own namespace.

URL normalization rules do not include collapsing `//`

RFC 3986 is quite explicit about what syntax-based normalization is: case normalization, percent-encoding normalization, and dot-segment removal. It does not list any rule that removes empty segments or collapses multiple slashes.

6.2.2. Syntax-Based Normalization

Implementations may use logic based on the definitions provided by this specification to reduce the probability of false negatives. This processing is moderately higher in cost than character-for- character string comparison. For example, an application using this approach could reasonably consider the following two URIs equivalent:

example://a/b/c/%7Bfoo%7D eXAMPLE://a/./b/../b/%63/%7bfoo%7d

Web user agents, such as browsers, typically apply this type of URI normalization when determining whether a cached response is available. Syntax-based normalization includes such techniques as case normalization, percent-encoding normalization, and removal of dot-segments.

Path normalization is quite narrowly specified too: it is about . and .. in relative references, not empty segments.

6.2.2.3. Path Segment Normalization

The complete path segments “.” and “..” are intended only for use within relative references (Section 4.1) and are removed as part of the reference resolution process (Section 5.2). However, some deployed implementations incorrectly assume that reference resolution is not necessary when the reference is already a URI and thus fail to remove dot-segments when they occur in non-relative paths. URI normalizers should remove dot-segments by applying the remove_dot_segments algorithm to the path, as described in Section 5.2.4.

Notice what is not present: there is no rule permitting removal of empty segments, nor any directive to coalesce repeated separators, etc.

HTTP scheme-based normalization still does not collapse `//`

HTTP adds a few scheme-based normalization rules, and they are quite narrow still. The only rule that touches the path concerns the empty path component (not empty segments inside a path):

Source: Hacker News