Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

caddyhttp: Smarter path matching and rewriting #4948

Merged
merged 15 commits into from
Aug 16, 2022
Merged

caddyhttp: Smarter path matching and rewriting #4948

merged 15 commits into from
Aug 16, 2022

Conversation

mholt
Copy link
Member

@mholt mholt commented Aug 10, 2022

This branch resolves several inconsistencies across Caddy's HTTP facilities regarding URI encodings in paths.

I am not entirely sure, but I suppose breaking changes might be possible if users relied on buggy behavior that has only just been determined and is being remedied here.

This PR mainly affects the path matcher and the rewrite middleware (including both the rewrite and uri Caddyfile directives). These are extremely commonly-used Caddy features.

Background

URIs (essentially the part of the URL after the scheme and authority/host, e.g. /foo/bar?a=b#frag -- though servers don't really deal with #fragment components) are famous for being inconsistently encoded and parsed. Differences in parsing/handling between servers, proxies, and applications often lead to bugs and security vulnerabilities. For example, a path of //foo/bar might be considered equivalent to /foo/bar by one piece of infrastructure, and different to another. Similarly, /foo%2Fbar might or might not be the same as /foo/bar. To a router, they could be different. To an application, they could be the same.

A web server like Caddy is between a rock and a hard place, because it finds itself between untrusted clients who send all manner of inconsistent requests, and other servers or applications who expect the request URI to be just right. Caddy is often expected to route requests of all varieties and rewrite/transform them into something the backend application (even if that's just the built-in static file server) can use without confusion. The problem is the requirements and expectations vary widely!

Caddy has had several issues over the years where some users expect a URI like /foo%2Fbar to be transformed into /foo/bar before being proxied. Some want /foo/bar to match /foo%2Fbar, while others don't. Some want a matcher like /secret/* to match URIs like //secret/* or /secret//* because they put it behind authentication, and if it doesn't match, auth could be bypassed! Windows treats /file.php . .. the same as /file.php -- even though they technically have different suffixes and file extensions, causing routing blunders. Then imagine a path prefix like /bands/*/*/ that should match /bands/Pink/Try/ as well as /bands/AC%2FDC/T.N.T -- but if the path matcher normalizes (decodes) URIs before matching, the first URI would work but the second would become /bands/AC/DC/T.N.T which doesn't match the pattern anymore. To make matters worse, any given URI has multiple valid encoded forms. %2F%66%6F%6F%2F%62%61%72 can be decoded to /foo/bar just as well as /foo%2Fbar can, and everything in between can, too. If routers matched on non-normalized URIs, there would be plenty more security bugs to deal with: a pattern of /foo/*, which is expected to be authenticated, would no longer match /foo%2Fbar even though they are, according to ratified RFCs, equivalent.

In other words, encodings are significant to applications, but normalizing URIs to a consistent form is critical for maintaining security.

Let me restate here what I wrote for the Laravel community when I started working on this (with minor changes to make sense out of context):


RFC 9110, "HTTP Semantics," has a section on HTTP URI normalization, which says:

Two HTTP URIs that are equivalent after normalization (using any method) can be assumed to identify the same resource, and any HTTP component MAY perform normalization. As a result, distinct resources SHOULD NOT be identified by HTTP URIs that are equivalent after normalization.

In other words, /foo%2Fbar and /foo/bar are equivalent after normalizing, and thus they SHOULD NOT be used for distinct resources. So if you are encoding application data into the path, and that data could possibly have reserved characters / delimiters (like /), consider redesigning your API: it is not robust in the harsh HTTP environment.

Note that several RFCs, notably RFCs 3986 and 9110, continually repeat that URI parsing is dependent upon scheme. That's one other problem: we all use the http:// or https:// scheme and yet expect applications to handle URIs differently. So of course there's going to be head-butting: we're fighting the design.

To clarify, it is definitely possible for a URI path such as /band/AC%2fDC/T.N.T to be "properly" handled by a server application. For this case, simply write a server that decodes everything after /band/ except %2f. 🤷‍♂️ The problem is that this is difficult in general. Depending on what situations you do this, you may be opening yourself to bugs and security holes. This is why Caddy currently handles URIs solely in the unescaped space: it's the "one true" representation of a URI, and normalized HTTP URIs are more or less clearly defined nowadays.

Others might propose a solution to double-encode application data in the path; in other words, have the client send a URI with a path of /bands/AC%252fDC/T.N.T. This will probably work, but it's a hack and it violates spec:

Implementations must not percent-encode or decode the same string more than once, as decoding an already decoded string might lead to misinterpreting a percent data octet as the beginning of a percent-encoding, or vice versa in the case of percent-encoding an already percent-encoded string.

Beware of the non-conforming behavior and highlight it very prominently in documentation so you can avoid bugs.

Laravel user @alcaitiff made a comment that some of you may be thinking:

The router should resolve the route and after that decode parameters, but it does decode the url parameters before resolving the route.

I can't speak for Laravel or what it's doing, but the Go standard library (what Caddy uses), for example, does do URL parsing correctly and still has this problem. Go does exactly what you and the spec recommend: it splits the URI into its components and then decodes reserved characters after parsing. It preserves the original, "raw" path in the RawPath field and offers the decoded path in the Path field. Its EscapedPath() method uses RawPath if it is a valid encoding of Path, which is interesting because any given path has multiple valid encodings as I noted above. So if I want to truly "normalize" the URI in Go, I have to call url.PathEscape(req.URL.Path) myself and ignore RawPath entirely (AFAIK). And guess what, this converts /foo/bar to... /foo/bar. In other words, decoding /foo%2Fbar is not reversible without loss of precision. (Unless the HTTP server knows your business logic, more on that in a moment.)

We can write our own logic, though, that uses RawPath as a "hint" (as the Go docs say) to maybe replace / with %2f, but if we've manipulated/rewritten the URI at all, this becomes infeasible because we don't know where or if that instance still exists in the string.

RFC 3986 section 2 states:

URI producing applications should percent-encode data octets that
correspond to characters in the reserved set unless these characters
are specifically allowed by the URI scheme to represent data in that
component. If a reserved character is found in a URI component and
no delimiting role is known for that character, then it must be
interpreted as representing the data octet corresponding to that
character's encoding in US-ASCII.

The / is in the reserved set. Thus it is up to the implementation to determine whether it is data or a delimiter. I guess Laravel doesn't know, and it's frankly safer to assume it's a delimiter and treat it in its normalized form.

So yes, this issue is frustrating. As a web server author, I feel like I need to write software that can read people's minds: is this slash data or is it a delimiter? The router needs more information, because both are very valid ways of interpreting a URI!


The solution

I think the key to this problem is trying to read the developer's mind: is this character supposed to be a delimiter (part of the path) or data? Should we collapse repeated slashes or no?

The answers depend on the context. For routing / path matching, the answer may be one way, for rewriting it may be another, and for proxying it may be yet another depending on the applications being proxied to.

Nginx, Apache, and Caddy all merge slashes by default when matching. However, Nginx and Apache have options to disable that behavior and preserve the slashes, which can lead to security vulnerabilities. All three do path matching (or routing) in the normalized space to mitigate bugs but, like we saw with Laravel, makes it difficult or impossible to route requests with application data that decode as path-significant characters like %2F (/), leaving many developers frustrated.

This PR introduces a somewhat novel solution that allows the developer to convey their intent to the server when doing matching and rewriting.

Simply put, our solution is to interpret encoded characters and multiple slashes in the configuration as a literal conveyance of the developer's intent. In other words, we don't blanket-unescape the whole URI every time. We do it byte-for-byte in lock-step with the configured pattern to match, and only unescape if the match pattern is not escaped at that position. Similarly, if a configured path has double slashes // in it, we do not merge slashes when comparing paths, because we infer the user's intent is to match repeated slashes.

Path matching

Path matching (aka routing) is still done in the normalized space. That means if you configure a path matcher of /foo/bar, it will match /foo/bar, /foo%2Fbar and even %2F%66%6F%6F%2F%62%61%72 because we normalize the URI. This is unchanged from before.

But now if you have a path matcher of /foo%2Fbar, it will match /foo%2Fbar exactly (the escape sequence is case-insensitive), whereas previously it would have only matched /foo%252Fbar (i.e. % as data). Now, /foo%2Fbar will NOT match /foo/bar or %2Ffoo/bar because we infer intent from seeing escape sequences in the match pattern as application data, not path delimiters.

This logic handily extends to wildcards, too. Referring to the previous example from our Laravel discussion, if you want to use /bands/*/* it is impossible to match a URI of /bands/AC%2fDC/T.N.T (in Laravel, too). But with this change, you can use special "escape-wildcard" characters: /bands/%*/%* to indicate that the span matched by the wildcard should not be URI-decoded and should be kept in the escaped/raw space.

So now, if you want to allow band names to have a / in them, you can simply write /bands/%*/%*.

Double slashes

Similar to escape sequences, we now disable slash merging automatically if the configured pattern has repeated slashes. Previously, it was impossible to match //foo because all URIs were normalized. Now, a path matcher of //foo will preserve multiple slashes. (A matcher of /foo will still match //foo.)

Rewriting

A common task of rewriting is to strip path prefix and path suffix. The logic explained above has also been implemented for these operations, allowing you to use escaped characters and multiple slashes in your prefix and suffix patterns, and now Caddy will rewrite more intuitively and correctly.

For example, if you want to strip a prefix of //prefix from //prefix/foo, it will work, whereas before it wouldn't find the prefix because it would look at a fully-normalized URI.

Similarly, you can strip prefixes or suffixes with encoded characters. For example, a prefix of /foo%2Fbar will rewrite a URI of /foo%2Fbar/asdf into /asdf, whereas before it wouldn't find the prefix.

Is it perfect?

Probably not. Are there bugs? Probably. Have I overlooked things? Almost certainly yes. I'm pretty sure there might be nooks and crannies within Caddy that I missed implementing this. Please file a bug report if you need it to work but doesn't work like you expect.

I'm pretty happy with this approach though. I think it's very useful and I don't know of other mainstream servers or frameworks that implement this behavior. In true Caddy fashion, this should just work.

Expected for path matching and rewriting. Fixes #4923.
This allows matching spans of raw/URI-escaped portions of the path.
@mholt mholt marked this pull request as ready for review August 11, 2022 20:52
@mholt mholt added under review 🧐 Review is pending before merging and removed in progress 🏃‍♂️ Being actively worked on labels Aug 11, 2022
@mholt mholt changed the title caddyhttp: Consistent URI-decoded (unescaped) form caddyhttp: Smarter path matching and rewriting Aug 12, 2022
@coolaj86
Copy link
Contributor

Very well reasoned, but I think this is something that you'll never regret not merging, and that you may very well regret putting into the codebase for a use case that is much more art (abusing constraints) than software engineering.

@mholt
Copy link
Member Author

mholt commented Aug 13, 2022

@coolaj86 Hiya! (I know I just replied on Twitter, but)

I disagree with the part about abusing constraints, but I do think you could be right. This is not my favorite change, but it doesn't violate any standards (and in fact it is perhaps more conforming in some ways than other implementations) and it provides an elegant solution to otherwise hacky or non-compliant alternatives.

Unless I have overlooked some significant problem with this approach, I do think it's well-engineered. In some ways it gives Caddy another competitive technical advantage.

But yes, I do hope I don't regret this. 😅

@coolaj86
Copy link
Contributor

YOLO!

// CleanPath cleans path p according to path.Clean(), but only
// merges repeated slashes if collapseSlashes is true, and always
// preserves trailing slashes.
func CleanPath(p string, collapseSlashes bool) string {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I gave this another thought.

Typically I guess dot segments ("." and "..") are rarely used in URLs, but I'm afraid the behavior of CleanPath is hard to reason about if dot segments are used along with multiple slashes.

For example, I'm not sure if /foo//./bar is the result most users expect 🤔

p CleanPath path.Clean
/foo//./..//./bar /foo//./bar /bar
/foo/./.././bar /bar /bar

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting! I'll look into this...

Copy link
Member Author

@mholt mholt Aug 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we need to decide if /foo//.. is / or /foo, i.e. whether // counts as a path element (an empty one) or not.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, this is an important point!

Obviously, path.Clean chooses /, but personally I think /foo is a better choice because URI Empty Path Segments Matter.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same author made an interesting point about path matching in Application Content URI Rules wildcard syntax:

For the path, an asterisk matches within the path segment. For example, http://example.com/a/*/c will match http://example.com/a/b/c and http://example.com/a//c but not http://example.com/a/b/b/c or http://example.com/a/c

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RussellLuo Thanks for those resources. I think I tend to agree, if we're not merging slashes, then treat empty path components as significant. I pushed a commit just now which makes the new test case pass.

@mholt mholt removed the under review 🧐 Review is pending before merging label Aug 16, 2022
@mholt mholt merged commit a479943 into master Aug 16, 2022
@mholt mholt deleted the path-escaping branch August 16, 2022 14:49
WilczynskiT pushed a commit to WilczynskiT/caddy that referenced this pull request Aug 17, 2022
Co-authored-by: RussellLuo <luopeng.he@gmail.com>
@francislavoie francislavoie modified the milestones: v2.6.0, v2.6.0-beta.1 Aug 21, 2022
@mholt mholt modified the milestones: v2.6.0-beta.1, v2.6.0 Sep 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants