Syntax for apache RewriteRule to match %-encoded URLs? (to fix character encoding issues; windows-1252 <=> utf-8 )

Solution 1:

You can't "convert encodings" as such using only mod_rewrite, however, you can search for that specific sequence of characters in the requested URL and "correct it".

http://localhost:60151/load?file=http://example.org/project²/some/data/file.bam
RewriteRule (.*)project%B2/(.*) $1project²/$2 [NE]

Note that project² appears as part of the query string in the example URL you posted, however, the RewriteRule pattern (which you are using above) matches against the %-decoded URL-path only (which excludes the query string). To match against the query string you need to use an additional RewriteCond directive and match against the QUERY_STRING (or THE_REQUEST) server variable instead.

Note that the QUERY_STRING (and THE_REQUEST) server variable is %-encoded (or rather, as sent from the client) - they have not been %-decoded.

Try the following instead:

RewriteCond %{QUERY_STRING} (.+)/project%B2/(.*)
RewriteRule ^(load)$ $1?%1/project%C2%B2/%2 [NE,L]

The backreferences %1 and %2 in the substitution string refer to the preceding CondPattern - the parts before and after the troublesome /project%B2/ part.

$1 is simply a backreference to the URL-path (to save repetition), which I assume is always load.

The NE flag prevents the % itself (when used as part of the URL-encoded characters) being URL encoded.

UPDATE: I'm afraid my original question was unclear about who GETs which URL, so the "query-string" part of your answer doesn't apply...

If you need to match the %-encoded URL-path then you should match against THE-REQUEST server variable instead. THE_REQUEST contains the first line of the HTTP request header and is not %-decoded. It contains the full URL-path (and query string) as sent from the client (as well as the request method and protocol version). For example, in the case of the malformed request, a string of the form:

GET /project%B2/some/data/file.bam HTTP/1.1

Which you could match and correct as follows:

RewriteCond %{THE_REQUEST} ^[A-Z]{3,7}\s(/project)%B2([^\s]+)
RewriteRule ^/?project %1%B2%C2%2 [NE,L]

%1 and %2 are backreferences to the captured subpatterns in the preceding CondPattern.

The RewriteRule pattern, on the other hand, matches against a pre-processed %-decoded URL-path only (as mentioned above). So, %B2 is whatever that decodes to; assuming a UTF-8 encoding. Unfortunately, this is a non-printable character so would need to be represented by the hex character sequence in the regex, ie. \xb2 (this is PCRE syntax representing a single byte sequence).

Solution 2:

Solution

RewriteRules must use \x instead of % in order to match %-encoded URLs! (PCRE syntax for byte sequences)

mod_rewrite-config uses PCRE regex syntax, and operates on decoded URLs, so typing a %-encoding in a RewriteRule pattern causes it to look for the literal %-character, not an encoded value.
The correct escape-character in RewriteRules is \x, so the URLencoded value %B2 can be matched using \xb2 (or \xB2, it's case-insensitive).

Note that RewriteRule is a hacky solution for character encoding issues, that only works when there is exactly one specific wrong-encoded character is in a specific, predictable place.

For a general solution for multiple wrong-encoded characters in arbitrary places, please see Can Apache .htaccess convert the percent-encoding in encoded URIs from Win-1252 to UTF-8? , which suggests a general solution using RewriteMap coupled to an external program in a full-featured programming language.

The proper solution is still to prevent this from the source, using explicit %-encoding throughout the entire chain. This avoids OS-dependent encoding accidentally happening 'somewhere in the middle', outside of your control. (assuming no client along the paths does double-encoding, which should be a punishable offense..)


How I got here

Getting desperate, I upped the server-wide logging using LogLevel Warn rewrite:trace3 as suggested in mod_rewrite docs. This is warned to (heavily) impact server performance, but was manageable because this is a low-traffic server, and there were no pre-existing rewrites.

The additional logging is emitted into (ssl_)error_log. This gave me insight into how exactly the matching was attempted, and what the internal representations for rules and URIs are in mod_rewrite.

excerpt from ssl_error_log (many columns ommitted for brevity), with rule RewriteRule (.*)project%B2/(.*) $1project²/$2 [NE,L]

[rewrite:trace3] applying pattern '(.*)project%B2/(.*)' to uri 'project\xb2/'
[rewrite:trace1] pass through /var/www/html/example.org/project\xb2

Note that the request-uri from client is written \xb2, but my pattern uses %B2.

Matching the rule-syntax to the uri-syntax, with rule RewriteRule (.*)project\xB2/(.*) $1project²/$2 [NE,L]

[rewrite:trace3] applying pattern '(.*)project\\xb2/(.*)' to uri 'project\xb2/'
[rewrite:trace2] rewrite 'project\xb2/' -> 'project%c2%b2/'
[rewrite:trace1] internal redirect with /auth-test/project\xc2\xb2/ [INTERNAL REDIRECT]

success! As we can see, we are now matching!


Why no [R]/[R=302] flag?

As this is a character-encoding issue, I do not think doing an extra HTTP-round-trip will add value; Every link fed into the client will run into the same issue again, unless I fix the encoding issue before feeding it into the client-side java-program.


Don't forget RewriteBase

Please note that this shortened version omits setting the correct RewriteBase, which can screw up the rewritten path, depending on where in your conf it is written (e.g. <Directory> vs <Location>). Without RewriteBase I accidentally redirected to ❌https://example.org/var/www/html/rewrite-testing/project² instead of ✅https://example.org/rewrite-testing/project²)