How can I reject base64 encoded spam email?
Don't do this with Postfix
body_check but write a Spamassassin rule for it, instead. Spamassain decodes the message body before applying its rules. Something like:
body LOCAL_QUANZHOUCOOWAY /Quanzhoucooway/ score LOCAL_QUANZHOUCOOWAY 7.0 describe LOCAL_QUANZHOUCOOWAY Block word Quanzhoucooway
These rules belongs to
Technically, you could directly filter the base64 encoded data for keywords. I'm not saying it's a practical or a reasonable thing to do, given the existence of better and simpler alternatives (as described e.g. in Esa's answer above), but it is possible.
The trick is to realize that base64 encoding is a deterministic mapping of 3-byte blocks of raw unencoded data into 4-character blocks of base64 characters. Thus, any time a certain sequence of 3-byte blocks appears in the unencoded data, the same sequence of 4-character blocks will appear in the encoded version.
For example, if you enter the string
Quanzhoucooway into a base64 encoder, you'll get the output
UXVhbnpob3Vjb293YXk=. Since the length of the input is not a multiple of 3 bytes, the output contains some padding at the end, but if we drop the final
= signs and the last actual base64 character
k (since it also encodes some padding bits), we get the string
UXVhbnpob3Vjb293YX that is guaranteed to appear in the base64-encoded data whenever the byte triplets
oow and the partial triplet
ay appear in the input in that order.
But, of course, the string
Quanzhoucooway might not start exactly on triplet boundary. For example, if we encode the string
XQuanzhoucooway instead, we get the output
WFF1YW56aG91Y29vd2F5, which looks completely different. This time, the input length is divisible by three, so there are no padding characters to discard at the end, but we do need to discard the first two characters (
WF) which each encode some of the bits from the prepended
X byte, leaving us with
Finally, base64 encoding
XXQuanzhoucooway gives the output
WFhRdWFuemhvdWNvb3dheQ==, which has padding at both ends. Removing the first three characters
WFh (which encode the
XX prefix) and the last three characters
Q== (which encode the zero bit padding at the end), we're left with the string
RdWFuemhvdWNvb3dhe. Thus, we obtain the following three base64-encoded strings:
UXVhbnpob3Vjb293YX F1YW56aG91Y29vd2F5 RdWFuemhvdWNvb3dhe
of which (at least) one must appear in the base64 encoded form of any input string containing the word
Of course, if you're unlucky, the base64 encoder may insert a line break in the middle of them, between any two encoded triplets. (Your example message, for example, has one between
aG91Y29vd2F5.) Thus, to reliably match these strings with regexps, you'd need something like the following (using PCRE syntax):
/UXVh\s*bnpo\s*b3Vj\s*b293\s*YX/ DISCARD /F1\s*YW56\s*aG91\s*Y29v\s*d2F5/ DISCARD /R\s*dWFu\s*emhv\s*dWNv\s*b3dh\s*e/ DISCARD
Generating these patterns by hand is kind of tedious, but it wouldn't be hard to write a simple script to do it in your favorite programming language, at least as long as it provides a base64 encoder.
If you really wanted, you could even implement case-insensitive matching by base64 encoding both the lowercase and the uppercase version of the keyword and combining them into a regexp that matches any combination of them. For example, the base64 encoding of
cXVhbnpob3Vjb293YXk= while that of
UVVBTlpIT1VDT09XQVk=, so the rule:
will match the base64 encoded word "Quanzhoucooway" in any case, provided that it begins on a triplet boundary. Generating the other two corresponding regexps for the shifted versions is left as an exercise. ;)
Alas, doing anything more complicated than simple substring matching like this quickly becomes impractical. But at least it's a neat trick. In principle, it could even be useful, if you for some reason could not use SpamAssassin or any other filter that can decode the base64 encoding before filtering. But if you can do that, instead of using hacks like this, you certainly should.