Replace MAC address with UUID

The following perl script uses either the Digest::MD5 or Digest::SHA module to transform a MAC address to a hash, using a secret salt. See the man pages of the modules for more details on them. It's worth noting that Digest::SHA has several more algorithms to choose from.

The code is written to make it easy to choose a different hashing algorithm - uncomment one and comment out the others to choose whichever suits you best. BTW, the output from the _base64 versions of the functions is a little shorter then the _hex functions but look more like line-noise.

I simplified your provided regex (couldn't see any need for a look-behind). You may need to tweak that a bit to work with your input data....you didn't provide any sample so I just guessed.

#!/usr/bin/perl

# choose one of the following digest modules:
use Digest::MD5 qw(md5_hex md5_base64);
#use Digest::SHA qw(sha256_hex sha256_base64);

use strict;

my $salt='secret salt phrase';

# store seen MAC addresses in a hash so we only have to calculate the digest
# for them once.  This speed optimisation is only useful if the input file
# is large AND any given MAC address may be seen many times.
my %macs=();

while(<>) {
  if (m/clientMac:\s*([A-Z0-9]{12})/i) {
    my $mac = $1;

    if (!defined($macs{$mac})) {
      # choose one of the following digest conversions:

      #my $uuid = sha256_hex($mac . $salt);
      #my $uuid = sha256_base64($mac . $salt);
      my $uuid = md5_hex($mac . $salt);
      #my $uuid = md5_base64($mac . $salt);

      $macs{$mac} = $uuid;
    };

    s/(clientMac:\s*)$mac/$1$macs{$mac}/gio;
  };
  print;
};

As requested in the comment, here is an example how to perform such a substitution with sed. You used the /linux tag, so it should be safe to use GNU sed with its e flag for the s command:

sed -E 'h;s/.*clientMac":\s"([A-Z0-9]{12}).*/echo secretKey\1|md5sum/e;T
  G;s/(.*)\s*-\n(.*clientMac":\s")[A-Z0-9]{12}(.*)/\2\1\3/' logfile

Explanation:

  • The h command saves the line to the hold space, so we can restore it after messing up the line (-;
  • s/.*clientMac":\s"([A-Z0-9]{12}).*/echo secretKey\1|md5sum/e matches the whole line, putting the actual MAC in () to be reused in the replacement. The replacement forms the command to be executed: echoing the MCA along with the "salt" and piping it into md5sum. The e flag makes sed execute this in the shell and putting the result in the buffer again
  • T branches to the end of the script if no replacement was made. This is to print lines without MAC unmodified. Following lines are executed only if a replacement was made
  • G appends the original line from the hold buffer, so now we have the md5sum output, a newline and the original line in the buffer
  • s/(.*)\s*-\n(.*clientMac":\s")[A-Z0-9]{12}(.*)/\2\1\3/ captures the MD5 in the first pair of (), the line before the MAC in the second and the rest of the line after the MAC in the third, thus \2\1\3 replaces the MAC with the MD5

As an alternative approach, sometimes I used simple line numbers as obfuscation value. This makes the output more compact and more readable.

Also, awk is a good tool when one needs to perform "smart" operations on a text file, having a more readable language than sed. The "smart" operation to perform in this case is avoid re-executing the obfuscation algorithm when any one MAC address is encountered more than once. This can speed up operations quite a lot if you have thousand of lines referring to a small number of MAC addresses.

In practice, consider the following script, which also handles possible multiple MAC addresses occurring on any one line, identifying and replacing each occurrence, and then prints a mapping table at the end:

awk -v pat='clientMac"\\s*"[[:xdigit:]]{12}' -v table='sort -k 1,1n | column -t' -- '
$0 ~ pat {
    for (i=1; i <= NF; i++)
        if (match($i, pat)) {
            if (!($i in cache))
                cache[$i]=NR "." i
            $i = "MAC:" cache[$i]
        }
}
1
END {
    print "---Table: "FILENAME"\nnum MAC" | table
    for (mac in cache)
        print cache[mac], mac | table
}
' file.log

The table at the end can be easily separated from the main output by an additional editing step, or by just making the command string in the -v table= argument redirect its output to a file, like in -v table='sort -k 1,1n | column -t > table'. It can also be removed altogether by just removing the entire END{ … } block.

As a variant, using a real encryption engine to compute obfuscation values and hence with no mapping table at the end:

awk -v pat='clientMac"\\s*"[[:xdigit:]]{12}' -v crypter='openssl enc -aes-256-cbc -a -pass file:mypassfile' -- '
$0 ~ pat {
    for (i=1; i <= NF; i++)
        if (match($i, pat)) {
            addr = cache[$i]
            if (addr == "") {
                "echo '\''" $i "'\'' | " crypter | getline addr
                cache[$i] = addr
            }
            $i = "MAC:" addr
        }
}
1
' file.log

Here I used openssl as encryption engine selecting its aes-256-cbc cipher (with also a base64-encoded output in order to be text-friendly), and making it read the encryption secret from a file named mypassfile.

Strings encrypted with a symmetric cipher (like aes-256-cbc) can be decrypted by knowing the secret used (the contents of mypassfile, which you want to keep for yourself), therefore they can be reversed. Also, since openssl uses a random salt by default, each run produces different values for the same input. Not using a salt (option -nosalt) would make openssl produce the same value for each run, so less secure, but on the other hand would produce shorter texts while still being encrypted.

The same awk script would work for other external commands instead of openssl by just replacing the command in the -v crypter= argument to awk, as long as the external command you choose can accept input from stdin and print output to stdout.

Strings hashed with algorithms like MD5 or SHA instead are one-way only (i.e. they can't be reversed), and always produce the same value for the same input, therefore you'd want to "salt" them so that the computed values produced in output can't just be searched over all possible MAC addresses. You might add a random "salt" as in the following slightly modified script:

awk -v pat='clientMac"\\s*"[[:xdigit:]]{12}' -v crypter='sha256sum' -- '
$0 ~ pat {
    for (i=1; i <= NF; i++)
        if (match($i, pat)) {
            addr = cache[$i]
            if (addr == "") {
                "(dd if=/dev/random bs=16 count=1 2>/dev/null; echo '\''" $i "'\'') | " crypter | getline addr
                cache[$i] = addr
            }
            $i = "MAC:" addr
        }
}
1
' file.log

This latter script uses a 16 bytes-long (pseudo-)random value as "salt", thus producing a different hash value on each run over the same data.