How can I convert two-valued text data to binary (bit-representation)

Another perl:

perl -pe 'BEGIN { binmode \*STDOUT } chomp; tr/AB/\0\1/; $_ = pack "B*", $_'

Proof:

$ echo ABBBAAAABBBBBABBABBBABBB | \
    perl -pe 'BEGIN { binmode \*STDOUT } chomp; tr/AB/\0\1/; $_ = pack "B*", $_' | \
    od -tx1
0000000 70 fb 77
0000003

The above reads input one line at a time. It's up to you to make sure the lines are exactly what they are supposed to be.

Edit: The reverse operation:

#!/usr/bin/env perl

binmode \*STDIN;

while ( defined ( $_ = getc ) ) {
    $_ = unpack "B*";
    tr/01/AB/;
    print;
    print "\n" if ( not ++$cnt % 3 );
}
print "\n" if ( $cnt % 3 );

This reads a byte of input at a time.

Edit 2: Simpler reverse operation:

perl -pe 'BEGIN { $/ = \3; $\ = "\n"; binmode \*STDIN } $_ = unpack "B*"; tr/01/AB/'

The above reads 3 bytes at a time from STDIN (but receiving EOF in the middle of a sequence is not a fatal problem).


{   printf '2i[q]sq[?z0=qPl?x]s?l?x'
    tr -dc AB | tr AB 01 | fold -b24
}   <infile   | dc

In making the following statement, @lcd047 has pretty well nailed my earlier state of confusion:

You seem to be confused by the output of od. Use od -tx1 to look at bytes. od -x reads words, and on little endian machines that swaps bytes. I didn't follow closely the exchange above, but I think your initial version was correct, and you don't need to mess with byte order at all. Just use od -tx1, not od -x.

Now this makes me feel a lot better - the earlier need for dd conv=swab was bugging me all day. I couldn't pin it, but I knew there was something wrong w/ it. Being able to explain it away in my own stupidity is very comforting - especially since I learned something.

Anyway, that will delete every byte which isn't [AB], then translate those to [01] accordingly, before folding the resulting stream at 24 bytes per line. dc ? reads a line at a time, checks if input contained anything, and, if so, Prints the byte value of that number to stdout.

From man dc:

  • P

    • Pops off the value on top of the stack. If it is a string, it is simply printed without a trailing newline. Otherwise it is a number, and the integer portion of its absolute value is printed out as a "base (UCHAR_MAX+1)" byte stream.
  • i

    • Pops the value off the top of the stack and uses it to set the input radix.

some shell automation

Here is a shell function I wrote based on the above which can go both ways:

ABdc()( HOME=/dev/null  A='[fc[fc]]sp[100000000o]p2o[fc]' B=2i
        case    $1      in
        (-B) {  echo "$B"; tr AB 01      | paste -dP - ~      ; }| dc;;
        (-A) {  echo "$A"; od -vAn -tu1  | paste -dlpx - ~ ~ ~; }| dc|
         dc  |  paste - - - ~            | expand -t10,20,30     |
                cut -c2-9,12-19,22-29    | tr ' 01' AAB         ;;
        (*)     set '' "$1";: ${1:?Invalid opt: "'$2'"}         ;;
        esac
)

That will translate the ABABABA stuff to bytes with -B, so you can just do:

ABdc -B <infile

But it will translate arbitrary input to 24 ABABABA bit-per-byte encoded strings - in the same form as that presented for example in the question - w/ -B.

seq 5 | ABdc -A | tee /dev/fd/2 | ABdc -B

AABBAAABAAAABABAAABBAABA
AAAABABAAABBAABBAAAABABA
AABBABAAAAAABABAAABBABAB
AAAABABAAAAAAAAAAAAAAAAA
1
2
3
4
5

For -A output I rolled in cut, expand, and od here, which I'll get into in a minute, but I also added another dc. I dropped the line-for-line ? read dc script for another method which works an array at time with f - which is a command that prints the full dc command-stack to stdout. Of course, because dc is a stack-oriented last-in,first-out type of application, that means that the full-stack comes out in the reverse order it went in.

This might be a problem, but I use another dc anyway with an output radix set to 100000000 to handle all of the 0-padding as simply as possible. And when it reads the other's last-in,first-out stream, it applies that logic to it all over again, and it all comes out in the wash. The two dcs work in concert like this:

{   echo '[fc[fc]]sp[100000000o]p2o[fc]'
    echo some data | 
    od -An -tu1        ###arbitrary input to unsigned decimal ints
    echo lpx           ###load macro stored in p and execute
} | tee /dev/fd/2  |   ###just using tee to show stream stages
dc| tee /dev/fd/2  |dc 

...the stream per the first tee...

[fc[fc]]sp[100000000o]pc2o[fc]            ###dc's init cmd from 1st echo
 115 111 109 101  32 100  97 116  97  10  ###od's output
lpx                                       ###load p; execute

...per the second tee, as written from dc to dc...

100000000o                             ###first set output radix
1010                                   ###bin/rev vs of od's out
1100001                                ###dc #2 reads it in, revs and pads it 
1110100                                
1100001
1100100
100000
1100101
1101101
1101111                                ###this whole process is repeated
1110011                                ###once per od output line, so
fc                                     ###each worked array is 16 bytes.

...and the output which the second dc writes is...

 01110011
 01101111
 01101101
 01100101
 00100000
 01100100
 01100001
 01110100
 01100001
 00001010

From there the function pastes it on <tabs>...

 01110011    01101111    01101101
 01100101    00100000    01100100
 01100001    01110100    01100001
 00001010

...expands <tabs> to spaces at 10 column intervals...

 01110011  01101111  01101101
 01100101  00100000  01100100
 01100001  01110100  01100001
 00001010

...cuts away all but bytes 2-9,12-19,22-29...

011100110110111101101101
011001010010000001100100
011000010111010001100001
00001010

...and translates <spaces> and zeroes to A and ones to B...

ABBBAABBABBABBBBABBABBAB
ABBAABABAABAAAAAABBAABAA
ABBAAAABABBBABAAABBAAAAB
AAAABABAAAAAAAAAAAAAAAAA

You can see on the last line there my primary motivation for including expand - it's such a lightweight filter, and it very easily ensures that every sequence written - even the last - is padded out to 24 encoded-bits. When that is process reversed, and the strings are decoded to -Byte-value, there are two appended NULs:

ABdc -B <<\IN | od -tc
ABBBAABBABBABBBBABBABBAB
ABBAABABAABAAAAAABBAABAA
ABBAAAABABBBABAAABBAAAAB
AAAABABAAAAAAAAAAAAAAAAA
IN

...as you can see...

0000000   s   o   m   e       d   a   t   a  \n  \0  \0
0000014

real world data

I played with it, and tried it with some simple, realistic streams. I constructed this elaborate pipeline for staged reports...

{                            ###dunno why, but I often use man man
    (                        ###as a test input source
        {   man man       |  ###streamed to tee
            tee /dev/fd/3 |  ###branched to stdout
            wc -c >&2        ###and to count source bytes
        }   3>&1          |  ###the branch to stdout is here
        ABdc -A           |  ###converted to ABABABA
        tee /dev/fd/3     |  ###branched again
        ABdc -B              ###converted back to bytes
        times >&2            ###the process is timed
    ) | wc -c >&2            ###ABdc -B's output is counted
} 3>&1| wc -c                ###and so is the output of ABdc -A

I don't have any good basis for performance comparison, here, though. I can only say that I was driven to this test when I was (perhaps naively) impressed enough to do so by...

man man | ABdc -A | ABdc -B

...which painted my terminal screen w/ man's output at the same discernible speed as the unfiltered command might do. The output of the test was...

37595                       ###source byte count
0m0.000000s 0m0.000000s     ###shell processor time nil
0m0.720000s 0m0.250000s     ###shell children's total user, system time
37596                       ###ABdc -B output byte count
313300                      ###ABdc -A output byte count

initial tests

The rest is just a more simple proof of concept that it works at all...

printf %s ABBBAAAABBBBBABBABBBABBB|
tee - - - - - - - -|
tee - - - - - - - - - - - - - - - |
{   printf '2i[q]sq[?z0=qPl?x]s?l?x'
    tr -dc AB | tr AB 01 | fold -b24
} | dc        | od -tx1
0000000 70 fb 77 70 fb 77 70 fb 77 70 fb 77 70 fb 77 70
0000020 fb 77 70 fb 77 70 fb 77 70 fb 77 70 fb 77 70 fb
0000040 77 70 fb 77 70 fb 77 70 fb 77 70 fb 77 70 fb 77
0000060 70 fb 77 70 fb 77 70 fb 77 70 fb 77 70 fb 77 70
0000100 fb 77 70 fb 77 70 fb 77 70 fb 77 70 fb 77 70 fb
0000120 77 70 fb 77 70 fb 77 70 fb 77 70 fb 77 70 fb 77
0000140 70 fb 77 70 fb 77 70 fb 77 70 fb 77 70 fb 77 70
0000160 fb 77 70 fb 77 70 fb 77 70 fb 77 70 fb 77 70 fb
0000200 77 70 fb 77 70 fb 77 70 fb 77 70 fb 77 70 fb 77
0000220 70 fb 77 70 fb 77 70 fb 77 70 fb 77 70 fb 77 70
0000240 fb 77 70 fb 77 70 fb 77 70 fb 77 70 fb 77 70 fb
0000260 77 70 fb 77 70 fb 77 70 fb 77 70 fb 77 70 fb 77
0000300 70 fb 77 70 fb 77 70 fb 77 70 fb 77 70 fb 77 70
0000320 fb 77 70 fb 77 70 fb 77 70 fb 77 70 fb 77 70 fb
0000340 77 70 fb 77 70 fb 77 70 fb 77 70 fb 77 70 fb 77
0000360 70 fb 77 70 fb 77 70 fb 77 70 fb 77 70 fb 77 70
0000400 fb 77 70 fb 77 70 fb 77 70 fb 77 70 fb 77 70 fb
0000420 77 70 fb 77 70 fb 77 70 fb 77 70 fb 77 70 fb 77
0000440 70 fb 77 70 fb 77 70 fb 77 70 fb 77 70 fb 77 70
0000460 fb 77 70 fb 77 70 fb 77 70 fb 77 70 fb 77 70 fb
0000500 77 70 fb 77 70 fb 77 70 fb 77 70 fb 77 70 fb 77
0000520 70 fb 77 70 fb 77 70 fb 77 70 fb 77 70 fb 77 70
0000540 fb 77 70 fb 77 70 fb 77 70 fb 77 70 fb 77 70 fb
0000560 77 70 fb 77 70 fb 77 70 fb 77 70 fb 77 70 fb 77
0000600 70 fb 77 70 fb 77 70 fb 77 70 fb 77 70 fb 77 70
0000620 fb 77 70 fb 77 70 fb 77 70 fb 77 70 fb 77 70 fb
0000640 77 70 fb 77 70 fb 77 70 fb 77 70 fb 77 70 fb 77
0000660