# Golf my "pre-golfed" C

g.(m.w.r.r=<<).lines.f
n:c:z="\n#_0123456789"++['A'..'Z']++['a'..'z']
(!)x=elem x
f('\\':'\n':a)=f a
f(a:b)=a:f b
f a=a
m('#':a)=c:a++[n]
m a=a
g(a:'#':b)=a:[n|a/=n]++c:g b
g(a:b)=a:g b
g a=a
s=span(!" \t")
r=reverse.snd.s
l n(a:b)d|a==d,n=a:w(snd$s b)|1>0=a:l(not$n&&a=='\\')b d
w('/':'/':_)=[]
w(a:b)|a!"\"'"=a:l(1>0)b a|(p,q:u)<-s b=a:[' '|p>"",a!z&&q!z||[a,q]!words"++ -- /*"]++w(q:u)
w a=a


Try it online!

This was a lot of fun to write! First the f function comes through and removes all the backslashes at the end of lines then lines breaks it into a list of strings at the newlines. Then we map a bunch of functions onto the lines and concatenate them all back together. Those functions: strip whitespace from the left (t) and from the right (r.t.r where r is reverse); remove whitespace from the middle, ignoring string and character literals as well as removing comments (w); and finally adds a newline character to the end if the line starts with a #. After all the lines are concatenated back together g looks for # characters and ensures they are preceded by a newline.

w is a little complex so I'll explain it further. First I check for "//" since in w I know I'm not in a string literal I know this is a comment so I drop the rest of the line. Next I check if the head is a delimiter for a string or character literal. If it is I prepend it and pass the baton to l which runs through the characters, tracking the "escape" state with n which will be true if there have been an even number of consecutive slashes. When l detects a delimiter and isn't in the escape state it passes the baton back to w, trimming to eliminate whitespace after the literal because w expects the first character to not be whitespace. When w doesn't find a delimiter it uses span to look for whitespace in the tail. If there's any it checks if the characters around it are cannot be brought into contact and inserts a space if so. Then it recurs after the whitespace is ended. If there was no whitespace no space is inserted and it moves on anyway.

EDIT: Thanks a lot to @DLosc for pointing out a bug in my program that actually led to a way for me to shorten it as well! Hooray for pattern matching!

EDIT2: I'm an idiot who didn't finish reading the spec! Thanks again DLosc for pointing that out!

EDIT3: Just noticed some annoying type reduction thing that turned e=elem into Char->[Char]->Bool for some reason, thus breaking on e[a,q]. I had to add a type signature to force it to be correct. Does anyone know how I could fix that? I've never had this problem in Haskell before. TIO

EDIT4: quick fix for bug @FelixPalmen showed me. I might try to golf it down later when I have some time.

EDIT5: -24 bytes thanks to @Lynn! Thank you! I didn't know you could assign things on the global scope using pattern matching like n:c:z=... that's really cool! Also good idea making an operator for elem wish I'd thought of that.

## Pip, 148135133 138 bytes

aRM"\
"R("|').*?(?<!\\)(\\\\)*\1{lPBaC:++i+191}R[//.*#.*{X*aJw.(?=}.')M[A\wRL2"++""--""/*"]w¶+'·C(192+,#l)][x_WR'¶{aRw'·}xnsl]


Bytes are counted in CP-1252, so ¶ and · are one byte each. Note that this expects the C code as a single command-line argument, which (on an actual command line) would require the use of copious escape sequences. It's much easier at Try it online!

### Explanation of slightly ungolfed version

The code does a bunch of replacement operations, with a couple tricks.

### Backslash continuations

We RM all occurrences of the literal string

"\
"


that is, backslash followed by newline.

### String and character literals

We use a regex replacement with a callback function:

("|').*?(?<!\\)(\\\\)*\1

{
lPBa
C(++i + 191)
}


The regex matches a single or double quote, followed by a non-greedy .*? that matches 0 or more characters, as few as possible. We have a negative lookbehind to ensure that the previous character was not a backslash; then we match an even number of backslashes followed by the opening delimiter again.

The callback function takes the string/character literal and pushes it to the back of the list l. It then returns a character starting with character code 192 (À) and increasing with each literal replaced. Thus, code is transformed like so:

printf("%c", '\'');

printf(À, Á);


These replacement characters are guaranteed not to occur in the source code, which means we can unambiguously back-substitute them later.

//.*

x


The regex matches // plus everything up to the newline and replaces with x (preset to the empty string).

### Preprocessor directives

#.*

_WR'¶


Wraps runs of non-newline characters starting with a pound sign in ¶.

### Spaces that should not be eliminated

{
(
X*a J w.(?=
) . ')
}
M
[
A\w RL 2
"++"
"--"
"/*"
]

{
a R w '·
}


There's a lot going on here. The first part generates this list of regexes to replace:

[
(?a)\w\s+(?=(?a)\w)  Whitespace surrounded by [a-zA-Z_]
\+\s+(?=\+)          Whitespace surrounded by +
\-\s+(?=\-)          Whitespace surrounded by -
\/\s+(?=\*)          Whitespace surrounded by / *
]


Note the use of lookaheads to match, for example, just the e  in define P printf. That way this match doesn't consume the P, which means the next match can use it.

We generate this list of regexes by mapping a function to a list, where the list contains

[
[(?a)\w (?a)\w]
"++"
"--"
"/*"
]


and the function does this to each element:

(X*aJw.(?=).')
X*a              Map unary X to elements/chars a: converts to regex, escaping as needed
Regexes like \w stay unchanged; strings like "+" become \+
J             Join the resulting list on:
w             Preset variable for \s+
.(?=       plus the beginning of the lookahead syntax
(           ).')  Concatenate the closing paren of the lookahead


Once we have our regexes, we replace their occurrences with this callback function:

{aRw'·}


which replaces the run of whitespace in each match with ·.

### Whitespace elimination and cleanup

[w ¶+ '·]

[x n s]


Three successive replacements substitute remaining runs of whitespace (w) for empty string (x), runs of ¶ for newline, and · for space.

### Back-substitution of string and character literals

C(192+,#l)

l


We construct a list of all the characters we used as substitutions for literals by taking 192 + range(len(l)) and converting to characters. We then can replace each of these with its associated literal in l.

And that's it! The resulting string is autoprinted.

## C, 497494490 489 bytes

Since we're processing C, let's do it using C! Function f() takes input from char pointer p and outputs to pointer q, and assumes that the input is in ASCII:

#define O*q++
#define R (r=*p++)
#define V(c)(isalnum(c)||c==95)
char*p,*q,r,s,t;d(){isspace(r)?g():r==47&&*p==r?c(),g():r==92?e():(O=s=r)==34?b():r==39?O=R,a():r?a():(O=r);}a(){R;d();}b(){((O=R)==34?a:r==92?O=R,b:b)();}c(){while(R-10)p+=r==92;}e(){R-10?s=O=92,O=r,a():h();}j(){(!isspace(R)?r==47&&*p==r?c(),j:(t=r==35,d):j)();}f(){t=*p==35;j();}i(){V(s)&&V(r)||s==47&&r==42||(s==43||s==45)&&r==s&&*p==s?O=32:0;d();}h(){isspace(R)?g():i();}g(){(r==10?t?O=r,j:*p==35?s-10?s=O=r,j:0:h:h)();}


We assume that the file is well-formed - string and character literals are closed, and if there's a comment on the final line, there must be a newline to close it.

### Explanation

The pre-golfed version is only slightly more legible, I'm afraid:

#define O *q++=
#define R (r=*p++)
#define V(c)(isalnum(c)||c=='_')
char*p,*q,r,s,t;
d(){isspace(r)?g():r=='/'&&*p==r?c(),g():r=='\\'?e():(O s=r)=='"'?b():r=='\''?O R,a():r?a():(O r);}
a(){R;d();}
b(){((O R)=='"'?a:r=='\\'?O R,b:b)();}
c(){while(R!='\n')p+=r=='\\';}
e(){R!='\n'?s=O'\\',O r,a():h();}
j(){(!isspace(R)?r=='/'&&*p==r?c(),j:(t=r=='#',d):j)();}
f(){t=*p=='#';j();}
i(){V(s)&&V(r)||s=='/'&&r=='*'||(s=='+'||s=='-')&&r==s&&*p==s?O' ':0;d();}
h(){isspace(R)?g():i();}
g(){(r=='\n'?t?O r,j:*p=='#'?s!='\n'?s=O r,j:0:h:h)();}


It implements a state machine by tail recursion. The helper macros and variables are

• O for output
• R to read input into r
• V to determine valid identifier characters (since !isalnum('_'))
• p and q - I/O pointers as described
• r - last character to be read
• s - saved recent non-whitespace character
• t - tag when working on a preprocessor directive

Our states are

• a() - normal C code
• b() - string literal
• c() - comment
• d() - normal C code, after reading r
• e() - escape sequence
• f() - initial state (main function)
• g() - in whitespace
• h() - in whitespace - dispatch to g() or i()
• i() - immediately after whitespace - do we need to insert a space character?
• j() - initial whitespace - never insert a space character

### Test program

#define DEMO(code)                              \
do {                                        \
char in[] = code;                       \
char out[sizeof in];                    \
p=in;q=out;f();                         \
puts("vvvvvvvvvv");                     \
puts(out);                              \
puts("^^^^^^^^^^");                     \
} while (0)

#include<stdio.h>
#include<stdlib.h>
int main()
{
DEMO(
"main() {\n"
"    printf(\"Hello, World!\"); // hi\n"
"}\n"
);
DEMO(
"#define max(x, y)                               \\\n"
"    x > y ? x : y\n"
"#define I(x) scanf(\"%d\", &x)\n"
"a;\n"
"b; // just a needless comment, \\\n"
"        because we can!\n"
"main()\n"
"{\n"
"    I(a);\n"
"    I(b);\n"
"    printf(\"\\\" max \\\": %d\\n\", max(a, b));\n"
"}\n"
);
DEMO(
"x;*c;i;\n"
"main()\n"
"{\n"
"    int _e;\n"
"    for(; scanf(\"%d\", &x) > 0 && ++_e;);\n"
"    for(c = x + _e; c --> x; i = 100 / *x, printf(\"%d \", i - --_e));\n"
"}\n"
);
DEMO(
"// often used functions/keywords:\n"
"#define P printf(\n"
"#define A case\n"
"#define B break\n"
"\n"
"// loops for copying rows upwards/downwards are similar -> macro\n"
"#define L(i, e, t, f, s) \\\n"
"        for (o=i; o e;){ strcpy(l[o t], l[o f]); c[o t]=c[s o]; }\n"
"\n"
"// range check for rows/columns is similar -> macro\n"
"#define R(m,o) { return b<1|b>m ? m o : b; }\n"
"\n"
"// checking for numerical input is needed twice (move and print command):\n"
"#define N(f) sscanf(f, \"%d,%d\", &i, &j) || sscanf(f, \",%d\", &j)\n"
"\n"
"// room for 999 rows with each 999 cols (not specified, should be enough)\n"
"// also declare \"current line pointers\" (*L for data, *C for line length),\n"
"// an input buffer (a) and scratch variables\n"
"r, i, j, o, z, c, *C, x=1, y=1;\n"
"char a, l, (*L);\n"
"\n"
"// move rows down from current cursor position\n"
"D()\n"
"{\n"
"    L(r, >y, , -1, --)\n"
"    r++ ? strcpy(l[o], l[o-1]+--x), c[o-1]=x, l[o-1][x]=0 : 0;\n"
"    c[y++] = strlen(l[o]);\n"
"    x=1;\n"
"}\n"
"\n"
"// move rows up, appending uppermost to current line\n"
"U()\n"
"{\n"
"    strcat(*L, l[y]);\n"
"    *C = strlen(*L);\n"
"    L(y+1, <r, -1, , ++)\n"
"    --r;\n"
"    *l[r] = c[r] = 0;\n"
"}\n"
"\n"
"// normalize positions, treat 0 as max\n"
"X(b) R(c[y-1], +1)\n"
"Y(b) R(r, )\n"
"\n"
"main()\n"
"{\n"
"    for(;;) // forever\n"
"    {\n"
"        // initialize z as current line index, the current line pointers,\n"
"        // i and j for default values of positioning\n"
"        z = i = y;\n"
"        L = l + --z;\n"
"        C = c + z;\n"
"        j = x;\n"
"\n"
"        // prompt:\n"
"        !r || y/r && x > *C\n"
"            ? P \"end> \")\n"
"            : P \"%d,%d> \", y, x);\n"
"\n"
"        // read a line of input (using scanf so we don't need an include)\n"
"        scanf(\"%[^\\n]%*c\", a)\n"
"\n"
"            // no command arguments -> make check easier:\n"
"            ? a *= !!a,\n"
"\n"
"            // numerical input -> have move command:\n"
"            // calculate new coordinates, checking for \"relative\"\n"
"            N(a)\n"
"                ? y = Y(i + (i<0 | *a=='+') * y)\n"
"                    , x = X(j + (j<0 || strchr(a+1, '+')) * x)\n"
"                :0\n"
"\n"
"            // check for empty input, read single newline\n"
"            // and perform <return> command:\n"
"            : ( *a = D(), scanf(\"%*c\") );\n"
"\n"
"        switch(*a)\n"
"        {\n"
"            A 'e':\n"
"                y = r;\n"
"                x = c[r-1] + 1;\n"
"                B;\n"
"\n"
"            A 'b':\n"
"                y = 1;\n"
"                x = 1;\n"
"                B;\n"
"\n"
"            A 'L':\n"
"                for(o = y-4; ++o < y+2;)\n"
"                    o<0 ^ o<r && P \"%c%s\\n\", o^z ? ' ' : '>', l[o]);\n"
"                for(o = x+1; --o;)\n"
"                    P \" \");\n"
"                P \"^\\n\");\n"
"                B;\n"
"\n"
"            A 'l':\n"
"                puts(*L);\n"
"                B;\n"
"\n"
"            A 'p':\n"
"                i = 1;\n"
"                j = 0;\n"
"                N(a+2);\n"
"                for(o = Y(i)-1; o<Y(j); ++o)\n"
"                    puts(l[o]);\n"
"                B;\n"
"\n"
"            A 'A':\n"
"                y = r++;\n"
"                strcpy(l[y], a+2);\n"
"                x = c[y] = strlen(a+2);\n"
"                ++x;\n"
"                ++y;\n"
"                B;\n"
"\n"
"            A 'i':\n"
"                D();\n"
"                --y;\n"
"                x=X(0);\n"
"                // Commands i and r are very similar -> fall through\n"
"                // from i to r after moving rows down and setting\n"
"                // position at end of line:\n"
"\n"
"            A 'r':\n"
"                strcpy(*L+x-1, a+2);\n"
"                *C = strlen(*L);\n"
"                x = 1;\n"
"                ++y > r && ++r;\n"
"                B;\n"
"\n"
"            A 'I':\n"
"                o = strlen(a+2);\n"
"                memmove(*L+x+o-1, *L+x-1, *C-x+1);\n"
"                *C += o;\n"
"                memcpy(*L+x-1, a+2, o);\n"
"                x += o;\n"
"                B;\n"
"\n"
"            A 'd':\n"
"                **L ? **L = *C = 0, x = 1 : U();\n"
"                y = y>r ? r : y;\n"
"                B;\n"
"\n"
"            A 'j':\n"
"                y<r && U();\n"
"        }\n"
"    }\n"
"}\n";);
}


This produces

main(){printf("Hello, World!");}

#define max(x,y)x>y?x:y
#define I(x)scanf("%d",&x)
a;b;main(){I(a);I(b);printf("\" max \": %d\n",max(a,b));}

x;*c;i;main(){int _e;for(;scanf("%d",&x)>0&&++_e;);for(c=x+_e;c-->x;i=100/ *x,printf("%d ",i- --_e));}

#define P printf(
#define A case
#define B break
#define L(i,e,t,f,s)for(o=i;o e;){strcpy(l[o t],l[o f]);c[o t]=c[s o];}
#define R(m,o){return b<1|b>m?m o:b;}
#define N(f)sscanf(f,"%d,%d",&i,&j)||sscanf(f,",%d",&j)
r,i,j,o,z,c,*C,x=1,y=1;char a,l,(*L);D(){L(r,>y,,-1,--)r++?strcpy(l[o],l[o-1]+--x),c[o-1]=x,l[o-1][x]=0:0;c[y++]=strlen(l[o]);x=1;}U(){strcat(*L,l[y]);*C=strlen(*L);L(y+1,<r,-1,,++)--r;*l[r]=c[r]=0;}X(b)R(c[y-1],+1)Y(b)R(r,)main(){for(;;){z=i=y;L=l+--z;C=c+z;j=x;!r||y/r&&x>*C?P"end> "):P"%d,%d> ",y,x);scanf("%[^\n]%*c",a)?a*=!!a,N(a)?y=Y(i+(i<0|*a=='+')*y),x=X(j+(j<0||strchr(a+1,'+'))*x):0:(*a=D(),scanf("%*c"));switch(*a){A'e':y=r;x=c[r-1]+1;B;A'b':y=1;x=1;B;A'L':for(o=y-4;++o<y+2;)o<0^o<r&&P"%c%s\n",o^z?' ' :'>',l[o]);for(o=x+1;--o;)P" ");P"^\n");B;A'l':puts(*L);B;A'p':i=1;j=0;N(a+2);for(o=Y(i)-1;o<Y(j);++o)puts(l[o]);B;A'A':y=r++;strcpy(l[y],a+2);x=c[y]=strlen(a+2);++x;++y;B;A'i':D();--y;x=X(0);A'r':strcpy(*L+x-1,a+2);*C=strlen(*L);x=1;++y>r&&++r;B;A'I':o=strlen(a+2);memmove(*L+x+o-1,*L+x-1,*C-x+1);*C+=o;memcpy(*L+x-1,a+2,o);x+=o;B;A'd':**L?**L=*C=0,x=1:U();y=y>r?r:y;B;A'j':y<r&&U();}}}


### Limitation

This breaks definitions such as

#define A (x)


by removing the space that separates the name from expansion, giving

#define A(x)


with an entirely different meaning. This case is absent from the test sets, so I'm not going to address it.

I suspect I might be able to produce a shorter version with a multi-pass in-place conversion - I might try that next week.