Golf my "pre-golfed" C

Haskell, 327 360 418 394 bytes

g.(m.w.r.r=<<).lines.f
n:c:z="\n#_0123456789"++['A'..'Z']++['a'..'z']
(!)x=elem x
f('\\':'\n':a)=f a
f(a:b)=a:f b
f a=a
m('#':a)=c:a++[n]
m a=a
g(a:'#':b)=a:[n|a/=n]++c:g b
g(a:b)=a:g b
g a=a
s=span(!" \t")
r=reverse.snd.s
l n(a:b)d|a==d,n=a:w(snd$s b)|1>0=a:l(not$n&&a=='\\')b d
w('/':'/':_)=[]
w(a:b)|a!"\"'"=a:l(1>0)b a|(p,q:u)<-s b=a:[' '|p>"",a!z&&q!z||[a,q]!words"++ -- /*"]++w(q:u)
w a=a

Try it online!

This was a lot of fun to write! First the f function comes through and removes all the backslashes at the end of lines then lines breaks it into a list of strings at the newlines. Then we map a bunch of functions onto the lines and concatenate them all back together. Those functions: strip whitespace from the left (t) and from the right (r.t.r where r is reverse); remove whitespace from the middle, ignoring string and character literals as well as removing comments (w); and finally adds a newline character to the end if the line starts with a #. After all the lines are concatenated back together g looks for # characters and ensures they are preceded by a newline.

w is a little complex so I'll explain it further. First I check for "//" since in w I know I'm not in a string literal I know this is a comment so I drop the rest of the line. Next I check if the head is a delimiter for a string or character literal. If it is I prepend it and pass the baton to l which runs through the characters, tracking the "escape" state with n which will be true if there have been an even number of consecutive slashes. When l detects a delimiter and isn't in the escape state it passes the baton back to w, trimming to eliminate whitespace after the literal because w expects the first character to not be whitespace. When w doesn't find a delimiter it uses span to look for whitespace in the tail. If there's any it checks if the characters around it are cannot be brought into contact and inserts a space if so. Then it recurs after the whitespace is ended. If there was no whitespace no space is inserted and it moves on anyway.

EDIT: Thanks a lot to @DLosc for pointing out a bug in my program that actually led to a way for me to shorten it as well! Hooray for pattern matching!

EDIT2: I'm an idiot who didn't finish reading the spec! Thanks again DLosc for pointing that out!

EDIT3: Just noticed some annoying type reduction thing that turned e=elem into Char->[Char]->Bool for some reason, thus breaking on e[a,q]. I had to add a type signature to force it to be correct. Does anyone know how I could fix that? I've never had this problem in Haskell before. TIO

EDIT4: quick fix for bug @FelixPalmen showed me. I might try to golf it down later when I have some time.

EDIT5: -24 bytes thanks to @Lynn! Thank you! I didn't know you could assign things on the global scope using pattern matching like n:c:z=... that's really cool! Also good idea making an operator for elem wish I'd thought of that.

Pip, 148 135 133 138 bytes

aRM"\
"R`("|').*?(?<!\\)(\\\\)*\1`{lPBaC:++i+191}R[`//.*``#.*`{X*aJw.`(?=`}.')M[A`\w`RL2"++""--""/*"]w`¶+`'·C(192+,#l)][x_WR'¶{aRw'·}xnsl]

Bytes are counted in CP-1252, so ¶ and · are one byte each. Note that this expects the C code as a single command-line argument, which (on an actual command line) would require the use of copious escape sequences. It's much easier at Try it online!

Explanation of slightly ungolfed version

The code does a bunch of replacement operations, with a couple tricks.

Backslash continuations

We RM all occurrences of the literal string

"\
"

that is, backslash followed by newline.

String and character literals

We use a regex replacement with a callback function:

`("|').*?(?<!\\)(\\\\)*\1`

{
 lPBa
 C(++i + 191)
}

The regex matches a single or double quote, followed by a non-greedy .*? that matches 0 or more characters, as few as possible. We have a negative lookbehind to ensure that the previous character was not a backslash; then we match an even number of backslashes followed by the opening delimiter again.

The callback function takes the string/character literal and pushes it to the back of the list l. It then returns a character starting with character code 192 (À) and increasing with each literal replaced. Thus, code is transformed like so:

printf("%c", '\'');

printf(À, Á);

These replacement characters are guaranteed not to occur in the source code, which means we can unambiguously back-substitute them later.

Comments

`//.*`

x

The regex matches // plus everything up to the newline and replaces with x (preset to the empty string).

Preprocessor directives

`#.*`

_WR'¶

Wraps runs of non-newline characters starting with a pound sign in ¶.

Spaces that should not be eliminated

{
 (
  X*a J w.`(?=`
 ) . ')
}
M
[
 A`\w` RL 2
 "++"
 "--"
 "/*"
]

{
 a R w '·
}

There's a lot going on here. The first part generates this list of regexes to replace:

[
 `(?a)\w\s+(?=(?a)\w)`  Whitespace surrounded by [a-zA-Z_]
 `\+\s+(?=\+)`          Whitespace surrounded by +
 `\-\s+(?=\-)`          Whitespace surrounded by -
 `\/\s+(?=\*)`          Whitespace surrounded by / *
]

Note the use of lookaheads to match, for example, just the e in define P printf. That way this match doesn't consume the P, which means the next match can use it.

We generate this list of regexes by mapping a function to a list, where the list contains

[
 [`(?a)\w` `(?a)\w`]
 "++"
 "--"
 "/*"
]

and the function does this to each element:

(X*aJw.`(?=`).')
 X*a              Map unary X to elements/chars a: converts to regex, escaping as needed
                  Regexes like `\w` stay unchanged; strings like "+" become `\+`
    J             Join the resulting list on:
     w             Preset variable for `\s+`
      .`(?=`       plus the beginning of the lookahead syntax
(           ).')  Concatenate the closing paren of the lookahead

Once we have our regexes, we replace their occurrences with this callback function:

{aRw'·}

which replaces the run of whitespace in each match with ·.

Whitespace elimination and cleanup

[w `¶+` '·]

[x n s]

Three successive replacements substitute remaining runs of whitespace (w) for empty string (x), runs of ¶ for newline, and · for space.

Back-substitution of string and character literals

C(192+,#l)

l

We construct a list of all the characters we used as substitutions for literals by taking 192 + range(len(l)) and converting to characters. We then can replace each of these with its associated literal in l.

And that's it! The resulting string is autoprinted.

C, 497 494 490 489 bytes

Since we're processing C, let's do it using C! Function f() takes input from char pointer p and outputs to pointer q, and assumes that the input is in ASCII:

#define O*q++
#define R (r=*p++)
#define V(c)(isalnum(c)||c==95)
char*p,*q,r,s,t;d(){isspace(r)?g():r==47&&*p==r?c(),g():r==92?e():(O=s=r)==34?b():r==39?O=R,a():r?a():(O=r);}a(){R;d();}b(){((O=R)==34?a:r==92?O=R,b:b)();}c(){while(R-10)p+=r==92;}e(){R-10?s=O=92,O=r,a():h();}j(){(!isspace(R)?r==47&&*p==r?c(),j:(t=r==35,d):j)();}f(){t=*p==35;j();}i(){V(s)&&V(r)||s==47&&r==42||(s==43||s==45)&&r==s&&*p==s?O=32:0;d();}h(){isspace(R)?g():i();}g(){(r==10?t?O=r,j:*p==35?s-10?s=O=r,j:0:h:h)();}

We assume that the file is well-formed - string and character literals are closed, and if there's a comment on the final line, there must be a newline to close it.

Explanation

The pre-golfed version is only slightly more legible, I'm afraid:

#define O *q++=
#define R (r=*p++)
#define V(c)(isalnum(c)||c=='_')
char*p,*q,r,s,t;
d(){isspace(r)?g():r=='/'&&*p==r?c(),g():r=='\\'?e():(O s=r)=='"'?b():r=='\''?O R,a():r?a():(O r);}
a(){R;d();}
b(){((O R)=='"'?a:r=='\\'?O R,b:b)();}
c(){while(R!='\n')p+=r=='\\';}
e(){R!='\n'?s=O'\\',O r,a():h();}
j(){(!isspace(R)?r=='/'&&*p==r?c(),j:(t=r=='#',d):j)();}
f(){t=*p=='#';j();}
i(){V(s)&&V(r)||s=='/'&&r=='*'||(s=='+'||s=='-')&&r==s&&*p==s?O' ':0;d();}
h(){isspace(R)?g():i();}
g(){(r=='\n'?t?O r,j:*p=='#'?s!='\n'?s=O r,j:0:h:h)();}

It implements a state machine by tail recursion. The helper macros and variables are

O for output
R to read input into r
V to determine valid identifier characters (since !isalnum('_'))
p and q - I/O pointers as described
r - last character to be read
s - saved recent non-whitespace character
t - tag when working on a preprocessor directive

Our states are

a() - normal C code
b() - string literal
c() - comment
d() - normal C code, after reading r
e() - escape sequence
f() - initial state (main function)
g() - in whitespace
h() - in whitespace - dispatch to g() or i()
i() - immediately after whitespace - do we need to insert a space character?
j() - initial whitespace - never insert a space character

Test program

#define DEMO(code)                              \
    do {                                        \
        char in[] = code;                       \
        char out[sizeof in];                    \
        p=in;q=out;f();                         \
        puts("vvvvvvvvvv");                     \
        puts(out);                              \
        puts("^^^^^^^^^^");                     \
    } while (0)

#include<stdio.h>
#include<stdlib.h>
int main()
{
    DEMO(
         "main() {\n"
         "    printf(\"Hello, World!\"); // hi\n"
         "}\n"
         );
    DEMO(
         "#define max(x, y)                               \\\n"
         "    x > y ? x : y\n"
         "#define I(x) scanf(\"%d\", &x)\n"
         "a;\n"
         "b; // just a needless comment, \\\n"
         "        because we can!\n"
         "main()\n"
         "{\n"
         "    I(a);\n"
         "    I(b);\n"
         "    printf(\"\\\" max \\\": %d\\n\", max(a, b));\n"
         "}\n"
         );
    DEMO(
         "x[10];*c;i;\n"
         "main()\n"
         "{\n"
         "    int _e;\n"
         "    for(; scanf(\"%d\", &x) > 0 && ++_e;);\n"
         "    for(c = x + _e; c --> x; i = 100 / *x, printf(\"%d \", i - --_e));\n"
         "}\n"
         );
    DEMO(
         "// often used functions/keywords:\n"
         "#define P printf(\n"
         "#define A case\n"
         "#define B break\n"
         "\n"
         "// loops for copying rows upwards/downwards are similar -> macro\n"
         "#define L(i, e, t, f, s) \\\n"
         "        for (o=i; o e;){ strcpy(l[o t], l[o f]); c[o t]=c[s o]; }\n"
         "\n"
         "// range check for rows/columns is similar -> macro\n"
         "#define R(m,o) { return b<1|b>m ? m o : b; }\n"
         "\n"
         "// checking for numerical input is needed twice (move and print command):\n"
         "#define N(f) sscanf(f, \"%d,%d\", &i, &j) || sscanf(f, \",%d\", &j)\n"
         "\n"
         "// room for 999 rows with each 999 cols (not specified, should be enough)\n"
         "// also declare \"current line pointers\" (*L for data, *C for line length),\n"
         "// an input buffer (a) and scratch variables\n"
         "r, i, j, o, z, c[999], *C, x=1, y=1;\n"
         "char a[999], l[999][999], (*L)[999];\n"
         "\n"
         "// move rows down from current cursor position\n"
         "D()\n"
         "{\n"
         "    L(r, >y, , -1, --)\n"
         "    r++ ? strcpy(l[o], l[o-1]+--x), c[o-1]=x, l[o-1][x]=0 : 0;\n"
         "    c[y++] = strlen(l[o]);\n"
         "    x=1;\n"
         "}\n"
         "\n"
         "// move rows up, appending uppermost to current line\n"
         "U()\n"
         "{\n"
         "    strcat(*L, l[y]);\n"
         "    *C = strlen(*L);\n"
         "    L(y+1, <r, -1, , ++)\n"
         "    --r;\n"
         "    *l[r] = c[r] = 0;\n"
         "}\n"
         "\n"
         "// normalize positions, treat 0 as max\n"
         "X(b) R(c[y-1], +1)\n"
         "Y(b) R(r, )\n"
         "\n"
         "main()\n"
         "{\n"
         "    for(;;) // forever\n"
         "    {\n"
         "        // initialize z as current line index, the current line pointers,\n"
         "        // i and j for default values of positioning\n"
         "        z = i = y;\n"
         "        L = l + --z;\n"
         "        C = c + z;\n"
         "        j = x;\n"
         "\n"
         "        // prompt:\n"
         "        !r || y/r && x > *C\n"
         "            ? P \"end> \")\n"
         "            : P \"%d,%d> \", y, x);\n"
         "\n"
         "        // read a line of input (using scanf so we don't need an include)\n"
         "        scanf(\"%[^\\n]%*c\", a)\n"
         "\n"
         "            // no command arguments -> make check easier:\n"
         "            ? a[2] *= !!a[1],\n"
         "\n"
         "            // numerical input -> have move command:\n"
         "            // calculate new coordinates, checking for \"relative\"\n"
         "            N(a)\n"
         "                ? y = Y(i + (i<0 | *a=='+') * y)\n"
         "                    , x = X(j + (j<0 || strchr(a+1, '+')) * x)\n"
         "                :0\n"
         "\n"
         "            // check for empty input, read single newline\n"
         "            // and perform <return> command:\n"
         "            : ( *a = D(), scanf(\"%*c\") );\n"
         "\n"
         "        switch(*a)\n"
         "        {\n"
         "            A 'e':\n"
         "                y = r;\n"
         "                x = c[r-1] + 1;\n"
         "                B;\n"
         "\n"
         "            A 'b':\n"
         "                y = 1;\n"
         "                x = 1;\n"
         "                B;\n"
         "\n"
         "            A 'L':\n"
         "                for(o = y-4; ++o < y+2;)\n"
         "                    o<0 ^ o<r && P \"%c%s\\n\", o^z ? ' ' : '>', l[o]);\n"
         "                for(o = x+1; --o;)\n"
         "                    P \" \");\n"
         "                P \"^\\n\");\n"
         "                B;\n"
         "\n"
         "            A 'l':\n"
         "                puts(*L);\n"
         "                B;\n"
         "\n"
         "            A 'p':\n"
         "                i = 1;\n"
         "                j = 0;\n"
         "                N(a+2);\n"
         "                for(o = Y(i)-1; o<Y(j); ++o)\n"
         "                    puts(l[o]);\n"
         "                B;\n"
         "\n"
         "            A 'A':\n"
         "                y = r++;\n"
         "                strcpy(l[y], a+2);\n"
         "                x = c[y] = strlen(a+2);\n"
         "                ++x;\n"
         "                ++y;\n"
         "                B;\n"
         "\n"
         "            A 'i':\n"
         "                D();\n"
         "                --y;\n"
         "                x=X(0);\n"
         "                // Commands i and r are very similar -> fall through\n"
         "                // from i to r after moving rows down and setting\n"
         "                // position at end of line:\n"
         "\n"
         "            A 'r':\n"
         "                strcpy(*L+x-1, a+2);\n"
         "                *C = strlen(*L);\n"
         "                x = 1;\n"
         "                ++y > r && ++r;\n"
         "                B;\n"
         "\n"
         "            A 'I':\n"
         "                o = strlen(a+2);\n"
         "                memmove(*L+x+o-1, *L+x-1, *C-x+1);\n"
         "                *C += o;\n"
         "                memcpy(*L+x-1, a+2, o);\n"
         "                x += o;\n"
         "                B;\n"
         "\n"
         "            A 'd':\n"
         "                **L ? **L = *C = 0, x = 1 : U();\n"
         "                y = y>r ? r : y;\n"
         "                B;\n"
         "\n"
         "            A 'j':\n"
         "                y<r && U();\n"
         "        }\n"
         "    }\n"
         "}\n";);
}

This produces

main(){printf("Hello, World!");}

#define max(x,y)x>y?x:y
#define I(x)scanf("%d",&x)
a;b;main(){I(a);I(b);printf("\" max \": %d\n",max(a,b));}

x[10];*c;i;main(){int _e;for(;scanf("%d",&x)>0&&++_e;);for(c=x+_e;c-->x;i=100/ *x,printf("%d ",i- --_e));}

#define P printf(
#define A case
#define B break
#define L(i,e,t,f,s)for(o=i;o e;){strcpy(l[o t],l[o f]);c[o t]=c[s o];}
#define R(m,o){return b<1|b>m?m o:b;}
#define N(f)sscanf(f,"%d,%d",&i,&j)||sscanf(f,",%d",&j)
r,i,j,o,z,c[999],*C,x=1,y=1;char a[999],l[999][999],(*L)[999];D(){L(r,>y,,-1,--)r++?strcpy(l[o],l[o-1]+--x),c[o-1]=x,l[o-1][x]=0:0;c[y++]=strlen(l[o]);x=1;}U(){strcat(*L,l[y]);*C=strlen(*L);L(y+1,<r,-1,,++)--r;*l[r]=c[r]=0;}X(b)R(c[y-1],+1)Y(b)R(r,)main(){for(;;){z=i=y;L=l+--z;C=c+z;j=x;!r||y/r&&x>*C?P"end> "):P"%d,%d> ",y,x);scanf("%[^\n]%*c",a)?a[2]*=!!a[1],N(a)?y=Y(i+(i<0|*a=='+')*y),x=X(j+(j<0||strchr(a+1,'+'))*x):0:(*a=D(),scanf("%*c"));switch(*a){A'e':y=r;x=c[r-1]+1;B;A'b':y=1;x=1;B;A'L':for(o=y-4;++o<y+2;)o<0^o<r&&P"%c%s\n",o^z?' ' :'>',l[o]);for(o=x+1;--o;)P" ");P"^\n");B;A'l':puts(*L);B;A'p':i=1;j=0;N(a+2);for(o=Y(i)-1;o<Y(j);++o)puts(l[o]);B;A'A':y=r++;strcpy(l[y],a+2);x=c[y]=strlen(a+2);++x;++y;B;A'i':D();--y;x=X(0);A'r':strcpy(*L+x-1,a+2);*C=strlen(*L);x=1;++y>r&&++r;B;A'I':o=strlen(a+2);memmove(*L+x+o-1,*L+x-1,*C-x+1);*C+=o;memcpy(*L+x-1,a+2,o);x+=o;B;A'd':**L?**L=*C=0,x=1:U();y=y>r?r:y;B;A'j':y<r&&U();}}}

Limitation

This breaks definitions such as

#define A (x)

by removing the space that separates the name from expansion, giving

#define A(x)

with an entirely different meaning. This case is absent from the test sets, so I'm not going to address it.

I suspect I might be able to produce a shorter version with a multi-pass in-place conversion - I might try that next week.