The logical quine

Verilog, optimized for area, 130 LE gates

The quine itself (the actual file is encoded in DEC SIXBIT):

module q(s,d);inout s,d;wire[0:1023]v=1024'b0110111100101001011001111110110110010110100100001111011010010111010101110101010110010110111100101001011001111110110100100110100100001111011010100111010101110110111000011100111100111010011001111011100000001001000111011010100111000101100111111010100111010111010100100110101101101110111010011111010110111000011011001101111000011110011100111000000010001100001011111100111001011001001001111001010000001100110010011010011001100010001010100111100101010010011000101001011001111010011011100000001010010111011010010010110100010110111010100111011010100001100100011111100000011010010111110101110110100100010110111001011011101001000000001001011011001100111001010000001010100111011010100010110100010110111001011011101001001011011011111001001101011011001001010000000000000000101101101111100100110101101100100101000000110001001000110011001100100100001001011011101001101110101111110101110100000000110011001100100100011011110111101001110010100101111011010000011010010001010000010010010011111101110110011101010001010000010010010100000111100010;reg[9:0]i=759;reg[2:0]j=7;assign d=j<6?j==2:v[i];always@(posedge s)if(j>5)begin i=i+1;j=j&1^!i?7:1;end else j=j+1;endmodule

Readable version with comments and a testbench:

module q(s,d);
   // Declare the ports. Making them both "inout" is shortest.
   inout s,d;
   // Data storage for the program.
   wire[0:1023]v=1024'b{DATA GOES HERE};
   // i is the current bit number within the program.
   // This is relative to the /end of the data storage/ (not to the start
   // of the program), so it starts at a nonzero value so that the output
   // starts at the start of the program.
   reg[9:0]i=759;
   // When expanding bits to (6-bit) bytes, j is the bit number within
   // the expansion, from 1 for the first bit up to 6 for the last.
   // When not expanding, j is always 7.
   // DEC SIXBIT encoding for 0 is (from MSB to LSB) 010 000.
   // DEC SIXBIT encoding for 1 is (from MSB to LSB) 010 001.
   // We use SSI encoding for the output, so the MSB is sent first.
   reg[2:0]j=7;
   assign d=j<6?j==2:v[i];
   // When we get a strobe:
   always@(posedge s)
     // If we just output a bit, move onto the next bit.
     // We may also need to reset j.
     if(j>5)
       begin 
          i=i+1;
          j=j&1^!i?7:1;
       end 
     else 
       // If we're inside a bit, continue to output that bit.
       j=j+1;
endmodule
// {TESTBENCH BELOW HERE}

`timescale 10ns / 1ns
module testbench();
   reg clock = 0;
   wire data, strobe;

   always
     #1 clock <= !clock;
   initial
     #14304 $finish;

   assign strobe = clock;
   q testquine(.s(strobe),.d(data));

   always @(negedge strobe)
      $display("%d", data);

endmodule // testbench

The use of Verilog gives me considerably more control over the low-level details than I'd have with Verity. In particular, it lets me control the clock and reset rules myself. This program's intended for a synchronous serial connection with strobe input s and data output d. Although each is only used in one direction, I declared them both as bidirectional to save a few bytes; I had to golf the non-data parts of the program down to 1024 bits to be able to use 10-bit logic gates internally (extra bits would be more expensive in area), and it only just scrapes under at 1008, so savings like this are important. In order to save a substantial amount of code, I rely on the FPGA's hardware reset circuitry rather than adding my own, and I merge the strobe and clock inputs (which is an old trick that's kind-of frowned upon nowadays because it makes it hard to keep the clock tree balanced at high clock speeds, but it's useful for golfing.) I hope that's synthesizable; I don't know how well Verilog synthesizers cope with using a bidirectional port as a clock.

The source is encoded in DEC SIXBIT (I'm assuming here that we interpret its single alphabet of letters as lowercase; a Verilog synthesizer would have no reason to use an uppercase interpretation). I used a six-bit character set internally in my other solution, then wasted bytes converting it; it's better to use a character set that's "naturally" six bits wide so that no conversion is necessary. I picked this particular six-bit character set because 0 and 1 differ only in their least significant bit, and only have one other bit set, meaning that the circuitry for converting a binary digit to DEC SIXBIT (i.e. "escaping" a string) can be very simple. Interestingly, the character set in question is missing a newline character; the original program's all on one line not just to make it easier to generate, but to make it possible to encode! It's a good thing that Verilog mostly doesn't care about whitespace.

The protocol for sending data to the host is based on Synchronous Serial Interface. I picked it because it's clocked (allowing me to use the clock/strobe trick, and also allowing me to write a portable program that doesn't rely on on-chip timing devices), and because it's very simple (thus I don't have to waste much code implementing it). This protocol doesn't specify a method of specifying where the message ends (the host is supposed to know); in this particular case, I padded the output up to a multiple of 1024 bits with zero bits (a total of 16 padding bits), after which (as required by SSI) the message restarts. (I don't implement an idle mode timer; its purpose is to determine whether to send a new message or whether to repeat the previous message, and as this program always sends its own source code as the message, the distinction isn't visible. You can consider it to be length 0, or infinitely long, depending on your point of view.)

In terms of the actual logic, the most interesting thing is the way that I split up the variables to reduce the amount of area needed on the chip. i, the larger register, holds the current "address" within the program's data, and is only ever changed via incrementing it; this means that its logic can be synthesized using the half-adder construction (which, as the name suggests, uses only half the resources that an adder does; this mostly only matters on the smallest FPGAs, larger ones will use 3-input or even 4-input LUTs which are powerful enough that they'll have lots of wasted capacity synthesizing a half-adder). The smaller register, j, is basically a state machine state and thus handles most of the program's complex logic. It's small enough that it can be handled entirely via lookup table on a larger FPGA (making the logic basically disappear); in case the program is synthesized for a small FPGA, I chose its encoding in such a way that few parts of the code care about all three of its bits at once.

It's also worth noting that I cyclically permuted the data storage. We can start i pointing anywhere inside it, not necessarily at the start. With the arrangement seen here, we can print from the initial value of i to the end directly, then print the entire array escaped, then print from the start to the initial value of i, in order to print the all the parts of the data in the right places without needing to save and restore the value of i. (This trick might be useful for quines in other languages too.)

The source is 1192 6-bit bytes long, the equivalent of 894 8-bit bytes. It's kind-of embarrassing that this contains fewer source bytes than my Verity submission, despite being optimized for something entirely different; this is mostly because Verilog has strings and Verity doesn't, meaning that even though I've encoded the program in binary rather than octal (which is substantially less efficient in terms of source code size), I can encode each byte of the program using six six-bit characters (one for each bit) rather than eight eight-bit characters (four for each octal digit). A Verilog submission that encoded the program in octal would probably be smaller in terms of source code size, but would almost certainly be larger in area.

I don't know how much area this program will end up using; it depends a lot on how powerful the optimizer is in your Verilog synthesizer (because the minimization problem of converting the stored data into a set of logic gates is something that's done in the synthesizer itself; throwing the work onto the synthesizer makes the source code itself much shorter, and thus reduces the area needed to store it). It should have a complexity of O(n log n), though, and thus be much smaller than the O(n²) of the other program. I'd be interested to see the OP try to run it on their FPGA. (It may take quite some time to synthesize, though; there are various steps you can take to optimize a program for compile time but I didn't take any here as it'd cause a larger program = larger area.)


Verity 0.10, optimized for source code size (1944 bytes)

I originally misread the question and interpreted it as a code-golf. That was probably for the best, as it's much easier to write a quine with short source code than short object code under the restrictions in the question; that made the question easy enough that I felt I could reasonably produce an answer, and might work as a stepping stone on the way to a better answer. It also prompted me to use a higher-level language for the input, meaning that I'd need to express less in the program itself. I didn't create Verity as a golfing language for hardware (I was actually hired to create it a while ago in an entirely different context), but there's quite a reminiscence there (e.g. it's substantially higher level than a typical HDL is, and it has much less boilerplate; it's also much more portable than the typical HDL).

I'm pretty sure that the correct solution for short object code involves storing the data in some kind of tree structure, given that the question disallows the use of block ROM, which is where you'd normally store it in a practical program; I might have a go at writing a program that uses this principle (not sure what language, maybe Verity, maybe Verilog; VHDL has too much boilerplate to likely be optimal for this sort of problem) at some point. That would mean that you wouldn't need to pass every bit of the source code to every bit of your "manually created ROM". However, the Verity compiler currently synthesizes the structure of the output based on the precedence and associativity of the input, meaning that it's effectively representing the instruction pointer (thus the index to the lookup table) in unary, and a unary index multiplied by the length of the lookup table gives this O(n²) space performance.

The program itself:

import <print>new x:=0$1296in(\p.\z.\a.new y:=(-a 5-a 1-a 1-a 2-a 4-a 2-a 3-a 2-a 6-a 2-a 0-a 3-a 0-a 4-a 4-a 7-a 4-a 2-a 6-a 2-a 5-a 1-a 2-a 2-a 0-a 3-a 6-a 7-a 2-a 2-a 1-a 1-a 3-a 3-a 0-a 4-a 4-a 3-a 2-a 7-a 5-a 7-a 0-a 6-a 4-a 4-a 1-a 6-a 2-a 6-a 1-a 7-a 6-a 6-a 5-a 1-a 2-a 2-a 0-a 5-a 0-a 0-a 4-a 2-a 6-a 5-a 0-a 0-a 6-a 3-a 6-a 5-a 0-a 0-a 5-a 0-a 6-a 5-a 2-a 2-a 1-a 1-a 3-a 3-a 0-a 4-a 5-a 3-a 2-a 7-a 5-a 7-a 0-a 5-a 5-a 5-a 1-a 4-a 4-a 3-a 1-a 5-a 5-a 1-a 2-a 2-a 0-a 4-a 3-a 3-a 4-a 1-a 5-a 1-a 0-a 2-a 1-a 1-a 1-a 4-a 4-a 3-a 6-a 7-a 0-a 6-a 0-a 1-a 3-a 2-a 0-a 5-a 4-a 2-a 0-a 5-a 5-a 1-a 2-a 1-a 0-a 4-a 6-a 3-a 4-a 7-a 3-a 6-a 2-a 6-a 0-a 3-a 4-a 1-a 1-a 1-a 2-a 2-a 0-a 4-a 6-a 3-a 3-a 5-a 1-a 7-a 2-a 6-a 1-a 1-a 0-a 2-a 7-a 2-a 1-a 1-a 0-a 4-a 6-a 3-a 1-a 5-a 3-a 7-a 5-a 1-a 2-a 1-a 0-a 4-a 6-a 3-a 5-a 7-a 5-a 7-a 4-a 6-a 5-a 6-a 0-a 3-a 4-a 1-a 1-a 1-a 2-a 2-a 0-a 4-a 3-a 3-a 4-a 1-a 5-a 1-a 0-a 2-a 1-a 1-a 1-a 4-a 5-a 3-a 6-a 7-a 0-a 6-a 0-a 1-a 3-a 2-a 0-a 5-a 4-a 2-a 0-a 4-a 1-a 7-a 7-a 6-a 3-a 7-a 4-a 2-a 0-a 4-a 3-a 6-a 2-a 6-a 3-a 7-a 4-a 2-a 0-a 5-a 4-a 6-a 0-a 7-a 2-a 0-a 1-a 4-a 5-a 3-a 4-a 4-a 4-a 4-a 3-a 6-a 4-a 4-a 4-a 4-a 3-a 6-a 2-a 6-a 1-a 5-a 3-a 7-a 4-a 2-a 0-a 4-a 4-a 6-a 5-a 6-a 3-a 7-a 5-a 3-a 2-a 7-a 5-a 7-a 1-a 4-a 5-a 3-a 6-a 7-a 6-a 7-a 3-a 6-a 1-a 5-a 1-a 1-a 0-a 2-a 7-a 2-a 1-a 1-a 0-a 4-a 7-a 2-a 7-a 1-a 5-a 1-a 4-a 2-a 3-a 7-a 4-a 3-a 2-a 7-a 5-a 7-a 1-a 4-a 4-a 3-a 6-a 7-a 6-a 7-a 6-a 6-a 1-a 5-a 1-a 5-a 4-a 2-a 6-a 2-a 5-a 1-a 2-a 2-a 0-a 3-a 0-a 5-a 1-a 4-a 4-a 3-a 4-a 4-a 4-a 4-a 6-a 6-a 4-a 4-a 4-a 4-a 3-a 6-a 2-a 6-a 1-a 5-a 0-a 5-a 0-a 0-a 0-a 1-a 6-a 5-a 4-a 3-a 2-a 7-a 5-a 7-a 1-a 4-a 4-a 3-a 6-a 7-a 6-a 7-a 3-a 6-a 2-a 0-a 0-a 1-a 4-a 7-a 4-a 7-a 1-a 6-a 2-a 6-a 1-a 7-a 3-a 6-a 3-a 7-a 0-a 6-a 1-a 5-!x)in while!x>0do(p(if z<32then z+92else z);if z==45then while!y>0do(p 97;p 32;p(48^!y$$3$$32);p 45;y:=!y>>3)else skip;x:=!x>>6))print(!x$$6$$32)(\d.x:=!x>>3^d<<1293;0)

More readable:

import <print>
new x := 0$1296 in
(\p.\z.\a.
  new y := (-a 5-a 1-
            # a ton of calls to a() omitted...
            -a 1-a 5-!x) in
  while !x>0 do (
    p(if z<32 then z+92 else z);
    if z==45
    then while !y>0 do (
      p 97;
      p 32;
      p(48^!y$$3$$32);
      p 45;
      y:=!y>>3 )
    else skip;
    x:=!x>>6
  )
)(print)(!x$$6$$32)(\d.x:=!x>>3^d<<1293;0)

The basic idea is that we store the entire data in the variable x. (As usual for a quine, we have a code section and a data section; the data encodes the text of the code, and can also be used to regenerate the text of the data.) Unfortunately, Verity doesn't currently allow very large constants to be written in the source code (it uses OCaml integers during compilation to represent integers in the source, which clearly isn't correct in a language that supports arbitrarily wide integer types) – and besides, it doesn't allow constants to be specified in octal – so we generate the value of x at runtime via repeated calls to a function a. We could create a void function and call it repeatedly as separate statements, but that would make it hard to identify where to start outputting the text of the data section. So instead, I made a return an integer, and use arithmetic to store the data (Verity guarantees that arithmetic evaluates left to right). The data section is encoded in x using a single - sign; when this is encountered at run time, it's expanded to the full -a 5-a 1-, etc., via the use of y.

Initializing y as a copy of x is fairly subtle here. Because a returns zero specifically, most of the sum is just zero minus zero minus… and cancels itself out. We end with !x (i.e. "the value of x"; in Verity, as in OCaml, a variable's name works more like a pointer than anything else, and you have to dereference it explicitly to get at the variable's value). Verity's rules for unary minus are a little complex – the unary minus of v is written as (-v) – thus (-0-0-0-!x) parses as (-(0-0-0-!x)), which is equal to !x, and we end up initializing y as a copy of x. (It's also worth noting that Verity is not call-by-value, but rather allows functions and operators to choose the order they evaluate things; - will evaluate the left argument before the right argument, and in particular, if the left argument has side effects, those will be visible when the right argument is evaluated.)

Each character of the source code is represented using two octal digits. This means that the source code is limited to 64 different characters, so I had to create my own codepage for internal use. The output is in ASCII, so I needed to convert internally; this is what the (if z<32 then z+92 else z) is for. Here's the character set I used in the internal representation, in numerical order (i.e. \ has codepoint 0, ? has codepoint 63):

\]^_`abcdefghijklmnopqrstuvwxyz{ !"#$%&'()*+,-./0123456789:;<=>?

This character set gives us most of the characters important for Verity. Notable characters missing are } (meaning that we can't create a block using {}, but luckily all statements are expressions so we can use () instead); and | (this is why I had to use an exclusive rather than inclusive OR when creating the value of x, meaning I need to initialize it to 0; however, I needed to specify how large it was anyway). Some of the critical characters that I wanted to ensure were in the character set were <> (for imports, also shifts), () (very hard to write a program that can be parsed without these), $ (for everything to do with bitwidth), and \ (for lambdas; theoretically we could work around this with let…in but it would be much more verbose).

In order to make the program a bit shorter, I constructed abbreviations for print and for !x$$6$$32 (i.e. "the bottom 6 bits of !x, cast to be usable to the print library) via temporarily binding them to lambda arguments.

Finally, there's the issue of output. Verity provides a print library that's intended for debug output. On a simulator, it prints the ASCII codes to standard output, which is perfectly usable for testing the program. On a physical circuit board, it depends on a print library having been written for the particular chip and board surrounding it; there's a print library in the Verity distribution for an evaluation board I had access to that prints the output on seven-segment displays. Given that the library will end up taking space on the resulting circuit board, it may be worth using a different language for an optimized solution to this problem so that we can output the bits of the output directly on wires.

By the way, this program is O(n²) on hardware, meaning that it's much worse on a simulator (I suspect O(n⁴); not sure, though, but it was hard enough to simulate that it seems unlikely to be even cubic, and based on how the time reacted to my changes as I was writing the program, the function seems to grow very quickly indeed). The Verity compiler needed 436 optimization passes (which is much, much more than it'd typically use) to optimize the program, and even after that, simulating it was very hard for my laptop. The complete compile-and-simulate run took the following time:

real  112m6.096s
user  105m25.136s
sys   0m14.080s

and peaked at 2740232 kibibytes of memory. The program takes a total of 213646 clock cycles to run. It does work, though!

Anyway, this answer doesn't really fulfil the question as I was optimizing for the wrong thing, but seeing as there are no other answers yet, this is the best by default (and it's nice to see what a golfed quine would look like in a hardware language). I'm not currently sure whether or not I'll work on a program that aims to produce more optimized ouptut on the chip. (It would likely be a lot larger in terms of source, as an O(n) data encoding would be rather more complex than the one seen here.)