How to write two bytes to a chunk of RAM repeatedly in Z80 asm

First of all, if you're going to move SP, you need to save and restore it. Second, you need to disable interrupts or else you'll have a race condition bug: if an interrupt triggers near the end of the copy, the stack will grow down into whatever is below it, which happens to be the VAT.

; Index registers are actually fast on the eZ80
    ld   ix, 0
    add  ix, sp
    di
; Do some hack using SP here
    ld   sp, ix
    ei

@Ped7g The eZ80 will cache any -IR/-DR suffix instruction; unlike the Z80, it doesn't reread the opcode from memory on each iteration. Consequently, instructions like LDIR can execute each iteration in just 2 bus cycles, one read and one write. The SP hack is therefore not only needlessly complicated, but actually slower. The SP hack still best left to more experienced programmers.

The eZ80 is very well pipelined and its performance is limited by its lack of any cache and 1-byte-wide bus. The only instruction that runs slower than the bus is MLT, a 2-bus-cycle instruction that needs 5 clock cycles. For every other instruction, just count the number of bytes in the opcode, and the number of read and write cycles, and you've got its execution time. It's a huge pity that in the TI-84+CE series, TI decided to pair the fast eZ80 with an SRAM that somehow needs four clock cycles for each read and write (at 48 MHz)! Yes, TI, a world leader in semiconductor design, managed to design a slow SRAM. Getting on-die SRAM to perform poorly is an engineering feat.

@harold has the right answer, though I prefer optimizing for size instead of speed outside of inner loops.

#include "includes\ti84pce.inc"

    .assume ADL=1
    .org userMem-2
    .db tExtTok,tAsm84CeCmp

    call  _homeup
    call  _ClrScrnFull
; Initialize registers
    ld    hl, vRam
    ld    bc, lcdWidth * lcdHeight * 2 - 2
    push  hl
    pop   de
; Write initial 2-byte value
    ld    (hl), 31
    inc   hl
    ld    (hl), 0
    inc   hl
    ex    de, hl
; Copy everything all at once.  Interrupts may trigger while this instruction is processing.
    ldir
    call  _GetKey
    call  _ClrScrnFull
    ret

On EFnet, #ez80-dev is a good place to ask questions. cemetech.net is also a good place.


This does not work:

dec   bc
jr    z,j2

Only 8 bit dec and inc modify the flags. It could be fixed by properly detecting whether bc is zero.

Here is a different technique without manual looping:

ld    hl,$D40000
ld    (hl),31
inc   hl
ld    (hl),0
dec   hl
ld    de,$D40002
ld    bc,$25800 - 2
ldir

The variation of tum_'s answer with faster-than-regular-dec bc zero test mechanism for looping.

    LD   SP,$D65800    ; <end of VRAM>: 0xD40000+0x25800
    LD   BC,$004B      ; 0x4B many times (in C) the 256x inner loop (B=0)
        ; that results into 0x4B00 repeats of loop, which when 8 bytes per loop
        ; are set makes the total 0x25800 bytes (VRAM size)
        ; (if you would unroll it for more than 8 bytes, it will be a bit more
        ; tricky to calculate the initial BC to get correct amount of looping)
        ; (not that much tricky, just a tiny bit)
    LD   HL,31         ; H <- 0, L <- 31
.L1
    PUSH HL            ; (SP – 2) <- L, (SP – 1) <- H, SP <- SP - 2
    PUSH HL            ; set 8 bytes in each iteration
    PUSH HL
    PUSH HL
    DJNZ .L1           ; loop by B value (in this example it starts as 0 => 256x loop)
    DEC  C             ; loop by C ("outer" counter)
    JR   NZ,.L1        ; btw JP is faster than JR on original Z80, but not on eZ80
.END

(BTW I never did eZ80 programming, and I didn't verify this in debugger, so this is kinda full of assumptions... actually thinking about it, isn't push on eZ80 32 bit? The the init of hl should be ld hl,$001F001F to set four bytes with single push, and the inner body of loop should have only two push hl)

(but I did ton of Z80 programming, so that's why I even bother with comment on this topic, even if I haven't seen eZ80 code ever before)

Edit: turns out the eZ80 push is 24 bit, i.e. the code above will produce incorrect result. It can be of course easily fixed (as the issue is implementation detail, not principal), like:

    LD   SP,$D65800    ; <end of VRAM>: 0xD40000+0x25800
    LD   BC,$0014      ; 0x14 many times (in C) the 256x inner loop (B=0)
        ; that results into 0x1400 repeats of loop, which with 30 bytes per
        ; loop set makes the total 0x25800 bytes (VRAM size)
    LD   HL,$1F001F    ; will set bytes 31,  0, 31
    LD   DE,$001F00    ; will set bytes  0, 31,  0
.L1
    PUSH DE
    PUSH HL
        ; here SP = SP-6, and 6 bytes 31, 0, 31, 0, 31, 0 were set
    PUSH DE
    PUSH HL
    PUSH DE
    PUSH HL
    PUSH DE
    PUSH HL
    PUSH DE
    PUSH HL            ; unrolled 5 times to set 30 bytes in total
    DJNZ .L1           ; loop by B value (in this example it starts as 0 => 256x loop)
    DEC  C             ; loop by C ("outer" counter)
    JR   NZ,.L1