Steve's Tech Talk : Building a VM

Building a VM

In 1995 or so, I wrote an emulator for the 6800 processor. This was not an unfamiliar task in that I'd written an emulator for the 6502 in college as well as a few fictitious processors. The 6800 took me a weekend (which says a lot about my social life at the time), but I got to try a few techniques to push performance.

Side tangent about performance - I wrote this in straight C on a Macintosh LCIII (a 68030 machine running at 25MHz). It ran at about 75% real time, which meant that if I really hand-tuned everything and did a lot of the emulation in assembly, I was pretty sure I could get it to run faster than real time. Then I compiled it and ran it on an early Power PC machine. It ran 4x real time in the debug version. There was no point in spending my time optimizing it-CPU performance had made up for code limitations.

The architecture was simple: a few globals representing the CPU (they really should've been in a struct), a bunch of tables, and a few functions to control the cpu. I want to start with the definition of an instruction:

typedef struct {
    void (*func)(void);
    unsigned char instructionType, mode;
} t_instr;

This is 6 bytes: a function pointer, the type of the instruction (ie, load, store, branch, etc) and the addressing mode of the instruction. The instruction type and mode are used for simplifying a disassembler. The function pointer is a function that performs that instruction.

Here is the definition of the CPU state:
unsigned short PC;
unsigned short SP;
unsigned short X;
unsigned char A;
unsigned char B;
unsigned char CCR;
unsigned short cycles;
unsigned char pendingInterrupt;

All the capital variables are registers. Cycles is the number of cycles taken up by the last instruction. In previous versions, I had each instruction return the number of cycles it consumed, but that was a performance bottleneck (!!). If I were reimplementing this today, I'd change that back. PendingInterrupt indicates that an interrupt has happened.

Given that, this is how the CPU is initialized:
void Init6800( void )
{
    PC = fetch(0xFFFE) << 8;
    PC |= fetch(0xFFFF);
    CCR = X_BIT | Y_BIT;
    X = 0;
    A = B = 0;
    pendingInterrupt = 0;
}

In the 6800, the starting address at cold reset is taken from the last two bytes in the address space. The condition code register (CCR) is starts with two bits (wihch are actually undefined!) set.

Now, here's how to execute a single instruction:
void Step( void )
{
    char buf[MAX_DISASSEMBLY_LENGTH];
    unsigned char theInstruction;

    CheckForInterrupt();
    DumpLine6800(PC, buf, fetch);
    cprintf("%s", buf);
    theInstruction = fetch(PC++);
    (*(i6808[theInstruction].func))();
    SpinPIA(cycles);
    cprintf(" SP: %04X X: %04X A: %02X B: %02X CCR: %02X\n",
        SP, X, A, B, CCR);
}

The first thing we do is check to see if there was an interrupt. This will modify the PC register if there was an interrupt. Then we dump the next instruction.   DumpLine6800() creates a single line string dump of the code at the given address and uses a supplied function to read from memory. Cprintf() is a console based flavor of printf I knocked together for debugging and output information.

After handling output, I fetch the next instruction then execute it. i6808[] is an array of t_instr structs that define every instruction for every possible opcode. This just pulls out the function for that instruction and executes it.

SpinPIA() gives the PIA time to figure out what to do. In this machine, there was a PIA chip that handled I/O. It was basically a latch that controlled external hardware.

Finally, I wrote out the contents of the registers.

Using step, we can write a routine to trace execution:
void Trace( Boolean (*checkForHalt)( void ) )
{
    long finalTicks;

    do {
        finalTicks = TickCount() + UI_LATENCY;
        while(finalTicks > TickCount())
            Step();
    } while (!(*checkForHalt)());
}

CheckForHalt is a function pointer that does UI maintenance and looks to see if the users has requested that tracing stop. The purpose of finalTicks is that we don't want to run the UI every single instruction. Instead, we check every so often.   In a modern system, we'd probably put the CPU in its own thread instead of polling the UI.

Now, in normal execution, we don't want to call Step(). Since it hits the console, it's way too slow. Instead, we have a special purpose Run() that does the work for normal execution:

void Run( register Boolean (*checkForHalt)( void ) )
{
    register long finalTicks;
    register unsigned char theInstruction;

    do {
        finalTicks = TickCount() + LONG_UI_LATENCY;
        while(finalTicks > TickCount()) {
            CheckForInterrupt();
            theInstruction = fetch(PC++);
            (*(i6808[theInstruction].func))();
            SpinPIA(cycles);
        }
    } while (!(*checkForHalt)());
}

This is nearly the same as trace, but there is no call to Step() (or a non-output equivalent), instead the code is inline for performance. Also the UI is given a much difference latency and there are some compiler hints for what things should go into registers if at all possible.

Emulating memory is harder than you might think. See, in older processors there is the possibility that writing to a memory location or reading from a memory location will activate hardware or even change the memory layout. It's possible to hardcode the memory map in, but we'd like the code to be flexible, fast, and easy to maintain. Checking rules for special memory cases is going to be painfully slow, especially since the special cases will fail most of the time. The way to handle this is to break memory operations down into their two basic operations: fetch and store, then figure out how to abstract those operations a little bit. Here are some definitions that do this:

typedef void (*storefunc)(unsigned short addr, unsigned char val);
typedef unsigned char (*fetchfunc)(unsigned short addr);

typedef struct {
    unsigned short start, end;
    storefunc f;
} t_storevec;

typedef struct {
    unsigned short start, end;
    fetchfunc f;
} t_fetchvec;

extern void store(unsigned short addr, unsigned char val);
extern unsigned char fetch(unsigned short addr);
extern void DefaultWriter(unsigned short addr, unsigned char val);
extern unsigned char DefaultReader(unsigned short addr);

extern void InitMemory( t_fetchvec *fv, t_storevec *sv );

A storefunc writes a byte into memory at the given address.
A fetchfunc reads a byte from memory at the given address.
A storevec defines the storefunc to use for a given memory range.
A fetchvec defines the fetchfunc to use for a given memory range.
DefaultWriter() and DefaultReader() are basic implementations that will be used if nothing else is specified.

InitMemory() allocates some memory and then creates two arrays of function pointers for reading and writing memory. Each of these array elements is initialized to the DefaultWriter and DefaultReader functions, then the given fetch and store vectors fill in the code. As a result, fetch() looks like this:
unsigned char fetch(unsigned short addr)
{
    return (*(memoryReaders[addr]))(addr);
}

Store is similar. The nice thing here is that there is no special casing for memory layouts.

So to read memory, you look up a function to read from that address. To write memory, you look up a function to write to that address. This lets you emulate RAM, ROM, and I/O. You can emulate circumstances where writing a value to an address reads something different back. You can also emulate bank switching to switch in new ROM/RAM or shadow ROM over video memory and so on.

The second to last ingredient is managing interrupts. On this particular processor, there are two types of interrupts: maskable and non-maskable. A maskable interrupt is one that can be disabled and enabled by the CPU. A non-maskable interrupt can't be ignored. During the execution of any instruction, an interrupt may come in. In this model, this interrupt condition is stored in the pendingInterrupt variable. Before executing any instruction, the CPU needs to check to see if there is a pending interrupt:
static void CheckForInterrupt( void )
{
    unsigned short addr;

    if (pendingInterrupt == 0 ||
              (pendingInterrupt == I_BIT && CCR & I_BIT))
        return;
    if (pendingInterrupt & C_BIT) { /* NMI */
        addr = 0xFFFC;
    }
    else { /* IRQ */
        addr = 0xFFF8;
    }
    PC = fetch(addr) << 8;
    PC |= fetch(addr + 1);
    pendingInterrupt = 0;
}
The first if statement checks to see if there is an interrupt waiting. If pendingInterrupt is zero, then we'll short circuit out of this routine. The trick is to make sure that the most common case (no interrupt) does as little work as possible to keep overhead down. The second part of the interrupt checks to see if there was an interrupt and if it was masked.

If there was a pending interrupt, we distinguish the type and set the PC to a new address. The next instruction will be the correct interrupt handler.

The final piece is the actual instructions themselves. Here is my favorite instruction:
void NOP( void )
{
    cycles = 2;
}
NOP is No OPeration. It does nothing except take some time. Here is the implementation of LDX in the direct addressing mode. This means that the instruction contains an address of memory to put into X:
void LDXDIR( void )
{
    unsigned short effad;

    effad = fetch(PC++);
    X = fetch(effad) << 8;
    X |= fetch(effad + 1);

    if (X & 0x8000) CCR |= N_BIT;
    else CCR &= ~N_BIT;

    if (X == 0) CCR |= Z_BIT;
    else CCR &= ~Z_BIT;

    CCR &= ~V_BIT;
    cycles = 5;
}
Effad is the effective address of where memory will be read into the X register. The last chunk of code checks for changes in the bits in the condition code register. For example, the N bit is the negative bit. It is set whenever the result of the last operation resulted in a negtive number. Since the X register is 16 bits, we check the high bit by masking X with 0x8000.

For branch instructions, I used this code:
#define M_BRANCH(cond) {\
    char disp = fetch(PC++);\
    if (cond) {\
        PC += (short)disp;\
    }\
    cycles = 4;\
}
This is a macro which encompasses everything used in a branch instruction. Now this is one of the times that I love the C preprocessor. It lets me define the code body once and then use it inline in multiple places. So here are three branch instructions:
void BRA( void )
{
    M_BRANCH(1);
}

void BNE( void )
{
    M_BRANCH( (CCR & Z_BIT) == 0);
}

void BGE( void )
{
    M_BRANCH( ((CCR >> N_POS) & 1) == ((CCR >> Z_POS) & 1));
}
BRA is Branch unconditional. The macro is nice because the if test compiles out.
BNE is branch on not equal - this tests if the Zero bit is set.
BGE is Branch on greater than or equal to. In this case, the Negative bit needs to be equal to the Zero bit.

I'm not going to describe the arithmetic instructions other than to say that they stink. Most of the work is done in setting condition codes. Checking for overflow is tricky. Checking for half carry (overflow from bit 3 to bit 4) is a real pain and it's made worse by the fact that it has to be done every time there's an addition and it adds an extra 20% to execution time of any addition instruction.

Published Thursday, March 09, 2006 3:20 PM by Steve Hawley

Comments

No Comments

Anonymous comments are disabled

Steve's Tech Talk

This Blog

Syndication

Search

Navigation

Tags

Recent Posts

Archives

Building a VM

Comments