473 lines
21 KiB
Text
473 lines
21 KiB
Text
|
|
||
|
HACKING ON THE GNUBOY SOURCE TREE
|
||
|
|
||
|
|
||
|
BASIC INFO
|
||
|
|
||
|
In preparation for the first release, I'm putting together a simple
|
||
|
document to aid anyone interested in playing around with or improving
|
||
|
the gnuboy source. First of all, before working on anything, you
|
||
|
should know my policies as maintainer. I'm happy to accept contributed
|
||
|
code, but there are a few guidelines:
|
||
|
|
||
|
* Obviously, all code must be able to be distributed under the GNU
|
||
|
GPL. This means that your terms of use for the code must be equivalent
|
||
|
to or weaker than those of the GPL. Public domain and MIT-style
|
||
|
licenses are perfectly fine for new code that doesn't incorporate
|
||
|
existing parts of gnuboy, e.g. libraries, but anything derived from or
|
||
|
built upon the GPL'd code can only be distributed under GPL. When in
|
||
|
doubt, read COPYING.
|
||
|
|
||
|
* Please stick to a coding and naming convention similar to the
|
||
|
existing code. I can reformat contributions if I need to when
|
||
|
integrating them, but it makes it much easier if that's already done
|
||
|
by the coder. In particular, indentions are a single tab (char 9), and
|
||
|
all symbols are all lowercase, except for macros which are all
|
||
|
uppercase.
|
||
|
|
||
|
* All code must be completely deterministic and consistent across all
|
||
|
platforms. this results in the two following rules...
|
||
|
|
||
|
* No floating point code whatsoever. Use fixed point or better yet
|
||
|
exact analytical integer methods as opposed to any approximation.
|
||
|
|
||
|
* No threads. Emulation with threads is a poor approximation if done
|
||
|
sloppily, and it's slow anyway even if done right since things must be
|
||
|
kept synchronous. Also, threads are not portable. Just say no to
|
||
|
threads.
|
||
|
|
||
|
* All non-portable code belongs in the sys/ or asm/ trees. #ifdef
|
||
|
should be avoided except for general conditionally-compiled code, as
|
||
|
opposed to little special cases for one particular cpu or operating
|
||
|
system. (i.e. #ifdef USE_ASM is ok, #ifdef __i386__ is NOT!)
|
||
|
|
||
|
* That goes for *nix code too. gnuboy is written in ANSI C, and I'm
|
||
|
not going to go adding K&R function declarations or #ifdef's to make
|
||
|
sure the standard library is functional. If your system is THAT
|
||
|
broken, fix the system, don't "fix" the emulator.
|
||
|
|
||
|
* Please no feature-creep. If something can be done through an
|
||
|
external utility or front-end, or through clever use of the rc
|
||
|
subsystem, don't add extra code to the main program.
|
||
|
|
||
|
* On that note, the modules in the sys/ tree serve the singular
|
||
|
purpose of implementing calls necessary to get input and display
|
||
|
graphics (and eventually sound). Unlike in poorly-designed emulators,
|
||
|
they are not there to give every different target platform its own gui
|
||
|
and different set of key bindings.
|
||
|
|
||
|
* Furthermore, the main loop is not in the platform-specific code, and
|
||
|
it will never be. Windows people, put your code that would normally go
|
||
|
in a message loop in ev_refresh and/or sys_sleep!
|
||
|
|
||
|
* Commented code is welcome but not required.
|
||
|
|
||
|
* I prefer asm in AT&T syntax (the style used by *nix assemblers and
|
||
|
likewise DJGPP) as opposed to Intel/NASM/etc style. If you really must
|
||
|
use a different style, I can convert it, but I don't want to add extra
|
||
|
dependencies on nonstandard assemblers to the build process. Also,
|
||
|
portable C versions of all code should be available.
|
||
|
|
||
|
* Have fun with it. If my demands stifle your creativity, feel free to
|
||
|
fork your own projects. I can always adapt and merge code later if
|
||
|
your rogue ideas are good enough. :)
|
||
|
|
||
|
OK, enough of that. Now for the fun part...
|
||
|
|
||
|
|
||
|
THE SOURCE TREE STRUCTURE
|
||
|
|
||
|
[documentation]
|
||
|
README - general information related to using gnuboy
|
||
|
INSTALL - compiling and installation instructions
|
||
|
HACKING - this file, obviously
|
||
|
COPYING - the gnu gpl, grants freedom under condition of preseving it
|
||
|
|
||
|
[build files]
|
||
|
Version - doubles as a C and makefile include, identifies version number
|
||
|
Rules - generic build rules to be included by makefiles
|
||
|
Makefile.* - system-specific makefiles
|
||
|
configure* - script for generating *nix makefiles
|
||
|
|
||
|
[non-portable code]
|
||
|
sys/*/* - hardware and software platform-specific code
|
||
|
asm/*/* - optimized asm versions of some code, not used yet
|
||
|
asm/*/asm.h - header specifying which functions are replaced by asm
|
||
|
asm/i386/asmnames.h - #defines to fix _ prefix brain damage on DOS/Windows
|
||
|
|
||
|
[main emulator stuff]
|
||
|
main.c - entry point, event handler...basically a mess
|
||
|
loader.c - handles file io for rom and ram
|
||
|
emu.c - another mess, basically the frame loop that calls state.c
|
||
|
debug.c - currently just cpu trace, eventually interactive debugging
|
||
|
hw.c - interrupt generation, gamepad state, dma, etc.
|
||
|
mem.c - memory mapper, read and write operations
|
||
|
fastmem.h - short static functions that will inline for fast memory io
|
||
|
regs.h - macros for accessing hardware registers
|
||
|
save.c - savestate handling
|
||
|
|
||
|
[cpu subsystem]
|
||
|
cpu.c - main cpu emulation
|
||
|
cpuregs.h - macros for cpu registers and flags
|
||
|
cpucore.h - data tables for cpu emulation
|
||
|
asm/i386/cpu.s - entire cpu core, rewritten in asm
|
||
|
|
||
|
[graphics subsystem]
|
||
|
fb.h - abstract framebuffer definition, extern from platform-specifics
|
||
|
lcd.c - main control of refresh procedure
|
||
|
lcd.h - vram, palette, and internal structures for refresh
|
||
|
asm/i386/lcd.s - asm versions of a few critical functions
|
||
|
lcdc.c - lcdc phase transitioning
|
||
|
|
||
|
[input subsystem]
|
||
|
input.h - internal keycode definitions, etc.
|
||
|
keytables.c - translations between key names and internal keycodes
|
||
|
events.c - event queue
|
||
|
|
||
|
[resource/config subsystem]
|
||
|
rc.h - structure defs
|
||
|
rccmds.c - command parser/processor
|
||
|
rcvars.c - variable exports and command to set rcvars
|
||
|
rckeys.c - keybindingds
|
||
|
|
||
|
[misc code]
|
||
|
path.c - path searching
|
||
|
split.c - general purpose code to split strings into argv-style arrays
|
||
|
|
||
|
|
||
|
OVERVIEW OF PROGRAM FLOW
|
||
|
|
||
|
The initial entry point main() main.c, which will process the command
|
||
|
line, call the system/video initialization routines, load the
|
||
|
rom/sram, and pass control to the main loop in emu.c. Note that the
|
||
|
system-specific main() hook has been removed since it is not needed.
|
||
|
|
||
|
There have been significant changes to gnuboy's main loop since the
|
||
|
original 0.8.0 release. The former state.c is no more, and the new
|
||
|
code that takes its place, in lcdc.c, is now called from the cpu loop,
|
||
|
which although slightly unfortunate for performance reasons, is
|
||
|
necessary to handle some strange special cases.
|
||
|
|
||
|
Still, unlike some emulators, gnuboy's main loop is not the cpu
|
||
|
emulation loop. Instead, a main loop in emu.c which handles video
|
||
|
refresh, polling events, sleeping between frames, etc. calls
|
||
|
cpu_emulate passing it an idea number of cycles to run. The actual
|
||
|
number of cycles for which the cpu runs will vary slightly depending
|
||
|
on the length of the final instruction processed, but it should never
|
||
|
be more than 8 or 9 beyond the ideal cycle count passed, and the
|
||
|
actual number will be returned to the calling function in case it
|
||
|
needs this information. The cpu code now takes care of all timer and
|
||
|
lcdc events in its main loop, so the caller no longer needs to be
|
||
|
aware of such things.
|
||
|
|
||
|
Note that all cycle counts are measured in CGB double speed MACHINE
|
||
|
cycles (2**21 Hz), NOT hardware clock cycles (2**23 Hz). This is
|
||
|
necessary because the cpu speed can be switched between single and
|
||
|
double speed during a single call to cpu_emulate. When running in
|
||
|
single speed or DMG mode, all instruction lengths are doubled.
|
||
|
|
||
|
As for the LCDC state, things are much simpler now. No more huge
|
||
|
glorious state table, no more P/Q/R, just a couple simple functions.
|
||
|
Aside from the number of cycles left before the next state change, all
|
||
|
the state information fits nicely in the locations the Game Boy itself
|
||
|
provides for it -- the LCDC, STAT, and LY registers.
|
||
|
|
||
|
If the special cases for the last line of VBLANK look strange to you,
|
||
|
good. There's some weird stuff going on here. According to documents
|
||
|
I've found, LY changes from 153 to 0 early in the last line, then
|
||
|
remains at 0 until the end of the first visible scanline. I don't
|
||
|
recall finding any roms that rely on this behavior, but I implemented
|
||
|
it anyway.
|
||
|
|
||
|
That covers the basics. As for flow of execution, here's a simplified
|
||
|
call tree that covers most of the significant function calls taking
|
||
|
place in normal operation:
|
||
|
|
||
|
main sys/
|
||
|
\_ real_main main.c
|
||
|
|_ sys_init sys/
|
||
|
|_ vid_init sys/
|
||
|
|_ loader_init loader.c
|
||
|
|_ emu_reset emu.c
|
||
|
\_ emu_run emu.c
|
||
|
|_ cpu_emulate cpu.c
|
||
|
| |_ div_advance cpu.c *
|
||
|
| |_ timer_advance cpu.c *
|
||
|
| |_ lcdc_advance cpu.c *
|
||
|
| | \_ lcdc_trans lcdc.c
|
||
|
| | |_ lcd_refreshline lcd.c
|
||
|
| | |_ stat_change lcdc.c
|
||
|
| | | \_ lcd_begin lcd.c
|
||
|
| | \_ stat_trigger lcdc.c
|
||
|
| \_ sound_advance cpu.c *
|
||
|
|_ vid_end sys/
|
||
|
|_ sys_elapsed sys/
|
||
|
|_ sys_sleep sys/
|
||
|
|_ vid_begin sys/
|
||
|
\_ doevents main.c
|
||
|
|
||
|
(* included in cpu.c so they can inline; also in cpu.s)
|
||
|
|
||
|
|
||
|
MEMORY READ/WRITE MAP
|
||
|
|
||
|
Whenever possible, gnuboy avoids emulating memory reads and writes
|
||
|
with a function call. To this end, two pointer tables are kept -- one
|
||
|
for reading, the other for writing. They are indexed by bits 12-15 of
|
||
|
the address in Game Boy memory space, and yield a base pointer from
|
||
|
which the whole address can be used as an offset to access Game Boy
|
||
|
memory with no function calls whatsoever. For regions that cannot be
|
||
|
accessed without function calls, the pointer in the table is NULL.
|
||
|
|
||
|
For example, reading from address addr can be accomplished by testing
|
||
|
to make sure mbc.rmap[addr>>12] is not NULL, then simply reading
|
||
|
mbc.rmap[addr>>12][addr].
|
||
|
|
||
|
And for the disbelievers in this optimization, here are some numbers
|
||
|
to compare. First, FFL2 with memory tables disabled:
|
||
|
|
||
|
% cumulative self self total
|
||
|
time seconds seconds calls us/call us/call name
|
||
|
28.69 0.57 0.57 refresh_2
|
||
|
13.17 0.84 0.26 4307863 0.06 0.06 mem_read
|
||
|
11.63 1.07 0.23 cpu_emulate
|
||
|
|
||
|
Now, with memory tables enabled:
|
||
|
|
||
|
38.86 0.66 0.66 refresh_2
|
||
|
8.42 0.80 0.14 156380 0.91 0.91 spr_enum
|
||
|
6.76 0.91 0.11 483134 0.24 1.31 lcdc_trans
|
||
|
6.16 1.02 0.10 cpu_emulate
|
||
|
.
|
||
|
.
|
||
|
.
|
||
|
0.59 1.61 0.01 216497 0.05 0.05 mem_read
|
||
|
|
||
|
As you can see, not only does mem_read take up (proportionally) 1/20
|
||
|
as much time, since it is rarely called, but the main cpu loop in
|
||
|
cpu_emulate also runs considerably faster with all the function call
|
||
|
overhead and cache misses avoided.
|
||
|
|
||
|
These tests were performed on K6-2/450 with the assembly cores
|
||
|
enabled; your milage may vary. Regardless, however, I think it's clear
|
||
|
that using the address mapping tables is quite a worthwhile
|
||
|
optimization.
|
||
|
|
||
|
|
||
|
LCD RENDERING CORE DESIGN
|
||
|
|
||
|
The LCD core presently used in gnuboy is very much a high-level one,
|
||
|
performing the task of rasterizing scanlines as many independent steps
|
||
|
rather than one big loop, as is often seen in other emulators and the
|
||
|
original gnuboy LCD core. In some ways, this is a bit of a tradeoff --
|
||
|
there's a good deal of overhead in rebuilding the tile pattern cache
|
||
|
for roms that change their tile patterns frequently, such as full
|
||
|
motion video demos. Even still, I consider the method we're presently
|
||
|
using far superior to generating the output display directly from the
|
||
|
gameboy tiledata -- in the vast majority of roms, tiles are changed so
|
||
|
infrequently that the overhead is irrelevant. Even if the tiles are
|
||
|
changed rapidly, the only chance for overhead beyond what would be
|
||
|
present in a monolithic rendering loop lies in (host cpu) cache misses
|
||
|
and the possibility that we might (tile pattern) cache a tile that has
|
||
|
changed but that will never actually be used, or that will only be
|
||
|
used in one orientation (horizontally and vertically flipped versions
|
||
|
of all tiles are cached as well). Such tile caching issues could be
|
||
|
addressed in the long term if they cause a problem, but I don't see it
|
||
|
hurting performance too significantly at the present. As for host cpu
|
||
|
cache miss issues, I find that putting multiple data decoding and
|
||
|
rendering steps together in a single loop harms performance much more
|
||
|
significantly than building a 256k (pattern) cache table, on account
|
||
|
of interfering with branch prediction, register allocation, and so on.
|
||
|
|
||
|
Well, with those justifications given, let's proceed to the steps
|
||
|
involved in rendering a scanline:
|
||
|
|
||
|
updatepatpix() - updates tile pattern cache.
|
||
|
|
||
|
tilebuf() - reads gb tile memory according to its complicated tile
|
||
|
addressing system which can be changed via the LCDC register, and
|
||
|
outputs nice linear arrays of the actual tile indices used in the
|
||
|
background and window on the present line.
|
||
|
|
||
|
Before continuing, let me explain the output format used by the
|
||
|
following functions. There is a byte array scan.buf, accessible by
|
||
|
macro as BUF, which is the output buffer for the line. The structure
|
||
|
of this array is simple: it is composed of 6 bpp gameboy color
|
||
|
numbers, where the bits 0-1 are the color number from the tile, bits
|
||
|
2-4 are the (cgb or dmg) palette index, and bit 5 is 0 for background
|
||
|
or window, 1 for sprite.
|
||
|
|
||
|
What is the justification for using a strange format like this, rather
|
||
|
than raw host color numbers for output? Well, believe it or not, it
|
||
|
improves performance. It's already necessary to have the gameboy color
|
||
|
numbers available for use in sprite priority. And, when running in
|
||
|
mono gb mode, building this output data is VERY fast -- it's just a
|
||
|
matter of doing 64 bit copies from the tile pattern cache to the
|
||
|
output buffer.
|
||
|
|
||
|
Furthermore, using a unified output format like this eliminates the
|
||
|
need to have separate rendering functions for each host color depth or
|
||
|
mode. We just call a one-line function to apply a palette to the
|
||
|
output buffer as we copy it to the video display, and we're done. And,
|
||
|
if you're not convinced about performance, just do some profiling.
|
||
|
You'll see that the vast majority of the graphics time is spent in the
|
||
|
one-line copy function (render_[124] depending on bytes per pixel),
|
||
|
even when using the fast asm versions of those routines. That is to
|
||
|
say, any overhead in the following functions is for all intents and
|
||
|
purposes irrelevant to performance. With that said, here they are:
|
||
|
|
||
|
bg_scan() - expands the background layer to the output buffer.
|
||
|
|
||
|
wnd_scan() - expands the window layer.
|
||
|
|
||
|
spr_scan() - expands the sprites. Note that this requires spr_enum()
|
||
|
to have been called already to build a list of which sprites are
|
||
|
visible on the current scanline and sort them by priority.
|
||
|
|
||
|
It should be noted that the background and window functions also have
|
||
|
color counterparts, which are considerably slower due to merging of
|
||
|
palette data. At this point, they're staying down around 8% time
|
||
|
according to the profiler, so I don't see a major need to rewrite them
|
||
|
anytime soon. It should be considered, however, that a different
|
||
|
intermediate format could be used for gbc, or that asm versions of
|
||
|
these two routines could be written, in the long term.
|
||
|
|
||
|
Finally, some notes on palettes. You may be wondering why the 6 bpp
|
||
|
intermediate output can't be used directly on 256-color display
|
||
|
targets. After all, that would give a huge performance boost. The
|
||
|
problem, however, is that the gameboy palette can change midscreen,
|
||
|
whereas none of the presently targetted host systems can handle such a
|
||
|
thing, much less do it portably. For color roms, using our own
|
||
|
internal color mappings in addition to the host system palette is
|
||
|
essential. For details on how this is accomplished, read palette.c.
|
||
|
|
||
|
Now, in the long term, it MAY be possible to use the 6 bpp color
|
||
|
"almost" directly for mono roms. Note that I say almost. The idea is
|
||
|
this. Using the color number as an index into a table is slow. It
|
||
|
takes an extra read and causes various pipeline stalls depending on
|
||
|
the host cpu architecture. But, since there are relatively few
|
||
|
possible mono palettes, it may actually be possible to set up the host
|
||
|
palette in a clever way so as to cover all the possibilities, then use
|
||
|
some fancy arithmetic or bit-twiddling to convert without a lookup
|
||
|
table -- and this could presumably be done 4 pixels at a time with
|
||
|
32bit operations. This area remains to be explored, but if it works,
|
||
|
it might end up being the last hurdle to getting realtime emulation
|
||
|
working on very low-end systems like i486.
|
||
|
|
||
|
|
||
|
SOUND
|
||
|
|
||
|
Rather than processing sound after every few instructions (and thus
|
||
|
killing the cache coherency), we update sound in big chunks. Yet this
|
||
|
in no way affects precise sound timing, because sound_mix is always
|
||
|
called before reading or writing a sound register, and at the end of
|
||
|
each frame.
|
||
|
|
||
|
The main sound module interfaces with the system-specific code through
|
||
|
one structure, pcm, and a few functions: pcm_init, pcm_close, and
|
||
|
pcm_submit. While the first two should be obvious, pcm_submit needs
|
||
|
some explaining. Whenever realtime sound output is operational,
|
||
|
pcm_submit is responsible for timing, and should not return until it
|
||
|
has successfully processed all the data in its input buffer (pcm.buf).
|
||
|
On *nix sound devices, this typically means just waiting for the write
|
||
|
syscall to return, but on systems such as DOS where low level IO must
|
||
|
be handled in the program, pcm_submit needs to delay until the current
|
||
|
position in the DMA buffer has advanced sufficiently to make space for
|
||
|
the new samples, then copy them.
|
||
|
|
||
|
For special sound output implementations like write-to-file or the
|
||
|
dummy sound device, pcm_submit should write the data immediately and
|
||
|
return 0, indicating to the caller that other methods must be used for
|
||
|
timing. On real sound devices that are presently functional,
|
||
|
pcm_submit should return 1, regardless of whether it buffered or
|
||
|
actually wrote the sound data.
|
||
|
|
||
|
And yes, for unices without OSS, we hope to add piped audio output
|
||
|
soon. Perhaps Sun audio device and a few others as well.
|
||
|
|
||
|
|
||
|
OPTIMIZED ASSEMBLY CODE
|
||
|
|
||
|
A lot can be said on this matter. Nothing has been said yet.
|
||
|
|
||
|
|
||
|
INTERACTIVE DEBUGGER
|
||
|
|
||
|
Apologies, there is no interactive debugger in gnuboy at present. I'm
|
||
|
still working out the design for it. In the long run, it should be
|
||
|
integrated with the rc subsystem, kinda like a cross between gdb and
|
||
|
Quake's ever-famous console. Whether it will require a terminal device
|
||
|
or support the graphical display remains to be determined.
|
||
|
|
||
|
In the mean time, you can use the debug trace code already
|
||
|
implemented. Just "set trace 1" from your gnuboy.rc or the command
|
||
|
line. Read debug.c for info on how to interpret the output, which is
|
||
|
condensed as much as possible and not quite self-explanatory.
|
||
|
|
||
|
|
||
|
PORTING
|
||
|
|
||
|
On all systems on which it is available, the gnu compiler should
|
||
|
probably be used. Writing code specific to non-free compilers makes it
|
||
|
impossible for free software users to actively contribute. On the
|
||
|
other hand, compiler-specific code should always be kept to a minimum,
|
||
|
to make porting to or from non-gnu compilers easier.
|
||
|
|
||
|
Porting to new cpu architectures should not be necessary. Just make
|
||
|
sure you unset IS_LITTLE_ENDIAN in the makefiles to enable the big
|
||
|
endian default if the target system is big endian. If you do have
|
||
|
problems building on certain cpus, however, let us know. Eventually,
|
||
|
we will also want asm cpu and graphics code for popular host cpus, but
|
||
|
this can wait, since the c code should be sufficiently fast on most
|
||
|
platforms.
|
||
|
|
||
|
The bulk of porting efforts will probably be spent on adding support
|
||
|
for new operating systems, and on systems with multiple video (or
|
||
|
sound, once that's implemented) architectures, new interfaces for
|
||
|
those. In general, the operating system interface code goes in a
|
||
|
directory under sys/ named for the os (e.g. sys/nix/ for *nix
|
||
|
systems), and display interfaces likewise go in their respective
|
||
|
directories under sys/ (e.g. sys/x11/ for the x window system
|
||
|
interface).
|
||
|
|
||
|
For guidelines in writing new system and display interface modules, i
|
||
|
recommend reading the files in the sys/dos/, sys/svga/, and sys/nix/
|
||
|
directories. These are some of the simpler versions (aside from the
|
||
|
tricky dos keyboard handling), as opposed to all the mess needed for
|
||
|
x11 support.
|
||
|
|
||
|
Also, please be aware that the existing system and display interface
|
||
|
modules are somewhat primitive; they are designed to be as quick and
|
||
|
sloppy as possible while still functioning properly. Eventually they
|
||
|
will be greatly improved.
|
||
|
|
||
|
Finally, remember your obligations under the GNU GPL. If you produce
|
||
|
any binaries that are compiled strictly from the source you received,
|
||
|
and you intend to release those, you *must* also release the exact
|
||
|
sources you used to produce those binaries. This is not pseudo-free
|
||
|
software like Snes9x where binaries usually appear before the latest
|
||
|
source, and where the source only compiles on one or two platforms;
|
||
|
this is true free software, and the source to all binaries always
|
||
|
needs to be available at the same time or sooner than the
|
||
|
corresponding binaries, if binaries are to be released at all. This of
|
||
|
course applies to all releases, not just new ports, but from
|
||
|
experience i find that ports people usually need the most reminding.
|
||
|
|
||
|
|
||
|
EPILOGUE
|
||
|
|
||
|
That's it for now. More info will eventually follow. Happy hacking!
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|