Tides Development Log

12/31/2015

Well, it's New Years Eve here in Baltimore and I'm busy coding :) The deadline
for Synchrony is in 9 days and, as usual, I'm just getting started. I was
planning on doing a shiny Windows 4k, but I have decided against it for 3
reasons:

    1. I no longer have time to do this.
    2. There is no 4k category at Synchrony.
    3. It's been done to death.
    
What I do have time for is the nifty "nano" category ( <= 256 bytes ). I'm
expecting to see a lot of DOS mode 13h stuff in this category, so my plan is
to try to turn some heads by doing something different...

I've experimented in the past with writting multiboot kernels. Multiboot is
a bootloader specification that allows the kernel to forgo the headache of
setting up protected mode. Conveniently, it also supports VESA. In short,
this allows one to code in a 32 bit environment with a VESA buffer already
configured all for the low, low cost of a 44 byte header.

The only bootloader that supports multiboot that I know of is GRUB. As an
added bonus, GRUB also supports compressed kernels :) I don't know how much
use this will be in a 256 byter, but in theory...

Since this kernel will be running directly on the metal (no DosBox or other
emulation layer to slow things down) I'll be able to take advantage of the
entirety of the (hopefully new and fast) compo machine CPU. This struck me as a
good occasion to take advantage of the SSE 4.1 instruction set extension to
write a nice highres raytracer. SSE 4.1 has the nifty DPPS instruction (aka dot
product) which should be perfect for this endeavor.

But, you say, SSE is huge! Most of the instructions are 3 or 4 bytes long! How
on earth will you possibly do anything productive in 256 bytes? Maybe you're
right, but I suspect the size of these instructions is worth it for something as
parallel as a raytracer. For example, say we want to normalize a vector before
we cast it out into oblivion. If we were using the regular FPU, at an absolute
minimum we'd use 3 FMULs (2 bytes each) and 2 FADDs (2 bytes each) for the dot
product, 1 FSQRT (2 bytes) and 3 FDIVs (2 bytes each). That comes out to 18
bytes, not counting code to move values into and out of the FPU. To do the 
equivalent using SSE instructions we have (assume x, y, and z are stored in
the first 3 positions in xmm1):

    movups      xmm2, xmm1          ; 3 bytes
    dpps        xmm2, xmm2, 0x7F    ; 6 bytes
    sqrtps      xmm2, xmm2          ; 3 bytes
    divps       xmm1, xmm2          ; 3 bytes
    
which comes out to 15 bytes. As you can see, in this case it's probably faster
and more space efficient to use SSE instructions. Whether or not this pays off
in the big picture we are yet to see...

So, my plan is to code up a simple reflective raytraced sphere in a box with
a dynamic lightsource buzzing around. 1280x1024x32 might be too optimistic for
a software raytracer, but time and testing will tell. I'm off for the next 3
days to research SSE and will hopefully be getting the bulk of the work done.

Wish me luck!

1/1/2016

Happy New Year! I was up till 4:30 a.m. last night writting a prototype in C.
I'm pretty happy with the result. No SSE intrinsics or any optimization, it runs
at about 2 - 3 FPS. My code must be terribly suboptimal because I'm only doing
basic Lambert shading without any reflections of a sphere and a plane at 640
by 480. I was expecting to do better. Looks like I'll have my work cut out for
me.

I've started writting the assembly code. The sphere intersection test alone is
currently using a whopping 180 bytes. Of course it doesn't work at all, it just
results in a triple fault. Fortunately, VirtualBox supports SSE 4.1 now and has
a nifty built-in debugger that I'm currently learning to use :) In other good
news, gzip manages to squeeze a few bytes out even at these tiny sizes, so the
kernel compression feature of GRUB will come in handy after all.

Apparently GRUB now has an aptly named "multiboot" command. If you enter the
GRUB command line and type "multiboot /kernel.bin" it will boot kernel.bin
provided it is in the /boot directory (under Ubuntu at least). Annoyingly, GRUB
doesn't seem to recognize the latest versions of ext, so its neccesary to keep
kernels on a seperate partition that gets mounted at /boot now. This all means
that to boot the intro I'm working on, you'll have to copy it to /boot. I hate
asking people to use root privledges just to run my code...

I'm taking the rest of the night off to eat and relax with friends. Can't wait
to show off screenshots of the prototype.

1/2/2016

Spent the day trying to set up a reasonable development environment. Started
with VirtualBox. It is fast, and has a built-in debugger. But the built-in
debugger doesn't have the ability to display the values in the SSE registers.
Thanks a fucking ton Oracle. Moved on to Bochs. Waaaay too slow, but otherwise
nice. Tried QEMU. Faster than Bochs, and has a GDB stub so I can use GDB to
debug my code as it runs! My heart jumped! But alas, the GDB stub appears to be
currently broken :( So back to Bochs for debugging. I'm actually doing
development in VirtualBox, then loading the VirtualBox disk image in Bochs,
booting from a GRUB2 ISO and loading kernel from the VirtualBox disk image to
debug. Not the most efficient development setup, but it's the best I've managed
to pull together.

I also spent a lot of time trying to get GRUB to boot over TFTP, as Bochs
has a built-in TFTP server that allows you to host files on the host machine,
but this appears to be nearly impossiible to get working.

Once I got my debugging environment sorted out I managed to track down my triple
fault (had the CR0 and CR4 registers configured incorrectly) and got the code
working well enough to render a sphere without any shading. It's currently
sitting at a bloated 190 bytes. That only leaves 66 bytes for dynamic lighting
and background... going to have to figure out some way to trim this down. Good
thing tomorrow is Sunday and I still have a whole day to work on this before I
return to "real life" on Monday.

1/3/2016

It was a good day. Finally made some progress on the code. I'm somewhat glad
that I spent some time yesterday getting a debugging environment set up, but I
barely used it. I had the good fortune to write mostly bug free code (that
NEVER happens). The intro is basically finished! Now it's just down to size
optimizing. It's currently at 328 bytes after a little bit of optimizing. So I
72 bytes to lose and realistically 3 evenings to do it in. Ouch! It turns out
that gzipping the intro is still relatively productive even at this tiny size.
It's currently saving me about 30 bytes. I should play around with different
versions of gzip and different compression algorithms to see what is most
productive (note to self).

I've been putting some thought into what I want to name this monster. The first
thing that came to mind was "Celestial" since it looks slightly lunar (the light
is rotating about the y-axis), but that's a little too boring. Metoikos is
giving a talk at the party titled "What is a demoscene?" I thought it might be
funny to call it "This is a demoscene"

Time for chicken fingers and Trailer Park Boys. More updates tomorrow.

1/4/2016

After spending some timem trying to size optimize the 328 byte monstrosity down
to something reasonably close to 256 bytes, I finally gave up. It turns out SSE
instructions are just simply too big to be useful at this size :( It was worth
a shot. On a positive note, I managed to rewrite the entire thing using the FPU
in about 6 hours. It's currently sitting at 286 bytes. I feel fairly confident
that I can shave off the last 30 bytes before the party deadline this Saturday
(it's currently Monday).

After rewriting it using the FPU I unfortunately encountered a nasty uneven
gradient on my previously beautifully smooth shaded sphere. Adding a tiny amount
of noise to the surface did the trick though, and it added an interesting grainy
effect that I like.

It's amazing how much difference color makes in a demo. The FPU rewrite is
currently only using shades of blue and it looks like dogshit. Definitely need
to add some color variation. In the previous incarnation I was multiplying the
Lambertian by 255 for the blue value and multiplying the Lambertian squared by
255 to get the green value. That really created some beautiful hues. Hopefully
I can still fit that color scheme into the final release.

I did find a better program to compress the code with. It's called Zopfli. It
uses DEFLATE and produces gzip files, but the number of passes it uses is
configurable whereas regular Gnu gzip uses 15 passes. It only saved me 2 or 3
bytes, but every byte counts!

Another name idea: "Real men don't use shaders"

1/5/2016

Down to 277 bytes. Now has more color. Need to figure out how to fake specular
lighting in a tiny amount of space.

1/9/2016

I'm sitting at Synchrony waiting for the next talk. No updates in a few days, 
but I managed to get it down to 256 bytes and added a blue wave-ish thing in the
background. I decided on the name "Tides" because the sphere in the center looks
vaguely like a moon and the blue in the background looks kind of like water
moving (if you squint hard). Party started yesterday. There was much music and
boozing last night. Having a great time so far :)

Looking back on the development of this thing, I had a good time. It's not a
compo winner, but I think it's an interesting idea that's worth sharing. It's
too slow to run in anything higher than 640 x 480 because I ended up not using
SSE instructions, but it looks good at that resolution. I'm excited to see what
comes out of the compos.

Well folks, time to wrap this up. Thanks for reading. I hope it offered some
interesting insights, or at least was mildly entertaining. Peace, love, and
demoscene.

orbitaldecay 2016