Blobless Linux on Raspberry Pi (rpi-open-firmware).

14 Jan 2017 in research

While there was a brief hiatus in the development of the open firmware for Raspberry Pi, we can now boot Linux straight from bootcode.bin (note, this does require an initrd built in with zImage as there are still issues with eMMC working reliably at boot).

We are also currently investigating and getting closer to getting the following working:

  • USB PHY: Preliminary driver for USB PHY initialisation is ready but USB itself relies on the DMA engine working and we're still figuring that out.
  • Centralised power and clock management drivers in the firmware (already used for BCM2708PowerDomainARM and BCM2708PowerDomainImage)
  • eMMC (SDHOST) driver reliability (at the moment it seems to have 1 in 100 or so failure rate, requiring a reboot).
  • eMMC (SDHOST) teardown for the Linux kernel.

The current trunk of the rpi-open-firmware is able to boot a minimal Linux kernel image, you will need cmdline.txt, rpi.dtb (compiled device tree for your rPi model) and zImage on your boot partition (same place bootcode.bin resides). You're encouraged to build a minimal kernel and try it out (a good starting point is this .config file that you could use for your kernel build using the latest Linux kernel, which also includes correctly configured early printk stuff).

Few caveats:

  • You need to use the SDHOST driver and make sure it's in your device tree if you want any chance of eMMC being recognised at boot (not that relevant given the fact that initrd is currently almost required).
  • Because of memory map differences stock Linux will not currently boot on rPi1 models, anything above that (BCM2709 - rPi2 / BCM2710 - rPi3) should be fine however.
  • GPIOs work, eMMC (with SDHOST driver) kind of works, USB, DMA and video stuff do not work yet (as outlined above, we're working on bringing them up).

Here is a snippet of Linux booting up (you can find the full log including firmware logs here:

[LDR:LoaderImpl]: Jumping to the Linux kernel...
Uncompressing Linux... done, booting the kernel.
[    0.000000] Booting Linux on physical CPU 0x0
[    0.000000] Initializing cgroup subsys cpuset
[    0.000000] Initializing cgroup subsys cpu
[    0.000000] Initializing cgroup subsys cpuacct
[    0.000000] Linux version 4.4.31-v7+ (alyssa@debian) (gcc version 6.2.1 20161124 (Debian 6.2.1-5) ) #81 Fri Jan 6 13:44:14 PST 2017
[    0.000000] CPU: ARMv7 Processor [410fc075] revision 5 (ARMv7), cr=10c53c7d
[    0.000000] CPU: PIPT / VIPT nonaliasing data cache, VIPT aliasing instruction cache
[    0.000000] Machine model: Raspberry Pi 2 Model B
-snip-
/ # uname -a
Linux (none) 4.4.31-v7+ #81 Fri Jan 6 13:44:14 PST 2017 armv7l GNU/Linux

If you're interested in contributing or asking any questions, feel free to join #raspberrypi-internals on Freenode and checking out the project on GitHub.

Improving LLVM for VideoCore4 (Raspberry Pi VPU): Part 1

20 Mar 2016 in research

Recently, I've been interested in messing around with the Raspberry Pi VPU but sadly I was not able to find any existing C compiler that I would consider viable. The first compiler I looked at was VBCC but sadly the version that I built from source appeared to have several significant bugs that led to incorrect code generation, and after spending a significant amount of time trying to fix them I gave up. The second option was the VideoCore4 fork of The Amsterdam Compiler Kit but unfortunately it only supported ANSI C and generated suboptimal code.

I therefore decided to improve David Given's LLVM for VideoCore4 fork in order to get it to generate semi-reasonable code based on the LLVM IR produced by Clang.

Fixing VA args

One of the main things I wanted to get to work was a printf-like function which required adding VA args support to VideoCore4TargetLowering::LowerCCCArguments. I decided to use the XCore backend as a reference since it had a fairly similar calling convention and an easy to understand VA arg ABI. Recycling code from the XCore backend was simple enough and involved just replacing XCore specific parts (such as register classes).

In order to further support VA args, the ISD::FrameIndex DAG node had to be correctly lowered which simply involved converting it into an addition of the frame index to the stack pointer.

res = CurDAG->getMachineNode(VideoCore4::ADD_R_RRI, dl, VT, TFI, CurDAG->getRegister(VideoCore4::SP, MVT::i32));

As the last step, lowering of ISD::VAARG and ISD::VASTART nodes had to be added (which were emitted when va_start/va_arg functions were used) which was, again, heavily based on the XCore backend.

After implementing those primitives, any function that accepted VA args would have a correct prolog that involved pushing all argument registers onto the stack and addressing them through the add/ld combination.

Jump Tables

Next step was implementing jump tables, which involved lowering the ISD::BR_JT node. The way I deal with it involved creating a pseudo instruction and then just emitting an inline jump table. While it wasn't especially nice (considering VideoCore4 had tbb/tbs instructions that dealt specifically with jump tables), it worked as a temporary hack.

# Inline jumptable on r1
# Jumptable entry 0
cmp r1, 0
beq BB3_36
# Jumptable entry 1
cmp r1, 1
beq BB3_39

Weird BRCONDs/SELECTs

Because of expansions performed by LLVM, I often found it generating ISD::SELECT and ISD::BRCOND nodes that were not done with a ISD::SETCC node, which caused a lot of failed selections. After brief research, I found out that these nodes were the equivalent of if (var != 0) then X else Y so I lowered those nodes to a sequence of cmp and either b or mov instructions with condition codes.

Toolchain pain

While the VideoCore4 LLVM backend was capable of generating assembly files, it could not emit actual machine instructions without having to add instruction encoding information to TableGen as well as implementing a machine instruction emitter (which included having to come up with a relocation format and having to write a separate linker). I decided to not take that route for the time being and instead use an external assembler/linker to produce the final firmware blob.

binutils for VideoCore4

The GNU binutils fork for VideoCore4 was an obvious option as the target toolchain, having the advantage of being free as in freedom. After fighting autotools, I finally managed to build it and tried compiling and linking the firmware binary with a flat linker script. After running it on the Raspberry Pi, it hung up early at boot and appeared to be unresponsive. After disassembling the binary I found out that instead of using PC-relative addresses in b/bl/lea instructions, it was instead encoding absolute zero-relative addresses into PC-relative instructions. Oops ...

After briefly looking at the code, I decided to leave it for the time being since there was another option for the toolchain.

VASM/VLINK

While VASM/VLINK are free, source code is provided for reference only and unfortunately, it is not possible to redistribute modifications without author's prior consent. Beacuse those tools seemed to work reliably after I had to fix a small bug in the VASM assembler (related to it ignoring 48-bit encodings of arithmetic instructions, as well as making it play nicely with some of the directives produced by LLVM), I decided to target that toolchain to generate the firmware binary (bootcode.bin).

And while it may seem insignificant, it felt so good to have printf finally working correctly:

printf on RaspberryPi

(More information to follow)

Boot image compression

07 Jul 2013 in misc

Because now days, kernel and driver images are fairly large, compressing them prior to reading saving them on the disk or before sending them over TFTP helps reduce boot time. This is especially helpful with TFTP as this is what I use for debug kernel and driver uploads, so I thought that compressing the mach-o binaries inside the IMGX files was a good idea.

At the time, the best possible candidate ended up the LZSS.C compressor, which is very widely in a variety of Apple boot-related things (kernelcaches, mkexts, IOGraphics etc) so I went with it. While LZSS.C does provide fast decompression and a fairly good compression ration (mach_kernel, 8615092 bytes to 4513873 bytes), the compression speed was fairly slow (took around 3 seconds on my machine).

Slow LZSS

Although 3 seconds was not a lot, I've decided to look for an alternative that could be used in place of LZSS.C and foung QuickLz, which according to benchmarks, seemed a lot faster. I've integrated it into the IMGX builder and my bootloader, and found that it was around 4 times faster in terms of compression speed and slightly better in terms if compression as well (as you can see below), while providing the same decompression time as LZSS.C.

QuickLZ

U-Boot shenanigans

24 Jun 2013 in research

One of the things that has always irritated me about u-boot is how command oriented it was. This didn't play nicely with the way I wanted to load XNU, as I had to come up with a solution that would allow me to load drivers before loading the actual kernel.

XNU has three ways of loading drivers at boot time - driver entries in DT, a mkext in DT or a kernelcache. The kernelcache method is the one Apple uses to "load" drivers on iOS devices, where the entire kernel is prelinked into a single blob with drivers. An obvious disadvantage of this method is that those drivers cannot really be unloaded later as they're a part of the kernel image. The device tree method involves loading the bare minimum set of drivers into the kernel memory (for example, the main platform driver and something like an MMC driver) and adding them to the device tree with names like Driver-<ADDRESS>. This is what I've decided to implement in my fork of u-boot.

Because some files may be loaded over TFTP and some from MMC, I needed a unified format which I could use for compressing the files as well as unifying them prior to transfers (as initiating a TFTP transfer takes around 3 seconds, so it was best to only do it once). I've also wanted to utilise LZSS compression in order to reduce the amount that needed to be transferred. So what I've ended up doing is shoving the kernel itself and all the drivers into tables of content (which could contain a single or multiple files). That way files could be kept separate on the MMC while allowing me to do batch transfers of multiple drivers and possibly the kernel in one go over TFTP.

typedef struct {
    uint32_t magic;
    uint32_t ncmds;
    /* ... first command ... */
}
table_of_contents_t;

This also meant implementing two new commands to replace bootx. The first command, imgx loaded a general image (either a table of contents, or a mach command container straight away) and depending on the flags in the container either treated it as a driver, adding them to a linked list or a kernel which would actually have to be mapped out.

else if (command->flags & kMachKernel) {
    /* ... */

    /* Hand over to the mach-o loader */
    ret = load_macho(
        image_address,
        image_size,
        command->load_address,
        gKernelMemoryTop,
        &entry_point,
        &size
    );

    /* ... */
}

The other new command was the mach_boot command which used to data accumulated by all imgx commands to bootstrap and execute the kernel image, in a similar way as the old bootx command.

I also needed to add a kCommandXMLDeviceTree command type which indicated an XML device tree specification file that would be loaded and preserved somewhere, for mach_boot to construct a flattened DT from.

Porting XNU (iOS kernel) to BeagleBoard-xM (DM37x): Part 2

19 Jun 2013 in research

This is the second part of my "XNU on BeagleBoard" saga, which took me quite some time to get right after I ran into annoying bugs related to both my implementation of machine dependent OSFMK components and the technical "quirks" of OMAP3. You can find the first part of this here.

Timers and Interrupts

To pick up where I've left of last time, I have managed to get the kernel to boot up to RTC initialisation, where it panicked due to the system clock frequency being 0 Hz (as the system clock code for this board wasn't implemented yet). So the course of action involved first implementing the interrupt controller initialisation code, followed by timer initialisation. DM37x has 11 general purpose timers (GPTIMER) and I've picked GPTIMER1 as the main system timer, which is was clocked at 32kHz (actually, the frequency of the timer logic can be clocked higher, but 32kHz was more than enough for this).

I've set up timer and interrupt controller code in the OMAP3 platform expert (pexpert/arm/OMAP3) adding routines for timebase init, handling interrupts and handling timer-specific interrupts. Interrupts that weren't timer interrupts were left to the IOKit interrupt controller to handle.

For the purpose of this, the timer was set up in incrementing mode, where an interrupt was generated on timer overflow. For now, I've put the timer on the IRQ line, but using FIQ for timer interrupts could be a better idea in the long run, although not crucial.

One thing I got caught on, is that the timer interrupt register has to be cleared in GPTIMER before allowing new interrupts in the interrupt controller, otherwise you get stuck in an interrupt loop.

/* clear timer interrupt */
gSysTimer->TISR |= 0x7;

/* enable new interrupts */
r32(gPIC + INTCPS_CONTROL) = 1;
asm volatile("dsb");

Userland: Attempt #1

Shortly after setting up interrupts and system clock, I was able to boot past IOKit initialisation and up to underland startup. To test everything, I have decided to use an iOS recovery ramdisk as a base, replacing restored with my simple daemon that would print some junk and then hang. To get further, I've prepared the ramdisk and uploaded it onto the MMC card. I also had to add a on option to uBoot to load this ramdisk. I've extended the bootx command to support the ramdisk address as the last argument. Since HFS+ is big endian, I had to flip the endianess when determining the total size of the volume (for copying it into the kernel memory region while still in the bootloader):

size = (OSSwapInt32(header->totalBlocks) * OSSwapInt32(header->blockSize));

After implementing the code for loading the ramdisk, I've changed my boot command in the environmental variables, adding this to it:

"echo loading ramdisk.dmg from flash ...;" \
"fatload mmc ${mmcdev} ${rdsk_addr} /ramdisk.dmg;"\
"echo starting kernel *with* ramdisk ...;" \
"bootx 0x80001000 ${kern_addr} ${dt_addr} ${rdsk_addr}\0" \

Domain Saga

To give a little introduction, ARMv7 has this nifty feature called domains, which affect how MMU treats permissions for pages. Each domain has several mode settings denoted by two bits. Bits 11 denoted manager access, in which access flags for a page were not checked (RWX everywhere) and bits 01 denoted client access in which access flags were checked.

Now, going back to booting up the userland, after launchd[1] has forked, I was experiencing all sorts of weird crashes all over the userland (several launchds were spawned and some crashes while others hung up, here is an example log). What was even more odd, that at the time, the stack was also ending up being executed for some unknown reason.

After spending a lot of time thinking this was an issue in the fork system call, I've finally decided to investigate why the stack was executed. My initial suspicion was a bug in my pmap (physical mapper) where pages somehow ended up getting the wrong access bits. After tracing all entered mappings, I've realised that this was not the case. So it looked like all access permission bits were simply being ignored, I checked the actual bits to make sure they were correct, and they were. I was really confused.

After some time, I've nearly given up, and decided to check DACR, and boom. I had a revelation - uBoot was setting DACR to 0xffffffff unlike bootkit which set it to 0x00000001. This is why all access bits were ignored. Fixing this up and setting domain 0 to client mode, fixed the "forking" issue, allowing launchd[1] to start my little daemon.

 old dacr = 0xffffffff, new dacr=0x01

removeKextBootstrap

Another weird quirk I've noticed was sometimes, I was getting crashes while the kernel was printing out it's backtraces (my debug code in the kernel does symbol lookups when printing backtraces). As I've suspected that something was overwriting pages in the kernel identity mapping, I've added a little macro to all pmap calls to make sure that identity mapping was removed or overwritten.

#define CHECK_VIRTUAL_ADDRESS(va) \
    if (va > gVirtBase && va < (gVirtBase + mem_size)) { \
         panic("%s: va 0x%08x is in the identity region", __FUNCTION__, va); \
    }

This has quickly revealed the culprit - OSKext::removeKextBootstrap, which was responsible for making kernel's __LINKEDIT (a section in the kernel binary that had the symbol table) swappable. Because we didn't have swapping on ARM anyway, I've decided that putting the call to ml_static_mfree under #ifndef __arm__. I would like to note that this part of the routine only gets invoked when you are building your kernel with kxld in it (kext linker), so the kernel is able to load live kernel extensions (commonly known as drivers) dynamically.

iOS devices do not typically do that, and instead rely on a prelinked kernel also known as a kernelcache. While I have the option of building one and using it, I did not feel that it was necessary for now, and I thought that loading kexts dynamically could certainly be useful.

Final Frontier

At the first glance, I have managed to get the userland to boot. When I tweaked the system clock rate (higher load value so quicker overflow), I started getting crashes in dyld when my daemon started up. I've determined that the crash was coming from macho_header being passed as NULL to dyld::start, as opposed to 0x00001000. This value is passed on the initial stack of the application and is set in exec_mach_imgact in bsd/kern/kern_exec.c:

 ap = thread_adjuserstack(thread, -new_ptr_size);
 error = copyoutptr(load_result.mach_header, ap, new_ptr_size);

After two days of debugging, I was not able to get any closer. My watchpoints were not being triggered and that's providing the condition has even occurred. Baffled by that, I've decided to trace pmap entries by exec_mach_imgact and found that when this has occurred the pmap into which the stack info was entered was not the same pmap as the one used when the process executed. Hooray, race condition.

Turns out the code responsible for process control block switching was not entirely correct and failed under certain conditions which could only occur while a mach-o binary was getting started and only if this process was interrupted by a context switch. Turned out I was meant to compare thread maps and not task maps before doing the switch:

/* Should do this: */
if (vm_map_pmap(old_th->map) != vm_map_pmap(new_th->map)) {

/* Instead of this: */
if (vm_map_pmap(old_th->task->map) != vm_map_pmap(new_th->task->map)) {

Having fixed that, I was able to get everything to start up in all of my tests and ran as expected. I was able to get launchd to start up, execute my simple boot.rc which then forked and started retardrc which prints out some stuff before intentionally crashing by doing a NULL deref :)

xnu single user boot

I would guess that the next step would be getting my IOSurface and IOMobileFramebuffer drivers to work, so I would have something very very cool to show. But as far as this post goes, the initial bringup on this board is done and I've enjoyed it.