Blobless Linux on Raspberry Pi (rpi-open-firmware).

14 Jan 2017 in research

While there was a brief hiatus in the development of the open firmware for Raspberry Pi, we can now boot Linux straight from bootcode.bin (note, this does require an initrd built in with zImage as there are still issues with eMMC working reliably at boot).

We are also currently investigating and getting closer to getting the following working:

  • USB PHY: Preliminary driver for USB PHY initialisation is ready but USB itself relies on the DMA engine working and we're still figuring that out.
  • Centralised power and clock management drivers in the firmware (already used for BCM2708PowerDomainARM and BCM2708PowerDomainImage)
  • eMMC (SDHOST) driver reliability (at the moment it seems to have 1 in 100 or so failure rate, requiring a reboot).
  • eMMC (SDHOST) teardown for the Linux kernel.

The current trunk of the rpi-open-firmware is able to boot a minimal Linux kernel image, you will need cmdline.txt, rpi.dtb (compiled device tree for your rPi model) and zImage on your boot partition (same place bootcode.bin resides). You're encouraged to build a minimal kernel and try it out (a good starting point is this .config file that you could use for your kernel build using the latest Linux kernel, which also includes correctly configured early printk stuff).

Few caveats:

  • You need to use the SDHOST driver and make sure it's in your device tree if you want any chance of eMMC being recognised at boot (not that relevant given the fact that initrd is currently almost required).
  • Because of memory map differences stock Linux will not currently boot on rPi1 models, anything above that (BCM2709 - rPi2 / BCM2710 - rPi3) should be fine however.
  • GPIOs work, eMMC (with SDHOST driver) kind of works, USB, DMA and video stuff do not work yet (as outlined above, we're working on bringing them up).

Here is a snippet of Linux booting up (you can find the full log including firmware logs here:

[LDR:LoaderImpl]: Jumping to the Linux kernel...
Uncompressing Linux... done, booting the kernel.
[    0.000000] Booting Linux on physical CPU 0x0
[    0.000000] Initializing cgroup subsys cpuset
[    0.000000] Initializing cgroup subsys cpu
[    0.000000] Initializing cgroup subsys cpuacct
[    0.000000] Linux version 4.4.31-v7+ (alyssa@debian) (gcc version 6.2.1 20161124 (Debian 6.2.1-5) ) #81 Fri Jan 6 13:44:14 PST 2017
[    0.000000] CPU: ARMv7 Processor [410fc075] revision 5 (ARMv7), cr=10c53c7d
[    0.000000] CPU: PIPT / VIPT nonaliasing data cache, VIPT aliasing instruction cache
[    0.000000] Machine model: Raspberry Pi 2 Model B
/ # uname -a
Linux (none) 4.4.31-v7+ #81 Fri Jan 6 13:44:14 PST 2017 armv7l GNU/Linux

If you're interested in contributing or asking any questions, feel free to join #raspberrypi-internals on Freenode and checking out the project on GitHub.

Improving LLVM for VideoCore4 (Raspberry Pi VPU): Part 1

20 Mar 2016 in research

Recently, I've been interested in messing around with the Raspberry Pi VPU but sadly I was not able to find any existing C compiler that I would consider viable. The first compiler I looked at was VBCC but sadly the version that I built from source appeared to have several significant bugs that led to incorrect code generation, and after spending a significant amount of time trying to fix them I gave up. The second option was the VideoCore4 fork of The Amsterdam Compiler Kit but unfortunately it only supported ANSI C and generated suboptimal code.

I therefore decided to improve David Given's LLVM for VideoCore4 fork in order to get it to generate semi-reasonable code based on the LLVM IR produced by Clang.

Fixing VA args

One of the main things I wanted to get to work was a printf-like function which required adding VA args support to VideoCore4TargetLowering::LowerCCCArguments. I decided to use the XCore backend as a reference since it had a fairly similar calling convention and an easy to understand VA arg ABI. Recycling code from the XCore backend was simple enough and involved just replacing XCore specific parts (such as register classes).

In order to further support VA args, the ISD::FrameIndex DAG node had to be correctly lowered which simply involved converting it into an addition of the frame index to the stack pointer.

res = CurDAG->getMachineNode(VideoCore4::ADD_R_RRI, dl, VT, TFI, CurDAG->getRegister(VideoCore4::SP, MVT::i32));

As the last step, lowering of ISD::VAARG and ISD::VASTART nodes had to be added (which were emitted when va_start/va_arg functions were used) which was, again, heavily based on the XCore backend.

After implementing those primitives, any function that accepted VA args would have a correct prolog that involved pushing all argument registers onto the stack and addressing them through the add/ld combination.

Jump Tables

Next step was implementing jump tables, which involved lowering the ISD::BR_JT node. The way I deal with it involved creating a pseudo instruction and then just emitting an inline jump table. While it wasn't especially nice (considering VideoCore4 had tbb/tbs instructions that dealt specifically with jump tables), it worked as a temporary hack.

# Inline jumptable on r1
# Jumptable entry 0
cmp r1, 0
beq BB3_36
# Jumptable entry 1
cmp r1, 1
beq BB3_39


Because of expansions performed by LLVM, I often found it generating ISD::SELECT and ISD::BRCOND nodes that were not done with a ISD::SETCC node, which caused a lot of failed selections. After brief research, I found out that these nodes were the equivalent of if (var != 0) then X else Y so I lowered those nodes to a sequence of cmp and either b or mov instructions with condition codes.

Toolchain pain

While the VideoCore4 LLVM backend was capable of generating assembly files, it could not emit actual machine instructions without having to add instruction encoding information to TableGen as well as implementing a machine instruction emitter (which included having to come up with a relocation format and having to write a separate linker). I decided to not take that route for the time being and instead use an external assembler/linker to produce the final firmware blob.

binutils for VideoCore4

The GNU binutils fork for VideoCore4 was an obvious option as the target toolchain, having the advantage of being free as in freedom. After fighting autotools, I finally managed to build it and tried compiling and linking the firmware binary with a flat linker script. After running it on the Raspberry Pi, it hung up early at boot and appeared to be unresponsive. After disassembling the binary I found out that instead of using PC-relative addresses in b/bl/lea instructions, it was instead encoding absolute zero-relative addresses into PC-relative instructions. Oops ...

After briefly looking at the code, I decided to leave it for the time being since there was another option for the toolchain.


While VASM/VLINK are free, source code is provided for reference only and unfortunately, it is not possible to redistribute modifications without author's prior consent. Beacuse those tools seemed to work reliably after I had to fix a small bug in the VASM assembler (related to it ignoring 48-bit encodings of arithmetic instructions, as well as making it play nicely with some of the directives produced by LLVM), I decided to target that toolchain to generate the firmware binary (bootcode.bin).

And while it may seem insignificant, it felt so good to have printf finally working correctly:

printf on RaspberryPi

(More information to follow)

U-Boot shenanigans

24 Jun 2013 in research

One of the things that has always irritated me about u-boot is how command oriented it was. This didn't play nicely with the way I wanted to load XNU, as I had to come up with a solution that would allow me to load drivers before loading the actual kernel.

XNU has three ways of loading drivers at boot time - driver entries in DT, a mkext in DT or a kernelcache. The kernelcache method is the one Apple uses to "load" drivers on iOS devices, where the entire kernel is prelinked into a single blob with drivers. An obvious disadvantage of this method is that those drivers cannot really be unloaded later as they're a part of the kernel image. The device tree method involves loading the bare minimum set of drivers into the kernel memory (for example, the main platform driver and something like an MMC driver) and adding them to the device tree with names like Driver-<ADDRESS>. This is what I've decided to implement in my fork of u-boot.

Because some files may be loaded over TFTP and some from MMC, I needed a unified format which I could use for compressing the files as well as unifying them prior to transfers (as initiating a TFTP transfer takes around 3 seconds, so it was best to only do it once). I've also wanted to utilise LZSS compression in order to reduce the amount that needed to be transferred. So what I've ended up doing is shoving the kernel itself and all the drivers into tables of content (which could contain a single or multiple files). That way files could be kept separate on the MMC while allowing me to do batch transfers of multiple drivers and possibly the kernel in one go over TFTP.

typedef struct {
    uint32_t magic;
    uint32_t ncmds;
    /* ... first command ... */

This also meant implementing two new commands to replace bootx. The first command, imgx loaded a general image (either a table of contents, or a mach command container straight away) and depending on the flags in the container either treated it as a driver, adding them to a linked list or a kernel which would actually have to be mapped out.

else if (command->flags & kMachKernel) {
    /* ... */

    /* Hand over to the mach-o loader */
    ret = load_macho(

    /* ... */

The other new command was the mach_boot command which used to data accumulated by all imgx commands to bootstrap and execute the kernel image, in a similar way as the old bootx command.

I also needed to add a kCommandXMLDeviceTree command type which indicated an XML device tree specification file that would be loaded and preserved somewhere, for mach_boot to construct a flattened DT from.

Porting XNU (iOS kernel) to BeagleBoard-xM (DM37x): Part 2

19 Jun 2013 in research

This is the second part of my "XNU on BeagleBoard" saga, which took me quite some time to get right after I ran into annoying bugs related to both my implementation of machine dependent OSFMK components and the technical "quirks" of OMAP3. You can find the first part of this here.

Timers and Interrupts

To pick up where I've left of last time, I have managed to get the kernel to boot up to RTC initialisation, where it panicked due to the system clock frequency being 0 Hz (as the system clock code for this board wasn't implemented yet). So the course of action involved first implementing the interrupt controller initialisation code, followed by timer initialisation. DM37x has 11 general purpose timers (GPTIMER) and I've picked GPTIMER1 as the main system timer, which is was clocked at 32kHz (actually, the frequency of the timer logic can be clocked higher, but 32kHz was more than enough for this).

I've set up timer and interrupt controller code in the OMAP3 platform expert (pexpert/arm/OMAP3) adding routines for timebase init, handling interrupts and handling timer-specific interrupts. Interrupts that weren't timer interrupts were left to the IOKit interrupt controller to handle.

For the purpose of this, the timer was set up in incrementing mode, where an interrupt was generated on timer overflow. For now, I've put the timer on the IRQ line, but using FIQ for timer interrupts could be a better idea in the long run, although not crucial.

One thing I got caught on, is that the timer interrupt register has to be cleared in GPTIMER before allowing new interrupts in the interrupt controller, otherwise you get stuck in an interrupt loop.

/* clear timer interrupt */
gSysTimer->TISR |= 0x7;

/* enable new interrupts */
asm volatile("dsb");

Userland: Attempt #1

Shortly after setting up interrupts and system clock, I was able to boot past IOKit initialisation and up to underland startup. To test everything, I have decided to use an iOS recovery ramdisk as a base, replacing restored with my simple daemon that would print some junk and then hang. To get further, I've prepared the ramdisk and uploaded it onto the MMC card. I also had to add a on option to uBoot to load this ramdisk. I've extended the bootx command to support the ramdisk address as the last argument. Since HFS+ is big endian, I had to flip the endianess when determining the total size of the volume (for copying it into the kernel memory region while still in the bootloader):

size = (OSSwapInt32(header->totalBlocks) * OSSwapInt32(header->blockSize));

After implementing the code for loading the ramdisk, I've changed my boot command in the environmental variables, adding this to it:

"echo loading ramdisk.dmg from flash ...;" \
"fatload mmc ${mmcdev} ${rdsk_addr} /ramdisk.dmg;"\
"echo starting kernel *with* ramdisk ...;" \
"bootx 0x80001000 ${kern_addr} ${dt_addr} ${rdsk_addr}\0" \

Domain Saga

To give a little introduction, ARMv7 has this nifty feature called domains, which affect how MMU treats permissions for pages. Each domain has several mode settings denoted by two bits. Bits 11 denoted manager access, in which access flags for a page were not checked (RWX everywhere) and bits 01 denoted client access in which access flags were checked.

Now, going back to booting up the userland, after launchd[1] has forked, I was experiencing all sorts of weird crashes all over the userland (several launchds were spawned and some crashes while others hung up, here is an example log). What was even more odd, that at the time, the stack was also ending up being executed for some unknown reason.

After spending a lot of time thinking this was an issue in the fork system call, I've finally decided to investigate why the stack was executed. My initial suspicion was a bug in my pmap (physical mapper) where pages somehow ended up getting the wrong access bits. After tracing all entered mappings, I've realised that this was not the case. So it looked like all access permission bits were simply being ignored, I checked the actual bits to make sure they were correct, and they were. I was really confused.

After some time, I've nearly given up, and decided to check DACR, and boom. I had a revelation - uBoot was setting DACR to 0xffffffff unlike bootkit which set it to 0x00000001. This is why all access bits were ignored. Fixing this up and setting domain 0 to client mode, fixed the "forking" issue, allowing launchd[1] to start my little daemon.

 old dacr = 0xffffffff, new dacr=0x01


Another weird quirk I've noticed was sometimes, I was getting crashes while the kernel was printing out it's backtraces (my debug code in the kernel does symbol lookups when printing backtraces). As I've suspected that something was overwriting pages in the kernel identity mapping, I've added a little macro to all pmap calls to make sure that identity mapping was removed or overwritten.

    if (va > gVirtBase && va < (gVirtBase + mem_size)) { \
         panic("%s: va 0x%08x is in the identity region", __FUNCTION__, va); \

This has quickly revealed the culprit - OSKext::removeKextBootstrap, which was responsible for making kernel's __LINKEDIT (a section in the kernel binary that had the symbol table) swappable. Because we didn't have swapping on ARM anyway, I've decided that putting the call to ml_static_mfree under #ifndef __arm__. I would like to note that this part of the routine only gets invoked when you are building your kernel with kxld in it (kext linker), so the kernel is able to load live kernel extensions (commonly known as drivers) dynamically.

iOS devices do not typically do that, and instead rely on a prelinked kernel also known as a kernelcache. While I have the option of building one and using it, I did not feel that it was necessary for now, and I thought that loading kexts dynamically could certainly be useful.

Final Frontier

At the first glance, I have managed to get the userland to boot. When I tweaked the system clock rate (higher load value so quicker overflow), I started getting crashes in dyld when my daemon started up. I've determined that the crash was coming from macho_header being passed as NULL to dyld::start, as opposed to 0x00001000. This value is passed on the initial stack of the application and is set in exec_mach_imgact in bsd/kern/kern_exec.c:

 ap = thread_adjuserstack(thread, -new_ptr_size);
 error = copyoutptr(load_result.mach_header, ap, new_ptr_size);

After two days of debugging, I was not able to get any closer. My watchpoints were not being triggered and that's providing the condition has even occurred. Baffled by that, I've decided to trace pmap entries by exec_mach_imgact and found that when this has occurred the pmap into which the stack info was entered was not the same pmap as the one used when the process executed. Hooray, race condition.

Turns out the code responsible for process control block switching was not entirely correct and failed under certain conditions which could only occur while a mach-o binary was getting started and only if this process was interrupted by a context switch. Turned out I was meant to compare thread maps and not task maps before doing the switch:

/* Should do this: */
if (vm_map_pmap(old_th->map) != vm_map_pmap(new_th->map)) {

/* Instead of this: */
if (vm_map_pmap(old_th->task->map) != vm_map_pmap(new_th->task->map)) {

Having fixed that, I was able to get everything to start up in all of my tests and ran as expected. I was able to get launchd to start up, execute my simple boot.rc which then forked and started retardrc which prints out some stuff before intentionally crashing by doing a NULL deref :)

xnu single user boot

I would guess that the next step would be getting my IOSurface and IOMobileFramebuffer drivers to work, so I would have something very very cool to show. But as far as this post goes, the initial bringup on this board is done and I've enjoyed it.

Porting XNU to BeagleBoard-xM (DM37x): Part 1

11 Jun 2013 in research

After a fairly long period of not doing anything due to personal issues, I've decided to finally get XNU running on the DM37x system on a chip developed by Texas Instruments. To assist early bringup I've decided to use a JTAG debugger and a serial port adapter.

Baby Steps

To start off, I needed to find a suitable way of booting the XNU kernel on BeagleBoard so I could test my code as soon as possible. My initial plan was to port my own bootloader called BootKit (which already has support for XNU) to that platform and then use it to boot up the kernel. About two days into working on the low level initialization code for the platform bits, it started to seem like a waste of time and I've abandoned this plan.

My next choice was the Das U-Boot bootloader which already had full support for BeagleBoard and was used both as the first level and the second level bootloader. Obviously, this bootloader lacked any support for the XNU kernel so I had to develop my own "extension" for it. To simplify the task, I've decided to recycle the code from the BootKit bootloader, integrating it into the u-boot tree and implementing a command that would compile the device tree definitions, map the mach-o kernel and execute it, which I've named bootx.

In essence, this command would load mach_kernel, map out the mach-o executable file at a certain address (0x80001000 for XNU) parse the XML device tree (yes, that's right, XML parser in a bootloader) specification, flatten it, populate the BootArgs structure and call the kernel entry point, passing the boot arguments to it.

Loading mach_kernel

In order to ease the pain in repetitively having to swap out the SD card, I've decided to use u-boot's built in TFTP support to download the kernel from my development machine (downloading it over UART was out of the question because the debug kernel image was just over 8MB in size). I ended up with a following command in my environment that would download and start the kernel:

/* Boot XNU from the network */ \
"xn=setenv ipaddr;" \
    "setenv serverip;" \
    "setenv usbethaddr ba:d0:4a:9c:4e:ce;" \
    "usb start;" \
    "echo downloading mach_kernel from tftp ...;" \
    "tftpboot 0x84000000 /mach_kernel;"\
    "echo loading devicetree.plist from flash ...;" \
    "fatload mmc ${mmcdev} 0x83000000 /devicetree.plist;"\
    "echo starting kernel ...;" \
    "bootx 0x80001000 0x84000000 0x83000000 0x0\0"

Platform Expert

The platform expert is a hardware abstraction module used by XNU to access machine specific components (not to be confused with architecture specific components like the MMU), which consists of early access to the serial port, interrupt controller initialization and handling of interrupts from the system timer. Unlike the x86 version, the platform expert routines are specific to each machine. For now, I have only decided to implement the serial output routine for BeagleBoard. Having done that, I've tweaked several kernel makefiles to add suport for OMAP3 as a machine config. I could now build the kernel using the following command:

make TARGET_CONFIGS="debug arm OMAP3" SDKROOT="/Darwin_ARM_SDK/"

Hello XNU

Of course, as expected, soon after starting, the kernel crashed very early without producing any visible output. I have anticipated that early printing might produce some useful information. I have tried to implement semihosting in the kernel in order to get some early output which unfortunately didn't quite work for an unknown reason. Instead of fiddling around with debugger configuration files, I have decided to do something simpler and make my early printing routine print to a preallocated buffer, which I could then dump using the debugger.

Early panic dump

From my early dump, I could see that the kernel was panicking out in arm_init due to a misaligned L1 table. I have quickly fixed that in the bootloader, and having recompiled it, I was able to get past early initialization where it called machine_startup to do general kernel initialization.

Odd crash

Shortly after being started, the kernel died due to a data abort happening before first VM map was created (full log here):

Fatal Exception: map is NULL, prob a fault during VM init, fault_addr is 0x8064bbb4

Quickly, I tried to read that address using a debugger, which seemed to work just fine, so I was sure it was not a virtual memory related problem. Just to be sure I dumped the translation tables and examined the entry for this page, which was perfectly valid. I have then looked at the data fault status register and noticed that it indicated that an external abort has occured (0x8).

Because I was confused as to what this actually meant, I did some quick searching and found an article on TI forums where someone has complained about similar behaviour. Apparently, this SoC asserts an external abort when an STREX or LDREX instruction is used on a page that is mapped with the Shared bit set. As this SoC is uniprocessor, I've simply disabled setting of that bit for pages, which resolved the issue.

Next Steps

Because getting all of the output via a serial console seemed boring, I've decided to implement video initialization in the platform expert so I could use XNU's video console. Usually, this task is left to the bootloader, but I've decided to implement it in the kernel. Because of the complexity of the OMAP DSS system, I took me a few hours to work out how to configure the display controller, after doing which I was able to populate and use that to set up the console. This was the result:

Video console

In order to advance further, I had to implement support routines for setting up the interrupt controller and the system timer in the Platform Expert, which I'm going to explain in my next post.