Stripping, Compiling, Linking, Then Somehow, Loading the Driver! (followed by a kernel panic) (2026/06/11)

Hi guys, it's been a bit longer than expected. I originally wanted to post an update last week, but I didn't want to update a post without some kind of substantial progress. Since then, I have made a bit of progress—not a ton, but a substantial amount in the grand scheme of things.

To get the first thing out of the way, I will be at BSDCan 2026! Sadly, I cannot attend the dev summit as I have a midterm on Thursday evening covering classical mechanics and special relativity. That being said, I am going to be at the rest of the conference from Friday onwards, and will likely stick around for a bit on Sunday even after it ends.

Stripping the Driver

So, if you've been reading this series, you've likely realized that this is a massive project that I cannot realistically do perfectly or "correctly." If I want any chance of finishing it within the given deadline, I'll need to strip everything down to the bare minimum. This includes the driver itself.

To strip the driver, I deleted everything that was not strictly needed for headless compute. This meant deleting the display engine, old generations, connector logic, start/stop routines, codecs, and a few other things I cannot remember. Now, my first attempt at this was to just delete everything that didn't look useful (literally deleting the files). This turned out to be a terrible idea because the files within amdgpu are like a massive web of weird interdependencies (like, tell me why the JPEG engine was somehow transitively including power management). Doing it that way led to too many errors, and I could not be bothered to even try fixing them.

Pivoting the Strategy

Thus, I pivoted harder than a pre-seed startup pivots after they realize a project is too hard (iykyk). I pivoted to deleting as many files as I reasonably could from the makefile instead. Then, I kept applying patches. Every 300~400 patches, I would check to see if they would compile, spending maybe 20–30 minutes at those checkpoints. What I did not realize at the time is that the lld command had a flag that would disregard undefined functions...

After making it through about half the patches (I think I'm around 6.16~), I thought, "Hmm, I should probably check to see if it loads." As expected, it did not load. As one does, I looked at the dmesg output, and it said vcn_v2_???? smth smth undefined (it was actually a different function, but this is the most recent one I remember; there were also others like gfxhub, athub, mmhub, vce, and a lot of other files). So I ran nm -U, and saw... SO MANY FUNCTIONS ARE UNDEFINED.

So I did a bit of digging and found out that during the final step of building drm-kmod (the linking stage, not counting installation), it basically just stitches the objects together. I was REALLY surprised by this. Reading into why, it turns out that when you load a .ko module, it DYNAMICALLY links everything at load time. So, all those undefined functions (which are actually defined by the kernel or elsewhere) get defined at load time. This was something I'd heard of before, but for some reason, I was not expecting it here.

The Battle with Dynamic Linking

This structural quirk resulted in me spending a week fixing all those undefined errors. At some point, I got really pissed at the tediousness and tried to use "AI agents" to do it, and it spectacularly FAILED. It tried editing files it didn't need to and stubbed out functions whose calls should have just been commented out. It was so... disappointing. I even gave it specific instructions to run a exact command, look at the output, and comment out the function call... but no luck. I had to sit down for like 3–4 days and just manually walk through fixing all of them.

At another point, I had a lot of difficulty getting it to recognize the file format, and it refused to load. After a lot of googling, I figured out it was a "freshness"/"integrity" value causing a mismatch mismatch, so I just disabled it (the file configuration is at the end). Then, IT FINALLY HAPPENED. IT LOADEDDD WHOOOO!

It happened very unexpectedly—I had honestly lost hope, and then it just worked! It does still panic/crash, but that is to be expected at this stage. I haven't gotten much further as of now. Here is the link/hash [put in before publishing].

kern.geom.label.disk_ident.enable="0"
kern.geom.label.gptid.enable="0"
zfs_load="YES"
autoboot_delay="30"
vm.aslr.enable=0
hw.amdgpu.dc=0
vfs.zfs.arc.max="128M"
    

Side Quests and Random Thoughts/Happenings

As you are an astute, smart, and intelligent reader, you may have noticed that I disabled aslr and set the ZFS ARC cache to just 128M via vfs.zfs.arc.max. The reasoning behind the ASLR change was that something with kernel security was stopping the loader, so I disabled it in the name of speed rather than correctness.

Then, I had to drastically lower the ZFS ARC cache size because it was taking up too much space in RAM, causing kernel dumps to not get written to my 2G swap partition... (Pro tip: don't make your swap only 2G when setting up a system for debugging. I did not know better at the time and really do not want to reinstall everything right now). Otherwise, everything works fine. Well, that's a lie—for some reason, kgdb doesn't exist despite installing gdb from ports.

Debugging in the Wild

I was working from home almost this entire week because I was waking up late and working even later, so I rarely went into the office. Because of that, I was caught completely off guard by the panic. It would just freeze and kill my SSH session. After reconnecting 2–3 minutes later and checking dmesg, I saw it was "wiped," meaning a full reboot occurred (which, iirc, only happens if there is a panic).

Pairing that with the core dump issue, I realized I needed to go handle it in person. It was around 2:00 PM, and I was sitting in the University of Waterloo Computer Science Club hacking away on one of the monitors when I realized I actually had to commute to the office for the first time this week (normally I go 4/5 days a week). Lol.

Upstreaming changes to freebsd

Here are some of the changes which have been upstreamed/pending review; 1 , 2 , 3 , 4

I did also open one PR to amd's rocm-systems repo, but with no response yeah

Potentially needed changes

Now if you, the super smart, intelegent, and most importantly curious reader have been paying attention. You'll remember the "this is foreshadowing for later" a while back. Well, if you were hoping for a nicely written cinematic battle with dma, I have bad news, I haven't yet progress to that stage of the quest. But I won't leave you fully fully hanging, below are the functions which I believe to be required to get ROCm working on FreeBSD (and potentially CUDA !:???!?!?!?!?, note: there are a lot more functions to come)

 
drm_sched_init
drm_sched_entity_init
drm_sched_entity_push_job
drm_sched_job_init
drm_sched_job_arm
drm_sched_start
dma_fence_init
dma_fence_signal
dma_fence_wait
dma_resv_add_fence
dma_resv_reserve_fences
dma_resv_wait_timeout
sync_file_create
_devm_drm_dev_alloc
drm_dev_register
drm_ioctl
drm_buddy_init
drm_buddy_alloc_blocks
drm_gem_private_object_init
drm_suballoc_manager_init
drm_suballoc_new