Context
I managed to brick my NixOS server. The recipe to achieve that is quite simple, just throw all best practices out the window by doing the following three things at the same time:
- Make a potentially harmful change.
- Make it impossible to recover.
- Do not test.
Pretty easy, right?
Potential Harmful Change
My server has its root partition on an encrypted ZFS partition. This means I need to enter the passphrase for my server to boot. To be able to ssh into the server and enter the passphrase remotely, I had the boot.initrd options setup correctly.
But I wasn’t happy with my setup. See, when I sshed in, the prompt to type the passphrase was shown, I could then enter the passphrase and press Enter to mount the root partition, but I then had the prompt still open and I needed to exit the prompt manually. This extra exit step was not to my liking. So I changed:
echo "zfs load-key zroot; killall zfs" \
>> /root/.profile
to this:
echo "zfs load-key zroot; killall zfs" \
>> /root/.profile; exit
instead of this:
echo "zfs load-key zroot; killall zfs; exit" \
>> /root/.profile
See the mistake? The exit was not in the correct location! So instead of showing me the prompt to unencrypt the partition, the startup would fail and I couldn’t do anything about it.
Impossible to Recover
I had been playing around with tweaking the boot initrd options for some time. Each time, it creates
a new voluminous file that lines inside /boot
. I gave less than 1Gb disk space for that partition
so it fills up quickly during my tweaks.
I even made another blog post about how to recover from a filled up /boot
partition already.
Anyway, I did what I wrote in that blog post and ran a command to remove old generations:
sudo nix profile wipe-history \
--profile /nix/var/nix/profiles/system \
--older-than 14d
But since that was not enough, I removed the older-than
argument. This removed all previous
generations, as I wanted, but not as I should have done.
Did not Test
nixos-rebuild build-vm
is a thing and I should learn to use this tremendously useful
feature. I already use NixOS VM tests quite extensively in my Self Host Blocks project so I
have no excuse.
Bonus Stress Inducing Factor
Of course, I did this 24 hours before leaving on a trip. Why not?
Recovery Overview
So, how do you recover from this? The quick overview that @k900
from Matrix gave me is:
- Boot on a NixOS Recovery USB stick
- nixos-enter
- change configuration
- nixos-rebuild
I knew nixos-enter
was a thing from the NixOS installation procedure but never would’ve thought
about using it to recover a system!
That being said, because I was using flakes, ZFS for the partitions and colmena to deploy, each step had unforeseen complications. This blog posts goes over those and how I proceeded for each step.
NixOS Recovery USB stick
Also called live CD. But how to make such a thing? The wiki explains how to do this quite well. And it’s so easy, I love nix.
Here’s exactly how I created the live CD:
-iso = let
nixosConfigurations.recoveryinherit (inputs) nixpkgs;
system = "x86_64-linux";
in
(nixpkgs.lib.nixosSystem {
inherit system;
modules = [
"${nixpkgs}/nixos/modules/installer/cd-dvd/installation-cd-minimal.nix"
"${nixpkgs}/nixos/modules/installer/cd-dvd/channel.nix"
({ config, pkgs, ... }: {
boot.kernelPackages = config.boot.zfs.package.latestCompatibleLinuxPackages;
environment.systemPackages = [
pkgs.colmena];
})
];
}).config.system.build.isoImage;
Since I wanted to work on ZFS partitions, I changed the kernel for a ZFS compatible one.
I also added the colmena
package. You could avoid doing that as long as the machine you want to
recover has internet by running nix shell nixpkgs#colmena
when booted in the recovery environment.
To build the CD:
nix build .#nixosConfigurations.recovery-iso
And finally to copy it on a USB stick on /dev/sdb
:
sudo dd \
\
bs=4M \
status=progress \
conv=fdatasync \
if=./result/iso/nixos-24.05.20240421.6143fc5-x86_64-linux.iso of=/dev/sdb
nixos-enter
Again, the wiki explains well what to do in the general case.
For this step, I needed to mount the ZFS partition called zroot
and unencrypt it:
sudo zpool import zroot -f
sudo zfs load-key zroot
# <enter passphrase>
I could then mount the 3 required filesystems. That being said, because I did let ZFS mount them by
setting the mountpoint
option, I couldn’t just use zfs mount
to mount them under the /mnt
directory. I needed instead
this:
sudo mount -t zfs -o zfsutil zpool/local/root /mnt
sudo mount -t zfs -o zfsutil zpool/local/nix /mnt/nix
sudo mount -t zfs -o zfsutil zpool/safe/home /mnt/home
Now, I could run nixos-enter
and get chrooted under /mnt
.
flake
At this point, I should have been able to change configuration.nix
but there is no such file when
using flakes! So instead, I needed a copy of my repository used to deploy this machine.
I could have copied it from my laptop but instead I went back out of the chrooted environment and copied it over from an external hard drive used to store the repositories:
sudo zpool import data -f
zfs get keylocation data
cp /mnt/persist/data_passphrase /persist
sudo zfs load-key data
sudo mount -t zfs -o zfsutil data/nextcloud /mnt/srv/nextcloud
# Repo is now under /mnt/srv/nextcloud/.../nix-config
cp -r /mnt/srv/nextcloud/.../nix-config /mnt/root
The instructions above have additional steps compared to the root partition because those external hard drives use a different passphrase that’s only located on the ZFS root system. So I needed to copy over the key to the location expected by ZFS.
Update Configuration
Now, I could finally make the change to the repository!
nixos-enter
cd /root/nix-config
then change the incriminated line to:
echo "zfs load-key zroot; killall zfs; exit" \
>> /root/.profile
nixos-rebuild
Since I was not using nixos-rebuild
to deploy in the first place, I could not run nixos-rebuild
to deploy locally either because there is no nixosConfigurations
flake output. When using
colmena
there is instead a colmena
output. This meant I needed to:
Allow colmena to deploy locally this machine by enabling the option:
true; deployment.allowLocalDeployment =
I had not enabled that option yet for this machine since I never needed to deploy this configuration locally.
Run
colmena apply-local --node <mymachine> boot
.
But that failed because I did not mount the boot partition!
So, after getting out of the chroot once more, I mounted the correct boot partition:
cat /mnt/etc/fstab | grep boot
sudo mount /dev/disk/by-partlabel/disk-x-ESP /mnt/boot
Then I re-entered the chroot with nixos-enter
and ran the following command which finally
succeeded.
colmena apply-local --node <mymachine> boot
Unmount Cleanly
But wait! If you reboot now, the system will not be able to mount the ZFS partitions. You first need to export the zpools:
sudo zpool export zroot
sudo zpool export data
If you forget this last step, just reboot on the USB drive and re-import the pools then export them.
Of course I forgot to do this on my first try.
Takeaway
What was the takeaway here? It was never wracking and exhausting. Never again.