Vanilla.sh | Cloud Backups: Towards Indestructible Data

The Problem

When I think back on old projects, websites, or blogs that I’ve created over the years, I always remember one concept that was a huge thorn in my eye: Data Persistence.

Simply put, I didn’t want to put a lot of effort into a project if it wasn’t guaranteed to be “permanently” available, if there was some chance of possible data loss. This is why a lot of my blogs used to die - because I didn’t consider the content to be permanent. Usually, I’d use WordPress, or sometimes even write my own CMS, backed by a database server that typically lived on the same VPS on which the web application was hosted. This was simply a cost-saving measure - I didn’t really want to keep paying 10€ a month or more for a cloud-hosted database server.

But a VPS feels “fickle”. Especially when I was younger and less experienced with system administration, I was always scared that some update or configuration could bomb my VPS, or the cloud provider could experience some issue which might lead to irrecoverable data loss. This may sound silly, but it was a real concern for me, and often the reason why I didn’t end up putting a lot of effort into publicly accessible projects or web presences. If I wasn’t personally convinced that the creative output could not easily be lost, I just didn’t want to put hours or days of work into it.

Some might say a simple solution would be a large cloud provider, like AWS, Azure or Google Cloud, but I’ve talked about my issues with enterprise-grade cloud solutions before. Not only is the pricing just not very appealing for a consumer, but the lack of a billing limit is a complete dealbreaker. Also, nobody has time to learn the overcomplicated mess that is the AWS console. And, you can technically still lose your data on enterprise-grade cloud, although admittedly, that chance is super negligible.

Recently, though, I’ve been building my own home server setup. In the process, I’ve learned about Proxmox, simple consumer-grade server virtualisation, and the built-in backup functionality. While I was working on this setup, I suddenly remembered a Synology NAS I bought a good 7 years ago or so, and believe it or not - I actually found it.

At the same time, I was facing an issue with backing up my user data. I’ve talked a bit about it in this article, where I discussed different online backup solutions like Google Drive and Proton Drive, the lack of native Linux app support, and my experiments with rclone. Eventually, I ended up switching to a tool called Insync, which worked decently enough, but the two-way syncing gave me some headaches when I accidentally almost deleted my entire user home directory and couldn’t open the app anymore without it trying to repeat this process. I guess some might call this a skill issue.

In any case, with all these pieces falling into place, my vision was simple - a three-step backup plan, where data would flow from On-Device (being either my workstation or a server) to my Synology NAS (respectively via rsync or Proxmox’s built-in backups to network storage), where it would be stored in a RAID-1 configuration. In the very unlikely case that something goes wrong with rsync or both drives of the RAID-1 solution break down simultaneously, the data would then be backed up in an encrypted form on cloud storage.

Voila - indestructible data.

Sketching Out The Process

Mapping Out The Data Flow

The centrepiece of this backup concept is the NAS, where all data is centrally stored and managed. I split my data into three functional use cases:

Data, which is volatile data that lives on my workstations.
Servers, which are server backups.
Glacier, which is slow-changing data with low transfer requirements.

These use cases have different requirements in terms of storage, backup rotations, and expected egress. For now, the cloud backup target for all folders is a storage box (NAS Drive) on a Hetzner server, where the backups are stored after being encrypted client-side. However, I might refine this concept in the future, e.g. storing Server backups or Glacier data on a storage provider optimised for infrequent reads. For now, the categorisation is mostly a semantic one.

As a cloud provider, I went with Hetzner, who offer cheap NAS storage. The advantage of my setup is that this decision is very non-committal, as the cloud backup is only a last-ditch crutch in case of home NAS failure. Since the data is also client-encrypted, privacy is not a real concern. If I decide to go with a different cloud provider in the future, all I have to do is change the target on my Synology backup tasks. At first, I wanted to use Google Drive, but I ended up changing my mind because:

The built-in backup to Google Drive has very slow throughput and poor availability at times.
It’s more expensive than a storage box.
Giving less money to Google is always good.

Hetzner even offers a tutorial on how to use Synology Hyper Backup to connect to one of their storage boxes, which made the process very simple.

In the rest of the article, I’ll briefly explain my setup for each of the three types of data.

Backing Up Servers

Since I’m running my servers virtualised on Proxmox, it’s pretty easy to manage the backups. In the Proxmox UI, under Datacenter > Storage, we can add an “SMB/CIFS” storage to connect to an SMB network share. This service can be activated in Synology NAS under Control Panel > File Services > SMB. I chose this protocol because it also allows me to easily connect to the network share from my other devices, like my workstation. In the SMB Advanced Settings on the NAS, I had to increase the Maximum SMB Protocol to SMB3, so that Proxmox would properly connect to it.

After connecting the SMB share, I set up a backup task on Proxmox under Datacenter > Backup. This automatically backs up all servers under the datacenter by default, or desired servers can be manually included or excluded. We can also define the retention policy here.

On my Synology NAS, I use the Hyper Backup application to mirror an encrypted daily backup to the cloud storage. Since versioning is already managed by Proxmox, I don’t set up a backup rotation for this task.

Backing Up Data

Synology provides a special utility for rsync in Control Panel > File Services > rsync.

After enabling rsync on the NAS, we can create a new user with rsync application permissions in the NAS user settings. We need to enable the user home service in Control Panel > User > Advanced > User Home to create a home directory for our users. This creates a homes directory on the NAS where we can upload a public key file into the .ssh subdirectory. We also need to create authorized_keys and copy the public key’s contents into it. The key files must be sufficiently guarded (700 for home and .ssh directory, 600 for the key files) and owned by the correct user and group (e.g. “rsync:users”) or they won’t be picked up properly.

There seems to be some information online that suggests SSH is required for rsync to work - this is partially correct. Technically, the rsync daemon does not require SSH access and can transport data without any encryption. However, this makes it susceptible to a range of attacks that, even on a local network, aren’t really ideal for the recurring transport of possibly sensitive information. To enable transport encryption, it’s recommended to add the “-e ssh” flag, which executes rsync in the context of a secure shell.

On Synology, to use interactive SSH, a user must be a member of the administrators group on the NAS. Overall terminal access can then be toggled in Control Panel > Terminal & SNMP > Terminal.

In my experiments, though, I observed that

Even without the -e ssh flag, rsync will by default open a connection using ssh.
Even when interactive shells are disabled on the NAS, or when a user is not part of the administrators group, SSH is not fully disabled, and rsync can still function normally.

Thus, it is not required to add the user to the administrators group, and it is not required to enable interactive shells systemwide, to run rsync in a secure context.

At this point, my workflow for this side of the backup basically became:

Moving all important data into as few directories as possible.
Running daily rsync cron jobs from my workstation during times when I am likely to be online.
Running Hyper Backup with backup rotation on the Synology NAS.

Currently, I’m syncing my data directory, which contains volatile data like documents, images, and videos, every 30 minutes. I am also syncing a select number of game configurations every evening. A sample crontab can be found below.

# Run backup task of Data every 30 minutes
*/30 * * * * rsync -av -e ssh --delete /home/vanilla/Data/ [email protected]:/volume1/Data/Data
# Run Backup Task of Game Folders Daily
0 22 * * * rsync -av -e ssh --delete "/mnt/extra/Battle.NET/drive_c/Program Files (x86)/World of Warcraft/_retail_/WTF" [email protected]:/volume1/Data/Games/WoW/WTF
10 22 * * * rsync -av -e ssh --delete "/mnt/extra/Battle.NET/drive_c/Program Files (x86)/World of Warcraft/_retail_/Interface" [email protected]:/volume1/Data/Games/WoW/Interface
20 22 * * * rsync -av -e ssh --delete "/home/vanilla/.xlcore/ffxivConfig/FFXIV_CHR0040002E92C8FF28" [email protected]:/volume1/Data/Games/FFXIV
30 22 * * * rsync -av --delete "/home/vanilla/.minecraft/saves" [email protected]:/volume1/Data/Games/Minecraft

Here, the -a flag is the archive mode and -v adds verbosity. -e ssh, as noted, should ensure rsync runs in a secure context. –delete ensures that files that aren’t present on the source system are deleted from the destination system. I consider this to be safe, as I still have the cloud backup in case of critical failures.

On Hyper Backup, I use a backup rotation to keep daily versions of the resulting Data directory, as well as some older versions for contingency.

Backing Up Glacier

The data I call “Glacier” is essentially my “digital attic”. It contains old diaries, music, videos, documents from my school days, and so on… Basically, stuff that’s neat to have, but unless I’m in the middle of some drunken nostalgia trip, I’ll probably not actively look at it. It also doesn’t really change very frequently. Mostly, it’ll change when I start a new chapter of my life, or when I don’t expect a specific topic to come up anymore in my regular data.

For now, my Glacier backups are very similar to my Data backups, with two notable differences:

I don’t keep Glacier data on my workstation. For Glacier data, the NAS is the single source of truth. If I want to update data, I can just connect directly to the network drive to do so.
The backup rotation is less frequent and contains a lower number of total versions, since data is not expected to change frequently.

Closing Words

With this system I’ve built, I finally have data persistence that I truly trust. This has some pretty important ramifications:

I’ll be able to expand my home network with more utility services like bookmarks, notes, or tasks, without being concerned about losing configurations or data long-term.
I’ll be able to gradually move applications from my cloud server to my home server, so I can wind down VPS usage.
I can rely on data safety enough to move critical services, like code repositories (Git) and password managers (VaultWarden) to my home server setup. This will be a benefit both financially and in terms of security, and a self-hosted Git setup will also improve DevOps for future projects.

However, my work is not fully done yet. Over the next months, I’ll likely have to fine-tune several parameters related to backup retention. I’ll have to monitor storage space to estimate the best tradeoff between cloud space and data safety. I might have to evaluate different cloud providers based on throughput, availability, and ease of use. And finally, I’ll have to test the system to see if I can actually properly restore the data at each step (i.e., from the cloud backup to the NAS, to local storage, and to the Proxmox server), so I can truly rest easy.

And if all that works out, who knows, maybe I’ll even want to expand the hardware components in the future.