A server installed with half-decent care can run uninterrupted for a long long time, given minimal maintenance and usual care (update, and reboot if you change the kernel).
Also, not installing a n+3 Kubernetes cluster with an external storage backend reduces overheads and number of running cogs drastically.
VMs, containers, K8S and other things are nice, but pulling the trigger so blindly assuming every new technology is a silver bullet to all problems is just not right on many levels.
As for home hardware, I'm running a single OrangePi Zero with DNS and SyncThing. That fits the bill, for now. Fitting into smallest hardware possible is also pretty fun.
For my earlier home setups, this was actually part of the problem! My servers and apps were so zero-touch, that by the time I needed to do anything, I'd forgotten everything about them!
Now, I could have meticulously documented everything, but... I find that pretty boring. The thing with Docker is that, to some extent, Dockerfiles are a kind of documentation. They also mean I can run my workloads on any server - I don't need a special snowflake server that I'm scared to touch.
NextCloud is a prime example. Adding some extensions (apps) on NextCloud becomes almost impossible when installed via Docker.
We have two teams with their own NextCloud installations. One is installed on bare metal, and other one is a Docker setup. The bare metal one is much easier to update, add apps, diagnose and operate in general. Docker installation needed three days of tinkering and headbanging to get what other team has enabled in 25 seconds flat.
To prevent such problems, JS Wiki runs a special container just for update duties for example.
I'd rather live document an installation and have an easier time in the long run, rather than bang my head during some routine update or config change, to be honest.
I document my home installations the same way, too. It creates a great knowledge base in the long run.
Not all applications fit into the scenario I told above, but Docker is not a panacea or a valid reason to not to document something, in my experience and perspective.
Last year when we’ve been asked to redeploy our servers because the OS version was being added to a nope list we decided to migrate to Kubernetes so we don’t have to manage servers anymore (the nodes are a control plane thing so not our problem). Now we just build our stuff using wathever curated image is there that can be used and ship it without worrying about OS patches.
So you basically replaced the "we regularly have to update the OS" with "we regularly have to pull the newest image". It's possible because I have a Linux admin background but I don't see that big of a difference here. Oh and just using whatever curated image is there doesn't necessarily provide you with a secure environment. [0]
[0] https://snyk.io/blog/top-ten-most-popular-docker-images-each...
Checking uptime on my FreeBSD home server ... 388 days. Probably 15-year-old hardware. That thing's a rock.
I can't say I'm very proud of my security posture, though (:
One day I'll do boot-to-ZFS and transubstantiate into the realm of never worrying about failed upgrades again.
https://forums.freebsd.org/threads/ufs-boot-environments.796...
Our highest maintenance stuff is entirely "the dev fucked up"/"the dev is clueless". Stuff like server deciding to dump gigabytes of logs per minute or run out of disk space coz of some code error. Only difference compared to k8s is that the dead app would signal dev first, not actual monitoring we had.
And honorable mention for WordPress "developers" that make the first place in "most issues per server" every single year. And we have very few WP compared to everything else. Stuff like "dev uploaded some plugins, got instantly hacked (partially, we force using outgoing proxy and that stopped full compromise), we reverted it, he uploaded same stuff and got hacked again". So, nothing to do with actual servers.
Stuff like setting up automation to deploy ceph cluster took some time... once. Now it takes nothing. We even managed to plug it to k8s cluster.
> A server installed with half-decent care can run uninterrupted for a long long time, given minimal maintenance and usual care (update, and reboot if you change the kernel).
could even not be touched if you set up automated updates and reboots, the red tape is usually a problem, like dumbly written support deals needing notification for every restart even if service is in HA. Or some dumbo selling a service with SLA but only on single machine...
full down to back online if hardware is not damaged is less than 15 min.
it is very easy to do and rock solid, unattend-upgrades does the trick. i tried almost every combination and even though i am a k8s user from version 1.2 onwoards the complexity for k8s at home or even vsphere is too much for me vs. this super simple and stable configuration i now have.
Those are almost entirely independent domains
the only point where they connect is that
> full down to back online if hardware is not damaged is less than 15 min.
automation makes both easier to deal with.
"Mean time to repair" is all fine if you don't have to drive 30 minutes to datacenter to swap out stuff.
Even if its cloudy cloud and you have backups, restores can take a lot of time.
Also you will want HA when it's your internet gateway to die
There are also levels of implementation. Making fully redundant set of loadbalancers is relatively easy. But any stateful app is much, much harder. In case of applications active-passive type of setups are also much easier than full on active active redundancy, especially if app wasn't written for it.
Definitely relevant for, say, SCADA controls with terrible security.
Rebuilding existing systems (or adding new ones) is excellent, but sometimes life really gets simpler when you have one well-known/HA endpoint to use in a given context
What does HA stand for?
Thank you.
not about stress but to have HA proxmox cluster you will need at least 3 machines or fake one with a quorum machine without vms on it. Sure your vms will run if one machines goes down, but do you have HA Storage underneath? ganesha would work but more complexity or another network storage with more machines. Don't get me wrong it is fun to play with, but i doubt any homelab needs HA and cannot have a few minutes downtime. Or what do you do in case your internet provider goes down or when you have a power outage? i don't want to provoke, i have fiber switches and 10G at home and have 2 locations to switch in case one location goes down but i can live with multiple days of downtime if i have to, or if not i take my backups and fire some VM's on some cloudprovider and be back online in a few min and pay for it because backups are in both locations.
i would much rather develop a simple LB + autoscaling group with a deployment pipeline (lambda or some other control loop) and containers on it than a k8s cluster the client is not prepared for. if they outgrow this, most likely the following solution is better than going 100% "cloud" the first time. Most clients go from java 8 Jboss monolith to spring containers in k8s and then wonder why it is such a shitshow. but yeah pays the bills so i am not complaining that often ^^
It's interesting from a job market standpoint how little incentive there is to learn the basics when more and more we are seeing developers being made responsible for ops and admin tasks.
As someone who’s been hiring for both sides, I see this reflected in candidates more and more. The good devs rarely know ops, and the good ops rarely code well. For our “platform” teams, we end up just hiring good devs and teaching them ops. I think the people that are really good at both often end up working at the actual cloud providers or DevOps startups.
"The cloud" is both a technical solution for flexible workload requirements, and a political solution to allow others to continue delivering when IT is quoting them 3 months of "prep work" for setting up a dev environment and 2 to 4 weeks for a single SSL certificate.
As a consultant, I am confronted to hostile IT requirements daily. Oftentimes, departments hiring us end up setting and training their own Ops teams outside of IT. despite leading to incredible redundancy, that's often credited as a huge success for that department, because of the gained flexibility.
So when problems do occur, I can make some pretty good guesses on what's wrong before I actually know for sure.
At home, well, I enjoy my work, so I containerize all my home stuff as well. And I have a plan to switch my main linux box to ProxMox and then host both a Nomad cluster and a Rancher cluster.
If I weren't interested in the learning experience, I'd just stick with docker-compose based app deployment and Ansible for configuring the docker host vm.
Once the line has been crossed with "I should be backing this up", "I should have a dev copy vs the one i'm using", the existing setups can port quite well/into something like ProxMox to keep everything in a homelab setup running relatively like an appliance with a minimum of manual system administration, maintenance and upgrading.
For home use, I've been experimenting with Portainer. It seems to work well for apps I'm not developing and am just running.
Yes, we'd have more than one instance of the app running. Ideally 3 or 5 instances. And they'd be load balanced. So I could update the app one instance at a time with no downtime. At least that is the goal.
"cattle" - servers which have no distinct personality traits, are fed, loved and cared for as a group or heard. You know it's a "cattle" server when the name isn't something you can easily remember
Neither pets nor cattle are "exploited", they all have their jobs to do, however they have an ROI and also a consequence of being unique or not. Cattle tend to be born, graze, die and it's part of the cycle. Pets die, and you are emotionally upset and do something stupid -- and you find yourself doing unholy things to resurrect them like using backup tapes or something
https://www.engineyard.com/blog/pets-vs-cattle/
Direct link to slide screencap here:
https://www.engineyard.com/wp-content/uploads/2022/02/vCloud...
Ansible with ‘pet’ containers is the way to go. Use it to automate the backups and patching. Ansible cookbooks are surprisingly easy to work with. Again, a learning curve, but it pays for itself within months.
Running a single machine with all the apps in a single environment is a recipe for tears as you are always one OS upgrade patch, library requirement change, hard drive failure away from disaster and hours of rebuilding.
I completely agree about the value of isolation. You can update or up-rebuild for a new os version on one service at a time - which I find helpful when pulling in updated packages/libraries. You can also cheaply experiment with a variation on your environment.
Nice thing about automation, is you can just rebuild that easily to a clean state, rather than having to remember how staging is different from prod. And as you write your automation and find those human errors, you put it in code so that you don't forget a step. Nothing more frustrating than thinking you've solved the problem in staging only to run a command in prod that breaks things more, and in a different way, because it turns out the fix was a combination of earlier experimentation plus the command you just ran.
As long as you avoid technology that takes more than a few minutes to set up, rebuilding is trivial.
I take your point and generally agree. Leave the major hardware to the actual data centers. Stuff at home should be either resume driven or very efficient.
Last year I ran some numbers and realized that it just wasn't worth it, as electricity costs alone make it fairly close to what Hetzner/OVH would charge for similar hardware. I also had power go down in my area, for probably first time in 5 years or so that I lived there, which I took as a sign to just migrate away.
I peer it into my internal network using WireGuard, so I barely notice a difference in my use and now that electricity costs are skyrocketing in Europe I'm certainly very happy I went this route.
I often feel a bit out of my depth when dealing with sysadmin tasks. I can achieve pretty much any task I set out to, but it is difficult to ascertain if I have done so well, or with obvious gaps, or if I have reinvented some wheels.
Does anyone have a recommendation for in-depth "So you want to be a sysadmin..." content? Whether books, articles, presentations, or any other format.
But it never hurts to understand the fundamentals:
- networking (IPv4/IPv6/TCP/UDP/ICMP/various link layer technologies (ethernet, wifi, ppp, USB, ...)... and concepts like VPNs/VLAN/bridging/switching/routing/...)
- protocols (DHCP, DNS, NTP, SMTP, HTTP, TLS, ...)
- chosen OS fundamentals (so on Linux things like processes, memory, storage, process isolation, file access control, sockets, VFS, ...)
- OS tooling (process managers, logging tools, network configuration tools, firewall, NAT, ...)
- how to setup typical services (mail servers, file servers, http servers/proxies, DNS servers, DNS resolvers, ...)
- how to use tools to debug issues (tracing, packet capture, logging, probing, traffic generators, ...)
- how to use tools to monitor performance of the network and hosts
These are just some basics to run a home network or maybe a small business network.
Do you have any good references for resources in the vein of teaching the fundamentals. Not like "DNS: here is how to configure a BIND server" but like "DNS: Here are the core concepts, what you will typically need to set up, common issues / confusions, and further reading."
I have tried going through RFCs for a lot of these, but they tend to be focused on implementation rather than "why". Similarly, software-specific content is "how to achieve this outcome with this implementation", but lacks context on why one would want that outcome.
For me, the middle ground seems to be a very plain Debian server, configured with Ansible and all services running in Docker. I consume a few official images and have a (free) pipeline set up to create and maintain some of mine own. I appreciate the flexibility of moving a container over to different HW, if needed, and also the isolation (both security and dependencies).
Going back pure, OS level configuration is always fun - and often necessary to understand how things work. It’s very true that the new cattle methodology takes away any deeper level of troubleshooting and pushes “SRE” towards not only focusing on their app code but also away from understanding the OS mechanics. Instance has died? Replace it. How did it happen? Disk filled up because of wrong configuration? Bad code caused excessive logging? Server was compromised? _They_ will never know.
> cattle methodology takes away any deeper level of troubleshooting and
> pushes “SRE” towards not only focusing on their app code but also away
> from understanding the OS mechanics
This is a minor point, but as a former SRE: We are the people writing the layer between the OS and the product developers (owners of the "app code"). Understanding the OS mechanics is a core part of the job.I know some places are trying to redefine "SRE" to mean "sysadmin", but that seems silly given that we'd need to invent a new term to mean the people who write and operate low-level infrastructure software that the rest of a modern distributed stack runs on.
Most of the time, I see two kinds of SRE around me. 1. The “rebranded sysadmin” SRE. Typically heavy on Ops skills and deep OS level knowledge but no formal education or experience in Software development.
2. The “SDE in Cloud” SRE. Person coming from software development side, having very little Ops experience, relying on benefits of quick and disposable computational power from your favorite Cloud provider.
I don’t mean to “shame” either of those cases, I am just pointing out that person with years (or decades) of experience on one side cannot suddenly become well balanced SRE the way Google defined it.
Having some servers as pets is fine for time being, but after being used to the cattle little pesky issues & upkeep starts to annoy more and more and you either give up on it and leave neglected or slowly build up most of the infra you ran away from in the first place.
I still don't know why you wouldn't use containers though, even if you dislike Docker, something like LXC makes everything oh-so neater.
I was initially pretty apprehensive about running Kubernetes on the compute nodes, my workloads all being special snowflakes and all. I looked at off-the-shelf apps like mailu for hosting my mail system, for instance, but I have some really bizarre postfix rules that it wouldn't support. So I was worried that I'd have to maintain Dockerfiles, and a registry, and lots of config files in Git, and all that.
And guess what? I do maintain Dockerfiles, and a registry, and lots of config files in Git, but the world didn't end. Once I got over the "this is different" hump, I actually found that the ability to pull an entire node out of service (or have fan failures do it for me), more than makes up for the difference. I no longer have awkward downtime when I need to reboot (or have to worry that the machines will reboot), or little bits of storage spread across lots of machines.
That being said, I can only encourage the author with this plan. All those abstractions are great, but at least for me it was massively valuable to know what you are replacing and what an old-school setup is actually capable of.
The note about dns is fairly orthogonal to this though. Sensible naming should be the default regardless of the physical setup. My photos are at photos.home.mydomain.com, my nas is nas.home.mydomain.com. They're the same hardware - perhaps different containers.
There are a million ways to make your life harder in this world, and a bunch of ways to make it easier. I'd put docker in the simplify category, and a bunch of the other tooling the author mentions in the complexify side. YMMV.
That's how I'll go about the next server I rent. And that's how I did with my main project.
You can have nested VMs if you need them.
But do whatever you want ofc.
My setup is not quite as simple. I have one homeprod server running Proxmox with a number of single task VMs and LXCs. A task, for my purposes, is a set of one or more related services. So I have an internal proxy VM that also runs my dashboard. I have a media VM that runs the *arrs. I have an LXC that runs Jellyfin (GPU pass through is easier with LXC). A VM running Home Assistant OS. Etcetera.
Most of these VMs are running Docker on top of Alpine and a silly container management scheme I've cooked up[1]. I've found this setup really easy to wrap my head around, vs docker swarm or k8s or what have you. I'm even in the process of stripping dokku out of my stack in favor of this setup.
Edit: the homeprod environment also consists of a SFF desktop running VyOS and a number of Dell Wyse 3040 thin clients running the same basic stack as the VMs, most running zwave and zigbee gateways and one more running DHCP and DNS.
That’s all I use (for administration stuff), found out not too long ago I don’t even know the su password or it isn’t set up to allow users to su — dunno, never missed the ability until a couple weeks ago when I got tired of typing sudo for multiple commands.
Also found out a long time ago installing software to /opt turns into a big enough PITA that I’ll go to all the trouble to make an rpm and install stuff using the package manager. Usually I can find an old rpm script thingie to update so isn’t too much hassle and there’s only a couple libraries I use without (un)official packages. Why, one might ask, don’t I try to upstream these rpm so everyone will benefit? Well, I’ve tried a couple times and they make it nearly impossible so I just don’t care anymore.
For homeserver I use some rock chip arm computers. It runs stuff that is (arguably) useful like a file server and a local DNS with ad and malware filtering rules. These are two fanless ARM machines that use less than 15W idle and they have a single OS and funny names. You can even say 2 machines is a bit too much already.
For home lab I’ve set a k3s environment on Oracle free tier cloud. No personal data or anything that would cause me problems if Oracle decides supporting my freeloading ass is not worth the hassle. I can test things there for exemple recently dabbling into Prometheus since workplace will soon introduce it as a part of a managed offering for Azure Kubernetes services.
"Hyper-V" -- Don't. Just... don't. Don't waste your time.
Source: ran Hyper-V in production for 3 years. Still have nightmares.