Rest assured that after the first time I messed it up (which required ssh into each box individually), I wrote a lot of unit and integration tests to make sure that it never failed to deploy again. One of the integration tests ensured that the app started up and could always go through the internal auto update process. This ran in CI and would fail the build if it didn't pass.
While I fully understand that this is hard to get right 100% of the time, a mess up of this level by a car manufacturer is pretty amazing to me.
Having worked for companies that produce network devices - including devices that are unreachable for example for 6 months of the year - and on software installation and upgrade, I am baffled how this bricking is possible. For one thing, you generally use some kind of confirmed boot mechanism - you upgrade a standby partition, set an ephemeral boot value that causes device to boot the alternate image, and reboot - only when the image is declared "up" does that get persisted (and then the alternate is upgraded, in order to prevent rollback in the event of a media error). You use watchdogs that are tied to actual forward progress (and not just some demon that the kernel schedules and bangs on the watchdog even if the rest of the system is hung) and if they fail, the WD reboots you. (This is one of the reasons that event driven programming is somewhat preferred - actually processing events from a single dispatch thread makes it easier to reason about the system.)
On top of that, you make sure that the core system is an immutable filesystem so that you can validate the _offline_ alternate image before rebooting (write-and-read-back-uncached) and periodically scrub the alternate image (same).
Like.. this is all embedded 101, stuff people have been widely doing since the mid 1990s and I think I can find examples going back to the 70s. Sometimes you get a little more sophisticated (allow sub-packages or overlays and use a manifest to check the ensemble instead of just a single image), but it's very standard.
If it is the most likely “management plane TLS certs” issue, I bet the watchdog won’t confirm the new boot args until the command dispatch daemon gets a pong from the C&C server moving forward (:
IMO there's not a lot of regular OSS for building embedded systems that comes with A/B partitioning, watchdogs, secure and verified boot - it's all custom at every org and tailored for individual products.
[1] https://arstechnica.com/gadgets/2023/11/android-14-patches-r...
That made me think, imagine NASA bricking up the voyager with a SW update.
I don't mean to be pedantic, but since we're talking about what should happen instead, this is insufficient. It works until the day you realize you made some kind of manual change to your CI infra, or that CI has some non-standard configuration that makes it work for you but not some significant fraction of the fleet.
People should do what you described in CI, but as well as that, you need phased rollout, where e.g. the build can only be rolled out to the next percentage point of randomly selected users in a specific segment (e.g. each hardware revision and country as independent segments) after meeting a ratio of successful check-ins, in the field, from the new build by production customers in that segment. That's the actual metric for proceeding with the rollout: actual customers are successfully checking in from the new version of the software.
Except, that's actually not sufficient either. What if the new build is good, but it contains an update to the updater which bricks the updater? Now you're getting successful check-ins from the new version in the field, but none of those customers will ever successfully auto-update again. So, test the new updater's ability to go forwards successfully, too.
Nah, my CI process was solid. This was proven in the field over the course of years.
> I don't mean to be pedantic... you need phased rollout
You don't need to be pedantic, but better to ask the question rather than assume that was all that I did. =) You have to realize that what I built, worked flawlessly. It wasn't easy either, took a lot of trial and error.
I did have a CIDR based rollout. I could specify down to the individual box that it would run a specific version. Or I could write "latest" to always keep certain boxes running on the latest build. This was another part of my testing, but ended up not being fully necessary because I had enough automated testing in CI that "latest" always worked.
> but it contains an update to the updater which bricks the updater?
This happened, so I wrote a lot of test code to make sure that would never happen again. My CI would catch that since I was E2E testing that it could actually run the upgrade process.
Once I implemented all of this, I never had a single failure and would routinely, several times a day, deploy to the entire cluster, over the course of a couple years.
It was all eventually consistent as I could also control the "check for update" frequency as well.
I feel like it's going to happen to someone that makes network devices eventually. I'm always scared to update my (several hundred) UniFi devices. Their update process isn't foolproof and they push auto-updates via the UI pretty hard.
Several years ago they caused some people's devices to disconnect from the management controller when they enabled 'https' communication. Prior to that, if you were pointing devices at 'https://example.com:8080...' they would ignore the 'https' part and do an 'http' request to port '8080'. Then they pushed their 'https' update which expected an 'https' connection and didn't fall back to the old behavior for anyone that was mistakenly using 'https' in their URL initially. Some people on their forums complained about having to manually SSH to every device to fix the issue.
It was caused by an end-user mistake, but they knew it was a potential issue. AFAIK, their attitude on it hasn't changed and a lot and at the time their response was that they knew it would break some people, but that it wouldn't be that many (lol).
IMO, the issue with those systems is that basic communication back to the update / config server is part of the total package which is too complex (ie: a full Debian install). I'd rather see something like Mender (mender.io) where the core communications / updates come from a hardened system with watchdog, recovery, rollback logic.
Think of how crazy it is to have something like pfSense doing package based updates rather than slice based updates. At least with boot environments they could add some watchdog and rollback type logic, but it'll still be part of the total system instead of something like a hardened slice based setup where the most critical logic is isolated from everything else and treated like a princess.
Do you have any insight on package vs slice based systems for updates? Did you isolate update logic from the rest of the system or am I out of touch with that opinion?
The way I built my app was that I could install it cleanly via a curl | bash.
So, I just had a simple shell script that iterated through the list of IP addresses (from the DHCP leases), ran curl | bash and that cleaned up the mess pretty quickly.
It’s also a testament to the way that the system was designed that they were able to get it back online.
One thing my little control process did on the box was to always set the password to be the same... user/1.
None of these boxes needed inbound connections, so it wasn't a big deal to do that.
If we pushed a broken update it might mean someone from the radio company would have to make a trip to go pull the device and send it to us physically.
Our upgrader did not run as root, but one time we had to move a file as root.. so I had to figure out a way to exploit our machine reliably from a local user, gain root, and move the file out of the way. We'd then deploy this over the satellite head end and N remote units would receive and run the upgrade autonomously. Fun stuff.
Turns out we had a separate process running that listened on a local socket and would run any command it received as root. Nobody remembered building or releasing it but it made my work quick.
So... worse than subterfuge? That being said it only listened on the local socket, so it's slightly less bad, and I don't want to get into the myriad of correct ways that original problem could have been solved, but lets just say that company doesn't exist anymore.
No offense, but what a shit show. It makes me assume no source control, and a really good chance that state actors made their way into your network/product. This almost happened at a communication startup I know, with three letter agencies helping resolve it. State actors really like infiltrating communication stuffs.
Or, you know, having an A/B boot partition scheme with a watchdog. Things that have been around for decades at this point.
Disclaimer: Former Googler, Worked closely with Automotive.
Maybe they've got a test fleet, but it accepts code signed with the test build key.
Maybe they've got a watchdog timer, but it doesn't get configured until later in the boot process.
Maybe they've got A/B boot partitions, but trouble counting their boot attempts - maybe they don't have any writable storage that early in the boot process.
I wouldn't be surprised if, as a newer company, they'd made a 'Minimum Viable Product' secure boot setup & release procedure, and the auto-fallback and fat-finger-protection were waiting to get to the top of the backlog.
> Maybe they've got a test fleet, but it accepts code signed with the test build key.
Polestar solves this by only delivering signed updates to their vehicles. The vehicle headunit will refuse to flash a partition that isn't signed by the private key held by Polestar. Pulls double duty to prevent someone from flashing a malicious update, as well as corruption detection.
> Maybe they've got a watchdog timer, but it doesn't get configured until later in the boot process.
Based on what the Rivian reports are showing (Speedometer, cameras, safety systems are working), they likely are running their infotainment as a "virtual machine" within their systems. Again, something that Polestar does.
Implementation of a watchdog with a "sub-system" like this is relatively braindead simple.
> Maybe they've got A/B boot partitions, but trouble counting their boot attempts - maybe they don't have any writable storage that early in the boot process.
Generally, A/B partitioning is part of the bootloader, the first program that executes after the reset (on many modern processors) pin is released. This also leads to reboot counters and such being stored as part of the NVRAM that is available at boot.
Opinion: Maybe I'm biased, but maybe if you can't develop something yourself, there's reason for you to get an off the shelf option that handles a lot of these things.
Disclaimer: Former Googler, Worked closely with Automotive.
You do not report a successful boot until and unless the entire system loads up successfully. You will definitely have writable storage by then.
Owners should have to bring the vehicle into a shop to have changes made, and they should be very rare.
This lazy, control freakery of the worst kind
Something very bad is going on happen and people will die before we realize that it is a stupid dangerous practice
There are a few different kinds of updates that can be applied, each with their own protective layers.
Infotainment updates, like what happened to Rivian aren't that dangerous. You lose "convienience features" like maps, air con, etc, but generally nothing that could kill you or someone else.
Then there's system updates, which is where danger noodle things happen. Automotive manufacturers are significantly more risk averse to updating these components, and generally, if _anything_ within the system looks wonky, it's an immediate revert.
If I, as a Polestar owner, wanted to get an update for my vehicle, the nearest service center is 1.5h away. If I lived in Montana (United States), it would be realistically impossible for me to update my car. Thus, if we want to enable competition within the markets, we shouldn't have regulations that force a new manufacturer to have a global network just to add CarPlay to a screen.
It should be fine to push software updates out, as long as the correct safety and fallback procedures are in place. It simply has to be designed to handle failure and procedures need to be in place to mitigate risks.
It sounds like that wasn't the case here. Also, why wouldn't you have a small initial release pool when you have such a large potential for disruption?
What amazes me is that any grown up person thinks it is a good idea to update telephones as if they were software and not phones.
Or rather that it is a good idea to have phones that need updates? Either way, we're all one 1/2 assed push update to a fridge, vacuum, washing machine, phone or car away from a really annoying day.
There's really no excuse from Rivian on this, this is shoddy
That just felt like a massive product to build and maintain for what really could have been backed by AWS iam. GCP IAM if they really really needed hierarchy. I guess I'm not surprised at this outage.
Not a bug in the software itself.
That is independent of testing the software, but still a distribution issue.
* "signed with the wrong cert" should mean the software package is rejected before it it is installed.
* software upgrades are tricky and there should be at least 2 versions available so that fallback to the previous is possible and automatic in case of issues.
They should have had further staging of the rollout (randomizing when it is offered to users).
Edit: Also, why the heck isn't the entertainment system completely air gapped from the software running the car?
So far, only Tesla seems to be able to update car software remotely, regularly and reliably. I'm certain it's neither easy nor cheap.
All things considered, physical buttons and dials are probably easier and cheaper, because they don't require software updates!
If you absolutely must have updates, then at least not OTA updates. Have them done at the dealership or service center so any issues can be dealt with immediately.
Come on, is this engineering or hacking? This is a car, not a CRUD app. Get. It. Right.
OTA is better for consumer when done properly. Other manufacturers manage it fine, and one bad example shouldn’t be what we base things on. It’s what we should learn from and improve on.
I also mostly WFH so... yea. lol.
I am pretty sure there is a market for a dumb modern car, but no one is building it. I am thinking of an electric car without anything "smart" in it. Modern safety features can stay, if they work completely self contained and without requiring an external connection ever over the lifespan of the car.
They designed, built, and shipped all the hardware. There is ABSOLUTELY NO excuse for not having a database of the exact hardware configs by serial number. They have the ability to test every single shipped configuration.
If they don't, they have already failed as a car company.
The update servers almost certainly don't talk to that system though.
My Jeep Grand Cherokee has OTA for over 5+ years. BMW has been doing it since 2018.
I’m almost positive a family member had it with GMC on star back in the late 2000s.
Almost all automotive control modules have firmware, whether that firmware is parsing touchscreen inputs or a rotary encoder.
It's still a free market - these companies could choose not to put tech into their product. But look at the backlash against GM when they announced they wouldn't support Apple Car Play or Android Auto. Consumers want it.
Tesla, whose computer systems quite regularly need to be hard rebooted while the car is driving? That Tesla?
I still do love the car though.... but a very sketchy moment that I shouldn't have brought on myself while driving in that situation.
Not really. Vehicle computers aren’t vastly different on every model year and every trim level or option package. These parts are standardized, tested, and carried across model years.
Even with changes, the teams would be expected to have the different variants in their development and test cycles. The 2020, 2021, and 2022 model infotainment systems likely share a lot more in common than an iPhone 13, iPhone 14, and iPhone 15 with all of the non-Pro, Pro, and Max variants.
If it ain't broke it's ripe for disruption
else if (cpu == B) do other code
They invited the multiple combination vampire into their house. They know what devices are being used. If you don't want a dedicated update per piece of equipment, it'll be a large binary with lots of branching. Saying they don't know what device is where is just lazy. Ask the device what it is, and have a branch for it. If the device IDs itself as something unknown, don't do anything.
In at least one instance, they fixed the cars manually by running a massive remote command on all cars after a messed up update: https://lobste.rs/s/v42zil/former_tesla_employee_ssh_d_as_ma...
I wouldn’t call that very reliable , but they indeed do it regularly
They'd never do that, except when they did do that.
I don't think a botched update is a big deal. It happens, and should be expected, in a sane design. The fact that the customer noticed is a big deal.
There are many implementations that could be used for an "auto rollback" feature. They either failed to implement that in a sane way, or they were goobers, and assumed things would always be rosy.
Rivian seems more like a "ship it and we'll fix it in the next sprint!" company.
How do other manufacturers handle updates?
Updating cars with new features OTA, even "just" an Infotainment can possibly cost lives, because the driver might get confused and isn't putting eyes on the streets.
It should be forbidden and every change should be made clear to the driver, shown in detail, and should need verification twice before being accepted. There must not be any kind of surprise in a car for the driver.
It should even be possible to skip an update or stop updating at all.
What are you referring to? That is not relevant to this story, and would require a deep understanding of the system to make such a claim of negligence.
“The issue impacts the infotainment system. In most cases, the rest of the vehicle systems are still operational ...”
Also, you can't do an update while driving.
Not the specifics of this article, but more generally about the gravity of the situation car makers (and their software engineers) operate under. The very idea that an OTA software update that causes a bug within more critical features of a car could be life threatening. So my point isn't about the specifics of this particular bug, rather the capacity for a bug that could kill.
Also, yes, I'm specifically avoiding using the word "owner" above for obvious reasons.
https://www.autoblog.com/2022/02/09/seattle-radio-station-br...
So far I've had to take my Chevy Bolt to the dealership twice due to major software problems causing the "service needed" indicator to be lit (equivalent to "check engine"), and I've owned it for barely over a year.
The first time, some random bug made the car think there was something wrong with the transmission under some extremely specific set of circumstances, and as a safety precaution it would refuse to shift into drive if not serviced within 100 key cycles.
The second time, it was a bug with the software that manages battery health making the car think the battery had a severe problem. In that situation, as a safety precaution, the car refuses to charge above 40%, disables regenerative breaking, limits the HVAC usage, and slightly limits max acceleration.
This is getting very irritating. I bought an EV because I thought it would require fewer maintenance visits to the dealer!
Maybe it’s auto company smoke but source: https://fortune.com/2023/10/27/tesla-elon-musk-hertz-evs-ren...
I'd like to please force any attackers to at least be within 50 feet of my TPMS, instead of being literally anywhere on the planet.
A car doesn't need data updates, and definitely not code updates[1]
1. source: every car built in previous century.
I don't think this is accurate. Many advanced driving assistance capabilities need access to updated map tiles, which is a data update. They may need code updates to fix errors or shortcomings that can be detected only after deployment on extensive fleets or in response to changes to the environment/infrastructure. This is just one example for why data and code updates are needed.
I think it is more accurate to say that a "dumb" car with mostly electro-mechanical systems doesn't need data updates and definitely not code updates. But that isn't true for vehicles built within the last few years and definitely untrue for vehicles that will be built in the coming years.
Your phone (or GPS or even a paper map) can guide you; none of the following need access to map tiles:
* forward collision warning
* automatic emergency braking
* lane departure warning
* adaptive cruise control
* blind spot detection
* stability control
> code updates to fix errors or shortcomings
That's what recalls and TSBs have traditionally been for, and the driver can refuse them if desired. I mean, actual lives are at stake here. Would we (or should we) allow 737's to get OTA updates? Of course not. The target is too valuable and surface area too vast to adequately protect it.
(2005 is just an arbitrary date I settled on, nothing significant about it)
But who knows what these guys were doing. :/
I don’t really like or trust most (if not all) of the established automakers, but there is something to be said for having several decades (over a century in some cases) of experience building potential killing machines vs. a company that’s not even 15 years old. The established players have put out cars which suffered freak malfunctions, but Rivian (and Tesla) seem to be struggling more with QA.
Non-rhetorical question: do companies have safeguards for critical components like braking systems, or are they also prone to catastrophic failure if a software engineer pushes a bad commit?
They never deployed bad software updates but they sure have designed & deployed bad fuel pumps.
In some ways it’s all engineering and quality control.
This incident does NOT give me confidence that Rivian is likely to offer a better alternative to CarPlay, despite their statements otherwise.
I suspect the EX90 will be what I land on eventually.
I have complete faith that, 5 and maybe even 10 years from now, no auto maker will have delivered anything that can compete with either CarPlay or Android Auto. The fact that an auto maker thinks they can do better is a sign of a really high level of either arrogance or outright greed. Complete deal breaker.
For example, if you have a distributed system and you want to upgrade a component that every caller uses: you have a large exercise on your hands where you might have to roll out a change over time and then clean up your incremental branches where you have to handle two control flow paths through the code. It reminds me of Google's protobuf required field discussions.
It reminds me of repository-per-microservice and a Java library that other microservices use and updating a dependency and having to deploy the change to every service.
It's like trying to change wheels on a car while the car is moving or refueling a jet in flight.
Unison lang is trying to solve this problem I think, by allowing multiple versions of a function to be available.
Migrations in databases are painful too.
One solution I've thought of which is probably overengineered is that API call sites are an abstract object and their schema and arguments is centrally deployed, I called this "protocol manager".
The idea is you write all your code to use a "span" and have contextual data in a span, and you can include or exclude data in a span with a non-software rollout. Your communication schema of RPC and API calls is a runtime decided thing, not hardcoded.
If you have N deployed versions of code and you want to upgrade to X, you have to test 1..N to X versions. So nobody does that.
[0] https://github.com/fabianlindfors/reshape [1] https://reshapedb.com
- Better isolation of different parts of the system (e.g. infotainment unit, instrument cluster, et al).
- Better isolation for updates (e.g. run a "beta" update, and a "stable" update side-by-side).
- Automatic error detection and rollback (e.g. if a VM keeps restarting after an update).
- Ease of offering features like rollbacks to end-users.
- Rare hypervisor updates can be held to a much higher standard relative to other VM updates.
The only downside of hypervisor-based systems is slightly higher hardware costs. But even that is largely mitigated by modern architectures that natively support virtualization.
PS - You can also look to any containerization. I specifically brought up the XBox because it is a hardware product, just like a vehicle.
Is there any reason not to do it this way?
Perhaps there are advantages to tighter integration with my car (at least the newer one) but IMO they are outweighed by the risks of things like this, or even just getting a software update that borks a small feature that I like.
Apple could potentially offer an API to have "reverse" CarPlay where the car's app can feed information into iOS. I recently rented an Mercedes EV which had Apple CarPlay and it was a weird experience having to manage two sets of experiences.
https://qz.com/1522309/how-chinas-electric-car-surveillance-...
https://www.consumerreports.org/lexus/what-to-do-if-your-lex...
It forced the company into bankruptcy because they had to replace all of them.
Or at least the ability to re-init/download from scratch, like a borked macbook disk. And hey, not the extra ability to do that, make it "the way it works" so you're always testing it.
The future is going great.
https://discussions.apple.com/thread/253315438
With the mandatory mobile phone updates for a few years you're definitely going to see a lot more cases like that.
A thread about Tesla directly related to your question:
https://teslamotorsclub.com/tmc/threads/wholl-be-responsible...
Simple example: my Subaru was sold to me with an interesting design decision that caused the radio to come on whenever the car was started. This was not a bug. Every Subaru worked this way for years. A year into ownership I received an OTA update that added a “not playing” state on startup.
This was never a safety issue and was likely not a defect. It was, however, stupid and needed to be changed.
Thousands of test points having to be verified was my understanding. That’s before even getting to the confirmed boot/watchdog aspect.
What a hassle, hope they like spending money on labor because it sounds like they are going to need to.
This is what I do with my Prius to get a comfortably distraction-free driving environment. Sounds like a feature not a bug.
Does anyone here have some practical tips to turn an embedded Linux machine into an appliance? The kind of system that a botched update cannot brick but only momentarily disable until a non-technical user presses a factory reset button of some sort.
Lol
I suppose this is the negative about having sensors that make sure water gets hot enough to be sanitizing, but not so hot that it wastes energy. And I'm sure you can imagine 100 other uses of having a microcontroller/CPU process data and do feedback. (I'm sure there are EE only ways of doing it, but theoretically possible and useful are two different thigs)
What a time to be alive. Software updates (almost) turning cars into paper weights lol
I'd love to be a fly on the wall at Rivian engineering/operations this week!
Like what do you mean "in most cases" I can understand a broken infotainment needing reset but imagine if you had to tow your truck I'd be furious.
All I need is a gauge cluster screen that can display the normal info like stored and heading while also letting me configure the cars performance and safety features. Then let me mount a double DIN radio that isn't dog shit. I've not seen a single new car with these dumb screens with a sound system that's not tinny muddy garbage with zero adjustment save for "bass" and "treble" settings. I mean all that technology and you can't be assed to put an eq in there. HVAC never needed more than two or three knobs anyway.
The speedometer screen is gone, so does that not imply the vehicle is inherently unsafe to drive?
So without more info we cannot know if it is accurate or not.
> That’s the last update we had over 10 hours after Rivian customer vehicles were fed the bad software update.
"Over 10 hours"!
I suppose it isn't Tesla, who yeets updates over the fence, that break new things, yeets another update that fixes that problem but introduces another one, then reverts back to two versions prior, before the issue. The Tesla that gets firmware fixes from vendors that have a test harness that should take 36+ hours to run, but says YOLO and flashes it onto a random car they have lying around and emails the vender back 3 hours later saying "LGTM, WFM, thanks!"
Getting cute with basic stuff like tail lights is forgettable or annoying at best, and absolutely can be dangerous.
[0]https://jalopnik.com/congratulations-mini-you-made-the-stupi...
Sounds more like you’ve just bought into the doom and gloom that a few specific news outlets have been pushing.
Car companies suck at tech. Let’s be realistic. They should stay their lane and focus on improving the car and physical aspects (safety, reducing carbon output, longevity, ease of repairability, reducing supply chain issues)
I'm not aware of any Tesla OTA updates bricking the infotainment system. At least since I've been paying attention. I don't see them quite as similar as you suggest.