Somebody always wants to undelete something, or examine it to see why it was deleted, or see who changed something, or blah blah blah. It helps the business, it helps you as developer by giving you debug information as well as helping you to cover your ass when you are blamed for some data loss bug that was really user error.
Soft deletion has obvious drawbacks but is usually far less work than implementing equivalent functionality out-of-stream, with verbose logging or some such.
Retrofitting your app and adding soft deletion and audit trails after the fact is usually an order of magnitude more work. Can always add it pre-launch and leave it turned off.
If performance is a concern, this is usually something that can be mitigated. You can e.g. have a reaper job that runs daily and hard-deletes everything that was soft-deleted more than n days ago, or whatever.
Though you really shouldn't be relying on a database for an audit trail. It might help find some issues, but things actually used for security shouldn't be writable so easily.
I think this is the part they miss. I've never
undeleted a user either, but there have been many
times I've gone back to look at something.
Yeah. As far as a user-facing "Undelete" button existing or being used... that's very rare in my experience.What's much more common is a user accidentally deletes some data. They deny they made an error. The developers are blamed. You then have to go on a wild goose chase figuring out if it was possible for the app to actually screw up in that way. There's usually no definitive answer, and even if there is, management can't understand it. And regardless of how any of that plays out, you still probably have to try and recover the data from backups or something.
Alternately, maybe it was the app's fault. Still plays out nearly the same!
Soft deletes and/or audit trails save you from all of that.
Though you really shouldn't be relying on a
database for an audit trail. It might help
find some issues, but things actually used
for security shouldn't be writable so easily.
I mean, at some level you need to trust the database right?Been ages since I did it, but it's usually possible to set up a "secure" audit trail with use of database permissions. For example, the application's DB credentials can have SELECT and INSERT permissions on the audit trail table, but no UPDATE or DELETE perms.
How would you set up a secure audit trail that didn't rely on the application and/or database at some level? Even if it lives outside of the database, that data came from the database.
Not a rhetorical question. Genuinely curious!
I, for one, have undeleted things tons of times, taking them of the trash can before emptying it, undoing the delete action (in apps where this is possible), and so on.
The exceptions are when there is a well tested query that affects a single account or something. Like GDPR
But did they though?
"Although I’ve never seen an undelete work in practice, soft deletion wasn’t completely useless because we would occasionally use it to refer to deleted data – usually a manual process where someone wanted to see to a deleted object for purposes of assisting with a support ticket or trying to squash a bug."
You cannot just keep user information forever "just in case" they are useful again.
Also note, if you use a soft-deleted column, indexes need to be keyed by that column as well if you want to access non-deleted objects quickly. That's extra complexity.
but that is less of a cost than having to add "deleted=False"
predicates in all of your queries.
Maybe or maybe not. You can use a view. Or you may be using an ORM that lets you set a default scope (essentially, a default WHERE clause - ActiveRecord lets you do this)It also depends on if you're designing an app for this from the ground up or if you're trying to retrofit an existing app with 90 million different hardcoded queries.
Also depends on what you want to do with the deleted records. Do you want to do things with them? Maybe after a user is deleted, you want them to be able to log in, but you would prefer them to see a "sorry - your account has been deleted" message instead of "user not found." Maybe you want your support staff to be able to look at deleted users. Etc. Now you need to update your app logic so that it's flexible enough to look at "users_deleted" and "users". Which may be at least as onerous as messing with the WHERE clause on every single query.
To be clear, I don't hate the "another table" solution. It's the right choice in a lot of situations IMO.
It's like people have forgotten what views are.
I don't actually like using a "deleted" column, my standard table has a status column, and deleted is one of those states, along with active/pending/suspended/etc, as the needs dictate. This way I get soft deletes for basically free both in the schema, but also in the queries (which would generally default to active), so it's not really the spaghetti that the author discusses.
Suppose I soft-delete an active, pending, or suspended user using your scheme.
Now I need to un-delete the user. What status should they have? We don't know.
This is another "best practice" I've learned over the years. Those flags/statuses you think are mutually exclusive? Maybe they aren't. Or maybe they are now but won't always be. It's usually easier in the long run to give each status its own column even if you think they'll always be mutually exclusive. Because I mean, what are you really saving? A few bytes per record? In most cases that's not worth it.
EDIT: Soft delete is a trivial piece of code when the framework has a well defined transaction system for its ORM. It's not really related to Rails per se. Your statement is extremely disingenuous, while trying to look smart. Audit trails _can_ be(but don't have to be) more complex, especially when the framework uses a lot of stored procedures to handle operations. But other than that these frameworks are specifically designed to REDUCE complexity of such operations, dependency costs - which are huge in node.js, specifically because you can mix and match anything into everything.
Node.js people tend to stitch together XSS solutions, random templating solutions based on their frontend work, even basic salting of auth passwords becomes unpredictable because you have 30 options on minimal auth libraries.
But yes nothing is ever free. If you want to use Rails you still have to learn ruby and the framework and a basic understand of how ActiveRecord builds queries if you want to be writing performant code. And the same applies to Laravel, Django, or whatever of the 50 patchwork node.js solutions you want to base your code on.
Also you're comparing language/runtime with an actual framework and then dogging those users...
If you want to compare Rails with Node/PHP then I'd suggest comparing with things like Laravel (PHP), Adonis (Node) and you'll find everything you can do in Rails is done in Node/PHP too.
Which is funny because Laravel give you this for free as well through it's ORM. Soft deletes are an easily solved problem with $table->softDeletes() in your migration.
"Soft deletion" and "audit trail" are technical terms we developers come up for solutions the business wants but maybe hasn't asked for yet. It's not really a soft deletion it's a "deactivate" or "hide". Likewise, it's not an audit trail it's a "history" or "undo". Most of the time your stakeholders and users actually want these features, but don't ask because they perceive this as more expensive to build then just a "delete" button.
Depends on the industry. The one I work in audit trail is a well-defined and mandatory business concern.
I think Datomic is neat, and I’d like to use it, but it is prohibitively expensive for a personal or hobby project. Personal projects are where I get excited about tech, and when I’m excited I’m more likely to adopt it in my day job.
They’re really shooting themselves in the foot by not having a one-click install free tier or a self hosted option.
I really like the concept of datomic though.
I can't wait until Postgres has this kind of functionality baked in. It's such a nice feature.
But... your experiences are also real and I believe you when you say that your experiences DO support your conclusion. :)
The "soft delete" design decision is usually pretty impactful and is in my experience potentially much more of a pain in the butt to implement later if you haven't included it from day 0.
Audit trails and soft-deletes are also crazy useful for developers (both for debugging and for general cover-thine-ass utility) even if end users never touch them.
Whereas, tags are easier to tack on later and are not intrinsically useful to developers.
> Somebody always wants to undelete something
> or examine it to see why it was deleted
> or see who changed something, or blah blah blah
People for whom this resonates should look into Dolt: https://dolthub.com/.It's a mysql-compat database that is versioned and forkable - basically imagine git and mysql had a baby. Every transaction creates an entry in the commit log that lets you see when, how, and why the database changed. And just like git you can `dolt checkout <hash>` to see what the data was like at some point in time, or `dolt checkout -b temp-branch <hash>` to make exploratory changes to some historical version, or `dolt revert` to revert a transaction... etc.
There is a lot more power that comes with making the entire database versioned and forkable by default. For example it makes it much easier to recover from catastrophic bad code pushes, etc.
note: Dolt was forked from Noms, my previous project, but I don't work for Dolt or have a stake. Just a fan.
Yes.
To note, with GDPR there's now legal reasons to do so regarding user personal data. That can be the moment the devs realize they actually can't delete the data, because they soft deleted for so long, many relations are now interlocked and the data model needs to be changed to give a starting point to the deletion cascade.
My lesson from that was to at least have one test deleting a mock user that spans the maximum data breadth of the service . We caught a bunch of these loops in test at dev time and that was pretty great.
My lesson from that was to at least have one test
deleting a mock user that spans the maximum data
breadth of the service . We caught a bunch of these
loops in test at dev time and that was pretty great.
Thank you. That is SUPER insightful. If it wasn't too late to edit my initial post I would add this.In my experience this happens “rarely”, not “always”.
It can happen, and in some ultra-rare cases the impact of not being able to recover some data might be huge (company-ending, even), and engineers are good at worrying about such edge cases. That’s why we habe protective measures like soft deleting and event sourcing - because of nightmare edge cases, not because we are always having to actually use them. It’s driven by engineers avoiding their worst nightmare: having to say “I’m sorry, I cannot solve this problem for you. The data is gone.” It’s a peace-of-mind thing, not an everyday-need thing.
IMO soft deletion is a hack trying to fix problems in CRUD, which is a hack in itself.
CRUD attempts to model everything in the universe as a collections of mutable items, loosely based on RDBMS/SQL.
Event Sourcing is more realistic: it models everything as a append-only logs of immutable events/facts, which preserves both the historical data and, more importantly, the original intent.
Unlike CRUD, Event Sourcing is technology agnostic.
I don't think soft delete is wrong per se, but it is something that should be native to the database engine.
This allows you to do a `select * from users where deleted_at is null` to get the active records, but also know when the user was deleted if you need to audit / rollback.
- How many records are there? Enough that performance will be an issue? If I'm planning on keeping things in the same table, can I utilize database features like indexes and partitions to mitigate any perf issues? (For some access patterns, partitions might solve 100% of your perf concerns)
- How common is deletion? Are we expecting 1% of the records to be soft-deleted, or more like 90%? If it's the latter, you may not want all those records clogging up your main table.
- Is this greenfield development, or am I adding soft-delete to an existing app? Greenfield favors "same table" soft delete; if you're retrofitting an existing app it may be better to keep deleted stuff in a separate table so that you don't break existing functionality.
- What do you want to do with the soft-deleted records? Are there times when you want to treat them just like regular records, e.g. "Give me a list of all the users who joined last week, even if we've deleted their accounts?" If the answer is "yes" then a lot of those things will probably be easier if you keep deleted and non-deleted in a single table.
But there’s always something that needs to be undeleted. You can either have an easy way to do it or restore an entire DB backup and cross query the missing records. Soft deletes are a lot easier.
I feel like a lot of developers underutilize the capabilities of the massively advanced database engines they code against. Sure, concerns about splitting logic between the DB and app layers are valid, but there are fairly well developed techniques for keeping DB and app states, logic and schemas aligned via migrations and partitioning and whatnot.
This is one gripe I have with soft-deletion. Since I can no longer rely on ON DELETE CASCADE relationships, I need to re-defined these relationship between objects at the application layer. This gets more and more difficult as relationships between objects increase.
If the goal is to keep a history of all records for compliance reasons or "just in case", I tend to prefer a CDC stream into a separate historical system of record.
Cascaded deletes scare me anyway. It only takes one idiot to implement UPSERT as DELETE+INSERT because it seems easier, and child data is lost. You could always use triggers to cascade you soft-delete flags as an alternative method, though that would be less efficient (and more likely to be buggy) than the built-in solution that cascaded deletes are.
If you look at how system-versioned (or “temporal”) tables are implemented in some DBMSs, that is a good compromise. The history table is your audit, containing all old versions of rows even deleted ones, and the base table can be really deleted from, so you don't need views or other abstractions away from the base data to avoid accidentally resurrecting data. You can also apply different storage options to the archive data (compression/not, different indexes, ... depending on expected use cases) without more manaully setting up partitioning based on the deleted/not flag. It can make some query times less efficient (you need to union two tables to get the latest version of things including deleted ones, etc.) but they make other things easier (especially with the syntactic sugar like AS AT SYSTEM_TIME <when> and so forth) and yet more things are rendered possible (if inefficient) where they were not before.
> I tend to prefer a CDC stream into a separate historical system of record.
This is similar, though with system versioned tables you are pretty much always keeping the history/audit in the same DB.
---
FWIW: we create systems for highly regulated finance companies where really deleting things is often verboten, until it isn't and then you have the other extreme and need to absolutely purge information, so these things are often on my mind.
Mostly I use soft-delete because for auditing requirements we pretty much can't remove anything but also because nothing ever truly goes away. If we have an Invoice or Order then, from our perspective, we must have those forever even if the corresponding client is deleted and can never place another one.
If you use soft deletes on all tables, you can also cascade them as long as you either cascade updates to the real keys as well, or prevent such updates, by having a deleted flag column on each table, including it in a unique constraint with the actual key column(s), and including it in the foreign key.
This is merely annoying when dealing with regular views because recreating even a large number of views is fast, but can be catastrophic if you have any matviews in your table dependency tree. A matview can easily turn what should be an instantaneous DDL operation into a partial outage while the matview is being regenerated.
(this is all postgres specific, it may be untrue for other systems)
You're not wrong, especially with the second part. I.e., deeply nested or convoluted dependencies between views can definitely make it awkward or painful to make adjustments near the root of the tree.
When I started this reply I was going to say "I hear you, but it's not an issue I run into very often". But that's not true. I've actually been burned by that moderately often, and have sometimes avoided or redesigned the root-level table change to avoid having to propagate all those changes to the rest of the dependency tree.
That said, in my experience (also mostly with postgres for this context) I feel like that's usually been more of a developer laziness issue (my own laziness that is), rather than an "ossified schema" issue. It's definitely a PITA when some simple change is going to break a dozen inter-connected views, but that's a coding issue not a deployment issue almost all of the time.
To be fair I don't really use matviews very often, but for true basic views I am guessing that the actual execution of the DDL to rebuild changed views is manageable in all but the most extreme cases. Even then there _should_ be a maintenance window of some sort available.
Thinking this thru a little bit, I believe the "anti-pattern" you're warning against isn't really views themselves but deeply nested/interconnected views (views that query other views, etc). I use views often (for this logical-delete type idiom for example) and I have rarely regretted it. I have often regretted creating complicated view-base dependency trees however, so I think I'm wholeheartedly in agreement on that point.
if you want view depending on mat view - materialize it yourself in a table, and refresh it yourself controllably.
``` CREATE OR REPLACE VIEW active_customer AS SELECT * FROM customer WHERE deleted_at IS NULL OR deleted_at <= NOW() ; ```
There, I fixed it.
Just use `active_customer` instead of `customer ... deleted_at IS NULL`.
In fact, since the deleted_at column is a timestamp, the original "leakage" query:
``` SELECT * FROM customer WHERE id = @id AND deleted_at IS NULL; ```
is actually broken. A non-null `deleted_at` timestamp that's in the future implies the record hasn't been deleted yet, right?
I've often had junior devs assert that views are some kind of code smell, but these sorts of "canned query/filter that you want to apply very very often" seem like the perfect use case for a view to me. It's DRY, and the fact that your standard "query" is in the database" means you can change it more readily than trying to make sure you hit all the points it might be embedded in the application code.
> I feel like a lot of developers underutilize the capabilities of the massively advanced database engines they code against
Early-ish in the JDBC days a senior dev I was working with at the time (as a junior dev myself) made a pretty good case that "the database is part of the application" that's always stuck with me. Full database independence via software level abstractions is a pretty silly goal outside of library code. If you have a service that makes extensive use of the database, don't throw away the database features in the interest of some abstract "we could swap out oracle with mysql without changing anything" objective. If you want it to be generic, use the SQL standard, but don't be afraid to have a few db-specific bits in the app code if that's a subsystem you might replace once a decade or something.
I blame the DBA/Dev divide for a lot of this. A lot of the impedance between these layers is social/procedural. If you can change the DB as easily as the code, there's a lot less fear of using the right tool for the specific job.
100% this. If you accept that the database is part of the application, you give yourself permission to use the full feature set of the database, and life becomes a lot simpler. Using views, stored procedures and other features lets you implement things like soft delete trivially, without it infecting all your application code.
In my entire career I've changed backend databases for an application exactly twice. It's not easy, and no amount of abstraction is likely to make it easier.
Even with conditional indexes where you exclude deleted data you take a significant performance hit reading dead blocks because there is no way to quickly vacuum them. You accumulate hours of bloat until your vacuum finishes.
You can't beat a separate insert only archive table which you never have to vacuum.
* hot and cold do not churn each others indexes or tables, you effectively have only one set of indexes (and data) that's actually churning and the others are stable.
* hot and cold can be treated differently - you can perform more aggressive indexing on data once you know it's out of the hot-set, or partition your hot data onto a tablespace with dedicated hardware while cold data is on bulk hardware, etc. Since queries are planned onto each partition individually, postgres can often select "the right way" for each partition.
* "deleted_at" is a special case of cold table. Dropping any partition is basically free, so if you partition into date ranges, then at the end of your data retention period you just drop the partition for that date range, which doesn't churn indexes or induce vacuuming
If data can never exit the cold-data state, then it's effectively an append-only table too, it just exists as a supplement to your online/hot-data table but it doesn't require special attention/etc. So we're in agreement on that point, that's a good design feature if you can swing it!
(note that for audit logging, I think it's simpler to just do the separate table. But the partition strategy does have some advantages for "cold" or "dead" data as a more generic concern imo)
However, indexes do currently have to be scanned as a whole. But that's only done by autovacuum if there's enough row versions for that to be worth it (in recent versions).
It couples nicely with some hackery which turns removes into soft-deletes. You can remove objects as usual and they get soft-deleted in the database.
I've used this in a few projects and it's fantastic.
https://docs.microsoft.com/en-us/ef/core/querying/filters
https://www.thereformedprogrammer.net/ef-core-in-depth-soft-...
Hi, <1 yr experience swe here. Would HN mind unpacking "whatnot" with specific names of some these techniques?
You could have views that say 'thing I have a foreign key to is not deleted' of course, but that sort of seems like 'code leakage' again, just in SQL this time.
So true. There are so many amazing, powerful features in all of the major players.
Also: updatable views are amazing. With query rewriting (whatever you vendor calls it) you can affect some truly material changes to the system without any changes to the client applications. An example would be implementing temporal relations.
I agree that one should make use of RDBMS capabilities. A check constraint may be practical instead of (or in addition to) the foreign-key constraint.
This is the way. Also, save record creation timestamp, and you can have very flexible "time-machine" selects/views of your table essentially for free.
What's an appropriate naming convention?
Should I do it universally and put transparent views in front of all my tables so I don't have to refactor my code to point to new views whenever I do suddenly need to put in a model constraint that isn't 1:1 with my data layer? Is a transparent view a no-op in terms of perf? if it matters, this is being done in Postgres
I will probably make my constraints partial over the not-deleted records, particularly for unique constraints used for upserts. Am I about to footgun myself? Is it even necessary with the new uniqueness behavior with NULLs being implemented in postgres? Will my performance characteristics be better one way or the other in particular circumstances? It sounds like if I have a high ratio of deleted to not-deleted records a partial index becomes necessary.
Everywhere I have worked people know a decent amount about their data store. Not architects, just mid devs and higher.
I'm with the author on this one. Any soft delete logic does have a tendency to bleed into other systems and make your systems more complicated, for very little gain.
That's one part. The other part is that in many industries you have regulatory data retention and audit requirements. This is arguably the most valuable and common reason to perform Logical deletes.
Hard deletes retain no memory of what you wanted to be gone, so any malfunctioning sync process will continuously recreate the deleted record soon after it's deleted. Soft deletes are often the only way to make sure deleted records don't reappear.
And it will go a long way to making your services harder to use if you don't allow users to associate friendly names with things. And to assume that the same friendly name will be used for a future item. (For example, if you name devices based on the room you put them in. Is reasonable to think that when you replace a device, that you are likely to want to reuse the name.)
This also helps with the original goal of making them safer by manually implementing "eventual consistency" for data living outside the transactional world.
Event sourcing is hard – because the engineers responsible for setting it up and managing it aren't generally well skilled in this domain (myself included) and there aren't a wealth of great tools helping engineers find their way into the pit of success.
The downsides of soft-deletes (as identified in the article) are numerous. The biggest problem is that it appears "simple" at first blush (just add a deleted_at column!), but it rots your data model from the inside out.
If accidentally writing the wrong query is a problem, then writing the wrong query is your problem.
At my startup, soft deletes for our SKUs are critical, because we work with data sources where notoriously both the technical systems and the humans driving will all-too-frequently accidentally represent something to our connection as deleted. Or there might be an irrecoverable error when asking "what things are still active upstream" - but that doesn't mean the SKUs are deleted, we might just not have certain live details until a bugfix is made. So "error status" and "soft delete" are somewhat synonymous, and both require investigation into root causes and root intents. Yes, the concept of "unerrored and active" is peppered through our codebase and analytics - but our ability to recover from supplier technical mistakes is much higher as a result. And we could absolutely do this with an event sourced system - but the tooling for relational databases is so much better, it's night and day.
Agreed, sometimes it makes business-sense to implement it, but in the big picture it's still kludgy and not-ideal.
While full-on event-sourcing isn't always the answer, once business-rules prevent you from un-deleting anything there's not much point of having all those dead-rows interspersed in your regular tables.
It was 100% our fault. But soft deletes saved us that day. If you're in a situation where you or your customers could benefit from the same, it's wise to not only embrace them but also make sure they work.
For context, I've worked in fintech where I often needed to review backoffice approvals, transactions, offers, etc.
And since you can never really be sure what you'll need 2 years from now, I imagine there are a lot of anecdotes out there of people who implemented it thinking it would be used a lot, and turned out to be wrong.
But the easiest way is to deactivate the user account (is_active boolean) and continue to reference the user in internal records.
ALTER TABLE blah ADD COLUMN deleted_at NULL TIMESTAMP;
ALTER TABLE blah RENAME TO blahwithdeleted;
CREATE VIEW blah (SELECT * FROM blahwithdeleted WHERE deleted_at IS NULL);
And thus your entire application just needs to keep SELECTing from blah while only a few select pieces of code related to undeleting things (or generating reports including deleted things) need to be shifted to read from blahwithdeleted.Furthermore, the cost of an error is potentially massive. Someone new at the company makes a revenue report based in the billed Invoices and does not realize they should query the view and not the table... Not great if 90% of all invoices belong to soft-deleted customers!
The author is right; soft-deletes are probably most definitely not worth it. There are many better ways to solve the problem.
The disadvantage of this is that if you ever do want to access this "deleted" data, e.g. in admin or compliance tools, you now have to do it in two different ways, one way for the main data and a different way in case the data has been "deleted".
The article asserts you'll never need to "undelete" the data. So they're presenting a solution with that assumption, fair enough. Without that assumption, however, moving the data back from an archive table becomes a pain, and if there are any unique constraints e.g. on username or email address, you'll have a problem if you've moved the data out of the main table and another user has used that username or email address.
IMO it's worth distinguishing between (A) some kind of "click to undelete" feature versus (B) simply having that old-data conveniently exposed for a developer to manually-edit things or craft database-change scripts.
In practice I've only ever seen the latter get used, because it requires a developer to figure out how the heck to get "the parts that matter" back while preserving the integrity of other newer data and obeying certain business-rules.
Use a view.
> if there are any unique constraints e.g. on username or email address
Have those in a dedicated table where they aren’t deleted, and add a synthetic key referenced by the other tables.
I can’t think of a single case where you’d want to remove the invoices of a customer you delete. Ever. In fact, the opposite is more likely to be a big problem, accidentally cascading your delete to your financial records!
Using a soft delete, your invoices won’t “disappear” because your app WILL have a view for looking at just the invoices.
Source: I built a bookkeeping system and soft deletes is a necessary feature.
CCPA will require you to delete the invoices. And I would love for all platforms to support deleting everything, including invoices, considering some things are illegal in other places and if there's proof of you buying said illegal thing, you can get in serious trouble (think gay dating apps in the UAE).
But I don't really agree with the author on his take about soft deletions.
https://docs.microsoft.com/en-us/sql/relational-databases/ta...
That works for updates too, by preserving the old data and showing you a time machine like backlog. But the archive database gets too large over time and you need to purge it periodically. You can create some delete triggers for automating this "save before delete" behavior.
E.g. you have 3 users sign up with the same email (a unique field) one after the other with deletions in-between each sign-up?
I don't get this statement. You wouldn't have had the env or data without soft delete? You did use it!
I would say, soft delete isn't a tick the box and done solution as many ORMs make it.
You need to consider the data model, and adjust your queries to that.
It may make sense for a product to be deleted, but orderlines still able to access it to display product name etc.
With blob data, I tend to move that to a "bin" with a 30-60 day grace period. Customers know quickly reporting, we can fully recover, while outside that time they'll have to provide images etc. It's a decent compromise.
Reuse of unique fields is the sticking point I run into often, as mysql interprets null as not clashing with other nulls so composite uniques using the ID and deletion date don't work.
Which is correct per SQL. Null is NaN, not zero (or negative infinity).
- Feature creep. "Sometimes our users accidentally hit the delete button, or change their minds a minute later. We want to give them a way to undo the deletion." Or "I know we said last quarter that we users want to delete stuff, but they also want to see a list of everything they've deleted in the past." Soft-deletes handle feature-creep a lot better than hard-deletions
- It simplifies foreign-keys management. If you want to hard-delete something that some other entity is referencing, you'll have to hard-delete or modify that other entity first. And potentially repeat this process recursively for their own references. This is a pain. One could argue that if you really want to delete something, you should be deleting all children as well. Such arguments are highly domain specific, and very bad universal claims. We've seen some use-cases where such pedantry is not necessary
- It makes it easier to recover from mistakes and bugs. Customer deleted something accidentally and emailed you begging for help? Your code has a bug causing stuff to get deleted when it shouldn't be? You'll be thankful you did a soft-delete and not a hard-delete. Is it going to solve every single problem where the data has system-wide ripple effects in a unicorn sized organization? No. But it'll still solve a number of problems where the data impact is more localized
- It makes debugging easier. You have a clear record of everything that used to exist. You don't have to go digging through your logs to find something that used to exist but has now been deleted
- Speed. All of the above problems can be solved in other ways too. The author suggests putting all deleted data in a "deleted records table." So now you need to maintain a 2nd table for every table that you may want to delete stuff from. All schema updates will need to be mirrored on this 2nd table. And you'll need to write and maintain code to populate this deleted-records-table every time you delete stuff from the original table. All doable and straight-forward but takes time away from other things you could be doing instead
The main benefit from hard-deletions is data compliance and liability. Ie, being able to tell privacy-conscious customers that you actually deleted their data. If you're handling any sensitive data, you should definitely do hard-deletions at some point for this reason. But the other reason the author gave - "it's annoying having to check for `deleted_at` when writing SQL queries" - seems pretty minor compared to the benefits.
1) Client wants to remove user from the system who have left their org but
2) There are objects that were contributed by that user which are required to persist beyond the user's deletion.
Those are ideal cases for soft deletion. We can still query information about the deleted user to explain who created this object, with the note that their account has been deleted.
Probably I should be doing full event-sourcing for this case, but delete flag works well. MS offers temporal tables for this use case and I'm still considering the implications there -- AFAIK ORM support is WIP.
And unlike the article author, I have used soft deletion to undelete things. Many times. Maybe he has better users than I do, I don't know.
If disk gets to full, project the latest time slice into a new database and move the old database onto a cold storage.
I've probably restored data from backup maybe 4 times in my career. I greatly prefer to do this on the rare one-off scenario than to deal with the overhead of soft deleting everything.
> I've probably restored data from backup maybe 4 times in my career.
Yet, I often use soft-deletes because it allows people to undelete things from the software interface and not call me all day long.
But that's not the most common reason I have for them. Normally it is because the data just can not be gone, and the full table is still important somewhere.
I think people arguing for and against soft deletes both understand that there are cases where you want to use them and when you don't.
- For many applications, it's easiest to put the state of the object in the primary key, and thus point reads will fail when something gets deleted. This has other problems though with hotspotting and compaction during deletes. The deleted table doesn't really solve this either.
- For storage systems, GC is critical functionality to implement. Most systems whether they want to believe it or not are glorified storage systems. Garbage collection is hard to do at scale, and I've never seen it implemented as SQL statements rather than code. Especially for GDPR etc.
- For large scale distributed systems, foreign key constraints are rare if impossible to implement with reasonable latency, so they don't exist either way. I haven't worked on a system in >15 years that had fk constraints.
- For large scale restores where you need to undelete trillions of rows, keeping the rows basically pre-assigns the distribution of writes. When you have to re-create the rows, you tend to get intense hotspotting and failures along the way as you attempt to load balance on the keyspace of the writes.
A deleted records table is good for smaller (<10000 machine) systems when latency between nodes can be kept within the same campus. It can really improve performance of your GC if reading by column isn't fast compared to reading by table.
Then you could do
FOREIGN KEY (foreign_id, NULL) REFERENCES foreign_table(id, deleted_at)But... weren't you using all those env and data info from the soft-deleted set?
I've typically been using soft-deletes for most projects for years. People have accidentally deleted records, and having a process to undelete them - manually or giving them a screen to review/restore - has usually been great.
Yes, if there's a lot of related artefacts not in the database (files/etc) that were literally deleted, you may not be able to get them back. But that's an ever greater edge case in projects I work in as to not be a huge issue. We probably have some files in a backup somewhere, if it's recent. Trying to 'undelete' a record from years ago - yeah, likely ain't gonna happen.
People are used to 'undo' and 'undelete'. Soft-deletes are one way to provide that functionality for some projects.
Once your main tables start getting to the order of tens of millions of records, the filtering by 'deleted_at is NULL' or 'deleted_at is NOT NULL' gets in the way of query performance.
NULL is also not indexed. So, that throws the spanner in the works sometimes (depending on the query).
Most of these downsides are easily mitigatable issues as well. As many users have stated, something like views solves the issue of forgetting the 'deleted' clause.
Lastly, I'm not sure the issue with foreign keys/stray records really resonates with me. I'd be hard pressed to be comfortable allowing a developer or DBA who isn't fully comfortable with the data model to be hard deleting records, let alone flagging them as soft deleted.
We don't need to have `where deleted_at is null` on every single query. But the best part though is our actual working data set of records we actually care about is tiny compared to the deleted cruft that would have otherwise been just sticking around forever. Backups and restores take no time at all. It's really cool that postgres lets you have conditional indexes on things, but it's even cooler not to need them.
Source: I write back of house software for resale store owners, and accidental deletes happen occasionally. Being able to restore things instills a lot of confidence for our customers.
It's surprisingly manageable. I mean, yes, it's definitely the largest table in the db, but:
1. it's well worth it
2. most of the stuff in it isn't the main scenario above (a human does something and I record the change) but various automated processes I also want to track, like API calls. which leads to:
3. it's easy to prune - both in time period kept, and by selectively deleting the automated stuff earlier
But it mostly helps by localizing things. It's just one meta-data log table, and everything related to logging actions is there. Not very elegant to keep adding fluff fields to every table, like "add_date" or "deleted_at". When I decided I want to also track the URL of the request I had to change things in just one place, and now I have it for every action everywhere.
Note: don't fall into the "everything is a nail" mistake. Some other dedicated log tables may be necessary, for high-volume or distinct stuff. I also have a mail_log, a sms_log and a separate table for events coming from mobile users (like location history).
For foreign key constraints you can set the foreign key to null and orphan the records if the relation is deleted. You could also hard delete them in this case. It depend on your use case.
When the data volume grows or the ratio of soft deleted to normal records is high, you should consider another solution. One solution you suggested, moving the record to a deleted table is a fine one.
The other solution that I've used successfully is to journal your deletions in another table or system. For smaller volumes having an audit table Journaling the data and storing the pk, fkeys, and a serialized version of the record, json works great in postgres, works well. For large volumes or frequent deletions something like Kafka or PubSub work better.
You may very well find others interested in consuming your audit journal to track changes. Updates and even inserts fit great in the more general case.
Either way, it allows you to have soft deletion and/or full activity logging functionality without the application having to know about it.
I don't get how this rocket science. Almost every query in the system is some kind of where clause on a fk to account or user or project or some other critical object ... so there are only a few places in the ORM where I need to support this.
So yeah, be careful with the soft-delete pattern :)
These days I use a deleted table as per the article as I decided it would be better to deal with the more complex undelete process. It keeps that process to a single section instead of spreading it all throughout your database.
Some of the suggestions here like "use views" don't really work for two reasons - sometimes the is_deleted check must be performed in the ON clause, not in the WHERE clause, and sometimes you want to count the deleted or show the deleted, while other times you don't.
This not a problem, its is almost always what's desired, otherwise you have no records for, for example, the tax auditor.
Obviously when, say, an employee leaves basically all things they did on a corporate system can't disappear. Any documents they created/updated still need to be accessed, their git history/commits can't disappear.
When you switch classrooms you don't want all the events that ever happened in the old classroom to disappear.
This sort of systems are the kinds of systems I've worked with my entire career. Undeletion happens all the time too (employees get rehired, for example).
Most computer systems aren't B2C free social media sites where you CAN just delete anything you want because no data is important.
At least for our use case, soft deletes made everything slower because it's much harder to index. For our database we basically had to do an audit of all of our WHERE clauses and create partial indexes on "not yet deleted" records. Of course, this bloats your indexes/disk and hurts write performance so it's not a silver bullet.
We've also taken to inserting into "delete records tables" for records we may want to recover or for historical reasons. You still lose foreign keys but indexing and query optimization is a lot easier, and your old data is just still a simple query away.
Your company has a database backup system right? That system should be configured so that when it runs a backup, it will not remove deleted entries from the previous backup, instead just mark them as "deleted_since" the current backup time.
Idk if any backup system actually support this, if there's some glaring problem (like you can't just overwrite parts of a database backup for some reason), or if most companies just don't have backups because they're too expensive (probably not), but this is the solution I would go with. It works for other sorts of data like file systems as well.
IME, as with “updated_by” and “last_modified_at” columns, it's usually hazy audit requirements, not making deletion reversible, that motivates it.
A proper history store maintained by appropriate triggers solves this, and leaves the referential integrity constraints on the base table intact. (It can also be used for reversibility if you need that.)
Views conceptually would work, but then you get bitten by all the ways that all relations are not equal in real-world RDBMSs.
The point that is true is that queries get more complicated as you'll have to add a "WHERE deleted_at IS NULL" to every SELECT (once for each table you refer to), but that can be automated if you use an ORM. A paradigm that I often use is that all objects in the database belong to a role object that determines who can read/write/delete the given object. So before doing anything with an object I always check the role object (e.g. the "user" referencing an "invoice", to stay with the example OP gives), and as part of this I check whether the user object still exists. Alternatively, you could automate most of the required update logic using triggers as well.
But otherwise I agree, soft deletes often don't seem to be a worthwhile tradeoff, not sure if I would use them again when designing a relational schema. They are very useful for auditing and undo though: In a current project, whenever a set of objects gets updated I soft-delete the old versions and create new objects, keeping the UUIDs intact. That allows me to display the entire version history of each object to the user, which can be necessary e.g. for compliance reasons. You can achieve this with an audit log as well but that would require more logic and different queries, whereas querying soft-deleted objects just requires a slight modification of existing queries.
There are cases where you shouldn’t be deleting or updating data; auditable and non-repudiation systems for some regulatory compliance come to mind. Best to use patterns that don’t require those operations.
Soft deletion does come at a cost. Choose carefully!
In your code you can isolate all soft-deleting from business logic in the ORM layer or data layer, so the complaint about littering your codebase is moot for me. For instance, using Entity Framework, you can change deletes to soft deletes in a centralized place for all records matching a particular interface, then add a query filter that applies in the background for all queries.
The complaint that soft deleting is never done is maybe valid since you can review things with audit logging or backups without risking unknown effects of an un-delete. But if you need a recycle bin feature then you get that for free if you just build that in from the start, and it is one more guarantee.
The risk of orphaned records is real, although you could probably handle most cases generically in the data layer or ORM as well. It seems like there's just tradeoffs to the various approaches. Do you want to err on the side of deleting data, or on the side of keeping it? Do you worry more about orphaned records or data loss?
At my current employer, we noticed that `acts_as_paranoid`'s default behavior was not what we wanted, so we migrated over to `discard`. We also added a concern that reflects on dependent associations, finds if they are discardable, and discards them if possible. And that cascades down, easing those concerns. This `Discardable` concern is automatically added to every single soft-deletable model and it has been working out great for us.
Effectively, you don't check for the soft delete flag if you get to it from a an un-deleted record, but you do check for it if you access it the other way around.
In other situations though, you may have some reason to treat the customer as if they were deleted, but better to examine the reason for that, and use an attribute more relevant to that reason, such as active/inactive etc. Would be different for different entities of course.
Usually archival does the main thing (“get this out of my main resource list”) without breaking audit trails or resource links. For people for whom this is insufficient, you can of course offer hard deletion.
Someday.
I'm embarrassed to admit how many decades I've been waiting for this.
1. This can easily be done with a trigger, so that you just call a DELETE on the table and deleted tables are copied to the deletion table automatically.
2. I prefer, instead of having a jsonb column, that each table has a corresponding `deleted_original_table_name` table that exactly matches the schema of the base table, with the addition that the first column is a `deleted_at` timestamp. It's easy to use helper methods in schema migrations to always keep the table definitions in sync.
For all tables containing a BOOLEAN column it’s always possible to simply split this table into two separate tables with the same columns, where the name of the table signals whether the factored-out BOOLEAN column would be TRUE or FALSE.
My gut instinct says it’s cleaner to have two separate tables, but I’ve never found a definite answer.
Mechanically, I've typically supported soft deletes with audit tables that shadow the primary table, with a bunch of automation in the database to make management mostly automagic. It isn't too bad in PostgreSQL.
I disagree with the terminology in the article. "Soft deletion" suggests that the complexity is in the "soft" part, and that a common way of implementing it is problematic. I disagree. There isn't some orthogonal "soft vs hard" dimension to a generic concept of deletion. The complexity is in the meaning of deletion.
In accounting, you don't simply delete. Or when you do, you really do, and the two operations aren't the same in any meaningful sense. If you want data to still be available—whether it's for debugging or analysis or auditing or whatever—you should be thinking about the semantics of what you need, and structure your data model accordingly.
The `deleted_at` column approach is a DB design smell if it isn't supporting application logic (where "application" may include auditing or whatever). It works against the DB's mechanisms to maintain data integrity. FKs are just one example.
An example: consider the place where you want to keep historical data, but you're also going to be modifying your schema. If you use a `deleted_at` column, your migrations will start inventing more and more things that were simply not true at the time a deleted row was alive. It will lie to you. It's the same if you move data to a single deleted data table and then migrate that table repeatedly. For maximal semantic purity, you probably ought to leave "deleted" data in a historical table matching the historical schema, and then use views to glue things together for convenience. If you update the live data schema in an incompatible way, you might even be saved by the FK constraints on the archive tables.
But that's a pain, and whether or not it's less pain than the other approaches depends again on what deletion semantics you are targeting. Crossing your fingers and closing your eyes and hoping that your chosen mechanism's semantics are close enough to the semantics you need is going to bite you.
A `deleted_at` column can absolutely be the right solution if your rows have a status that changes over time, and one of the statuses that you're willing to support (with potentially brittle code) is "archived".
Keeping a deleted recrords table via app code or triggers has always been more trouble than it took to build.
> As far as I’m aware, never once, in ten plus years, did anyone at any of these places ever actually use soft deletion to undelete something.
That's wild. So it seems the idea of needing undelete is largely an unfounded fear.
In Django, it's really easy to create almost seamless soft deletion logic in the model manager or in your querysets.
Over the last decade, I find myself using soft deletion more and more - usually to accommodate user/client requests.
Excuse the dramatic title of the post
We renamed "deleted_at" to "archived_at". And there was much rejoicing.
Don't ask me why the linter configurations weren't simply persisted in git.
If you're just going to throw in some deleted bool or deleted_at timestamp without thorough testing, you might as well just skip it. It's virtually certain to go wrong.
You could even sprinkle cryptographic guarantees into the mix. This would be very challenging to do with mutable DB rows.
Edit: Deletion is always reversible btw since there are backups.
Every table had the same table with _del suffix (eg. users_del). If a record was deleted, it was simply moved to _del table. We used code for this but later we started to use db triggers.
It worked quite well, and yes, there was always someone who wanted to undelete things. One downside was, if the schema changed on the source table, we needed to also change the schema in _del table. I like the approach with storing the data as json. That way there could be only 1 deleted_stuff table because it was looking quite strange having all the _del tables.
What? You do not delete invoices, unless you’re trying to take revenge on your accountant. This is what soft deletion is (partially) for: you don’t want to see Alice in a customer list for some reason, but her invoices are the accomplished fact. You can even visit her card from there, but it is unlisted everywhere else.
Of course that depends on which sense you put into deletion, e.g. you may put obsolete cards into a special group instead and only use deletion to remove data completely with all references. But then deletion is useless, because only a programmer to the bone can imagine deletion of a customer together with all historical (legal) documents they participated in.
Solution… a whole other table of deleted stuff… in a new structure.
Man soft deletes just look better to my eye.
For accidental delete, we will undelete from archive.
Then have a statement like RESTORE * FROM ...
In any case, soft deletion is usually a sign of incompetence. Whenever I saw it on a project, both soft deletion and the project turned sour.
If you can _actually_ delete something, then that means that a malicious actor can fabricate data an claim that you deleted it. GDPR may be well intentioned but systems that have the ability to remove any record of a thing lay the groundwork for systematic fabrication of data, because any record of the past has been erased.
Operationally, I can totally see why soft delete might be considered to be problematic in certain cases, but from an information security point of view I think it is absolutely critic for protecting users against a whole class of attacks.
0. https://news.ycombinator.com/item?id=27249738 1. https://news.ycombinator.com/item?id=27691442
There's only deactivate account, archive a legacy product,...
Because there's no such thing as delete something from real world.
GDPR, among other regulations, requires that you be able to do this sometimes; and it requires that the data REALLY BE GONE.
But I really think that soft deletion should be the default unless you think you’ll be fielding user data deletion requests.
However, GDPR/customer data demands across regimes makes me agree with him and would suggests folks listen. <3
GDPR is the big thing to consider, I think.
Just put a filtered index on the column to better query non deleted data.
On the whole I don’t think in practice the author’s take makes much sense.
Post a daft hot take.
A view solves that problem. Make a view that only has the non-deleted stuff. Give it ON UPDATE and ON INSERT triggers so that it walks, talks, swims, and quacks like a table. Voila, no more code bleeding.
> Another consequence of soft deletion is that foreign keys are effectively lost.
Bit trickier, you have a few options:
* Make the constraint include that the foreign object has `deleted_at IS NULL`.
* Add a trigger that automatically marks as deleted (I'm assuming a setup where you'd ordinarily use ON DELETE CASCADE) anything that refs row X when you mark row X as deleted. If you prefer commit failure ON DELETE instead, triggers can do that too.
> GDPR scaremongering
You don't have to delete records from backup tapes either (SOURCE: I read the whole thing). Using soft delete is actually making life easier for you - you presumably _do_ have certain data storage requirements (for audit trails and the like), and now you can just have the one canonical database that contains it all. When its time to prune an entire customer/user into oblivion, it's simpler to do that then - just `DELETE` the right rows away (actual DELETE, not UPDATE SET deleted_at).
Yes, GDPR has something to say about keeping data around where you have no feasible auditing or any other reason to have it, but that's a red herring: You don't want your database tables to grow humongous with 99% of the rows 'deleted'. That view with some indexes can do a lot but it isn't magic. Presumably you want a cleanup task that, every month or so, DELETEs anything with a deleted_at value that's older than a month or what not. This fully takes care of your GDPR requirements as far as unreasonable data retention goes: A script automatically runs to wipe out all rows in all tables whose deleted_at is too long ago, and then reports that it did this so that you have the audit trail.
So, for your requirement to delete specific records upon request, it's easier. For your requirement to not keep unneccessary data around beyond reasonable bounds, it's a simple script.
> Here’s a snippet from one that I wrote recently which keeps all foreign keys satisfied by removing everything as part of a single operation
If you set your constraints explicitly to checking only at the end of a commit this isn't at all difficult the way the author says it is. Just delete what you wanna delete, commit at the end, and poof - all is well. You can force postgres specifically into a 'yeah yeah do not check any constraints until I commit' mode if you don't want to change your ON DELETE clauses.
> data deletion has non-data sideeffects
That depends on the use case. It feels like a bit of a strawman argument - obviously if the delete action does irreversible damage, marking the database row using a soft delete is rather silly. Of course. Most delete operations are nothing like that though.
> Alternative: A deleted records table
Author's previous point about non-data sideeffects kills this just as badly. But, sure, this isn't a bad idea. However, most of the complaints about soft deletion apply in a different fashion to this model. For example, if you have reference constraints, and using ON DELETE CASCADE, you need to do a heck of a lot of copying. You don't just 'copy' the row you want to delete, you also have to copy every row of every table that refs your table with ODCascade constraints to its 'deleted' variant first, and only then can you delete the lot.
> Hard deleting old records for regulatory requirements gets really, really easy: DELETE FROM deleted_record WHERE deleted_at < now() - '1 year'::interval.
It is _exactly_ as simple to do this if you use soft-delete. Bit of an own goal.
Author's got the right idea (soft delete needs some thought), but the technical aspects are a swing and a miss, I think. However, some database make some of these solutions hard. As far as I remember, they don't all support ON UPDATE/ON INSERT rules on views, for example. Fortunately, postgres supports all of this stuff.
Remember the Attlassian outage from earlier this year. They sure would have appreciated a soft delete