It’s curious that over those projections, we then build event stores for CQRS/ES systems with their own projections mediated by application code.
Let’s also mention the journaled filesystem on which the database logs reside. And the log structure that your SSD is using internally to balance writes.
It’s been a long time since we wrote an application event stream linearly straight to media, and although I appreciate the separate concerns that each of these layers addresses, I’d probably struggle to justify them from first principles to even a slightly more Socratic version of myself.
[1] https://www.confluent.io/blog/turning-the-database-inside-ou...
Also, to my knowledge, the logs in a DB are not kept forever. Instead they are trimmed as soon as reasonable. It starts to smell a little bit like a https://en.wikipedia.org/wiki/Log-structured_merge-tree
The academic/enterprise database space has been discussing and tackling the types of questions you raise for decades. I don't think that is a useful lens to evaluate this article which is effectively a "tips on when to use our GoLang SQL Pub/Sub layer".
One must be careful not to use this as an excuse, of course, and keep an eye on the scaling concerns and certain other details. SQL-as-pubsub has certain well-known issues and anyone using it this way ought to be aware of them. But it's a thought worth having.
I've got a system I'm managing where Cassandra is the backend. I've got about 5 "documents" (in the MongoDB sense, let's say) I want to store in the system. I don't put up a whole "document DB" for them, I just have a table in Cassandra. I have in my entire system, one distributed lock I'd like to have per certain resource, of which I expect there to be single digit numbers of that resource over the lifetime of the system. Cassandra is not a great distributed locker, but it does work (and as far as I can tell, done properly, is also correct), so rather than install an entire distributed lock server, I use Cassandra.
Should this ever turn out to become a mistake, all code that uses either of these functions is cleanly isolated and I can easily swap them out later. I am aware of the possibility that could happen in the future and have prepared for it. In the meantime, I've avoided two entire systems being poorly deployed and understood in favor of the one system that is well-deployed and understood by the team.
I would update all references of the former to the latter.
Do databases keep the whole transaction log forever? It seems like that could keep growing forever even when the tables stay at constant size.
For your question about the logs growing forever, there are usually points in the process where a transaction log is saved off somewhere else and then can be overwritten, but on a transactional system the logs over a period of a week or two can sometimes be many times larger than the actual data stored at any given time, yes.
That's really logical. From the view of the application there is no transaction log, only a table. It's an implementation detail of the database.
The application wants similar guarantees a log can provide, so they build their own.
It might work, but it’s not the general case and you might spend more time to debug your table then to write the code to use a real queue.
And I’ve also seen people build their own queueing engine for a few hundred tasks per day. Why don’t they just choose one of the very good open source solutions?
If you already need a database for something else, using the DB as a Queue means you don't need to list {mqFlavorOfChoice} as a requirement for new hires. You also don't have to manage that extra infrastructure. Of course, you are putting additional load on the DB.
Mind you, I'm speaking of a pub-sub type queue and not a FIFO here. You can do FIFO queues in DB as well of course, it's just not as compelling of a story nowadays.
Also way easier to look at and 'poke' a Database queue if you need to. The queries are also not really difficult to write for a general purpose use case.
If I was building a new system that required a queue I'd definitely put in the same postgres db as the rest of the data until I had a good reason not to.
If you don’t take this into consideration then you’re detracting from the business to satisfy another need.
Messages will never arrive, arrive out of order and I don't remember the third one right now (messages will arrive late?)
EDIT: More context for the above process[1]
[1]https://www.eversql.com/faster-pagination-in-mysql-why-order...
What parent mean is that there may be holes in the sequence of primary keys. What you do with pagination is that you first sort the sequence, then thrown away the first N results, and finally select only the next M results.
It will work just fine.
Polling isn't a huge issue to begin with, and is mitigated with LISTEN/NOTIFY (on certain DBs). Inserts with indexes are not a performance problem at the scale of most applications. A separate messaging service won't prevent you from building a "hugely coupled monster".
Personally, I almost always start with the database as a queue. The operational overhead of running, updating, and monitoring another entire service is non trivial. If the messaging rate exceeds the database's capabilities in the future, I'll migrate then.
I'm working on a module that send notifications to a user when an alert is generated. I have PostGreSQL as the database and NodeJS is the handler and for connection pooling. Are there any good pub/sub tools that I can use. Thanks in advance.
It's an Elixir server (Phoenix) that allows you to listen to changes in your database via websockets. Basically the Phoenix server listens to PostgreSQL's replication functionality, converts the byte stream into JSON, and then broadcasts over websockets. The beauty of listening to the replication functionality is that you can make changes to your database from anywhere - your api, directly in the DB, via a console etc - and you will still receive the changes via websockets.
The article suggests Postgres’ native LISTEN/NOTIFY functionality. I tried that originally and found that NOTIFY payloads have a limit of 8000 bytes, as well a few other inconveniences.
It's still in very early stages, although I am using it in production at my company and will work on it full time starting Jan.
https://deepstream.io/tutorials/concepts/what-is-deepstream/
https://github.com/deepstreamIO/deepstream.io
(I'm not affiliated with the project.)
(Blockchain another "oplog" that ends up caring a lot about state eventually).
It's no wonder you can use them interchangeably in many common base cases.
When the "forward" part of "store-and-forward" is most important then Kafka is a fine solution.
However, when the "store" part - for example you want to be able to stream historical data again, or interact with the data in different ways - is most important I have recommended HBase (+ Phoenix) as a better solution in the past.
[1] https://www.reddit.com/r/PostgreSQL/comments/ebu6nh/message_...
An alternative implementation is provided by Debezium [1], a general solution for change data capture for MySQL, Postgres, MongoDB, SQL Server and others, based on top of Apache Kafka (but can also be used with Pulsar and others).
There's support for outbox coming as part of Debezium out of the box [2].
Disclaimer: I'm working on Debezium.
[1] https://debezium.io/ [2] https://debezium.io/documentation/reference/1.0/configuratio...
It's incredibly well written and I am using it in a project.
It works by following the MySQL binary log and triggering a reactive query based on event conditions specified by the programmer, e.g. a change in a field.