https://github.com/frankmcsherry/timely-dataflow
https://github.com/frankmcsherry/differential-dataflow
Or, just tell your friends. :)
Better yet, write some python / pandas / dataframes / whatever_the_cool_kids_need layer on top, and rule the next big data drama cycle.
You are right, though, that if the processing is extremely "batchy" and all data dies at the same time, then it doesn't make a difference.
I'm not convinced that's the reason why Java is used for it. There are native alternatives like HPCC which claim to perform better.
As was noted, concurrent access to shared data is not something very common in such distributed computation scenario. Well designed processing will avoid it, and thus will avoid need for locking as well.
Could someone (or OP) elaborate on the value that re-implementing a whole software to a new language provide comparatively to just building an interface "bridging" both worlds?
To clarify, my metric for "value" is usefulness to other people. That is, without considering the (interesting) learning opportunity that it represent for the author.
For example, someone developed a Python interface to the Stanford Core-NLP library (written in Java). Would re-writing the Core NLP library to Python be useful to the community? How to figure what are people needs?
I am asking because while I think it would be ton of fun and allow me to learn a lot, I also value building useful software and re-writing a whole system sounds like an overkill except for a few very niche cases..
And if I am not mistaken you would need a team at least as large as the parent project to implement new features, fix bugs and keep pace with it. Looking forward to hear what HNers think!
edit: clarified ambiguities
To learn Rust.
Edit: It also mentions not being tied to UNIX and appears to claim it will run on Windows. That's certainly something.Making it safer and even catching bugs in the original implementation (both things Rust will help with)?
Making it integrate seamlesly with the new language's ecosystem? E.g. Lucene is Java, and someone could use that, but there are tons of ports of it in C, C++, Python etc, providing convenience to integrate it with projects in these languages.
>And if I am not mistaken you would need a team at least as large as the parent project to implement new features, fix bugs and keep pace with it.
Not necessarily. A project with 10 part time contributors could be matched with a project with 2-3 full time competent hackers for example, or even surpassed.
There used to be several ports, though most are dead and/or are several major versions behind. A new C++ or Rust port would be great, though unrealistic given the huge project side.
More generally, as coldtea mentions, making integration into the rest of the language's ecosystem is the primary benefit of rewriting in another language.
The value of such a port to others depends on how easy it is to integrate between the two languages, either via libraries or other methods. The harder it is to integrate the two (and the absence of automated translation tools) increases the value of the rewrite to others.
Your Core-NLP example is actually an interesting one, because that library has already been ported to other languages... It is available for the C#/F# ecosystem (http://sergey-tihon.github.io/Stanford.NLP.NET/).
- maxmemory key eviction
- hash values
- ~2/3 of the set operators
- multi/exec
- lua scripting
This is an interesting and potentially useful effort, but a replacement for Redis it is not.
I'm sure pull requests to bring it up to feature parity would be welcome!
http://redis.io/topics/data-types-intro
The data structures are all addressed by string keys.
Redis can persist this heap to disk, and load it again, so you get a measure of durability, but the typical use case is for data you can afford to lose - caches, metrics, some kinds of events, etc.
Redis's key non-functional strengths are speed and robustness. Operations people love it because you stick it in production and it just quietly keeps on working without needing attention or setting your CPU on fire.
To my mind, any project should have PostgreSQL as its first data store. But it should probably have Redis as its second, when it finally has some data that needs to be changed or accessed so fast that PostgreSQL can't keep up.
(Kafka is third)
It simply means that the key-value store is directly loaded into the memory (RAM) and is available for fast access, but the data is retained (persistent) even after the application is closed.
It is usually used as cache store, queuing messages to communicate with different processes locally or distributed.
Its data structure cover a good part of what you'd need with generic data structures, which makes Redis an easy way to do the logic of, say, List intersection of friends common between multiple people, sorted set of goods ranked by their amount, all of this shared with other processes.
Redis also offers pubsub capabilities in two forms:
- A standard PUB/SUB couple which does what you think it does
- Blocking pop on a list for a client, and a push for another client, which will "wake up" the first one with the value.
It's a very versatile swiss knife.
It can be used for caching, queues and for applications with volatile data.
GitHub.com/sudhirj/restis
Also wondering if some rethinking is possible - would a HTTP interface a la DynamoDB be more useful? Can complexity and performance be increased by using a purely memory backend with no disk persistence? If there were pluggable back ends would a Postgres or DynamoDB back end be more useful for terabytes / petabytes of data? Is the beauty of Redis the API or the implementation?
The answer is "no" with a certain amount of probability. Redis isn't single threaded by lack of capability, but by design. Concurrency for multiple CPUs will actually slow down a lot of the stuff you see, as you will need to introduce locking mechanisms.
Also, garbage collection is highly tuned and customized in Redis to the use case of an in-memory-DB (in stark contrast to usual allocation patterns of an application), up to the point where it's almost impossible to replicate the performance in a garbage collected language.
I love Go and we're a 100% Go (and Angular) shop, but for an in-memory DB it wouldn't be a sane choice.
There should be minimal overhead from having the capability in Redis due to the way it implements disk snapshots (RDB snapshots are done by fork()'ing and relying on copy on write to let the main process keep on doing its thing while the child process writes the snapshot, so the main process doesn't need to care; other than that Redis offers logging/journalling of changes, but the cost of having that as an option is trivial if it's switched off).
Having pluggable backends for things like Postgres or DynamoDB seems a bit at odds with the purpose of Redis, which is exactly that you pay the (very low) cost of in-memory manipulation of simple data structures, though if a single Redis server could partition the key space between plugins, it might potentially be useful by letting you e.g. move keys between backends and still access them transparently to the client. E.g. for the samples I mentioned above, we roll up and transfer data to a CouchDB instance for archival now (doesn't matter that it's CouchDB really - we just need a persistent key-value store; Postgres or DynamoDB would also both have worked fine), but if I could archive them while still having them visible in a Redis-like server together with the in-memory keys, that'd make the client a tiny bit simpler.
For most Redis usage, I think paying the cost of connection setup and teardown and sending HTTP headers etc. would likely slow things down immensely. At least it would for my usage. Having a HTTP interface as an addition might be useful in some scenarios to enable new use cases, but as a replacement for the current Redis API would be a disaster.
If you want to explore alternative interfaces, I'd instead suggest going in the opposite direction, and experimenting with a UDP interface. In a typical data centre setting packet loss is low enough that while you'd need retry logic, it wouldn't necessarily get exercised much in normal situations.
(On the other hand, for the typical request/reply cycle it might very well not give any benefits vs. tcp in most scenarios where multiple request/replies are done over a single connection and thus amortising the connection setup cost - would be interesting to benchmark, though)
I also have no intention of making this project live as long or have as many users as Redis does.
Can rust be readable?
Its a very interesting piece of work though.
I'll be interested to see Antirez's view on the trade-offs between C and Rust for this.
Looks like a really cool effort but authors of open source projects often think people would read the code and figure out all, the truth is people usually look at what's in the readme and that's all the attention span most people are going to have. My 2c: improve your README.md.
And if you read "Why? To learn rust" and ask "should I use this in production"...