- functions are versioned by name
- a function will "pull in" its dependencies, transitively, at compile time; a function will never change behaviour just because a dependency has a new version available
- the global database can store all functions ever created, without worrying about anyone stepping on anyone else's toes
- magical zero-install (runtime reference of a function hash that doesn't exist -> the process blocks while it gets downloaded from the database.) This is safe: presuming a currently-accepted cryptographic hash, if you ask for a function with hash X, you'll be running known code.
- you can still build "curation" schemes on top of this, with author versioning, using basically Freenet's Signed Subspace Key approach (sort of equivalent to a checkout of a git repo). The module author publishes a signed function which returns a function when passed an identifier (this is your "module"). Later, they publish a new function that maps identifiers to other functions. The whole stdlib could live in the DB and be dereferenced into cache on first run from a burned-in module-function ref.
- function unloading can be done automatically when nothing has called into (or is running in the context of) a function for a while. Basically, garbage collection.
- you can still do late binding if you want. In Erlang, "remote" (fully-qualified) calls don't usually mean to switch semantics on version change; they just get conflated with fully-qualified self-calls, which are explicitly for that. In a flat function namespace, you'd probably have to make late-binding explicit for the compiler, since it would never be assumed otherwise. E.g. you'd call apply() with a function identifier, which would kick in the function metadata resolution mechanism (now normally just part of the linker) at runtime.
Plug: I am already working on a BEAM-compatible VM with exactly these semantics. (Also: 1. a container-like concept of security domains, allowing for multiple "virtual nodes" to share the same VM schedulers while keeping isolated heaps, atom tables, etc. [E.g. you set up a container for a given user's web requests to run under; if they crash the VM, no problem, it was just their virtual VM.] 2. Some logic with code signing such that calling a function written by X, where you haven't explicitly trusted X, sets up a domain for X and runs it in there. 3. Some PNaCl-like tricks where object files are simply code-signed binary ASTs, and final compilation happens at load-time. But the cached compiled artifact can sit in the global database and can be checked by the compiler, and reused, as an optimization of actually doing compilation. Etc.) If you want to know more, please send me an email (levi@leviaul.com).
I stopped when the madness became unbearable.
Exhibit A: https://github.com/mattsta/zlang/blob/master/priv/site_macro...
Exhibit B: https://github.com/mattsta/zlang/blob/master/examples/editor...
(unbearable from an "i created a new unmaintainable language pov, nothing bad about the distributed function storage feature.)
I made a graph-based UI for editing this kind of program since you wouldn't want to deal with hash-based names directly. You can see an example program here: http://nickretallack.com/visual_language/#/f2983238d90bd3e0a...
Currently it runs in JavaScript, but I was looking at other languages that it would be good to compile to, and Erlang seems like a pretty good fit for it.
Here's some other thoughts about it: https://docs.google.com/a/thinair.com/document/d/1WtgfUqN6Sd...
Can we work together?
However, sometimes code needs the broken version of some function...
There really is no great universal solution to this stuff.
apply(repo_handle_foo, bar_fn, {'~>', 2, 3})
Meaning the system would bind that to a version of bar_fn that exists in the foo repo, and is considered to be in the 2.3.X release series. (Conveniently, given the global code DB, it'd probably always resolve to the very latest thing in that series.)I would suggest a slight change in thinking, though: at the individual function level, you can version implementation, but you don't version semantics. If you have foo 1.0 and foo 2.0 that do different things, that have different API contracts—then those are simply different functions which should have separate identifiers. The only time a function-contract identifier should have "revisions" is to correct flaws that diverge the implementation from the contract. Eventually, the function perfectly does what the contract specifies; at that point, the function is done. If you want the function to do something else, new contract = new identifier.
(But what if your old implementation works and meets the contract, but is way too slow? This is where you get into alternative implementations of the same contract, and compilers doing global runtime A/B-testing of different implementations as a weird sort of JIT. This is orthogonal to versioning, almost, but it means you can't put the version constraints like "relies on [revision > bugfix] of foo" in your code—because what if the VM is running a foo that never had the bug? So those constraints go in the database itself, and effectively "hide" old versions from being offered as resolution targets, without impeding direct reference.)
If you use a strong crypto hashing algorithm that would be impossible given current computational resources and their growth for eons into the future. For example, there are no known collisions of SHA-2. No one has found 2 items that have the same SHA-2 hash.
Quantum computers / some break-through algorithm could change that. If that happens all encryption on the internet likely breaks then as well, except where quantum cryptography is being used.
If you're not designing your software to worry about cosmic rays, you shouldn't design your software to worry about hash collisions. Just pick a good hash.
Presumably this only works for pure-functional languages?
IDE could do "It looks like you're trying to write qsort - we already have it written 10000 times in these functions: ...".
I've had similar thoughts in the past, though my thought was as a function loader for javascript. Why not just experiment with this as library for javascript functions? Downloading the code from the "global database" would be provided by a function:
getDataByHash(hash) : Blob
To load the code into the return use eval (blech), or back up a step and use script tag + an advanced version of magnet links I'd call snow links:
<script src="snow:?xt=urn:sha2:beaaca..."></script>
There will have to be a dom-scanning/watching js library that the pages load from a normal url which scans the page for snow links, downloads them by calling getDataByHash(), and then swaps in new URLs using URL.createObjectURL(Blob).
This leads me in a somewhat orthogonal direction to your post...
Obviously any element in the page that loads resources would benefit from this (link rel=stylesheet, img, iframe, etc) so the document-scanner should work it's magic on them as well.
To me the tricky part becomes "the global database." There are several ways to implement it. My thought would be to build it as a DHT on top of Web-RTC. I'd look into webtorrent, it has to scale from very small to very large files. Maybe have multiple different DHTs that the scanner will try.
Storing stuff in the DHT ought to be as simple as declaring to the network that data (or solution) of a given hash is known. Clients p2p connect over to the knower (or daisy chain style as a sort of ad-hoc STUN/TURN setup?), and virally spread the data by declaring they too now have the solution for the hash they just downloaded. A CDN can still store a file and can be listed as a fallback provider in the snow url.
As an example application:
Once this DHT exists and p2p is built on top of it should not be hard to ask peers to run the content of a given SHA and return the result (or they will return it's SHA if it's large, effectively memo-izing the function). Something like:
run(hash) : RunOffer
Any peer (AWS or Google etc could implement a peer as well) could respond to that with a offer to run the computation for an amount of p2p currency that scales to billions or trillions of tiny transactions per second.
...
Anyhow this comment is way too long already, but the ideas keep flowing from here. There are a lot of technical challenges for vital features (like how to register (im)mutable alias for hashes and distribute them to the network without using DNS for authoritative top-level namespacing? how get reliable redundancy guarantees for stored data without resorting to a CDN? What to do about realtime steaming of data where data is produced by an ongoing process? grouping multiple devices for the same user?)
All-in-all it seems like being able to get data by the hash of its content from inside of a program (including library code) as easily as loading it from a URL or off the filesystem is pretty useful. It also seems like we can engineer the technology to make this happen, so I think it's inevitable and will happen pretty soon.
This is, in fact, exactly what Freenet does! It's a global DHT acting as a content-addressible store, where there's no "direct me to X" protocol function, only a "proxy my request for X to whoever you suspect has X, and cache X for yourself when returning it to me" function.
(Aside: Freenet also does Tor-like onion-routing stuff between the DHT nodes, which makes it extremely slow to the point that it was discarded by most as a solution for anonymous mesh-networking. But it doesn't have to do that. "Caching forwarding-proxy DHT" and "encrypted onion-routing message transport" are completely orthogonal ideas, which could be stacked, or used separately, on a case by case basis. "Caching forward-proxying DHT" is to the IP layer as "encrypted onion-routing message transport" is to TLS. PersistentTor-over-PlaintextFreenet would be much better than either Tor or Freenet as they are today.)
I agree that this could totally be done as a library in pretty much any language that allows for runtime module loading or runtime code evaluation. And, like you say, there are tons of interesting corollaries.
I think, though, that to be truly useful, you have to push this sort of idea as low in the stack as possible.
Unix established the "file descriptor" metaphor: a seekable stream of blocks-of-octets, some of which (files) existed persistently on disk, some of which (pipes) existed ephemerally between processes and only required resources proportional to the unconsumed buffer, some of which (sockets) existed between processes on entirely separate hosts. Everything in Unix is a file. (Or could be, at least. Unix programmers get "expose this as a file descriptor" right at about the same rate web API designers get REST right.) Unix (and most descendants) are "built on" files.
To truly expose the power of a global code-sharing system, you'd need the OS to be "built on" global-DHT-dereferenceable URNs. There would be cryptographic hashes everywhere in the system-call API. Because your data likely wouldn't just be on your computer (just cached there), you'd see cryptographic signatures instead of ownership, and encryption where a regular OS would use ACL-based policies.
At about the same level of importance that Unix treats "directories", such an OS would have to treat the concept of a data equivalence-mapping manifests (telling you how to get git objects from packfiles, a movie from a list of torrent chunk hashes, etc.)
For any sort of collaborative manipulation of shared state, you'd probably see blocks from little private cryptocurrency block-chains floating about in the DHT, where the "currency" just represents CPU cycles the nodes are willing to donate to one-another's work.
And then (like in Plan9's Fossil), you might see regular filesystems and such built on top of this. But they'd only be metaphors—your files would be "on" your disk to about the same degree that a process's stack is "in" L1 cache. Disks would really just be another layer of the memory hierarchy, entirely page-cache; and marking a memory page "required to be persisted" would shove it out to the DHT, not to disk, since your disk isn't highly available.
—but. Designing this as an OS for PCs would be silly. It would have much too far to go to catch up with things like Windows and OSX. Much better to design it to run in userspace, or as a Xen image, or on a rump kernel on server hardware somewhere. It'd be an OS for cloud computers, or to be emulated in a background service on a PC and spoken with over a socket, silently installed as a dependency by some shiny new GUI app.
Which is, of course, the niches Erlang already fits into. So see above :)
define('object/clone', ['object/each'], function(each) {
/**
* Shallow copy objects
*
* @param {object} - object from which to copy properties
* @return shallow copy of `obj`
*/
return function(obj) {
var clone = {};
each(obj, function(key, value) {
clone[key] = value;
});
return clone;
};
});If I'm pulling in a function, I want it to do what I think I want. Sometimes I want that to change (get a bug fix), but sometimes I don't (someone introduces a bug, or makes the func more general and introduces slowdowns).
This feels like a job for a content-addressable git-like tool. How about this:
I can discover my function (via whatever means). The function is actually named 8804ea505fda087da53b799434c377f015933707 (the sha-something of it's (normalised?) textual representation).
I then import it into my codebase as "useful_fun". My code reads like:
useful_fun("do it", "to it")
but I have some kind of dependencies/import record which says that "useful_fun" is actually 8804ea505fda087da53b799434c377f015933707. That means one and only one thing across all time, the func with that hash.So how do we handle updates? If we want a golang-like model, the developer could run something like "update deps". This would:
- go back to the central repository, looking for updates to 8804ea505fda087da53b799434c377f015933707. It might find 5. Local policy then determines what happens. Could be "always choose the original authors update" or "choose the one with the most votes" or "always ask the dev, showing diffs".
Note that because the unique name is based on the function content, any change to it creates a new item in the db. (Content-addressability, same way git and other systems do it.)
- stuff can be grouped and batched. If I pull in 10 functions tagged with the same project ('module') and they've all been updated, I can say "and do the same with all the others".
- This kind of metadata allows all kind of good stuff. I can subscribe to alerts on the functions I've imported and get told about new versions, or security warnings. This kind of subscription information can be used as a popularity contest to solve the "which fork on github do I want to use" problem?
- people can still publish modules. They now look like a git directory or tree. A git tree is a blob which contains the hashes of the files within it. A 'module' could be a blob which specifies which (immutable) functions are in it.
If we use normalised functions, we've now got a module representation which allows arbitrary functions to be pulled together. At fetch time, we can denormalise into the user's preferred coding style. At push time, we renormalise. We aren't grouping stuff into files, so a 'project' or a 'module' consists solely of the semantic contents, nothing to do with artificial grouping for the file system.
Seems like an interesting future.
I see too many issues with this (for example):
- I publish a news article. I publish a retraction/update to said article. Now the article has a new hash. Does the old hash give you the old version of the article, or redirect you to the new version?
- How do we define 'document?' If we define it as the complete HTML page served up to the browser, then changes to the design of the site would invalidate all previous hashes. Pointing old hashes to new hashes is work, which will not always be done (leading to the same situation we have with site redesigns breaking old URLs).
What we think is great about it is that the hash is location-independent. There could be multiple copies of the document at various locations at any given point. As long as there is at least one copy and that it is reachable via search, it will be retrieved.
We also built a distributed Wiki based on that idea and platform, called P2Pedia (http://p2pedia.sf.net).
It's all very much an academic research project, so don't expect a beautiful interface or easy-to-install packaging or anything, but I think it's a good proof of concept.
(note to self: we should really move these to GitHub).
Effectively, this gives you "program as a value" where the same url means the same program. Immutable programs basically.
I've tried this and the current downside is that it looks extremely ugly when you try to share a link lol. But this should be circumventable. The other downside is that this is a bit theoretical still. You'll have to exclude sensitive information such as password. Sometimes stuff are in a closure rather than in your ˋstate`.
It's initially just doing the simplest possible thing (making the resource unavailable unless its hash is valid) but semantically it will probably be allowed for the browser to resolve the resource using other methods (e.g. if it already has that resource cached from another URL) so long as the hash matches.
"I am thinking more and more that if would be nice to have all functions in a key_value database with unique names."
Yeah, sure... Sounds good, right. Until you have naming conflicts.
So then the patch is "oh, let's just add another column to make it more unique", without realizing that you've just, in essence, created a "module" of sorts except it's stored in some sort of giant key/value database.
And then you've come full-circle back to the dilemma the author complains of which is that he doesn't know where to put a function that seems to belong in two modules.
Eventually, I'd say this is a general failing of modules that could potentially solved by some sort of inheritance. Maybe even a tagging mechanism if you really want to be "patch-work joe" about it.
Let's re-word "all functions in a database" as "a revision control system in a database."
So, let's make a revision control system. All contents, branches, tags are kept in a database.
Sounds good, right. Until you have naming conflicts.
No, no, no. There are no naming conflicts. The names humans will use are just pointers to the most recently updated underlying contents. The _actual_ names are garbage hash identifiers. The _usable_ names are human names bound to underlying contents.
So, if master is commit A and you make commit B, there is no naming conflict on the name "master," you just re-point it to commit B.
a function that seems to belong in two modules.
That's the problem with explicit hierarchy and why the world now runs on tagging-based crowdsourced folksonomies.
"No, no, no. There are no naming conflicts. The names humans will use are just pointers to the most recently updated underlying contents. The _actual_ names are garbage hash identifiers. The _usable_ names are human names bound to underlying contents. So, if master is commit A and you make commit B, there is no naming conflict on the name "master," you just re-point it to commit B."
"Master" is the actual name that is going to be conflicting, if I understand your example.
In a third module, where else?
Not saying this Erlang idea isn't good or wont work, just these are the pitfalls besides the obvious name spacing and conflicts.
Nope. Joe's thought experiment is: what if every function became available in the global namespace? What if every function got kept in a global datastore so you: launch your REPL, run any function, and have it pulled down and work immediately. No imports unless you need to pin a specific function to a specific past revision.
Of course, someone would come along and say "these 30 functions only work together when pinned to these specific revisions," so you end up pulling down a named bundle of specific revisions, ...
7 +- 2. [1] That's the number of things our prefrontal cortex/short term memory can track at once. That's why we (humans) organize things into hierarchies. That's why the best team size is around that number. Etcetera.
Heck, everything in the world on a computer is serialized into memory or onto disk. Or addressed as some disk in a serial array of disks. Serialized as in, "there's some data somewhere in these 2TB that tell me where in the same 2TB the rest of the data is." Computers excel at this. Humans are terrible at this.
I guess my point is, humans are the reasons we need modules.
[1] http://en.wikipedia.org/wiki/The_Magical_Number_Seven,_Plus_...
It is all about handling complexity! By putting together things that belong together, complexity is reduced. Also engines and other machines are designed that way: Things belonging together, are put together in the same spot.
More specifically I feel like there are two problems. 1) It feels suspiciously like there's a combination of halting problem and diagonalisation that shows there are an uncountably infinite number of functions that we want to write that can't be named (although I would want to have a better idea of how this is supposed to work before I try to hammer out a proof). 2) I don't understand how it's possible for any hashing scheme to encode necessary properties of a function such that the function with necessary properties has a different hash than an otherwise identical function without these properties. For example can we hash these functions such that stable sort looks different than unstable sort? Wouldn't we need dependent typing to encode all required properties? And if that's the case couldn't I pull a Gödel and show that there's always one more property not encodable in your system?
[1] - https://www.youtube.com/watch?v=lKXe3HUG2l4 [2]
[2] - https://news.ycombinator.com/item?id=8572920 (thanks for the link)
But what it enabled is an Emacs community where single functions are freely shared, for example on http://www.emacswiki.org/emacs/. People just copy them into their Emacs init file. Sometimes they modify them a little and post them again with their own prefix. This has obvious downsides such as lack of versioning and organization. But it provides a low barrier to entry and creates a dynamic community.
If you postulate for a minute that the (truly nontrivial) surface problems are all solved, and concentrate only on the idea of a universally accessible group of functions that accretes value over time -- like a stdlib that every language on every runtime could access -- that seems like a pretty exciting idea worth thinking about.
I had something like that idea almost two decades ago (http://www.gossamer-threads.com/lists/perl/porters/26139?do=...) but at the time it was all in fun. But these days, that sort of thing starts looking pretty possible, especially for the group of pure functions.
Another idea: Unit tests could be stored as function metadata.
The fact that it is difficult to decide which module a function belongs in doesn't make them pointless. People who have to read or debug your code use them to quickly zero in on areas of likely interest.
Why does Erlang (or any other language) have modules?
The biggest reason for me (and I think the one with the most merit) is for clarity and usability.
Modules exist as ways of grouping units of code by the responsibilities of that code. If you removed this hierarchy, wouldn't things become a lot more difficult to navigate and understand as a developer?
Lately, I've been moving more toward the proposed design in my Node.js projects. It keeps individual files concise, makes code sharing trivial, encourages stateless methods, and it makes writing tests a breeze.
That said, a nice SML module probably could work as the base abstraction here.
The beauty of commonjs modules is they allow you to focus on implementation, rather than identification. All functions can be anonymous, identified only by their path and named at the whims of the caller.
Except in this case the core is user-generated and ever-expanding.
I bet there are a lot of issues in Java history that could predict possible bumps in the road for such a system (since it was essentially concurrently designed by a bunch of actors -- except in that case they were corporate entities)
Seems especially applicable to fp (like erlang), where code reuse is more often of small functions.
Because, really, if you start throwing everything into one store, you're going to run into the naming conflict issue, and any attempt at addressing the naming conflict issue is going to either look like importing modules or look like namespaces. You either have to explicitly state what your program has access to, or you explicitly state what function you mean when you have access to everything. Realistically, if you give every function a unique name and don't use namespaces, then there will start to be functions called system_event_fire() and game_gun_fire() and disasters_house_fire() and you're right back to having namespaces, just not in name or with a syntax that makes things nice when you know you're dealing with specific things.
Though, it'd be nice if types weren't the only thing that could be placed into a namespace directly in .NET. I'd like to put free functions in there. The Math class in the System namespace only exists because of this. I'd have prefered there to be a System.Math namespace and Cosine and Sine be members of it. Then I could "using System.Math;" and call "Cos(angle)". Instead, I'm stuck in a limbo of half-qualified names.
And I like it. I like it a lot more than Python, Racket, Node.js, etc. and having to import this Thing X from that Module Y. I like the idea that linking modules together is defined at the build level, not at the individual source file level. These languages are supposed to be better for exploratory programming than Java and C#, but actually, you know, doing the exploring part is harder!
Sometimes, I really do just want to blap out the fully qualified name of a function, in place in my code. System.Web.HttpContext.Current.User. If I'm doing something like that, it's a hack, and I know it's a hack, and having the fully qualified name in there, uglying it up, makes clearer that it's a hack. Though, I suppose I'm one of the rare people who actually do go back and clean up my hacks.
EDIT: I thought I wrote more, weird.
The network-accessible database of every library, ever, is definitely a great idea. I think it's where we're heading, with tools like NPM, NuGet, etc. It seems like a natural progression to move the package manager into the compiler (or linker, rather, but that's in the compiler in most languages now). Add in support in an editor to have code completion lists include a search of the package repository and you're there.