Why do we need modules at all? (2011) (opens in new tab)

(erlang.org)

155 pointsthomas1111y ago76 comments

76 comments

61 comments · 24 top-level

derefr11y ago· 20 in thread

A function's true name should be its content hash. (Where that content hash is calculated after canonicalizing all the call-sites in the function into content hash refs themselves.) This way:

- functions are versioned by name

- a function will "pull in" its dependencies, transitively, at compile time; a function will never change behaviour just because a dependency has a new version available

- the global database can store all functions ever created, without worrying about anyone stepping on anyone else's toes

- magical zero-install (runtime reference of a function hash that doesn't exist -> the process blocks while it gets downloaded from the database.) This is safe: presuming a currently-accepted cryptographic hash, if you ask for a function with hash X, you'll be running known code.

- you can still build "curation" schemes on top of this, with author versioning, using basically Freenet's Signed Subspace Key approach (sort of equivalent to a checkout of a git repo). The module author publishes a signed function which returns a function when passed an identifier (this is your "module"). Later, they publish a new function that maps identifiers to other functions. The whole stdlib could live in the DB and be dereferenced into cache on first run from a burned-in module-function ref.

- function unloading can be done automatically when nothing has called into (or is running in the context of) a function for a while. Basically, garbage collection.

- you can still do late binding if you want. In Erlang, "remote" (fully-qualified) calls don't usually mean to switch semantics on version change; they just get conflated with fully-qualified self-calls, which are explicitly for that. In a flat function namespace, you'd probably have to make late-binding explicit for the compiler, since it would never be assumed otherwise. E.g. you'd call apply() with a function identifier, which would kick in the function metadata resolution mechanism (now normally just part of the linker) at runtime.

Plug: I am already working on a BEAM-compatible VM with exactly these semantics. (Also: 1. a container-like concept of security domains, allowing for multiple "virtual nodes" to share the same VM schedulers while keeping isolated heaps, atom tables, etc. [E.g. you set up a container for a given user's web requests to run under; if they crash the VM, no problem, it was just their virtual VM.] 2. Some logic with code signing such that calling a function written by X, where you haven't explicitly trusted X, sets up a domain for X and runs it in there. 3. Some PNaCl-like tricks where object files are simply code-signed binary ASTs, and final compilation happens at load-time. But the cached compiled artifact can sit in the global database and can be checked by the compiler, and reused, as an optimization of actually doing compilation. Etc.) If you want to know more, please send me an email (levi@leviaul.com).

seiji11y ago

I wrote a prototype programming language + distributed function storage system like this a few years ago.

I stopped when the madness became unbearable.

Exhibit A: https://github.com/mattsta/zlang/blob/master/priv/site_macro...

Exhibit B: https://github.com/mattsta/zlang/blob/master/examples/editor...

(unbearable from an "i created a new unmaintainable language pov, nothing bad about the distributed function storage feature.)

tiglionabbit11y ago

Oh my god! This is exactly what I've been working on in my spare time. I was just trying to come up with the specific content ID algorithm for it.

I made a graph-based UI for editing this kind of program since you wouldn't want to deal with hash-based names directly. You can see an example program here: http://nickretallack.com/visual_language/#/f2983238d90bd3e0a...

Currently it runs in JavaScript, but I was looking at other languages that it would be good to compile to, and Erlang seems like a pretty good fit for it.

Here's some other thoughts about it: https://docs.google.com/a/thinair.com/document/d/1WtgfUqN6Sd...

Can we work together?

Retric11y ago

This sounds great. The issue is when you fix a function you generally want everyone to use your bug fix. This is even more important for secruity issues.

However, sometimes code needs the broken version of some function...

There really is no great universal solution to this stuff.

derefr11y ago

The second one would be the behaviour you'd get out of the "low-level" contract of the system. The former could be built on top: you could have late-bound references that effectively call a function with a semver constraint. e.g.

    apply(repo_handle_foo, bar_fn, {'~>', 2, 3})

Meaning the system would bind that to a version of bar_fn that exists in the foo repo, and is considered to be in the 2.3.X release series. (Conveniently, given the global code DB, it'd probably always resolve to the very latest thing in that series.)

I would suggest a slight change in thinking, though: at the individual function level, you can version implementation, but you don't version semantics. If you have foo 1.0 and foo 2.0 that do different things, that have different API contracts—then those are simply different functions which should have separate identifiers. The only time a function-contract identifier should have "revisions" is to correct flaws that diverge the implementation from the contract. Eventually, the function perfectly does what the contract specifies; at that point, the function is done. If you want the function to do something else, new contract = new identifier.

(But what if your old implementation works and meets the contract, but is way too slow? This is where you get into alternative implementations of the same contract, and compilers doing global runtime A/B-testing of different implementations as a weird sort of JIT. This is orthogonal to versioning, almost, but it means you can't put the version constraints like "relies on [revision > bugfix] of foo" in your code—because what if the VM is running a foo that never had the bug? So those constraints go in the database itself, and effectively "hide" old versions from being offered as resolution targets, without impeding direct reference.)

agumonkey11y ago

the hash must account for bound variables so that fun add(x, y) { return x + y } is the same as fun add(a, b) { return a + b }. Beta-reduction independance.

derefr11y ago

Right. Identifier names effectively become metadata; the AST would look like numbered SSA references (as in LLVM, which does a similar transformation, to allow the optimizer pass to just pattern-match known AST "signatures" and rewrite them.)

3rd311y ago

What is the probability to have a hash collision of two functional pieces of code? For example one could possibly replace a hard-coded domain "http://example.com" with "http://exàmlpœ.com" (assuming it results in the same hash), then register exàmlpœ.com and do a man-in-the-middle attack with it.

benjaminjackman11y ago

I am not a cryptographer so maybe one of them can chime in and verify this:

If you use a strong crypto hashing algorithm that would be impossible given current computational resources and their growth for eons into the future. For example, there are no known collisions of SHA-2. No one has found 2 items that have the same SHA-2 hash.

Quantum computers / some break-through algorithm could change that. If that happens all encryption on the internet likely breaks then as well, except where quantum cryptography is being used.

geofft11y ago

The probability, assuming a properly-designed hash function, is no more likely than one in about 300 undecillion. No algorithmic flaws are known that can would let an attacker take an arbitrary input to MD5, SHA-1, SHA-2, or SHA-3 and construct a hash collision in noticeably better time. There are attacks that let an attacker construct two inputs with the same hash for MD5 and SHA-1, so out of an abundance of caution, we're considering MD5 and SHA-1 both "broken", there are now two successor hashes to SHA-1, and we're hurrying to replace all uses of MD5 and SHA-1 where collisions matter.

If you're not designing your software to worry about cosmic rays, you shouldn't design your software to worry about hash collisions. Just pick a good hash.

1 more reply

pjc5011y ago

a function will never change behaviour just because a dependency has a new version available

Presumably this only works for pure-functional languages?

endergen11y ago

I'm assuming his suggestion is no updates ever unless you update your dependencies version. many module systems like say npm has a culture of using fuzzy matching which means when you do an npm install again you can easily pull in new versions of libraries that have had upgrades since you last did npm install. I'm of a fan of strict dependencies but some people prefer to make it easier to stay to the latest minor patch or other logical upgrade patterns.

endergen11y ago

Another option would be have an area that is reserved for meta data about upgrades that can't be modified by page content but is browser level that still has hash verifiable and pages with fixed URLs. But at the top there would be something like: There have been 5 changes to this page since this version. See Changes [link] Go to latest version [Link]

tel11y ago

Not even them. A data type could change: the interfaces between functions are what must be versioned e.g. An SML module system.

ajuc11y ago

It would be nice if the hash was local (so functions that have similar structure have similar hashes).

IDE could do "It looks like you're trying to write qsort - we already have it written 10000 times in these functions: ...".

benjaminjackman11y ago

"the global database can store all functions ever created, without worrying about anyone stepping on anyone else's toes"

I've had similar thoughts in the past, though my thought was as a function loader for javascript. Why not just experiment with this as library for javascript functions? Downloading the code from the "global database" would be provided by a function:

getDataByHash(hash) : Blob

To load the code into the return use eval (blech), or back up a step and use script tag + an advanced version of magnet links I'd call snow links:

There will have to be a dom-scanning/watching js library that the pages load from a normal url which scans the page for snow links, downloads them by calling getDataByHash(), and then swaps in new URLs using URL.createObjectURL(Blob).

This leads me in a somewhat orthogonal direction to your post...

Obviously any element in the page that loads resources would benefit from this (link rel=stylesheet, img, iframe, etc) so the document-scanner should work it's magic on them as well.

To me the tricky part becomes "the global database." There are several ways to implement it. My thought would be to build it as a DHT on top of Web-RTC. I'd look into webtorrent, it has to scale from very small to very large files. Maybe have multiple different DHTs that the scanner will try.

Storing stuff in the DHT ought to be as simple as declaring to the network that data (or solution) of a given hash is known. Clients p2p connect over to the knower (or daisy chain style as a sort of ad-hoc STUN/TURN setup?), and virally spread the data by declaring they too now have the solution for the hash they just downloaded. A CDN can still store a file and can be listed as a fallback provider in the snow url.

As an example application:

Once this DHT exists and p2p is built on top of it should not be hard to ask peers to run the content of a given SHA and return the result (or they will return it's SHA if it's large, effectively memo-izing the function). Something like:

run(hash) : RunOffer

Any peer (AWS or Google etc could implement a peer as well) could respond to that with a offer to run the computation for an amount of p2p currency that scales to billions or trillions of tiny transactions per second.

...

Anyhow this comment is way too long already, but the ideas keep flowing from here. There are a lot of technical challenges for vital features (like how to register (im)mutable alias for hashes and distribute them to the network without using DNS for authoritative top-level namespacing? how get reliable redundancy guarantees for stored data without resorting to a CDN? What to do about realtime steaming of data where data is produced by an ongoing process? grouping multiple devices for the same user?)

All-in-all it seems like being able to get data by the hash of its content from inside of a program (including library code) as easily as loading it from a URL or off the filesystem is pretty useful. It also seems like we can engineer the technology to make this happen, so I think it's inevitable and will happen pretty soon.

derefr11y ago

> Clients p2p connect over to the knower (or daisy chain style as a sort of ad-hoc STUN/TURN setup?), and virally spread the data by declaring they too now have the solution for the hash they just downloaded.

This is, in fact, exactly what Freenet does! It's a global DHT acting as a content-addressible store, where there's no "direct me to X" protocol function, only a "proxy my request for X to whoever you suspect has X, and cache X for yourself when returning it to me" function.

(Aside: Freenet also does Tor-like onion-routing stuff between the DHT nodes, which makes it extremely slow to the point that it was discarded by most as a solution for anonymous mesh-networking. But it doesn't have to do that. "Caching forwarding-proxy DHT" and "encrypted onion-routing message transport" are completely orthogonal ideas, which could be stacked, or used separately, on a case by case basis. "Caching forward-proxying DHT" is to the IP layer as "encrypted onion-routing message transport" is to TLS. PersistentTor-over-PlaintextFreenet would be much better than either Tor or Freenet as they are today.)

I agree that this could totally be done as a library in pretty much any language that allows for runtime module loading or runtime code evaluation. And, like you say, there are tons of interesting corollaries.

I think, though, that to be truly useful, you have to push this sort of idea as low in the stack as possible.

Unix established the "file descriptor" metaphor: a seekable stream of blocks-of-octets, some of which (files) existed persistently on disk, some of which (pipes) existed ephemerally between processes and only required resources proportional to the unconsumed buffer, some of which (sockets) existed between processes on entirely separate hosts. Everything in Unix is a file. (Or could be, at least. Unix programmers get "expose this as a file descriptor" right at about the same rate web API designers get REST right.) Unix (and most descendants) are "built on" files.

To truly expose the power of a global code-sharing system, you'd need the OS to be "built on" global-DHT-dereferenceable URNs. There would be cryptographic hashes everywhere in the system-call API. Because your data likely wouldn't just be on your computer (just cached there), you'd see cryptographic signatures instead of ownership, and encryption where a regular OS would use ACL-based policies.

At about the same level of importance that Unix treats "directories", such an OS would have to treat the concept of a data equivalence-mapping manifests (telling you how to get git objects from packfiles, a movie from a list of torrent chunk hashes, etc.)

For any sort of collaborative manipulation of shared state, you'd probably see blocks from little private cryptocurrency block-chains floating about in the DHT, where the "currency" just represents CPU cycles the nodes are willing to donate to one-another's work.

And then (like in Plan9's Fossil), you might see regular filesystems and such built on top of this. But they'd only be metaphors—your files would be "on" your disk to about the same degree that a process's stack is "in" L1 cache. Disks would really just be another layer of the memory hierarchy, entirely page-cache; and marking a memory page "required to be persisted" would shove it out to the DHT, not to disk, since your disk isn't highly available.

—but. Designing this as an OS for PCs would be silly. It would have much too far to go to catch up with things like Windows and OSX. Much better to design it to run in userspace, or as a Xen image, or on a rump kernel on server hardware somewhere. It'd be an OS for cloud computers, or to be emulated in a background service on a PC and spoken with over a socket, silently installed as a dependency by some shiny new GUI app.

Which is, of course, the niches Erlang already fits into. So see above :)

2 more replies

thekaleb11y ago

I use requirejs for module loading and for my personal projects try to limit each module to a single function. Contrived example:

    define('object/clone', ['object/each'], function(each) {
        /**
         * Shallow copy objects
         *
         * @param {object} - object from which to copy properties
         * @return shallow copy of `obj`
         */
        return function(obj) {
            var clone = {};
            each(obj, function(key, value) {
                clone[key] =  value;
            });
            return clone;
        };
    });

chongli11y ago

What happens if somebody discovers a critical weakness in your hash function?

flixt11y ago

Is this a bit like blockspring?

hackerews11y ago

Yup - building a universal library of functions. Let me know if you have any questions paul@blockspring.com.

jbert11y ago· 6 in thread

So, immutability and/or api contract is important here.

If I'm pulling in a function, I want it to do what I think I want. Sometimes I want that to change (get a bug fix), but sometimes I don't (someone introduces a bug, or makes the func more general and introduces slowdowns).

This feels like a job for a content-addressable git-like tool. How about this:

I can discover my function (via whatever means). The function is actually named 8804ea505fda087da53b799434c377f015933707 (the sha-something of it's (normalised?) textual representation).

I then import it into my codebase as "useful_fun". My code reads like:

    useful_fun("do it", "to it")

but I have some kind of dependencies/import record which says that "useful_fun" is actually 8804ea505fda087da53b799434c377f015933707. That means one and only one thing across all time, the func with that hash.

So how do we handle updates? If we want a golang-like model, the developer could run something like "update deps". This would:

- go back to the central repository, looking for updates to 8804ea505fda087da53b799434c377f015933707. It might find 5. Local policy then determines what happens. Could be "always choose the original authors update" or "choose the one with the most votes" or "always ask the dev, showing diffs".

Note that because the unique name is based on the function content, any change to it creates a new item in the db. (Content-addressability, same way git and other systems do it.)

- stuff can be grouped and batched. If I pull in 10 functions tagged with the same project ('module') and they've all been updated, I can say "and do the same with all the others".

- This kind of metadata allows all kind of good stuff. I can subscribe to alerts on the functions I've imported and get told about new versions, or security warnings. This kind of subscription information can be used as a popularity contest to solve the "which fork on github do I want to use" problem?

- people can still publish modules. They now look like a git directory or tree. A git tree is a blob which contains the hashes of the files within it. A 'module' could be a blob which specifies which (immutable) functions are in it.

If we use normalised functions, we've now got a module representation which allows arbitrary functions to be pulled together. At fetch time, we can denormalise into the user's preferred coding style. At push time, we renormalise. We aren't grouping stuff into files, so a 'project' or a 'module' consists solely of the semantic contents, nothing to do with artificial grouping for the file system.

Seems like an interesting future.

doty11y ago

I think Mr. Armstrong would approve, given his comments near the end of https://www.youtube.com/watch?v=lKXe3HUG2l4, where he opines that the web would be great if, instead of URLs, every published document were just named with a hash of its content.

pyre11y ago

> every published document were just named with a hash of its content.

I see too many issues with this (for example):

- I publish a news article. I publish a retraction/update to said article. Now the article has a new hash. Does the old hash give you the old version of the article, or redirect you to the new version?

- How do we define 'document?' If we define it as the complete HTML page served up to the browser, then changes to the design of the site would invalidate all previous hashes. Pointing old hashes to new hashes is work, which will not always be done (leading to the same situation we have with site redesigns breaking old URLs).

4 more replies

esfandia11y ago

We have a P2P file-sharing program that does this, called U-P2P (http://u-p2p.sf.net). Content is hashed, and you use a Gnutella search using the hash to retrieve it. Documents are organized by what we call "communities", which themselves are represented by a document and its corresponding hash. So the document name is really made up of two hashes: the one of the community it belongs to, and its own hash. You can use these hashes as hyperlinks, and U-P2P resolves it via search, as previously mentioned.

What we think is great about it is that the hash is location-independent. There could be multiple copies of the document at various locations at any given point. As long as there is at least one copy and that it is reachable via search, it will be retrieved.

We also built a distributed Wiki based on that idea and platform, called P2Pedia (http://p2pedia.sf.net).

It's all very much an academic research project, so don't expect a beautiful interface or easy-to-install packaging or anything, but I think it's a good proof of concept.

(note to self: we should really move these to GitHub).

chenglou11y ago

In React.js, you can serialize your whole app state through a simple ˋJSON.stringify` and base64 encode that into the url. The nice property of that is that you get to pass that url around to friends, and when they click on it they'll go to the page, which decodes and deserializes the url and reproduce the exact app state, down to the letters in the input boxes.

Effectively, this gives you "program as a value" where the same url means the same program. Immutable programs basically.

I've tried this and the current downside is that it looks extremely ugly when you try to share a link lol. But this should be circumventable. The other downside is that this is a bit theoretical still. You'll have to exclude sensitive information such as password. Sometimes stuff are in a closure rather than in your ˋstate`.

1 more reply

rictic11y ago

Emerging standard in that area, subresource integrity: http://w3c.github.io/webappsec/specs/subresourceintegrity/

It's initially just doing the simplest possible thing (making the resource unavailable unless its hash is valid) but semantically it will probably be allowed for the browser to resolve the resource using other methods (e.g. if it already has that resource cached from another URL) so long as the hash matches.

jacquesm11y ago

So we could simply set up a url shortening service that published such hashes. Unfortunately with the 'dynamic' nature of web pages these days that's going to be hard to go back to. It may be an interesting way to re-boot the web though. 'regular' Urls are then merely a DNS like layer on top of a content hashing scheme.

Alex391711y ago· 4 in thread

This is basically what Urbit is doing, among other things.

balquhidder11y ago

Is Urbit a real thing, or an elaborate hoax?

reirob11y ago

It made me curious too. Found this HN post about urbit: https://news.ycombinator.com/item?id=6438320

1 more reply

kencausey11y ago

https://github.com/cgyarvin/

shaurz11y ago

Probably both.

zo111y ago· 3 in thread

I don't know Erlang, so I might be missing something key here.

"I am thinking more and more that if would be nice to have all functions in a key_value database with unique names."

Yeah, sure... Sounds good, right. Until you have naming conflicts.

So then the patch is "oh, let's just add another column to make it more unique", without realizing that you've just, in essence, created a "module" of sorts except it's stored in some sort of giant key/value database.

And then you've come full-circle back to the dilemma the author complains of which is that he doesn't know where to put a function that seems to belong in two modules.

Eventually, I'd say this is a general failing of modules that could potentially solved by some sort of inheritance. Maybe even a tagging mechanism if you really want to be "patch-work joe" about it.

seiji11y ago

Okay, let's try to not shit all over new ideas here. If we take what Joe-from-2011 means instead of hallucinating him to be incompetent...

Let's re-word "all functions in a database" as "a revision control system in a database."

So, let's make a revision control system. All contents, branches, tags are kept in a database.

Sounds good, right. Until you have naming conflicts.

No, no, no. There are no naming conflicts. The names humans will use are just pointers to the most recently updated underlying contents. The _actual_ names are garbage hash identifiers. The _usable_ names are human names bound to underlying contents.

So, if master is commit A and you make commit B, there is no naming conflict on the name "master," you just re-point it to commit B.

a function that seems to belong in two modules.

That's the problem with explicit hierarchy and why the world now runs on tagging-based crowdsourced folksonomies.

zo111y ago

"A revision control system in a database"? A revision control system is a database. I think what you and the author are trying to get at is some sort of "docker, but for functions" type of thing. And we all know what a mess that is when it comes to public docker images.

"No, no, no. There are no naming conflicts. The names humans will use are just pointers to the most recently updated underlying contents. The _actual_ names are garbage hash identifiers. The _usable_ names are human names bound to underlying contents. So, if master is commit A and you make commit B, there is no naming conflict on the name "master," you just re-point it to commit B."

"Master" is the actual name that is going to be conflicting, if I understand your example.

1 more reply

vegedor11y ago

>he doesn't know where to put a function that seems to belong in two modules.

In a third module, where else?

cwmma11y ago· 2 in thread

JavaScript works similar to this this and apps/libraries that wrap themselves in a giant closure work almost exactly like this. The disadvantage of this over using modules is in dependencies between functions. When you don't have modules and you try to refactor you get this annoying tendency for function a in file b to break when you change function y in file z. When you have modules you can easily tell before changing function a whether it is exported or not, and if it is to see in file z wither file a is imported.

Not saying this Erlang idea isn't good or wont work, just these are the pitfalls besides the obvious name spacing and conflicts.

seiji11y ago

JavaScript works similar to this this and apps/libraries that wrap themselves in a giant closure work almost exactly like this.

Nope. Joe's thought experiment is: what if every function became available in the global namespace? What if every function got kept in a global datastore so you: launch your REPL, run any function, and have it pulled down and work immediately. No imports unless you need to pin a specific function to a specific past revision.

Of course, someone would come along and say "these 30 functions only work together when pinned to these specific revisions," so you end up pulling down a named bundle of specific revisions, ...

protonfish11y ago

This problem can happen in any codebase, though I can see the argument that it might be more prevalent without modules. I can also see a lot of solutions: modifying functions to always be backward-compatible and defining the behavior of a function with a thorough unit test.

andrewstuart211y ago· 1 in thread

Because humans suck at serialized content.

7 +- 2. [1] That's the number of things our prefrontal cortex/short term memory can track at once. That's why we (humans) organize things into hierarchies. That's why the best team size is around that number. Etcetera.

Heck, everything in the world on a computer is serialized into memory or onto disk. Or addressed as some disk in a serial array of disks. Serialized as in, "there's some data somewhere in these 2TB that tell me where in the same 2TB the rest of the data is." Computers excel at this. Humans are terrible at this.

I guess my point is, humans are the reasons we need modules.

[1] http://en.wikipedia.org/wiki/The_Magical_Number_Seven,_Plus_...

PythonicAlpha11y ago

That is exactly, what I thought, too!

It is all about handling complexity! By putting together things that belong together, complexity is reduced. Also engines and other machines are designed that way: Things belonging together, are put together in the same spot.

Verdex11y ago· 1 in thread

I saw Joe's strange loop talk [1] a while ago and I get the same vibe reading his post as I did when watching the video. It sounds very cool, but I can't shake the feeling that it only works for 85% of the code. That is to say if you program in exactly the right way, you will be able to do everything you want and it will work with this system, but there are ways of programming that won't work with this system.

More specifically I feel like there are two problems. 1) It feels suspiciously like there's a combination of halting problem and diagonalisation that shows there are an uncountably infinite number of functions that we want to write that can't be named (although I would want to have a better idea of how this is supposed to work before I try to hammer out a proof). 2) I don't understand how it's possible for any hashing scheme to encode necessary properties of a function such that the function with necessary properties has a different hash than an otherwise identical function without these properties. For example can we hash these functions such that stable sort looks different than unstable sort? Wouldn't we need dependent typing to encode all required properties? And if that's the case couldn't I pull a Gödel and show that there's always one more property not encodable in your system?

[1] - https://www.youtube.com/watch?v=lKXe3HUG2l4 [2]

[2] - https://news.ycombinator.com/item?id=8572920 (thanks for the link)

nowne11y ago

There are countably infinite number of functions. A simple proof is that each function can be represented as a string, and there are countably infinite number of strings for a finite alphabet. You could also argue that functions are equivalent to Turing machines, and there are a finite number of Turing machines.

thomas11OP11y ago

Armstrong's proposal reminds me a bit of Emacs extensions. Since Emacs Lisp doesn't have namespaces or modules, all functions must be uniquely named which is done by prefixing them: foo-replace. This is not that different from having a module foo, as Armstrong notes: "managing a namespace with namees like foo.bar.baz.z is just as complex as managing a namespace with names like foo_bar_baz_z".

But what it enabled is an Emacs community where single functions are freely shared, for example on http://www.emacswiki.org/emacs/. People just copy them into their Emacs init file. Sometimes they modify them a little and post them again with their own prefix. This has obvious downsides such as lack of versioning and organization. But it provides a low barrier to entry and creates a dynamic community.

inflagranti11y ago

To me this is the same question whether we need directories or not in a file system. Ideally, your file system is a flat database and files are indexes by a vast array of automatic and manually added metadata that allows to easily retrieve them. Microsoft tried to go this direction with WinFS that was eventually cut for Vista, maybe because it wasn't practical (yet). Looking how people use the Internet though, where 90% of browsing will start at Google, this does seem a very reasonable approach for many things in the future. At the end, why should humans do manual indexing and retrieval if the computer can facilitate this part?

felixgallo11y ago

I think a lot of people are focusing on the implementation details here, which is fun and great, but the real deep insight here is the idea of a global registry of correct functions.

If you postulate for a minute that the (truly nontrivial) surface problems are all solved, and concentrate only on the idea of a universally accessible group of functions that accretes value over time -- like a stdlib that every language on every runtime could access -- that seems like a pretty exciting idea worth thinking about.

I had something like that idea almost two decades ago (http://www.gossamer-threads.com/lists/perl/porters/26139?do=...) but at the time it was all in fun. But these days, that sort of thing starts looking pretty possible, especially for the group of pure functions.

shaurz11y ago

I quite like the idea. I think it would probably still make sense to have "collections" where a bunch of related functions can be grouped together, discovered and worked on as a unit (this would just be an optional extra layer on top of the global function database). Although there would no exclusivity in collections so a function might appear in more than one, or zero, collections.

Another idea: Unit tests could be stored as function metadata.

protomyth11y ago

Lambda the Ultimate's discussion http://lambda-the-ultimate.org/node/5079 is pretty interesting.

philbo11y ago

To answer the question in the title directly, I think modules are to aid reading and discovery.

The fact that it is difficult to decide which module a function belongs in doesn't make them pointless. People who have to read or debug your code use them to quickly zero in on areas of likely interest.

al2o3cr11y ago

In my experience, telling programmers "all functions must have unique names" means you get a half-ass module system tacked on via common prefixes. In other words, you get "foo_bar_function1", "foo_bar_function2" etc.

ryanisnan11y ago

While you're talking about Erlang specifically, the concepts you bring up can be applied to programming in general.

Why does Erlang (or any other language) have modules?

The biggest reason for me (and I think the one with the most merit) is for clarity and usability.

Modules exist as ways of grouping units of code by the responsibilities of that code. If you removed this hierarchy, wouldn't things become a lot more difficult to navigate and understand as a developer?

brianshaler11y ago

Is the author's use of the term `module` specific to erlang? To me, it sounds like he's advocating for modules that are comprised of a single function, rather than utility belt modules that contain many functions. As I understand it, I agree with what the author proposes, and I feel like a subset of npm already provides what he's talking about. The best example is probably underscore.js versus lodash.js, which both have many functions and a wide API surface area. What's notable is that you can cherry-pick individual lodash functions and depend on a specific version[0]. (Admittedly, I lazily pull in the full lodash module instead of importing only the function(s) I'm using)

Lately, I've been moving more toward the proposed design in my Node.js projects. It keeps individual files concise, makes code sharing trivial, encourages stateless methods, and it makes writing tests a breeze.

[0] https://www.npmjs.org/browse/keyword/lodash-modularized

tel11y ago

The problem is now you either have zero data abstraction or uncontrolled data abstraction without even a convention like "these functions work together as a bundle" to save you.

That said, a nice SML module probably could work as the base abstraction here.

rymohr11y ago

The problem with this approach is you need to consider every existing function name in order to define a new one.

The beauty of commonjs modules is they allow you to focus on implementation, rather than identification. All functions can be anonymous, identified only by their path and named at the whims of the caller.

endergen11y ago

Related to this would be all the cool content addressable third-party meta data. Services could automatically generate pre-compiles of things or alternate optimizations. Or auto complete data, or statistics, test suities, behavioral diffing, example code, documentation, the options are endless.

the_cat_kittles11y ago

this talk about modules as a way to organize similar code makes me wonder- if you had all the functions in a global namespace, you could probably automatically generate some kind of organization by extracting relevant features from each function and doing some kind of clustering. maybe some features could be the function's dependencies, who depends on it, what it returns, its signature, and maybe even nlp in the hope that people are actually using descriptive variable names.

fat0wl11y ago

isn't this issue sortof analogous to the expansion/contraction of a language core?

Except in this case the core is user-generated and ever-expanding.

I bet there are a lot of issues in Java history that could predict possible bumps in the road for such a system (since it was essentially concurrently designed by a bunch of actors -- except in that case they were corporate entities)

hyp011y ago

reminds me of gmail: instead of hierarchical directories ("modules"), just search, and have multiple tags, so an email can be in more than one directory ("metadata").

Seems especially applicable to fp (like erlang), where code reuse is more often of small functions.

moron4hire11y ago

I think what you're discussing is really just namespacing ala C++, Java, or .NET. Especially with Java and .NET, you don't import a self-contained module directly from individual source files. The modules are technically all accessible at any time (or at least, the ones linked in to the build, which in the case of the Java and .NET standard libraries is quite a smorgasbord). You just reference the class/function you want in some way: either with using statements or with fully qualified names.

Because, really, if you start throwing everything into one store, you're going to run into the naming conflict issue, and any attempt at addressing the naming conflict issue is going to either look like importing modules or look like namespaces. You either have to explicitly state what your program has access to, or you explicitly state what function you mean when you have access to everything. Realistically, if you give every function a unique name and don't use namespaces, then there will start to be functions called system_event_fire() and game_gun_fire() and disasters_house_fire() and you're right back to having namespaces, just not in name or with a syntax that makes things nice when you know you're dealing with specific things.

Though, it'd be nice if types weren't the only thing that could be placed into a namespace directly in .NET. I'd like to put free functions in there. The Math class in the System namespace only exists because of this. I'd have prefered there to be a System.Math namespace and Cosine and Sine be members of it. Then I could "using System.Math;" and call "Cos(angle)". Instead, I'm stuck in a limbo of half-qualified names.

And I like it. I like it a lot more than Python, Racket, Node.js, etc. and having to import this Thing X from that Module Y. I like the idea that linking modules together is defined at the build level, not at the individual source file level. These languages are supposed to be better for exploratory programming than Java and C#, but actually, you know, doing the exploring part is harder!

Sometimes, I really do just want to blap out the fully qualified name of a function, in place in my code. System.Web.HttpContext.Current.User. If I'm doing something like that, it's a hack, and I know it's a hack, and having the fully qualified name in there, uglying it up, makes clearer that it's a hack. Though, I suppose I'm one of the rare people who actually do go back and clean up my hacks.

EDIT: I thought I wrote more, weird.

The network-accessible database of every library, ever, is definitely a great idea. I think it's where we're heading, with tools like NPM, NuGet, etc. It seems like a natural progression to move the package manager into the compiler (or linker, rather, but that's in the compiler in most languages now). Add in support in an editor to have code completion lists include a search of the package repository and you're there.

tracker111y ago

dibs on create_uuid_v4!!

j / k navigate · click thread line to collapse

76 comments

61 comments · 24 top-level

derefr11y ago· 20 in thread

A function's true name should be its content hash. (Where that content hash is calculated after canonicalizing all the call-sites in the function into content hash refs themselves.) This way:

- functions are versioned by name

- a function will "pull in" its dependencies, transitively, at compile time; a function will never change behaviour just because a dependency has a new version available

- the global database can store all functions ever created, without worrying about anyone stepping on anyone else's toes

- function unloading can be done automatically when nothing has called into (or is running in the context of) a function for a while. Basically, garbage collection.

seiji11y ago

I wrote a prototype programming language + distributed function storage system like this a few years ago.

I stopped when the madness became unbearable.

Exhibit A: https://github.com/mattsta/zlang/blob/master/priv/site_macro...

Exhibit B: https://github.com/mattsta/zlang/blob/master/examples/editor...

(unbearable from an "i created a new unmaintainable language pov, nothing bad about the distributed function storage feature.)

tiglionabbit11y ago

Oh my god! This is exactly what I've been working on in my spare time. I was just trying to come up with the specific content ID algorithm for it.

Currently it runs in JavaScript, but I was looking at other languages that it would be good to compile to, and Erlang seems like a pretty good fit for it.

Here's some other thoughts about it: https://docs.google.com/a/thinair.com/document/d/1WtgfUqN6Sd...

Can we work together?

Retric11y ago

This sounds great. The issue is when you fix a function you generally want everyone to use your bug fix. This is even more important for secruity issues.

However, sometimes code needs the broken version of some function...

There really is no great universal solution to this stuff.

derefr11y ago

    apply(repo_handle_foo, bar_fn, {'~>', 2, 3})

agumonkey11y ago

the hash must account for bound variables so that fun add(x, y) { return x + y } is the same as fun add(a, b) { return a + b }. Beta-reduction independance.

derefr11y ago

3rd311y ago

benjaminjackman11y ago

I am not a cryptographer so maybe one of them can chime in and verify this:

Quantum computers / some break-through algorithm could change that. If that happens all encryption on the internet likely breaks then as well, except where quantum cryptography is being used.

geofft11y ago

If you're not designing your software to worry about cosmic rays, you shouldn't design your software to worry about hash collisions. Just pick a good hash.

1 more reply

pjc5011y ago

a function will never change behaviour just because a dependency has a new version available

Presumably this only works for pure-functional languages?

endergen11y ago

tel11y ago

Not even them. A data type could change: the interfaces between functions are what must be versioned e.g. An SML module system.

ajuc11y ago

It would be nice if the hash was local (so functions that have similar structure have similar hashes).

IDE could do "It looks like you're trying to write qsort - we already have it written 10000 times in these functions: ...".

benjaminjackman11y ago

"the global database can store all functions ever created, without worrying about anyone stepping on anyone else's toes"

getDataByHash(hash) : Blob

To load the code into the return use eval (blech), or back up a step and use script tag + an advanced version of magnet links I'd call snow links:

This leads me in a somewhat orthogonal direction to your post...

Obviously any element in the page that loads resources would benefit from this (link rel=stylesheet, img, iframe, etc) so the document-scanner should work it's magic on them as well.

As an example application:

run(hash) : RunOffer

...

derefr11y ago

I think, though, that to be truly useful, you have to push this sort of idea as low in the stack as possible.

Which is, of course, the niches Erlang already fits into. So see above :)

2 more replies

thekaleb11y ago

I use requirejs for module loading and for my personal projects try to limit each module to a single function. Contrived example:

    define('object/clone', ['object/each'], function(each) {
        /**
         * Shallow copy objects
         *
         * @param {object} - object from which to copy properties
         * @return shallow copy of `obj`
         */
        return function(obj) {
            var clone = {};
            each(obj, function(key, value) {
                clone[key] =  value;
            });
            return clone;
        };
    });

chongli11y ago

What happens if somebody discovers a critical weakness in your hash function?

flixt11y ago

Is this a bit like blockspring?

hackerews11y ago

Yup - building a universal library of functions. Let me know if you have any questions paul@blockspring.com.

jbert11y ago· 6 in thread

So, immutability and/or api contract is important here.

This feels like a job for a content-addressable git-like tool. How about this:

I can discover my function (via whatever means). The function is actually named 8804ea505fda087da53b799434c377f015933707 (the sha-something of it's (normalised?) textual representation).

I then import it into my codebase as "useful_fun". My code reads like:

    useful_fun("do it", "to it")

So how do we handle updates? If we want a golang-like model, the developer could run something like "update deps". This would:

Note that because the unique name is based on the function content, any change to it creates a new item in the db. (Content-addressability, same way git and other systems do it.)

- stuff can be grouped and batched. If I pull in 10 functions tagged with the same project ('module') and they've all been updated, I can say "and do the same with all the others".

Seems like an interesting future.

doty11y ago

pyre11y ago

> every published document were just named with a hash of its content.

I see too many issues with this (for example):

4 more replies

esfandia11y ago

We also built a distributed Wiki based on that idea and platform, called P2Pedia (http://p2pedia.sf.net).

It's all very much an academic research project, so don't expect a beautiful interface or easy-to-install packaging or anything, but I think it's a good proof of concept.

(note to self: we should really move these to GitHub).

chenglou11y ago

Effectively, this gives you "program as a value" where the same url means the same program. Immutable programs basically.

1 more reply

rictic11y ago

Emerging standard in that area, subresource integrity: http://w3c.github.io/webappsec/specs/subresourceintegrity/

jacquesm11y ago

Alex391711y ago· 4 in thread

This is basically what Urbit is doing, among other things.

balquhidder11y ago

Is Urbit a real thing, or an elaborate hoax?

reirob11y ago

It made me curious too. Found this HN post about urbit: https://news.ycombinator.com/item?id=6438320

1 more reply

kencausey11y ago

https://github.com/cgyarvin/

shaurz11y ago

Probably both.

zo111y ago· 3 in thread

I don't know Erlang, so I might be missing something key here.

"I am thinking more and more that if would be nice to have all functions in a key_value database with unique names."

Yeah, sure... Sounds good, right. Until you have naming conflicts.

And then you've come full-circle back to the dilemma the author complains of which is that he doesn't know where to put a function that seems to belong in two modules.

Eventually, I'd say this is a general failing of modules that could potentially solved by some sort of inheritance. Maybe even a tagging mechanism if you really want to be "patch-work joe" about it.

seiji11y ago

Okay, let's try to not shit all over new ideas here. If we take what Joe-from-2011 means instead of hallucinating him to be incompetent...

Let's re-word "all functions in a database" as "a revision control system in a database."

So, let's make a revision control system. All contents, branches, tags are kept in a database.

Sounds good, right. Until you have naming conflicts.

So, if master is commit A and you make commit B, there is no naming conflict on the name "master," you just re-point it to commit B.

a function that seems to belong in two modules.

That's the problem with explicit hierarchy and why the world now runs on tagging-based crowdsourced folksonomies.

zo111y ago

"Master" is the actual name that is going to be conflicting, if I understand your example.

1 more reply

vegedor11y ago

>he doesn't know where to put a function that seems to belong in two modules.

In a third module, where else?

cwmma11y ago· 2 in thread

Not saying this Erlang idea isn't good or wont work, just these are the pitfalls besides the obvious name spacing and conflicts.

seiji11y ago

JavaScript works similar to this this and apps/libraries that wrap themselves in a giant closure work almost exactly like this.

Of course, someone would come along and say "these 30 functions only work together when pinned to these specific revisions," so you end up pulling down a named bundle of specific revisions, ...

protonfish11y ago

andrewstuart211y ago· 1 in thread

Because humans suck at serialized content.

I guess my point is, humans are the reasons we need modules.

[1] http://en.wikipedia.org/wiki/The_Magical_Number_Seven,_Plus_...

PythonicAlpha11y ago

That is exactly, what I thought, too!

Verdex11y ago· 1 in thread

[1] - https://www.youtube.com/watch?v=lKXe3HUG2l4 [2]

[2] - https://news.ycombinator.com/item?id=8572920 (thanks for the link)

nowne11y ago

thomas11OP11y ago

inflagranti11y ago

felixgallo11y ago

I think a lot of people are focusing on the implementation details here, which is fun and great, but the real deep insight here is the idea of a global registry of correct functions.

shaurz11y ago

Another idea: Unit tests could be stored as function metadata.

protomyth11y ago

Lambda the Ultimate's discussion http://lambda-the-ultimate.org/node/5079 is pretty interesting.

philbo11y ago

To answer the question in the title directly, I think modules are to aid reading and discovery.

al2o3cr11y ago

ryanisnan11y ago

While you're talking about Erlang specifically, the concepts you bring up can be applied to programming in general.

Why does Erlang (or any other language) have modules?

The biggest reason for me (and I think the one with the most merit) is for clarity and usability.

brianshaler11y ago

[0] https://www.npmjs.org/browse/keyword/lodash-modularized

tel11y ago

The problem is now you either have zero data abstraction or uncontrolled data abstraction without even a convention like "these functions work together as a bundle" to save you.

That said, a nice SML module probably could work as the base abstraction here.

rymohr11y ago

The problem with this approach is you need to consider every existing function name in order to define a new one.

endergen11y ago

the_cat_kittles11y ago

fat0wl11y ago

isn't this issue sortof analogous to the expansion/contraction of a language core?

Except in this case the core is user-generated and ever-expanding.

hyp011y ago

reminds me of gmail: instead of hierarchical directories ("modules"), just search, and have multiple tags, so an email can be in more than one directory ("metadata").

Seems especially applicable to fp (like erlang), where code reuse is more often of small functions.

moron4hire11y ago

EDIT: I thought I wrote more, weird.

tracker111y ago

dibs on create_uuid_v4!!

j / k navigate · click thread line to collapse