Maybe, once you've found a location to read, insert, or write to. The author neglects the runtime cost required for the hash algorithm itself, which may not be trivial; computing the hash of a string key is typically an O(n) operation.
Furthermore, unless a suitable table size is selected, integer keys (should one use a map like an array) will eventually hash to the same value, requiring even more time to iterate through all the remaining values that match that key until the desired value is found.
"I don’t know why you would ever use [a linked list] over an array (or a hash)..."
Here's why: because arrays take up space whether you use it or not. Linked lists don't suffer from this problem, but at the cost of 1-2 pointers per item. Has the author seriously never managed memory before? Please tell me this article is a joke.
for (k in ['a', 'b']) {
console.log(k);
}
and get back 0, 1? However: var a = {0:'a', 1:'b'};
for (k in a) {
console.log(k);
}
also outputs 0, 1.Edit: turns out, you can even have doubles, too:
var weird = {3.14:'hello', 6.28:'world'};
// for loop above emits: 3.14, 6.28
console.log(weird[3.14]); // emits 'hello'Yeah, I know, it always came off to me as BS to repeat that hashes are O(1) lookup, like we're making a special exception. Anywhere else, you can look up the algorithm and derive its big-O without having to memorize, but not this one.
Pretty clearly, as the hash table grows linearly, the keyspace has to grow linearly and the key size logarithmically. Since you get the keys from the output of a (nice, well-mixed) hash function, then the computation for where to insert has to increase logarithmically -- a longer hash output requires more steps.
You only get O(1) by selectively ignoring one kind of asymptotic behavior (that of the computations needed for the hash).
Reddit discussion: http://www.reddit.com/r/compsci/comments/2z74z8/why_is_hasht...
Since that's usually not relevant to big-O analysis, we can sidestep the issue by specifying what we're counting: rather than say that mergesort is in O((log n) log (log n)) time, we say it takes O(n log n) comparisons.
Whether you think of that as a theoretical computer with a constant-time comparison instruction or as a real computer with a commonsense (2^64) upper bound, the upshot is that we don't care. As long as what we're studying is sorting, not comparison, it will fall out of every algorithm, so the difference isn't important even in theory.
It's still an important thing to notice, though-- there are plenty of cases where you definitely do care about counting bits.
In more detail: http://cs.stackexchange.com/questions/1643/how-can-we-assume...
But for hashtables, what exactly are we studying that doesn't care about hash time? The whole reason that a hashtable is supposed to save time is that you locate the desired key's value by computing the location from the key itself. To whatever extent that computation requires more steps, that cannot itself be abstract away -- if only because a limitation on key size is a limitation on table size.
Also, if the key space is known before hand, it is possible to build a perfect minimal hash table with 1.3n total keys (n is the size of the keyspace)
Another cost is the cache misses. If you are doing operations in a sequential fashion using an array would have a significant advantage if the number of elements is high. That's another tradeoff to take in consideration.
This is a good point, though to be fair, this aspect of it is usually also ignored when analyzing the main alternative data structures, which are balanced binary trees. At least every successful lookup in such a tree with strings as keys also has to read the entire string.
This means that hashes probably still have an advantage over balanced binary trees. Tries, on the other hand, could really beat hashes when the keys are strings. In practice, they might have a cache and branching disadvantage, though.
At the very least, this makes it easier for your algorithms to be more cache friendly, because tries' cache behavior is predictable in a way that hash maps' isn't.
This is even more relevant for branch prediction: hash functions (by design) are harder to predict, giving the lookup algorithm of a trie an advantage.
A cache aware design like "adaptive radix trees" delivers really good performance in practice, comparable or better than hash maps while also supporting fast iteration and certain queries (like iterating over all keys with a given prefix).
Take a look at the adaptive radix tree paper[1]: the idea is very accessible, and the resulting performance very impressive. They also generalize the system to work with different types of keys through a systematic method for turning values into binary strings.
[1]: http://www3.informatik.tu-muenchen.de/~leis/papers/ART.pdf
By the time you hash a key, you could have likely already inserted it into a trie.
Lookups on hashes are also not O(1) for similar reasons. You have to hash the search string, then compare the value at whatever location it hashed to (usually a string-comparison operation which aren't O(1)) and depending on the collision strategy, do more things if it doesn't match, but isn't an empty value.
Hash table lookups and insertions take time O(1), when talking about number of items already in the hash table - and that 'when talking about' is implicit and doesn't have to be said as anyone talking about the complexity of a hash table knows that, or would state otherwise as it would be an exception.
Talking about the length of strings used as keys in the hash table is therefore nonsense when we have already agreed that we're talking about the number of elements in the hash table, as length of the strings isn't a parameter in that function.
And they never account for growth - yeah, it's amortised isn't it? That's what we wanted to do when we do an O().
I mean what you are proposing would be an interesting study - the complexity of hash tables parameterised for all sorts of different properties such as the size increase function or the load factor or whatever, but it's not what anyone means when they talk about complexity of a hash table unless they state otherwise.
So nobody's getting it wrong or making a mistake. They're using some assumptions that everyone's already agreed on and know about. If you want to talk about the complexity of something else you need to clarify.
I'm fairly positive that a unit of measure should _never_ be variable, otherwise it's fairly pointless. And if you don't think hashing a 1TB string takes significantly longer than a 100byte string ...
Futher more, big O notation is supposed to be a wide upper bound, but I don't think that works well if it's not actually an upper bound. If you were to tell your bosses something took constant time, and it took an hour for string A (1TB) and 100ms for string B (10B), I'm pretty sure your opinion wouldn't mean much after that.
The hashmap should absolutely be considered in terms of the hash function, because it's the longest running part of the algorighthm. To do otherwise is disingenous.
Using that logic, I could call any iterative algorithm O(1) since the top level function only gets called once.
However, this is absolutely dominated by complexity of the hashing function, growth (which can be amortized, but is not O(1)) deletion (also not O(1)) and comparison functions (which are usually O(mn) or O(n) (or some similar depending)).
We end up measuring the things that are the most minimal in hashing and pretending like the expensive operations, which aren't O(1), don't exist. This is wrong and dishonest when considering algorithms. We know for example that string comparison is not O(1), and it's usually a part of most hash table algorithms, and yet it magically disappears when analyzing hash table complexity and everything is somehow supposed to be considered as a mem+offset which is stupid.
Hash-tables also usually have exponential memory growth complexity which nobody ever pays attention to. Of course there are versions which don't do this and/or have fixed memory sizes, but the random kind you find in most languages grow O(n^2) or similar. And this is also ignored and we pretend it doesn't happen and resizing magically becomes part of the "1" in O(1)....even though copying arrays isn't free and as resizes happen arrays get bigger and big-O should get worse.
Hash-table operations can be pretty expensive and faulty big-O analysis doesn't help. So sure, for a static, fixed size, hash table, with no growth, instantaneous hash functions that use temporal oracles, don't need to deal with collisions, O(1) is correct. But these hash functions pretty much don't exit in nature, and most programmers won't be working with them. The standard libraries for most languages sure as hell don't implement such ideal hash tables.
O(n) is at least an honest approximation. I honestly have never seen a well considered analysis of hash function complexity but I know it grows super-linearly from just using them a lot. This kind of fanciful analysis doesn't help anybody.
So yes, O(something) where something has a unit. But the unit sure as heck isn't whatever "1" is supposed to represent. It's probably closer to string length or array length (or some combination of the two), but a single "add" it is most definitely not.
When strings are immutable, the hash values can be cached so the computation is amortized (this is how it works in Java)
Consider that the universe enforces both a maximum information density per cubic meter, and a maximum speed at accessing a given physical location.
The n in operation time on data structures usually refers to the number of members in the data structure. String length would be a different variable.
For example worst case for finding a string member in a linked list would be O(n * k) where n is the elements in the linked list and k is the string length. I.e. the worst case here assumes lots of members with shared prefixes.
So the O(1) in hash tables actually is O(1 * k).
This distinction often is important because the length of strings and the number of members are independent variables.
And for many use-cases (e.g. member names in classes or text-protocol keywords) k is essentially fixed, e.g. a maximum member name length of 255 characters or a similar limitation. And for big O notation that means O(1 * 1), since it's a constant.
And decreased memory locality (more cache misses) and increased number of allocations.
Best of all, I expect sincere discussion of the merits of the argument in this thread. Well executed, and heh heh.
Which is not necessarily a good thing. PHP's array and JS's Object are essentially the same thing.
Plus all the metatables goodies: weak key and/or value references, prototype inheritance, ...
That sounds... completely normal? Python dictionaries will do that too. So will Java HashMaps.
In those languages, you'll probably need to use a data structure well known for its simplicity, efficiency, and broad cross-language-platform compatibility: The Resource Description Framework's graph model. Graphs are also a well-known candidate for "universal data structure" [2]. Also, it's semantic, which is a clear advantage over the entirely not semantic UniversalHash, because semantic is better than not semantic when it comes to universal semantics of semanticness.
Semantic.
Otherwise, the article is pretty solid and all serious developers should definitely consider implementing it forthwith, as they will find it a very, very educational experience. I know I've seen people implement this model in Java before, and I can certainly vouch for the fact that it was an education for all involved.
I'm going to say "semantic" one more time, because it's been my observation that the more you say it, the smarter you sound, so: semantic. Oh, yeah, that's the stuff. Mmmmmmmmm.
[1]: https://wiki.haskell.org/Tying_the_Knot
[2]: Seriously, if you really do want to discuss the "universal data structure", perhaps because you want to analyze data structures very generically for some mathematical reason, graphs really are a good candidate. Not necessarily RDF graphs, which are both bizarrely complicated while lacking simple features; the three blessed collection types are a weird combination of features.
(By the by... you know you can trust me, because... semantic. Semantic.)
[1] http://steve-yegge.blogspot.com/2008/10/universal-design-pat...
"God-given axioms, which is academic lingo for a truth we accept purely by our God-given logic, like: if something is not true then it is false, something cannot exist in an empty set, something logical is logical."
But then I saw this and started laughing:
"Remember when everyone used tables to lay out their HTML? Well, that proved to be a horrible way to do things, because tables are inherently inflexible. It’s a strictly geometric constraint. Now we all use divs and CSS, because we get a much-more flexible CSS engine to define our layout. Postgres and her SQL friends are all table based, just like the <table> of 1999. Don’t be 1999."
Many axioms maybe picked because they seem 'obviously true', (or more likely, because they are useful) but that doesn't make their truth simple or make them the result of logic. (For an example take a look at the existence of infinity).
Additionally, the 'axioms' he lists are all what I would generally consider tautologies. (Although you might argue that the first one is actually an axiom of bivalent logic systems).
"A hash is simple. A hash is fast. A hash is all you need to start with".
I can think of plenty of good reasons to stop using a map/hash/associative array in code, but I can't think of very many good reasons not to start coding with associative arrays as your default data structure. If there's a performance/memory problem, fix it later. I've seen a lot more code suffer from premature optimization than I've seen suffer from using data structures that were a little too inefficient.
Using hashes as a first choice data structure is not necessarily a bad idea. 1] Until profiling a working implementation demonstrates otherwise, other data structures may be premature optimization.
[1] Clearly an improvement over the Lisper's association lists.
So, really, that's not a different universal data structure.
Not in this case. An ordered map implemented as a tree has the same interface but with very different operation complexities
One might even call them "functions", but that ruins the joke.
Hash tables are cool but they are far from being the only thing you need.
The problem with this kind of subtle satire without a disclaimer at the end is that some people would fall by this and blindly follow it (collisions weren't mentioned not even once nor cache misses).
"Unlike most academic work that has little to no practical implications, I think blindly following the stuff here will prove to be incalculably beneficial for you."
When you go from "Objects are super easy and useful in this language" to "Everything Is An Object" you basically doom yourself to using objects to implement a bunch of stuff that doesn't really make sense as objects and could be implemented much easier as another data structure.
Big-brainded academics love the challenge of "ooh, can I make everything an object?" because they are always free to decrease the scope of their research a little to compensate for the implementation taking a long time. And the more phenomena you can contort into agreement with your thesis, the more scholarpoints you get.
Blow advocates "data-driven programming" which, as a rule of thumb, I translate in my head as "don't move anything around you don't have to."
For example, rather than just copying a giant array of JSON objects over the wire when you only need some image URLs each with an array of timestamped strings, you write the code that serialized that data. And if you do that a few times, write the tooling you need to make that kind of thing easy.
The pitch is that it's not more work. And I'm kind of convinced. It just gets rid of so much pollution when you are debugging.
Your first cut of things is often a little weird: "do I need a generator generator here?" but typically you realize that a simpler solution works just as well in the refactor.
When you hack in a "wrong but easier to sketch out" solution into your code as the first try, it often just lives like that forever. Correct, confusing code often collapses into correct, simple code. Simple, functional-but-wrong code just seems less inclined to self improvement.
And I am continually surprised by how many problems, when simplified down as much as possible, are best described with the basics: functions, structs, arrays. You need fancy stuff sometimes for sure, but most of our human problems are trivial enough that these blunt tools suffice. I just often won't be able to see it until I've renamed all the variables three times.
What's interesting is I've been doing JavaScript programming this way, and Jonathan Blow is... shall I say... not a fan of JS. But I think the concepts translate pretty well! It's just instead of targeting raw memory, you target the DOM/js runtime which is actually a pretty powerful piece of metal if you have the patience to learn about the runtime and keep thinking about your data every step of the way.
> Hashes are always O(1) reads, inserts and writes.
To realise that this was a joke.
*or was, I am not sure about current state
Let's replace main memory with hash maps and then implement existing data structures on that.
Yes. This is satire.
I would agree - didn't realize it at first, but it does sound a bit too OTT on second thoughts...
The major historical selling point of object-oriented programming (OOP) to the Forces of Evil-- not all business people are "Forces of Evil; really, there are some great business people out there, and so I'm referring specifically to cost-cutting mini-Eichmanns who've attempted to commoditize, humiliate, and infantilize us with "user scrum stories" and a culture of mediocrity-- was that OOP (after diverging far from Alan Kay's original vision) would allow the Forces of Evil to replace high-cost experts with teams of mediocre programmers, and thereby ruin the balance of power. Culturally, it worked (the culture of mediocrity is well-established in the software industry); economically, it failed (because large teams of mediocrities are actually fucking expensive, because a "10x" engineer only costs about 1.5-2.5x an average one).
The sales pitch for OOP to the Forces of Evil was that OOP would make it possible to hire a couple of low-paid body-shop programmers too stupid to recognize the OP as either (a) satire or, missing the joke but still correct, just wrong. Smart wizards in the open-source world and at companies like Google would do the actual engineering that made efficient hash-maps possible, and CommodityScrumDrones would staple shit together using topologies thereof, without really understanding any of the technologies they were gluing together, and probably ignorant of why these things are sometimes called "hashes" in the first place.
The problem is that when CommodityScrumDrones grow up and become middle managers and get to start making technical choices, they often make bad ones. They reject PostgreSQL as too hard to too old and use some NoSQL JSON storage engine that was a "Show HN" project 17 days ago.
Even though the CommodityScrumProgrammer phenomenon has been a massive failure in economic terms-- the Forces of Evil have won on ego terms but utterly dominating their targets, but they've lost money-- it has been a cultural success that has inflicted mediocrity, anti-intellectualism, and subordinacy to "The Business", on the software industry. And now we have people calling important technical shots who have literally no idea why the OP is either satire or wrong.
This is relevant: http://www.smashcompany.com/technology/object-oriented-progr...