undefined | Better HN

0 pointsAloisius14y ago0 comments

Which underlying algorithm?

The search engine was built on a ternary tree with a custom merging algorithm. I honestly don't know if the merging algorithm has a name as it was something I came up with (literally) while sleeping one night. Because we mostly used ID3 tags and file names, it was completely unnecessary to use a stemming algorithm because if you typed in a misspelled search, given the size of our index, there was probably someone who tagged their file using the same misspelling.

The network biasing code used BGP data combined from a number of looking glass servers to build a map of ip/prefix -> ASN number and ASN->ASN distances. It was then used to reorder search results based on network distance to users so they would bias towards their own networks and save ISPs money and speed transfers on broadband connections.

Servers were linked through a fully meshed network. Each had presence information about every user on the network so that they could route IMs around. The chat system was semi-linked (fully linked on some servers, but we couldn't fully link the whole thing because the client had no administrative functions for chat). If we couldn't send the user back enough results for a search, the query was simply passed around the backend.

The whole thing was written in C++. At its peak, there were about 2.3 million users online at any given time (80 million total users growing by a million every 4 days). The system would be indexing about 17.6 million files per second (and de-indexing about the same amount). The whole system pushed out about 2 Gbps of bandwidth in search results (which were tiny).

Napster was one of the very first services to push past 10K connections on a Linux machine. At peak, I could get over 100K users on a single process (though I'd run out of memory indexing files on the tiny 2 GB machines and blow out the NIC sending search results). During normal operations, each server process had around 40K users on it and between 7-12 million files indexed.

There were a bunch of side infrastructure things no one saw. Court mandated copyright filtering systems, recommender systems (most for play), load balancing servers, bot detection and sequestration systems, analytics reporting jobs, etc.

Nowadays, I could probably fit all of Napster on one big machine. Heh.

0 comments

4 comments · 2 top-level

Mizza14y ago· 2 in thread

Wow, that's incredibly fascinating! Thanks for posting such a great reply! You built something that really changed the world, in my opinion, and you should be very proud.

I'm particularly pleased by the use of ASN/distancing weights in your results! None of the BT trackers I've hacked around on have had anything like that in them.

Where is the code now? It should be in a museum.

AloisiusOP14y ago

No problem! It occurred to me that I've never really talked about the Napster architecture or algorithms publicly.

I once hacked up Transmission to re-prioritize based on network distance. It worked rather well when it discovered over the DHT. It is a lot easier when you only have to calculate distance between you and other networks. Storing the graph of distances between two arbitrary users is harder.

The code was part of the assets that were bought out of bankruptcy by Roxio (who renamed themselves Napster). I doubt it is being used for anything.

tuna14y ago

you should definitely publish this kind of things somewhere. it was big way before the whole big data stuff around.

cheers.

prawn14y ago

Loved this:

"...it was completely unnecessary to use a stemming algorithm because if you typed in a misspelled search, given the size of our index, there was probably someone who tagged their file using the same misspelling."

j / k navigate · click thread line to collapse