The largest work we do is building an inverted index. Oversimplified, it is equivalent to this:
inverted_index = defaultdict(list)
for (doc_id, doc_json) in enumerate(doc_jsons):
c = json.loads(payload)
for (field, field_text) in c.items():
for (position, token) in enumerate():
inverted_index[token].push((doc, position))
serialize_in_compressed_way_that_allows_lookup(inverted_index)You can implement it in a couple of hours in the language of your choice to get a proper baseline.
I am sure we can still improve our indexing throughput... but I have never seen any search engine indexing as fast as tantivy.
If someone knows a project I should know of, I'd be genuinely keen on learning from it.