Eg you can’t just do equality on collision, you need to do equality on every lookup at least once.
> you need to do equality on every lookup at least once.
I still don't think so. You only need for when the split would correspond to a good partial match.
There are also other smarter hashing thing you can do to help convince you.
For example if hash( s ) = sum_over_i( g(s[i]) ) if you pick g(x) = x or g(x) = xx or g(x)=xxx (or any function you are free to choose) (it's kind a like matching statistical moments of words).
You can calculate this hashes in amortized O(1) because of sliding window trick.
You can also index on split length.
You can filter your set of candidate split positions so much that you only have very probably valid split positions in your candidate set, and therefore the complexity is just verifying the validity of the candidates splits, so still O(n
(number of candidates after filter) ).> > Part 2
A solution based on filtering a candidate set could also probably be made to work for multiple words. You use the matching statistics to get candidates starting positions and length for words of the dictionary that appear in the string and you verify if they touch each other. Pathological cases are when words overlap each other but this may not appear very often in practice.
You can also add some filter based on pairs of words (or triplets..., or n-grams) .
All in all, I think the problem is rather bad because smart candidates will often have a solution that the interviewer may not be able to comprehend, and as a candidate you have just put yourself in a possible ego battle with the interviewer were even explaining the proof is not easy. In fact if the alternative solution is more interesting than the rest of the problem, you might get your candidate be focusing on that instead of your simpler but boring trie based solution.
My initial hunch was an example where every prefix of the word is part of the dictionary, then on an example like `abcdefgh` you'd test `a`, `ab,` `abc,`... But on such a case you can be clever and defer the string match until you verify that the hash of the suffix also exists in the dictionary, which it won't.
So it would seem a pathological example needs both prefix and suffix matches, but this just means that it's very easy for us to find a valid segmentation, so we will find one very early on.
I guess this is probably what you meant by "split would correspond to a good partial match" though.