The site implements an api which describes what data it needs to perform the check, and the standard would be to accept hashes of the data.
There are then sites that provide a UI over the api. The user can point it to the api URL (they can also allow it to be specified in the URL so it can be linked to), it performs the hashing client side and makes requests to the API.
The worst the people providing the api can get is hashes, and people can check the source for the UI to verify it isn't siphoning off data.
Because the UI is decoupled from the data leak, there is less code to check.
A client-side, bloom-filter based solution would be nice IMHO. You would get either a definitive "No, your data wasn't leaked" or a "Your data was very likely (xx% possibility) leaked."
This all still doesn't help non-technical people decide whether a site can be trusted though :)
Interesting decision. Sure the dump is publicly available, but this is much more accessible.
"For now, we have censored the last two digits of the phone numbers in order to minimize spam and abuse. Feel free to contact us to ask for the uncensored database. Under certain circumstances, we may agree to release it."
what about SheepCheck? it does not sound right, then again, which other internet slang term does? ;)