Ad

Search By Hash?

- 1 answer

I had the idea of a search engine that would index web items like other search engines do now but would only store the file's title, url and a hash of the contents.

This way it would be easy to find items on the web if you already had them and didn't know where they came from or wanted to know all the places that something appeared.

More useful for non textual items like images, executables and archives.

I was wondering if there is already something similar?

Ad

Answer

Check out the wikipedia page on locality sensitive hashing. There's also a good page hosted by a research on MIT.

In general, there are several flavors available: hashes for strings (such as simhash), sets or 0/1 features (such as min-wise hashes), and for real vectors.

The main trick for numerical hashes is basically dimension reduction, so far. For strings, the idea is to come up with a representation that's robust in the face of minor edits.

I'm also doing a little research in this field, although I guess stackoverflow might not be the right place for nascent work.

Ad
source: stackoverflow.com
Ad