Sharepoint - SharePoint 2010 Search: When is a document a duplicate?

I will put my little research as answer (although it is not a real answer).

I followed link provided by @AnitaBoerboom and I believe it applies to 2010 version as well.

Quote:

How does the duplicate document is identified when we do a search?

Document similarity for purposes of identifying duplicates is based only on a hash of the content of the document. No File properties (e.g. file name, type, author, create and modify dates) are input to this hash. The MSSDuplicateHashes table in the SSP’s search database holds, for each document, all the 64bit hashes necessary to determine if one document is a near-duplicate of another. This is read while doing a search if duplicate collapsing is enabled.

This is probably answer to @MikeOryszak strange issue. He uploaded 250 documents with same content - so this is just 1 document and 249 duplicates.

64bit hash is something that puzzles me. This hash is determined when document is crawled by preforming Full crawl.

After reading few documents from SharePoint Back-End Protocols and after exploring my local SharePoint SQL stored procedures I have found this: Duplicate Identifier Block

And it leaded me to 'something that is completely different' (or not):

Finding Duplicate Documents in SharePoint using PowerShell

This script uses MD5 Message-Digest Algorithm to determine if two files are same. And this was my first hunch after reading this question.

So without any real and hard evidence IMHO this is exact procedure that determines if two files are exact duplicates.

So possibility that 2 documents with different content have a same hash is almost none and way lower then possibility that 2 items in single list are determined by same GUID.

Disclaimer: I am really not an expert in this field so don't take my findings as granted. I would like to hear from someone with 'right knowledge'.


Per the blog post for 2007, it uses a hash of the entire document and ignores the actual document properties. I did some tests today outlined in my comment above which showed all of the documents as duplicates, even when across site collections. Any documents that were changed afterwards showed as unique documents after another crawl.

Tags:

Search