On our ACS we have about 1 million files and more to come. It is a community sharing collection of books, articles, videos, audio-files etc.
Now I want to make sure, when an upload is made (even bulk uploads), that no already existing file gets added to the repos again.
My idea to solve this is by fingerprints of the file. i.e. MD5.
So I added a MD5 property – which is set by automatic in an input folder, whenever I upload new files in bulk.
While uploading the process has to look for another existing file in the repository with same md5-fingerprint, if found, then refuse to add that again to the repos.
My programmer has the consideration, that this does not work as intended, as the solr-index is built with time delay long after the upload of a document.
So my solution for this is to bypass ACS and Solr with a separate SQL-table and use a direct sql-command and a separate table just with one column: MD5 to add each one there. No need to know which file this belongs to. Just to fill it and then lookup by use of mysql-index (without solr) whether this MD5 is already there or not. This will be without time lag.
Will this work or has anyone a better idea?
Perhaps this was already solved with a plugin, as I can not see this feature as very exotic. Everyone wants to avoid redundancy in his repository.