AnsweredAssumed Answered

safely crawl all documents via webscript

Question asked by jaeni on Jan 20, 2016
Latest reply on Jan 20, 2016 by jaeni
what i am trying to do is:

find all nodes in the repo and get their file size. also get all versions of the node and calculate the overall filesize of the node and its versions.

how can i safely crawl every document in the repository?

searchservice is going to hit the result-limit easily, even if i increase the limit, the searches wont return results.

by traversing recursively through the repository i also seem to fill up the solr caches



private static void traverse(List<FileInfo> context) {
    for (FileInfo node : context) {
        if (node.isFolder()) {
            traverse(fileFolderService.list(node.getNodeRef()));
        }
        else {
            // is file = do stuff
        }
    }
}



… :44,186  INFO  [solr.component.AsyncBuildSuggestComponent] [Suggestor-alfresco-1] Loaded suggester shingleBasedSuggestions, took 267411 ms
… :53,005  WARN  [cache.node.nodesTransactionalCache] [http-apr-8080-exec-1] Transactional update cache 'org.alfresco.cache.node.nodesTransactionalCache' is full (125000).
… :17,075  WARN  [cache.node.aspectsTransactionalCache] [http-apr-8080-exec-1] Transactional update cache 'org.alfresco.cache.node.aspectsTransactionalCache' is full (65000).
… :17,081  WARN  [cache.node.propertiesTransactionalCache] [http-apr-8080-exec-1] Transactional update cache 'org.alfresco.cache.node.propertiesTransactionalCache' is full (65000).
… :19,938  WARN  [alfresco.cache.contentUrlTransactionalCache] [http-apr-8080-exec-1] Transactional update cache 'org.alfresco.cache.contentUrlTransactionalCache' is full (65000).
… :19,991  WARN  [alfresco.cache.contentDataTransactionalCache] [http-apr-8080-exec-1] Transactional update cache 'org.alfresco.cache.contentDataTransactionalCache' is full (65000).
… :49,599  WARN  [org.alfresco.nodeOwnerTransactionalCache] [http-apr-8080-exec-1] Transactional update cache 'org.alfresco.nodeOwnerTransactionalCache' is full (40000).
… :27,516  WARN  [cache.node.childByNameTransactionalCache] [http-apr-8080-exec-1] Transactional update cache 'org.alfresco.cache.node.childByNameTransactionalCache' is full (65000).



i understand it is an antipattern to grab everything at once, but i don't know of any service/api that allows me to page the results into batches/pages,
please enlighten me :(

version: 5.0.c

Outcomes