Hi,
this is an old product, started on Community 2.1, reached Community 4.2.c. We could not advance further because our solution was too much interconnected with the old Alfresco architecture.
We run it on CentOS 5.6 VM (all on single machine) with 4 processors and 32 Gb memory. We have more than 2 million documents and 12 Gb Lucene indexes, servicing around 450 users. System is a bit sluggish, we're trying to speed it up, and one of the ways is thru repository setup.
We left mostly repository properties untouched, but I think we could get some improvement with better Lucene settings. Yes, we don't use Solr. I don't know if that would work better, either putting it on the same or separate machine.
We disabled content indexing in our content model, and I was surprised to find this
lucene.indexer.contentIndexingEnabled=true
in our properties. We can safely put it to false, yes?
Our documents have no more than 150 fields.
I'm interested in stop words and analyzers. That could help.
What about query cache? Our users have some query repetition, but not much. Is it better to have it on or off regarding performance?
Regarding index files, we have one big chunk 10 Gb big. Is there a way to tell Lucene to divide it in smaller files. Maybe I'm wrong, but wouldn't be easier to Lucene to work with smaller files?
And merge parameters. How much they're important while production is used by users, and how much at night when index business happens?
Well, I hope you can help us. You will make us and our users very happy with your help.
Here's our repository properties (some entries were removed due to post length, assume default values if needed):
# Repository configuration repository.name=Main Repository # Directory configuration dir.root=/var/alfresco/data dir.contentstore=${dir.root}/contentstore dir.contentstore.deleted=${dir.root}/contentstore.deleted # The location of cached content dir.cachedcontent=${dir.root}/cachedcontent dir.auditcontentstore=${dir.root}/audit.contentstore # The value for the maximum permitted size in bytes of all content. # No value (or a negative long) will be taken to mean that no limit should be applied. # See content-services-context.xml system.content.maximumFileSizeLimit= # The location for lucene index files dir.indexes=${dir.root}/lucene-indexes # The location for index backups dir.indexes.backup=${dir.root}/backup-lucene-indexes # The location for lucene index locks dir.indexes.lock=${dir.indexes}/locks #Directory to find external license dir.license.external=. # Spring resource location of external license files location.license.external=file://${dir.license.external}/*.lic # Spring resource location of embedded license files location.license.embedded=/WEB-INF/alfresco/license/*.lic # Spring resource location of license files on shared classpath location.license.shared=classpath*:/alfresco/extension/license/*.lic # WebDAV initialization properties system.webdav.servlet.enabled=true system.webdav.url.path.prefix= system.webdav.storeName=${protocols.storeName} system.webdav.rootPath=${protocols.rootPath} system.webdav.activities.enabled=true # File name patterns that trigger rename shuffle detection # pattern is used by move - tested against full path after it has been lower cased. system.webdav.renameShufflePattern=(.*/\\..*)|(.*[a-f0-9]{8}+$)|(.*\\.tmp$)|(.*\\.wbk$)|(.*\\.bak$)|(.*\\~$) # Is the JBPM Deploy Process Servlet enabled? # Default is false. Should not be enabled in production environments as the # servlet allows unauthenticated deployment of new workflows. system.workflow.deployservlet.enabled=true # Sets the location for the JBPM Configuration File system.workflow.jbpm.config.location=classpath:org/alfresco/repo/workflow/jbpm/jbpm.cfg.xml # Determines if JBPM workflow definitions are shown. # Default is false. This controls the visibility of JBPM # workflow definitions from the getDefinitions and # getAllDefinitions WorkflowService API but still allows # any in-flight JBPM workflows to be completed. system.workflow.engine.jbpm.definitions.visible=true #Determines if Activiti definitions are visible system.workflow.engine.activiti.definitions.visible=true # Determines if the JBPM engine is enabled system.workflow.engine.jbpm.enabled=true # Determines if the Activiti engine is enabled system.workflow.engine.activiti.enabled=true index.subsystem.name=lucene # ######################################### # # Index Recovery and Tracking Configuration # # ######################################### # # # Recovery types are: # NONE: Ignore # VALIDATE: Checks that the first and last transaction for each store is represented in the indexes # AUTO: Validates and auto-recovers if validation fails # FULL: Full index rebuild, processing all transactions in order. The server is temporarily suspended. index.recovery.mode=AUTO # FULL recovery continues when encountering errors index.recovery.stopOnError=false index.recovery.maximumPoolSize=5 # Set the frequency with which the index tracking is triggered. # For more information on index tracking in a cluster: # http://wiki.alfresco.com/wiki/High_Availability_Configuration_V1.4_to_V2.1#Version_1.4.5.2C_2.1.1_and_later # By default, this is effectively never, but can be modified as required. # Examples: # Never: * * * * * ? 2099 # Once every five seconds: 0/5 * * * * ? # Once every two seconds : 0/2 * * * * ? # See http://www.quartz-scheduler.org/docs/tutorials/crontrigger.html index.tracking.cronExpression=0/5 * * * * ? index.tracking.adm.cronExpression=${index.tracking.cronExpression} index.tracking.avm.cronExpression=${index.tracking.cronExpression} # Other properties. index.tracking.maxTxnDurationMinutes=10 index.tracking.reindexLagMs=1000 index.tracking.maxRecordSetSize=1000 index.tracking.maxTransactionsPerLuceneCommit=100 index.tracking.disableInTransactionIndexing=false # Index tracking information of a certain age is cleaned out by a scheduled job. # Any clustered system that has been offline for longer than this period will need to be seeded # with a more recent backup of the Lucene indexes or the indexes will have to be fully rebuilt. # Use -1 to disable purging. This can be switched on at any stage. index.tracking.minRecordPurgeAgeDays=30 # Unused transactions will be purged in chunks determined by commit time boundaries. 'index.tracking.purgeSize' specifies the size # of the chunk (in ms). Default is a couple of hours. index.tracking.purgeSize=7200000 # Reindexing of missing content is by default 'never' carried out. # The cron expression below can be changed to control the timing of this reindexing. # Users of Enterprise Alfresco can configure this cron expression via JMX without a server restart. # Note that if alfresco.cluster.name is not set, then reindexing will not occur. index.reindexMissingContent.cronExpression=* * * * * ? 2099 # Change the failure behaviour of the configuration checker system.bootstrap.config_check.strict=true # # How long should shutdown wait to complete normally before # taking stronger action and calling System.exit() # in ms, 10,000 is 10 seconds # shutdown.backstop.timeout=10000 shutdown.backstop.enabled=false # Server Single User Mode # note: # only allow named user (note: if blank or not set then will allow all users) # assuming maxusers is not set to 0 #server.singleuseronly.name=admin # Server Max Users - limit number of users with non-expired tickets # note: # -1 allows any number of users, assuming not in single-user mode # 0 prevents further logins, including the ability to enter single-user mode server.maxusers=-1 # The Cron expression controlling the frequency with which the OpenOffice connection is tested openOffice.test.cronExpression=0 * * * * ? # # Disable all shared caches (mutable and immutable) # These properties are used for diagnostic purposes system.cache.disableMutableSharedCaches=false system.cache.disableImmutableSharedCaches=false # The maximum capacity of the parent assocs cache (the number of nodes whose parents can be cached) system.cache.parentAssocs.maxSize=130000 # The average number of parents expected per cache entry. This parameter is multiplied by the above # value to compute a limit on the total number of cached parents, which will be proportional to the # cache's memory usage. The cache will be pruned when this limit is exceeded to avoid excessive # memory usage. system.cache.parentAssocs.limitFactor=8 # # Properties to limit resources spent on individual searches # # The maximum time spent pruning results system.acl.maxPermissionCheckTimeMillis=10000 # The maximum number of search results to perform permission checks against system.acl.maxPermissionChecks=1000 # The maximum number of filefolder list results system.filefolderservice.defaultListMaxResults=5000 # Properties to control read permission evaluation for acegi system.readpermissions.optimise=true system.readpermissions.bulkfetchsize=1000 # # Manually control how the system handles maximum string lengths. # Any zero or negative value is ignored. # Only change this after consulting support or reading the appropriate Javadocs for # org.alfresco.repo.domain.schema.SchemaBootstrap for V2.1.2 system.maximumStringLength=-1 # # Limit hibernate session size by trying to amalgamate events for the L2 session invalidation # - hibernate works as is up to this size # - after the limit is hit events that can be grouped invalidate the L2 cache by type and not instance # events may not group if there are post action listener registered (this is not the case with the default distribution) system.hibernateMaxExecutions=20000 # # Determine if modification timestamp propagation from child to parent nodes is respected or not. # Even if 'true', the functionality is only supported for child associations that declare the # 'propagateTimestamps' element in the dictionary definition. system.enableTimestampPropagation=true # # Decide if content should be removed from the system immediately after being orphaned. # Do not change this unless you have examined the impact it has on your backup procedures. system.content.eagerOrphanCleanup=false # The number of days to keep orphaned content in the content stores. # This has no effect on the 'deleted' content stores, which are not automatically emptied. system.content.orphanProtectDays=14 # The action to take when a store or stores fails to delete orphaned content # IGNORE: Just log a warning. The binary remains and the record is expunged # KEEP_URL: Log a warning and create a URL entry with orphan time 0. It won't be processed or removed. system.content.deletionFailureAction=IGNORE # The CRON expression to trigger the deletion of resources associated with orphaned content. system.content.orphanCleanup.cronExpression=0 0 4 * * ? # The CRON expression to trigger content URL conversion. This process is not intesive and can # be triggered on a live system. Similarly, it can be triggered using JMX on a dedicated machine. system.content.contentUrlConverter.cronExpression=* * * * * ? 2099 system.content.contentUrlConverter.threadCount=2 system.content.contentUrlConverter.batchSize=500 system.content.contentUrlConverter.runAsScheduledJob=false # #################### # # Lucene configuration # # #################### # # # Millisecond threshold for text transformations # Slower transformers will force the text extraction to be asynchronous # lucene.maxAtomicTransformationTime=100 # # The maximum number of clauses that are allowed in a lucene query # lucene.query.maxClauses=10000 # # The size of the queue of nodes waiting for index # Events are generated as nodes are changed, this is the maximum size of the queue used to coalesce event # When this size is reached the lists of nodes will be indexed # # http://issues.alfresco.com/browse/AR-1280: Setting this high is the workaround as of 1.4.3. # lucene.indexer.batchSize=1000000 fts.indexer.batchSize=1000 # # Index cache sizes # lucene.indexer.cacheEnabled=true lucene.indexer.maxDocIdCacheSize=100000 lucene.indexer.maxDocumentCacheSize=100 lucene.indexer.maxIsCategoryCacheSize=-1 lucene.indexer.maxLinkAspectCacheSize=10000 lucene.indexer.maxParentCacheSize=100000 lucene.indexer.maxPathCacheSize=100000 lucene.indexer.maxTypeCacheSize=10000 # # Properties for merge (not this does not affect the final index segment which will be optimised) # Max merge docs only applies to the merge process not the resulting index which will be optimised. # lucene.indexer.mergerMaxMergeDocs=1000000 lucene.indexer.mergerMergeFactor=5 lucene.indexer.mergerMaxBufferedDocs=-1 #lucene.indexer.mergerRamBufferSizeMb=16 lucene.indexer.mergerRamBufferSizeMb=20 # # Properties for delta indexes (not this does not affect the final index segment which will be optimised) # Max merge docs only applies to the index building process not the resulting index which will be optimised. # lucene.indexer.writerMaxMergeDocs=1000000 lucene.indexer.writerMergeFactor=5 lucene.indexer.writerMaxBufferedDocs=-1 #lucene.indexer.writerRamBufferSizeMb=16 lucene.indexer.writerRamBufferSizeMb=20 # # Target number of indexes and deltas in the overall index and what index size to merge in memory # lucene.indexer.mergerTargetIndexCount=8 lucene.indexer.mergerTargetOverlayCount=5 lucene.indexer.mergerTargetOverlaysBlockingFactor=2 lucene.indexer.maxDocsForInMemoryMerge=60000 lucene.indexer.maxRamInMbForInMemoryMerge=16 lucene.indexer.maxDocsForInMemoryIndex=60000 #lucene.indexer.maxRamInMbForInMemoryIndex=16 lucene.indexer.maxRamInMbForInMemoryIndex=20 # # Other lucene properties # lucene.indexer.termIndexInterval=128 lucene.indexer.useNioMemoryMapping=true # over-ride to false for pre 3.0 behaviour lucene.indexer.postSortDateTime=true lucene.indexer.defaultMLIndexAnalysisMode=EXACT_LANGUAGE_AND_ALL lucene.indexer.defaultMLSearchAnalysisMode=EXACT_LANGUAGE_AND_ALL # # The number of terms from a document that will be indexed # lucene.indexer.maxFieldLength=10000 # Should we use a 'fair' locking policy, giving queue-like access behaviour to # the indexes and avoiding starvation of waiting writers? Set to false on old # JVMs where this appears to cause deadlock lucene.indexer.fairLocking=true # # Index locks (mostly deprecated and will be tidied up with the next lucene upgrade) # lucene.write.lock.timeout=10000 lucene.commit.lock.timeout=100000 lucene.lock.poll.interval=100 lucene.indexer.useInMemorySort=true lucene.indexer.maxRawResultSetSizeForInMemorySort=1000 lucene.indexer.contentIndexingEnabled=true index.backup.cronExpression=0 0 3 * * ? lucene.defaultAnalyserResourceBundleName=alfresco/model/dataTypeAnalyzers # When transforming archive files (.zip etc) into text representations (such as # for full text indexing), should the files within the archive be processed too? # If enabled, transformation takes longer, but searches of the files find more. transformer.Archive.includeContents=false # Database configuration db.schema.stopAfterSchemaBootstrap=false db.schema.update=true db.schema.update.lockRetryCount=24 db.schema.update.lockRetryWaitSeconds=5 db.driver=org.gjt.mm.mysql.Driver db.name=alfresco db.url=jdbc:mysql:///${db.name} db.username=alfresco db.password=* db.pool.initial=10 db.pool.max=40 db.txn.isolation=-1 db.pool.statements.enable=true db.pool.statements.max=40 db.pool.min=0 db.pool.idle=-1 db.pool.wait.max=-1 db.pool.validate.query= db.pool.evict.interval=-1 db.pool.evict.idle.min=1800000 db.pool.validate.borrow=true db.pool.validate.return=false db.pool.evict.validate=false # db.pool.abandoned.detect=false db.pool.abandoned.time=300 # # db.pool.abandoned.log=true (logAbandoned) adds overhead (http://commons.apache.org/dbcp/configuration.html) # and also requires db.pool.abandoned.detect=true (removeAbandoned) # db.pool.abandoned.log=false # # Caching Content Store # system.content.caching.cacheOnInbound=true system.content.caching.maxDeleteWatchCount=1 # Clean up every day at 3 am system.content.caching.contentCleanup.cronExpression=0 0 3 * * ? system.content.caching.minFileAgeMillis=60000 system.content.caching.maxUsageMB=4096 # maxFileSizeMB - 0 means no max file size. system.content.caching.maxFileSizeMB=0 mybatis.useLocalCaches=false fileFolderService.checkHidden.enabled=true ticket.cleanup.cronExpression=0 0 * * * ? # # Disable load of sample site # sample.site.disabled=false # # Download Service Cleanup # download.cleaner.startDelayMins=60 download.cleaner.repeatIntervalMins=60 download.cleaner.maxAgeMins=60 # enable QuickShare - if false then the QuickShare-specific REST APIs will return 403 Forbidden system.quickshare.enabled=true # # Cache configuration # cache.propertyValueCache.maxItems=10000 cache.contentDataSharedCache.maxItems=130000 cache.immutableEntitySharedCache.maxItems=50000 cache.node.rootNodesSharedCache.maxItems=1000 cache.node.allRootNodesSharedCache.maxItems=1000 cache.node.nodesSharedCache.maxItems=250000 cache.node.aspectsSharedCache.maxItems=130000 cache.node.propertiesSharedCache.maxItems=130000 cache.node.parentAssocsSharedCache.maxItems=130000 cache.node.childByNameSharedCache.maxItems=130000 cache.userToAuthoritySharedCache.maxItems=5000 cache.authenticationSharedCache.maxItems=5000 cache.authoritySharedCache.maxItems=10000 cache.authorityToChildAuthoritySharedCache.maxItems=40000 cache.zoneToAuthoritySharedCache.maxItems=500 cache.permissionsAccessSharedCache.maxItems=50000 cache.readersSharedCache.maxItems=10000 cache.readersDeniedSharedCache.maxItems=10000 cache.nodeOwnerSharedCache.maxItems=40000 cache.personSharedCache.maxItems=1000 cache.ticketsCache.maxItems=1000 cache.avmEntitySharedCache.maxItems=5000 cache.avmVersionRootEntitySharedCache.maxItems=1000 cache.avmNodeSharedCache.maxItems=5000 cache.avmNodeAspectsSharedCache.maxItems=5000 cache.webServicesQuerySessionSharedCache.maxItems=1000 cache.aclSharedCache.maxItems=50000 cache.aclEntitySharedCache.maxItems=50000 cache.resourceBundleBaseNamesSharedCache.maxItems=1000 cache.loadedResourceBundlesSharedCache.maxItems=1000 cache.messagesSharedCache.maxItems=1000 cache.compiledModelsSharedCache.maxItems=1000 cache.prefixesSharedCache.maxItems=1000 cache.webScriptsRegistrySharedCache.maxItems=1000 cache.routingContentStoreSharedCache.maxItems=10000 cache.executingActionsCache.maxItems=1000 cache.tagscopeSummarySharedCache.maxItems=1000 cache.imapMessageSharedCache.maxItems=2000 cache.tenantEntitySharedCache.maxItems=1000 cache.immutableSingletonSharedCache.maxItems=12000 cache.remoteAlfrescoTicketService.ticketsCache.maxItems=1000 cache.contentDiskDriver.fileInfoCache.maxItems=1000 cache.globalConfigSharedCache.maxItems=1000 cache.authorityBridgeTableByTenantSharedCache.maxItems=10 # # Download Service Limits, in bytes # download.maxContentSize=2152852358 # # Use bridge tables for caching authority evaluation. # authority.useBridgeTable=true
Our progress so far:
Question: how can we tell Alfresco to use our analyzer?
Ask for and offer help to other Alfresco Content Services Users and members of the Alfresco team.
Related links:
By using this site, you are agreeing to allow us to collect and use cookies as outlined in Alfresco’s Cookie Statement and Terms of Use (and you have a legitimate interest in Alfresco and our products, authorizing us to contact you in such methods). If you are not ok with these terms, please do not use this website.