AnsweredAssumed Answered

Alfresc indexing slow due to transformation

Question asked by dmorozov on May 12, 2011
Latest reply on Mar 12, 2012 by ebogaard
I have been fighting last week with Alfresco going terribly slow because (I think) of Tika transformations happening in background.
Please provide an advice how to solve this issue.

We have Alfresco 3.4.d installed on Ubuntu 64 bit server.
RAM: 16G
CPU: 4
JVM settings: -Djava.awt.headless=true -server -Xss1M -Xms1G -Xmx4G -XX:NewSize=1G -XX:MaxPermSize=512M -XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode -XX:CMSInitiatingOccupancyFraction=80 -XX:+CMSClassUnloadingEnabled -XX:+UseParNewGC -XX:+UseTLAB
Database -> MySQL on separate server
Content repository size: 23G
Server: Apache tomcat 6.0.32

Alfresco starting with memory about 1.5G and after some time memory usage jumped up to 4.6G
This seems okay while it has good throughout. No slowness, now errors.

But after some time it go really slow and then hang. Even if nobody use the site for some time. I don't know what cause the issue but here is what I have:
1. Linux top shows the average CPU utilization is about 25% (can assume that one of CPUs loaded for ~100% ???)
2. memory dump (kill -3 PID) shows always the same picture. The only really interesting thread that always showed while slowness is Tika transformer for Excel files started from full text search job:

"DefaultScheduler_Worker-2" prio=10 tid=0x0000000041280800 nid=0x7891 runnable [0x00007fe37f4f2000]
   java.lang.Thread.State: RUNNABLE
        at org.openxmlformats.schemas.spreadsheetml.x2006.main.impl.CTSheetDataImpl.sizeOfRowArray(Unknown Source)
        - locked <0x00007fe3e3352a58> (a
        at org.openxmlformats.schemas.spreadsheetml.x2006.main.impl.CTSheetDataImpl$1RowList.size(Unknown Source)
        at java.util.AbstractList$Itr.hasNext(
        at org.apache.poi.xssf.usermodel.XSSFSheet.initRows(
        at org.apache.poi.xssf.usermodel.XSSFSheet.onDocumentRead(
        at org.apache.poi.xssf.usermodel.XSSFWorkbook.onDocumentRead(
        at org.apache.poi.POIXMLDocument.load(
        at org.apache.poi.xssf.usermodel.XSSFWorkbook.<init>(
        at org.apache.poi.xssf.extractor.XSSFExcelExtractor.<init>(
        at org.apache.poi.extractor.ExtractorFactory.createExtractor(
        at org.apache.poi.extractor.ExtractorFactory.createExtractor(
        at org.alfresco.repo.content.TikaOfficeDetectParser.parse(
        at org.alfresco.repo.content.transform.TikaPoweredContentTransformer.transformInternal(
        at org.alfresco.repo.content.transform.AbstractContentTransformer2.transform(
        at org.alfresco.repo.content.transform.AbstractContentTransformer2.transform(
        at sun.reflect.GeneratedMethodAccessor329.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(
        at java.lang.reflect.Method.invoke(
        at org.springframework.aop.framework.ReflectiveMethodInvocation.invokeJoinpoint(
        at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(
        at org.springframework.transaction.interceptor.TransactionInterceptor.invoke(
        at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(
        at org.springframework.aop.framework.JdkDynamicAopProxy.invoke(
        at $Proxy79.index(Unknown Source)
        at org.quartz.simpl.SimpleThreadPool$

3. JProfiler showed memory allocation mostly caused by the same Tica transformer classes.
Most memory taken by xmlbeans, poi and openxmlformats packages and allocation tree showed the same transformation job.

4. Full re-indexing done without any issues.

Can anybody suggest what else I can do and what is the reason of all that?
Is it common to have Alfresco taking almost 5G of RAM?
How can I disable CONTENT indexing for Excel files (that doesn't make sense for me)?
I believe that users can upload pretty big Excel files into repository (say 3M-10M) can it cause the issue?

Any suggestions are appreciated.