AnsweredAssumed Answered

Cluster content inconsistencies

Question asked by crichsource360 on Jun 4, 2014
Latest reply on Jun 10, 2014 by crichsource360
Hello,

We have a 2 node cluster of Alfresco.  We have content being pushed to the repository via WebDAV.  We find that every once in a while, the opposite node does not have the updated file.  The database has the updated information but, the object (from Hibernate, I presuem) is not updated.  So, when any code accesses the file on node1 it's correct.  But, when you access it on node2, it is inaccurate. The servers have to restart in order for node2 to reflect the latest changes.

We increase the rmi socket timeout which seems to make a small improvement however, it's still happening and the timeout is LONG(hours).
alfresco.ehcache.rmi.sockettimeoutmillis

We are using TCP for jgroups and I've increased logging but, it's like finding a needle in a haystack.

However, I also realize that the ehcache objects are configured to never expire.  Our ehcache_custom.xml has entries like this:
    <cache
        name="org.alfresco.repo.domain.hibernate.NodeImpl"
        maxElementsInMemory="10000"
        eternal="false"
        timeToIdleSeconds="900"
        timeToLiveSeconds="900"
        overflowToDisk="false">

            <cacheEventListenerFactory
                    class="net.sf.ehcache.distribution.RMICacheReplicatorFactory"
                    properties="replicatePuts = false,
                                replicateUpdates = true,
                                replicateRemovals = true,
                                replicateUpdatesViaCopy = false,
                                replicateAsynchronously = false"/>
    </cache>

I'm concerned about expiring and potentially moving a load to the database - not desireable.

Does anybody have any advice to help improve the situation?

UPDATE: Digging a little more, I realize I may be going down the wrong path.  I believe it's using JGroups instead of EHCache so, my change probably didn't affect anything!  Here is the ehcache_custom.xml
    cacheManagerPeerProviderFactory
        class="org.alfresco.repo.cache.AlfrescoCacheManagerPeerProviderFactory"
        properties="heartbeatInterval=5000,
                    peerDiscovery=automatic,
                    multicastGroupAddress=230.0.0.1,
                    multicastGroupPort=4446"
   

I guess this means it's using JGroups?  So maybe a timeout in JGroups?  The alfresco-jgroups-TCP.xml is this:
<config>
    <TCP bind_port="${alfresco.tcp.start_port:7800}"
         loopback="true"
         recv_buf_size="20000000"
         send_buf_size="640000"
         discard_incompatible_packets="true"
         max_bundle_size="64000"
         max_bundle_timeout="30"
         enable_bundling="true"
         use_send_queues="false"
         sock_conn_timeout="300"
         skip_suspected_members="true"
        
         thread_pool.enabled="true"
         thread_pool.min_threads="1"
         thread_pool.max_threads="25"
         thread_pool.keep_alive_time="5000"
         thread_pool.queue_enabled="false"
         thread_pool.queue_max_size="100"
         thread_pool.rejection_policy="run"

         oob_thread_pool.enabled="true"
         oob_thread_pool.min_threads="1"
         oob_thread_pool.max_threads="8"
         oob_thread_pool.keep_alive_time="5000"
         oob_thread_pool.queue_enabled="false"
         oob_thread_pool.queue_max_size="100"
         oob_thread_pool.rejection_policy="run"/>
                        
    <TCPPING timeout="3000"
             initial_hosts="${alfresco.tcp.initial_hosts:localhost[7800]}"
             port_range="${alfresco.tcp.port_range:3}"
             num_initial_members="2"/>
    <MERGE2 max_interval="30000"
              min_interval="10000"/>
    <FD_SIMPLE timeout="10000" max_missed_hbs="10" />
    <VERIFY_SUSPECT timeout="1500"  />
    <BARRIER />
    <pbcast.NAKACK
                   use_mcast_xmit="false" gc_lag="0"
                   retransmit_timeout="300,600,1200,2400,4800"
                   discard_delivered_msgs="true"/>
    <UNICAST timeout="300,600,1200" />
    <pbcast.STABLE stability_delay="1000" desired_avg_gossip="50000"
                   max_bytes="400000"/>
    <VIEW_SYNC avg_send_interval="60000"/>
    <pbcast.GMS print_local_addr="true" join_timeout="3000"
                view_bundling="true"/>
    <FC max_credits="2000000"
        min_threshold="0.10"/>
    <FRAG2 frag_size="60000"  />
    <pbcast.STREAMING_STATE_TRANSFER/>
    <!– <pbcast.STATE_TRANSFER/> –> 
</config>


Any help would be appreciated.


Thanks,

Outcomes