Developer Documentation Library > Interviewer - Server > Monitoring and tuning system performance > Monitoring and tuning the system > Cluster tuning > Recycling and tuning recommendations
 
Recycling and tuning recommendations
Many of these recommendations are based on extensive benchmarking that was carried out in January 2012. The goal of the testing was to prove that the specified cluster could support 6000 concurrent respondents for many hours with bursts of up to 7200 respondents. Four project sizes were used – small, medium, large and extra-large. Each project included sample and quota. Thousands of projects were setup in DPM. Thousands of projects were cached in memory. Users were spread unevenly over 12 projects during testing. A number of tests were run over the course of more than two weeks modeling different ramp up values and project mix, with the goal of supporting 500 interviews starting over a minute. These tests included simultaneous activations and data extractions. Application and Web tier failures were inserted to check that recovery could be accomplished.
The testing, as well as input from users and the online services team at IBM, have led to the following recommendations.
Recycling recommendations
Recycle engine application pools every 24 hours on a fixed schedule during the time when the lowest level of interviewing is occurring.
Recycles should be spread sufficiently to allow all traffic to failover from on engine before another engine starts (at least five minutes apart).
Engines should be recycled in reverse order of registration so that a single respondent is unlikely to experience two failovers.
The server team should be alerted each time an engine recycles. The alert process should be setup to ignore scheduled recycles.
Do not recycle engine application pools based on memory.
Engine application pools will increase in memory; however, tests show that even under full load daily recycling is sufficient.
Recycling at random times means failing over hundreds of respondents, which can lead to suboptimal respondent experience and additional load on other engines. This can also cause cluster issues if the tuning recommendations are not followed.
Log all application pool recycling events
Log all application pool recycle events to make it easier to find out why interviewing engine restarts have occurred. It should always be possible to associate an interviewing engine restart log message with either an application pool recycle event or a w3wp failure in the Windows Event logs.
For more information, see the Microsoft article “Configure Logging of Events when an Application Pool Recycles because of Configured Recycling Events (IIS 7)”:
http://technet.microsoft.com/en-us/library/cc771318%28v=ws.10%29.aspx
Tuning recommendations
Configure the ConnectionLimit and StartsPerSecLimit counters
The ConnectionLimit and StartsPerSecLimit counters are used by the engine load balancing script to decide which engine should be used for a new respondent. It is important to properly load balance engines to make the best use of the CPU and memory available on each computer. The goal of limiting projects to single engines, to avoid extra memory use, must be balanced with the requirement to efficiently handle failover. A failover situation is similar to a mass mail-out in that a large number of respondents will be coming in at one time to be distributed. In large clusters (meant to handle thousands of simultaneous respondents) a failover of a heavily loaded engine can overload another engine, resulting in a cascading failover if these limits are not managed.
These limits must be set or updated in the registry of each engine computer under \HKEY_LOCAL_MACHINE\Software\SPSS\mrInterview\3\LoadBalancing. Refer to the UNICOM Intelligence Developer Documentation Library topic Counters for Interview Engine load balancing for more information on these counters.
General tuning should take the following factors into account:
available CPU resource per engine
CPU cycles required to start or restart surveys
number of interviews that must be restarted with each engine failure.
Set the ConnectionLimit counter based on the following:
number of engines per server
number of cores per server
mix of projects.
The benchmarking testing found that each 4-core server could comfortably support over 2,000 interviews using a small project, or 1,000 interviews of mixed size. Limiting the ConnectionLimit counter to 400 allows the engine to stop accepting interviews as it approaches capacity. Some users have reduced this number further to 300.
Set the startup limit (StartsPerSecLimit counter) based on the following:
CPU cycles required to restart surveys
any other startup delays
monitoring the StartsPerSecond and RestartsPerSecond counters.
Benchmarking and user usage has shown a value of 40 to be suitable for the StartsPerSecLimit counters on heavily loaded clusters. However, the optimal value for these limits can only be found by monitoring the StartsPerSecond and RestartsPerSecond performance counters and finding the number that prevents the CPU from reaching 100% usage. If the default value of 100 is too large for your cluster it is suggested to reduce the value to 40 and then monitor the StartsPerSecond, RestartsPerSecond, and CPU performance counters and adjust accordingly. Some users have reduced this value to as low as 30.
Reduce the CacheSubscriptionTimeout value
The computer hosting FMRoot has the additional overhead of managing the cache files. This can stress the CPU of the FMRoot computer if the transfer process fails, or a large number of interviews are abandoned, resulting in an inordinate amount cache files. This behavior can be identified by regular CPU spikes on the FMRoot computer.
If your respondent base is likely to complete an interview within 8 hours, you can reduce the CacheSubscriptionTimeout from 48 to 8 hours to remove abandoned cache files more quickly and reduce this load.
To reduce the network load, you might want to increase the ScanInterval from the default 5 minutes to up to 30 minutes. This change reduces the network load, but only increases the time for data to be seen in the database in cases where there is a failure in writing to the database. During normal operation, data is written directly to the database after each question so the increased time has no impact.
See also
Cluster tuning