Recycling and tuning recommendations

Interviewer - Server > Monitoring and tuning system performance > Monitoring and tuning the system > Cluster tuning > Recycling and tuning recommendations

Many of these recommendations are based on extensive benchmarking that was carried out in January of 2012. The goal of the testing was to prove that the specified cluster could support 6000 concurrent respondents for many hours with bursts of up to 7200 respondents. Four project sizes were used – small, medium, large and extra-large. Each project included sample and quota. Thousands of projects were setup in DPM. Thousands of projects were cached in memory. Users were spread unevenly over 12 projects during testing. A number of tests were run over the course of more than two weeks modelling different ramp up values and project mix, with the goal of supporting 500 interviews starting over a minute. These tests included simultaneous activations and data extractions. Application and Web tier failures were inserted to check that recovery could be accomplished.

The testing, as well as input from users and the online services team at IBM, have led to the following recommendations.

Recycling recommendations

▪Engine application pools should be recycled every 24 hours on a fixed schedule during the time when the lowest level of interviewing is occurring.

▪Recycles should be spread sufficiently to allow all traffic to failover from on engine before another engine starts (at least five minutes apart).

▪Engines should be recycled in reverse order of registration so that a single respondent is unlikely to experience two failovers.

▪The server team should be alerted each time an engine recycles. The alert process should be setup to ignore scheduled recycles.

▪Engine application pools should not be recycled based on memory.

▪Engine application pools will increase in memory; however tests show that even under full load daily recycling is sufficient.

▪Recycling at random times means failing over hundreds of respondents, which can lead to suboptimal respondent experience and additional load on other engines. This can also cause cluster issues if the tuning recommendations are not followed.

Log all application pool recycling events

All application pool recycle events should be logged in order to ease the process of determining why interviewing engine restarts have occurred. It should always be possible to associate an interviewing engine restart log message with either an application pool recycle event or a w3wp failure in the Windows Event logs.

For more information, see the Microsoft article “Configure Logging of Events when an Application Pool Recycles because of Configured Recycling Events (IIS 7)”:

http://technet.microsoft.com/en-us/library/cc771318%28v=ws.10%29.aspx

Tuning recommendations

Configure the ConnectionLimit and StartsPerSecLimit counters

The ConnectionLimit and StartsPerSecLimit counters are used by the engine load balancing script to decide which engine should be used for a new respondent. It is important to properly load balance engines to make the best use of the CPU and memory available on each computer. The goal of limiting projects to single engines, to avoid extra memory use, must be balanced with the requirement to efficiently handle failover. A failover situation is similar to a mass mail-out in that a large number of respondents will be coming in at one time to be distributed. In large clusters (meant to handle thousands of simultaneous respondents) a failover of a heavily loaded engine can overload another engine, resulting in a cascading failover if these limits are not managed.

These limits must be set or updated in the registry of each engine computer under \HKEY_LOCAL_MACHINE\Software\SPSS\mrInterview\3\LoadBalancing. Refer to the UNICOM Intelligence Developer Documentation Library topic Counters for Interview Engine load balancing for more information on these counters.

General tuning should take the following factors into account:

▪available CPU resource per engine

▪CPU cycles required to start or restart surveys

▪number of interviews that must be restarted with each engine failure

Adjust the ConnectionLimit counter based on the following:

▪number of engines per server

▪number of cores per server

▪mix of projects.

The benchmarking testing found that each 4-core server could comfortably support over 2,000 interviews using a small project, or 1,000 interviews of mixed size. Limiting the ConnectionLimit counter to 400 allows the engine to stop accepting interviews as it approaches capacity. Some users have reduced this number further to 300.

Adjust the startup limit (StartsPerSecLimit counter) based on the following:

▪CPU cycles required to restart surveys

▪any other startup delays

▪monitoring the StartsPerSecond and RestartsPerSecond counters.

Benchmarking and user usage has shown a value of 40 to be suitable for the StartsPerSecLimit counters on heavily loaded clusters. However, the optimal value for these limits can only be found by monitoring the StartsPerSecond and RestartsPerSecond performance counters and finding the number that prevents the CPU from reaching 100% usage. If the default value of 100 is too large for your cluster it is suggested to reduce the value to 40 and then monitor the StartsPerSecond, RestartsPerSecond, and CPU performance counters and adjust accordingly. Some users have reduced this value to as low as 30.

Reduce the CacheSubscriptionTimeout value

The computer hosting FMRoot has the additional overhead of managing the cache files. This can stress the CPU of the FMRoot computer if the transfer process fails, or a large number of interviews are abandoned, resulting in an inordinate amount cache files. This behavior can be identified by regular CPU spikes on the FMRoot computer.

If your respondent base is likely to complete an interview within 8 hours you can reduce the CacheSubscriptionTimeout from 48 to 8 hours to remove abandoned cache files more quickly and reduce this load.

To reduce the network load you may want to increase the ScanInterval from the default 5 minutes to up to 30 minutes. This change will reduce the network load but only increase the time for data to be seen in the database in cases where there is a failure in writing to the database. During normal operation, data is written directly to the database after each question so the increased time has no impact.