UNICOM Intelligence monitoring guidelines
Overview
This white paper provides tuning recommendations for UNICOM Intelligence clusters, suggestions for monitoring clusters with performance counters, and investigating clusters with log queries.
The tuning and monitoring recommendations are intended only for interviewing (web and phone). Tuning and monitoring for synchronization (for personal interviewing), and the Interviewer Server Administration/Interviewer Server activities are not covered.
Tuning
Every cluster configuration should be benchmarked and tuned before use. A broad range of settings and configuration can impact the performance of the cluster, including IT department provided settings, network configuration, load balancer setup, and project mix. The settings vary from business to business and can decrease the expected performance of your cluster. The best method for evaluating cluster performance is to benchmark it under load. Testing must include load testing, performance benchmarking, stress testing, and endurance testing with defined usage scenarios (that cover your specific projects and users and defined success criteria).
Setting up a repeatable load test for a cluster, with a large number of respondents, and different user roles, is a challenge that is different for every cluster and customer. Load testing typically requires Interviewer Server Load Tool scripting, scripting a user interface load tool, numerous workstations to drive the load, and people with different skills to monitor and interpret the results. Benchmarking cluster capacity is not described. However, the following sections provide tuning suggestions that have proved useful in previous benchmark tests.
Measurement and analysis
Several tools are suggested to measure and analyze performance during benchmark testing:
▪The Interviewer Server Load Tool provide page to page response times and particular question statistics.
▪The Monitoring Using Performance Counters section provides suggested performance counters that are used during tuning.
▪The Monitoring Using Log Files section discusses log queries that can be used to find errors and check for other useful activities.
▪A user interface load tool is typically required when you test user interfaces. The tool provides response times and other custom timers.
▪Microsoft SQL Server Reports are useful when you check performance. The reports that are most useful to internal tuning are Top Queries By Average CPU Time and Top Queries By Total CPU Time. When more information is required, the Microsoft SQL Server Profiler can be used to retrieve detailed trace output. However, the profiler impacts database performance and must be used only when issues arise.
Tuning the load balancing script
The ConnectionLimit, StartsPerSecLimit, and RestartsPerSecLimit settings are used by the engine load balancing script to decide which engine is be used for a new respondent. It is important to properly load balance engines to make the best use of the CPU and memory that is available on each server. The goal of limiting projects to single engines, to avoid extra memory use, must be balanced with the requirement to efficiently handle failover. A failover situation is similar to a mass email invitation in that many respondents are received to be distributed. In large clusters (that are designed to handle thousands of simultaneous respondents), a failover of a heavily loaded engine can overload another engine, which results in a cascading failover when limits are not managed.
Limits must be set or updated in the registry on each engine server under:
\HKEY_LOCAL_MACHINE\Software\SPSS\mrInterview\3\LoadBalancing
For more information, see Counters for Interview Engine load balancing.
When tuning, pay attention to the following items:
▪Available CPU resources per engine.
▪CPU cycles that are required to start or restart surveys (dependent on your project mix).
▪The number of interviews that must be restarted with each engine failure.
Adjust the ConnectionLimit based on:
▪Number of engines per server.
▪Number of cores per server.
▪Mix of projects.
Previous benchmark testing found that each 4 core server can comfortably support over 2,000 interviews when a small project, or 1,000 interviews of mixed size, are used. Limiting the ConnectionLimit value to 400 allows the engine to stop accepting interviews when it approaches capacity. Some customers successfully reduced the ConnectionLimit value number to 300.
Adjust the start limit (StartsPerSecLimit) based on:
▪CPU cycles that are required to restart surveys.
▪Any other start delays.
▪Monitoring StartsPerSecond and RestartsPerSecond.
Benchmarking, along with customer usage, shows that a StartsPerSecLimit value of 40 is adequate for heavily loaded clusters. However, the optimal values can be identified only by monitoring the StartsPerSecond and RestartsPerSecond performance counters and finding a value that prevents CPU usage from reaching 100%. If the default value of 100 is too large for your cluster, reduce the value to 40 and then monitor the StartsPerSecond, RestartsPerSecond, and CPU performance counters and adjust the value as needed. Some customers successfully reduced the value to as low as 30.
Other tuning
The server that hosts FMRoot has the additional load of managing the cache files. The additional load can stress the CPU of the FMRoot server when the transfer process fails, or many interviews are abandoned (resulting in numerous cache files). The stress is represented by regular CPU spikes on the FMRoot server. Many cache files can also exert stress on the network.
If your respondent base is likely to complete an interview within 8 hours, you can reduce the CacheSubscriptionTimeout value from 48 to 8 hours. Reducing the value removes abandoned cache files more quickly and reduces the server load.
To reduce the network load, you might want to increase the ScanInterval value from the default of 5 minutes to up to 30 minutes. Changing the ScanInterval value reduces the network load, and increases the time (for data to be seen in the database) only in cases where there is a failure in writing to the database. During normal operation, data is written directly to the database after each question. The increased time has no impact on server performance.
Per project tuning
Each engine has a single database connection for each project. Updating the database, with this single connection, can become a bottleneck when a single project includes a large number (over 500) of concurrent respondents. Performance can be improved, for projects with large number of respondents, by setting the MaxCommandsPerConnection property for the RDB DSC 2. The setting specifies the maximum number of commands that are shared by one database connection for the project. Another database connection is created after the limit is reached. Based on internal tests, the suggested value for projects with a high load is between 30 and 50. The value can be updated in DPM at: Server > Projects > [ProjectName] > mrInterview > MaxCommandsPerConnection.
Monitoring with performance counters
Performance counters provide important information in the tuning phase for a new cluster. After the cluster is tuned, a subset of counters can be used to monitor on-going stability and performance. Extra counters might be required when you investigate an issue on the cluster.
Customers often ask about the impact to the cluster when they monitoring performance counters. Performance counters are lightweight and are updated in memory (whether they are monitored or not). Monitoring is also lightweight, provided a suitable sampling time is used. In most cases, a sampling time of 1 minute is sufficient. However, continuous sampling requires a large amount of disk space. The monitoring tool must be setup to use a circular log, with the option to overwrite older logs. Set the log size to provide the system administration team enough time to investigate or backup the log when an alert occurs (some experimentation might be required).
Monitoring must be conducted from a remote workstation, rather than directly from the cluster server. The following items must be monitored:
▪Database - For web and telephone interviewing, the major bottleneck is typically the database and the disk on the database server.
▪Memory - For most server applications on 32-bit systems, the addressable memory space is a bottleneck to the number of projects that can be loaded at one time. For 64-bit installations, while addressable virtual memory is no longer a concern, physical memory is still a consideration, especially given the performance impact of paging.
▪CPU - For web and telephone interviewing, the interviewing tier can limit the CPU, depending on the types of projects.
▪Network - For interviewing, when bandwidth and latency between the browser and the server are a consideration, network I/O can be the bottleneck. The network can also be a bottleneck when many cache files are stored on FMRoot, and many interviewing engines are checking the cache files.
▪Threads - Given that the server environments use thread pools, a potential bottleneck is the number of threads that are available in the thread pool. If a thread is not available to process a request, the request is queued, which results in a response delay.
The following tables list the available performance counters.
Cluster performance counters
Group
|
Counter
|
Servers
|
When to monitor?
|
Alert at
|
Memory
|
% Committed Bytes In Use
|
All DC
|
When issues arise
|
|
Memory
|
Available Mbytes
|
All DC
|
Always
|
Consistently < 20% of installed RAM indicates insufficient memory, if alerted investigate which process is using the memory
|
Memory
|
Page/sec
|
All DC
|
When issues arise
|
Sustained > 5
|
NetworkInterface
|
Bytes Total/sec
|
All DC
|
Benchmarking
|
Sustained > 80% of bandwidth
|
NetworkInterface
|
Output Queue Length
|
All DC
|
When issues arise
|
|
NetworkInterface
|
Packets/sec
|
All DC
|
When issues arise
|
|
Processor(_Total)
|
% Processor Time
|
All DC
|
Always
|
>= 80% for more than 1 minute, if alerted investigate which process is using the CPU
|
System
|
Context Switches/sec
|
All DC
|
When issues arise
|
|
System
|
Processor Queue Length
|
All DC
|
Benchmarking
|
Average value > 2
|
ASP.NET
|
Active Threads
|
All DC
|
When issues arise
|
|
ASP.NET
|
Request Execution Time
|
All DC
|
When issues arise
|
|
ASP.NET
|
Request Wait Time
|
All DC
|
When issues arise
|
|
ASP.NET
|
Requests Queued
|
All DC
|
Always
|
>= 0 for more than 1 minute
|
Interview Web [For all web tier instances]
|
Active Threads
|
Accessories, Web
|
Always
|
>= 30 per sec
|
Interview Web [For all web tier instances]
|
Average Response Time
|
Accessories, Web
|
Always
|
>= 500ms
|
Interview Web [For all web tier instances]
|
Current Queued Requests
|
Accessories, Web
|
Always
|
>= 0 for more than 1 minute
|
Interview Web [For all web tier instances]
|
Engines Failed
|
Accessories, Web
|
Always
|
At each increase
|
Interview Web [For all web tier instances]
|
Server Requests/sec
|
Accessories, Web
|
Always
|
>= 50 per sec
|
Process [For all w3wp processes]
|
% Processor Time
|
All DC
|
Benchmarking
|
>= 80%
|
Process [For all w3wp processes]
|
Thread Count
|
All DC
|
When issues arise
|
>= 64 * cores + 200
|
Process [For all w3wp processes]
|
Virtual Bytes
|
All DC
|
When issues arise
|
Increasing without leveling off
|
Process [For all w3wp processes]
|
Virtual Bytes Peak
|
All DC
|
When issues arise
|
|
Process [For all w3wp processes]
|
Working Set
|
All DC
|
When issues arise
|
|
Process [For all w3wp processes]
|
Working Set Peak
|
All DC
|
When issues arise
|
|
APP_POOL_WAS(_Total)
|
Total Worker Process Failures
|
Interviewing
|
Always
|
At each increase
|
APP_POOL_WAS(_Total)
|
Total Application Pool Recycles
|
Interviewing
|
Always
|
Log only
|
Interview Engine [For all interview engines]
|
Active Threads
|
Interviewing
|
Always
|
>= 30 per sec
|
Interview Engine [For all interview engines]
|
Average Response Time
|
Interviewing
|
Always
|
>= 100ms
|
Interview Engine [For all interview engines]
|
Completes/sec
|
Interviewing
|
Always
|
Log only
|
Interview Engine [For all interview engines]
|
Current Interviews
|
Interviewing
|
Always
|
Log only, ConnectionLimit stops engines from overloading, use PercentLoaded for alerting
|
Interview Engine [For all interview engines]
|
Current Queued Requests
|
Interviewing
|
Always
|
>= 0 for more than 1 minute
|
Interview Engine [For all interview engines]
|
Percent Loaded
|
Interviewing
|
Always
|
>= 80%
|
Interview Engine [For all interview engines]
|
Server Requests/sec
|
Interviewing
|
Always
|
>= 50 per sec
|
Interview Engine [For all interview engines]
|
Total Interviews
|
Interviewing
|
When issues arise
|
|
LogicalDisk [All Volumes]
|
% Free Space
|
All DC
|
Always
|
<=20%
|
PhysicalDisk(_Total)
|
% Disk Time
|
Database, FMRoot
|
Always
|
>= 50%
|
PhysicalDisk(_Total)
|
Avg. Disk Read Queue Length
|
Database, FMRoot
|
Always
|
>= 2
|
PhysicalDisk(_Total)
|
Avg. Disk Write Queue Length
|
Database, FMRoot
|
Always
|
>= 2
|
Process(sqlservr)
|
% Processor Time
|
Database
|
Always
|
>= 80% for more than 1 minute, if alerted investigate which process is using the CPU
|
Process(sqlservr)
|
Thread Count
|
Database
|
When issues arise
|
|
Process(sqlservr)
|
Virtual Bytes
|
Database
|
When issues arise
|
|
Process(sqlservr)
|
Virtual Bytes Peak
|
Database
|
When issues arise
|
|
Process(sqlservr)
|
Working Set
|
Database
|
When issues arise
|
|
Process(sqlservr)
|
Working Set Peak
|
Database
|
When issues arise
|
|
SQLServer:Access Methods
|
Full Scans/sec
|
Database
|
When issues arise
|
|
SQLServer:Buffer Manager
|
Buffer cache hit ratio
|
Database
|
When issues arise
|
|
SQLServer:General Statistics
|
Logical Connections
|
Database
|
When issues arise
|
|
SQLServer:General Statistics
|
User Connections
|
Database
|
When issues arise
|
|
SQLServer:Locks(_Total)
|
Lock Requests/sec
|
Database
|
Benchmarking
|
|
SQLServer:Locks(_Total)
|
Lock Waits/sec
|
Database
|
Benchmarking
|
|
SQLServer:Locks(_Total)
|
Number of Deadlocks/sec
|
Database
|
When issues arise
|
|
SQLServer:Memory Manager
|
Target Server Memory (KB)
|
Database
|
When issues arise
|
|
SQLServer:Memory Manager
|
Total Server Memory (KB)
|
Database
|
When issues arise
|
|
SQLServer:SQL Statistics
|
Batch Requests/sec
|
Database
|
When issues arise
|
|
Project performance counters
Group
|
Counter
|
Servers
|
When to monitor?
|
Alert at
|
Interview Project
|
Web - Average time to start interview
|
Interviewing
|
Always
|
> 1 second
|
Interview Project
|
Web - Maximum time to start interview
|
Interviewing
|
Always
|
> 4 seconds
|
Interview Project
|
Web - Average time page-to-page
|
Interviewing
|
Always
|
> 1 second
|
Interview Project
|
Web - Maximum time page-to-page
|
Interviewing
|
Always
|
> 2 seconds
|
Interview Project
|
Telephone - Average time to start interview
|
Interviewing
|
Always
|
> 1 second
|
Interview Project
|
Telephone - Maximum time to start interview
|
Interviewing
|
Always
|
> 4 seconds
|
Interview Project
|
Telephone - Average time page-to-page
|
Interviewing
|
Always
|
> 1 second
|
Interview Project
|
Telephone - Maximum time page-to-page
|
Interviewing
|
Always
|
> 2 seconds
|
Use the following recommendations, in association with the previous tables, when you monitor with performance counters:
▪Always monitor the bolded counters. The Alert at: column documents a level for the counter that is considered to be too high.
▪When you benchmark, monitor the italicized counters in addition the bolded counters. When you tune, look for levels that are greater than the Alert at: values.
▪The remaining counters, that are marked as When issues arise, might be useful when you investigate issues.
The suggested alerts do not always indicate that a cluster is about to fail. The alerts identify a cluster that is starting to respond less optimally and might impact user experience. You might want to adjust your alerts by identifying your acceptable level of temporary performance degradation, and your cluster support staff availability.
It might be useful to create performance counter templates for each of the monitoring requirements. Some counters are only created when the process is created. For example, the engine counters for a particular engine are only created when the engine is created. This fact makes it difficult to create a template in Microsoft Performance Monitor for a production environment. However, it is possible to monitor transient counters with systems management and monitoring tools, such as IBM Tivoli or Microsoft Systems Manager.
To avoid performance counter corruption, always allow sufficient time for the IIS and engine processes to shutdown. 120 seconds is a sufficient shutdown for IIS. When you reset the application pools, set the shutdown time limit to 180 seconds.
Project monitoring
The stability and performance of an UNICOM Intelligence cluster can vary greatly depending on the active projects. Large projects, with many variable instances, require more memory. Projects with complex routing require more CPU resources. Projects that use external COM objects (with CreateObject) can be impacted by the requirements of the external objects. For example, some objects might limit the number of simultaneous accesses, which causes contention issues in a server environment. Other objects might load files into memory, which increases the memory requirements. ADO objects might impact the database server load.
Many of the issues can be monitored and tracked with the performance counters. However, to monitor the actual performance of each project, and therefore the performance that is experienced by the respondents (or interviewers), it might be useful to monitor the per project counters.
Active projects change over time, which makes monitoring per project performance counters a challenge. Companies with programmatic activation processes have an advantage in that the counters can programmatically be added into the monitoring tool. Systems management and monitoring tools like IBM Tivoli or Microsoft Systems Manager can also be used to monitor transient counters.
Refer to the Project performance counters table for suggested project counters.
Monitoring with log files
Log files can be useful in identifying project issues before they cause issues in the cluster. They are also useful when issues, that need to be investigated, arise. To get the most out of the log files, it is crucial that you ensure that all the required log files are available. On busy clusters, the regular archiving of the log files might be required, sometimes as frequently as every 2 hours. As your cluster load increases, check the time interval of log file overwrite and setup an archive process to ensure that no log files are lost. Log files do not need to be stored for a long period but do need to be available when an issue is reported. Logs must be kept for the largest interval between when an issue is reported and the investigation of the issue. A week's worth of log files is typically sufficient.
The Log DSC is useful when you investigate log files, as it reduces the parsing requirements and allows the investigator to use SQL queries during investigation. The Log DSC can be used with DM Query to manually view logs, or it can be used programmatically for the automatic monitoring of log files.
For more information, see
Log DSC and
DM Query.
The UNICOM Intelligence Interviewer - Server Admin Manage Logs activity is often used to gather logs. The activity archives logs in server specific subfolders. It is useful to structure log files for investigation in this manner so that you have the option of querying log files across all servers (or focus on a particular server). To view logs across all folders, select a directory (rather than a specific file) and set MR Init Custom to SearchSubFolders on the connection string.
If the log set is large, you can also filter the set. For example, use the following to filter the logs by type:
FileTypeFilter=IVW;SearchSubFolders=True
The log set can also be filtered by time if the original creation and modified times are available in the log files. If the original creation and modified times are available, the following example filters by time:
FileDateFilter=2013/04/12 7:00PM-2013/04/12
11:59PM;FileTypeFilter=IVW;SearchSubFolders=True
In the following queries, the parts of the query that require substitution are bolded.
Useful queries
There is no recipe for log file investigation. However, some queries might be useful or might provide useful templates. The following MRS script might prove useful for initial log investigation. The script runs a subset of the following queries and exports the results to CSV files. You must set the LOG_FILES_FOLDER and TIER_TYPE constants before you run the query.
' *********************************************************************************
' Last updated: 2014-04-24
' This script uses the Log DSC to query log files
' It also runs some standard queries
' The standard queries depend on if the logs are Web or Interview tier logs
' It is expected that the LOG_FILES_FOLDER is the top level of a directory structure similar to the following
' LOG_FILES_FOLDER
' --- Engine machine 1
' --- Engine machine 2
' --- Engine machine 3
#define DELIMITER ","
' Specify the tier type as Web or Interview
#define TIER_TYPE "App"
' Specify the folder for the logs
#define LOG_FILES_FOLDER "C:\Logs\App"
' *********************************************************************************
' Open the logs using the Log DSC
' Specify SearchSubFolders=True to load logs from all sub folders
Dim ConnectionString, adoConnection
ConnectionString = "Provider=mrOleDB.Provider.2;" + _
"Initial Catalog=" + LOG_FILES_FOLDER + ";" + _
"MR Init MDSC=mrLogDsc;" + _
"MR Init Category Names=1;" + _
"MR Init Custom=SearchSubFolders=True"
Set adoConnection = CreateObject("ADODB.Connection")
adoConnection.Open(ConnectionString)
Dim OutputPath, ReportName, SQLQuery, HeaderString
OutputPath = LOG_FILES_FOLDER + "\"
' Find the start and length of the machine name
' Should be a subdirectory of the LOG_FILES_FOLDER
Dim MachineIndex, MachineLength, MachineExpression
MachineIndex = Len(OutputPath)
MachineLength = GetSubFolderLength(LOG_FILES_FOLDER)
MachineExpression = "Mid(LogFile," + CText(MachineIndex) + "," + CText(MachineLength) + ")"
' First run a query to get the log file overlap
ReportName = "LogOverlap"
SQLQuery = "SELECT Min(DateTime), Max(DateTime), __MachineFromLogFile__ " + _
"FROM VDATA " + _
"GROUP BY __MachineFromLogFile__"
HeaderString = "StartDateTime, EndDateTime, Machine"
ExportSQLOutput(adoConnection, ReportName, HeaderString, SQLQuery, MachineExpression, OutputPath)
' Execute a set of standard SQL queries based on the tier type
If (UCase(TIER_TYPE) = "WEB") Then
' Get the number of IsActive timeouts in 15 minute intervals
ReportName = "IsActiveTimeoutsIn15MinuteIntervals"
SQLQuery = "SELECT DateTime.DateOnly() As Day, " + _
"CText(DateTime.DatePart('h'))+':'+CText(DateTime.DatePart('n')/15 * 15) As QuarterHour, " + _
"COUNT(*) As IsActiveErrors " + _
"FROM VDATA " + _
"WHERE LogEntry.Find('IsActive') >= 0 AND LogLevel = {warning} " + _
"GROUP BY " + _
"DateTime.DateOnly(), " + _
"DateTime.DatePart('h'), " + _
"DateTime.DatePart('n')/15 " + _
"ORDER BY " + _
"DateTime.DateOnly(), " + _
"DateTime.DatePart('h'), " + _
"DateTime.DatePart('n')/15"
HeaderString = "Day,QuarterHour,IsActiveErrors"
ExportSQLOutput(adoConnection, ReportName, HeaderString, SQLQuery, MachineExpression, OutputPath)
' Get the number of timeouts per question
ReportName = "TimeoutsPerProjectAndQuestion"
SQLQuery = "SELECT LogEntry.Mid(LogEntry.Find('I.Project') + 10, 8) As Project, " + _
"LogEntry.Mid(LogEntry.Find('HTTP Client')+ 12, 20) As Error, " + _
"IIf(LogEntry.Find('CreateInterview') = -1, IIf(LogEntry.Find('I.SavePoint') >= 0, LogEntry.Mid(LogEntry.Find('I.SavePoint') + 12, LogEntry.Find('&', LogEntry.Find('I.SavePoint')) - LogEntry.Find('I.SavePoint')-12), '[Post]'), '[CreateInterview]') As SavePoint, " + _
"COUNT(*) " + _
"FROM VDATA WHERE LogEntry.Find('http') = 0 and LogLevel = {error} " + _
"GROUP BY " + _
"LogEntry.Mid(LogEntry.Find('I.Project') + 10, 8), " + _
"LogEntry.Mid(LogEntry.Find('HTTP Client')+ 12), " + _
"IIf(LogEntry.find('CreateInterview') = -1, IIf(LogEntry.Find('I.SavePoint') >= 0, LogEntry.Mid(LogEntry.Find('I.SavePoint') + 12, LogEntry.Find('&', LogEntry.find('I.SavePoint')) - LogEntry.Find('I.SavePoint')-12), '[Post]'), '[CreateInterview]')"
HeaderString = "Project,Error,SavePoint,Count"
ExportSQLOutput(adoConnection, ReportName, HeaderString, SQLQuery, MachineExpression, OutputPath)
' Template errors by project
ReportName = "TemplateErrorsByProject"
SQLQuery = "SELECT Project, MIN(DateTime) AS FirstTime, MAX(DateTime) AS LastTime, COUNT(*) AS Number, LogEntry FROM VDATA " + _
"WHERE (LogLevel * {warning, error, fatal}) AND LogEntry LIKE '%Template%' " + _
"GROUP BY Project, LogEntry " + _
"ORDER BY Project, COUNT(*) DESC"
HeaderString = "Project,FirstTime,LastTime,Count,LogEntry"
ExportSQLOutput(adoConnection, ReportName, HeaderString, SQLQuery, MachineExpression, OutputPath)
Else
' List the engine shutdowns and starts
ReportName = "EngineShutdownsAndStarts"
SQLQuery = "SELECT __MachineFromLogFile__, DateTime, Milliseconds, LogScope, LogEntry " + _
"FROM VDATA " + _
"WHERE LogEntry.Find('Session Engine s') >= 0"
HeaderString = "Machine,DateTime,Milliseconds,LogScope,LogEntry"
ExportSQLOutput(adoConnection, ReportName, HeaderString, SQLQuery, MachineExpression, OutputPath)
' Count the interview starts and stops over all logs, hopefully the entire cluster
ReportName = "InterviewStartsAndStops"
SQLQuery = "SELECT " + _
"DateTime.DateOnly() As Day, " + _
"CText(DateTime.DatePart('h'))+':'+CText(DateTime.DatePart('n')/15 * 15) As QuarterHour, " + _
"SUM(LogEntry.Find('Interview start') = 0) As InterviewStarts, " + _
"SUM(LogEntry.Find('Interview restart') = 0) As InterviewRestarts, " + _
"SUM(LogEntry.Find('Interview terminated') = 0) As InterviewTerminates, " + _
"SUM(LogEntry.Find('Interview complete') = 0) As InterviewCompletes, " + _
"SUM(LogEntry = 'Interview timeout') As InterviewTimeouts, " + _
"SUM(LogEntry.Find('Abandoning') >= 0) As InterviewTimeoutAbandons, " + _
"SUM(LogEntry.Find('Interview stopped') = 0) As InterviewStops, " + _
"SUM(LogEntry.Find('Interview shutdown') = 0) As InterviewShutdowns, " + _
"SUM(LogEntry.Find('Interview rejected') = 0) As InterviewRejects " + _
"FROM VDATA " + _
"WHERE LogLevel = {metric} " + _
"GROUP BY " + _
"DateTime.DateOnly(), " + _
"DateTime.DatePart('h'), " + _
"DateTime.DatePart('n')/15 " + _
"ORDER BY " + _
"DateTime.DateOnly(), " + _
"DateTime.DatePart('h'), " + _
"DateTime.DatePart('n')/15"
HeaderString = "Day,QuarterHour,Start,Restart,Terminate,Complete,Timeout,TimeoutAbandon,Stop,Shutdown,Reject"
ExportSQLOutput(adoConnection, ReportName, HeaderString, SQLQuery, MachineExpression, OutputPath)
' Count the interview starts and stops per logscope/engine
ReportName = "InterviewStartsAndStops_PerEngine"
SQLQuery = "SELECT " + _
"LogScope, " + _
"DateTime.DateOnly() As Day, " + _
"CText(DateTime.DatePart('h'))+':'+CText(DateTime.DatePart('n')/15 * 15) As QuarterHour, " + _
"SUM(LogEntry.Find('Interview start') = 0) As InterviewStarts, " + _
"SUM(LogEntry.Find('Interview restart') = 0) As InterviewRestarts, " + _
"SUM(LogEntry.Find('Interview terminated') = 0) As InterviewTerminates, " + _
"SUM(LogEntry.Find('Interview complete') = 0) As InterviewCompletes, " + _
"SUM(LogEntry = 'Interview timeout') As InterviewTimeouts, " + _
"SUM(LogEntry.Find('Abandoning') >= 0) As InterviewTimeoutAbandons, " + _
"SUM(LogEntry.Find('Interview stopped') = 0) As InterviewStops, " + _
"SUM(LogEntry.Find('Interview shutdown') = 0) As InterviewShutdowns, " + _
"SUM(LogEntry.Find('Interview rejected') = 0) As InterviewRejects " + _
"FROM VDATA " + _
"WHERE LogLevel = {metric} " + _
"GROUP BY " + _
"LogScope, " + _
"DateTime.DateOnly(), " + _
"DateTime.DatePart('h'), " + _
"DateTime.DatePart('n')/15 " + _
"ORDER BY " + _
"DateTime.DateOnly(), " + _
"DateTime.DatePart('h'), " + _
"DateTime.DatePart('n')/15"
HeaderString = "LogScope,Day,QuarterHour,Start,Restart,Terminate,Complete,Timeout,TimeoutAbandon,Stop,Shutdown,Reject"
ExportSQLOutput(adoConnection, ReportName, HeaderString, SQLQuery, MachineExpression, OutputPath)
' Scripting errors by project
ReportName = "ScriptingErrorsByProject"
SQLQuery = "SELECT Project, MIN(DateTime) AS FirstTime, MAX(DateTime) AS LastTime, COUNT(*) AS Number, LogEntry " + _
"FROM VDATA " + _
"WHERE (LogLevel * {error, fatal}) AND LEFT(LogEntry,4) <> 'HTTP' AND LEFT(LogEntry,26) <> 'The requested sample field' " + _
"GROUP BY Project, LogEntry " + _
"ORDER BY Project, COUNT(*) DESC"
HeaderString = "Project,FirstTime,LastTime,Count,LogEntry"
ExportSQLOutput(adoConnection, ReportName, HeaderString, SQLQuery, MachineExpression, OutputPath)
' Sample errors by project
ReportName = "SampleErrorsByProject"
SQLQuery = "SELECT Project, MIN(DateTime) AS FirstTime, MAX(DateTime) AS LastTime, COUNT(*) AS Number, LogLevel, LogEntry " + _
"FROM VDATA " + _
"WHERE (LogLevel * {warning, error, fatal}) AND (LogEntry LIKE 'The requested sample field%') " + _
"GROUP BY Project, LogEntry, LogLevel " + _
"ORDER BY Project, COUNT(*) DESC"
HeaderString = "Project,FirstTime,LastTime,Count,LogLevel,LogEntry"
ExportSQLOutput(adoConnection, ReportName, HeaderString, SQLQuery, MachineExpression, OutputPath)
End If
adoConnection.Close()
' *********************************************************************************
Function ExportSQLOutput(adoConnection, ReportName, HeaderString, SQLQuery, MachineExpression, OutputPath)
Dim fso, FileName, TextStream, OutputLine
Dim adoRecordSet, adoField
FileName = TIER_TYPE + "_" + ReportName + ".csv"
Debug.Log("Creating report " + FileName)
' Substitute the machine index to create a machine name
SQLQuery = Replace(SQLQuery, "__MachineFromLogFile__", MachineExpression)
Set adoRecordSet = adoConnection.Execute(SQLQuery)
If (adoRecordSet.BOF And adoRecordSet.EOF) Then
Debug.Log(ReportName + ": No records")
ExportSQLOutput = False
Exit Function
End If
Set fso = CreateObject("Scripting.FileSystemObject")
On Error Resume Next
fso.DeleteFile(OutputPath + FileName, True)
On Error GoTo ExportSQLOutput_ErrorHandler
Set TextStream = fso.OpenTextFile(OutputPath + FileName, 2 '! ForWriting !', True)
TextStream.WriteLine(HeaderString)
adoRecordSet.MoveFirst()
While (Not(adoRecordSet.EOF))
OutputLine = ""
For Each adoField In adoRecordSet.Fields
OutputLine = OutputLine + CText(adoField.Value) + ","
Next
TextStream.WriteLine(OutputLine)
adoRecordSet.MoveNext()
End While
TextStream.Close()
ExportSQLOutput = True
Exit Function
ExportSQLOutput_ErrorHandler:
Debug.Log(ReportName + ": Error at line " + CText(Err.LineNumber) + ": " + Err.Description)
ExportSQLOutput = False
End Function
' *********************************************************************************
Function GetSubFolderLength(FolderPath)
Dim fso, folder, subfolder
Dim SubFolderLength
SubFolderLength = 0
Set fso = CreateObject("Scripting.FileSystemObject")
' TODO - Error
Set folder = fso.GetFolder(FolderPath)
For Each subfolder In folder.SubFolders
If (Len(subfolder.Name) > SubFolderLength) Then
SubFolderLength = Len(subfolder.Name)
End If
Next
GetSubFolderLength = SubFolderLength
End Function
All - finding the log file overlap
In many cases, an incomplete set of log files is delivered. The following query lists the start and end time for the log files in each directory, which identifies the period when log files are available for all server.
After the log file overlap is defined, the following settings can be used to filter the log file. It is important to setup the WHERE clause to filter for the log file types that are important to the investigation, because each log file type overwrites at a different speed.
The following example lists the minimum and maximum time in the IVW logs. The parameter for the call to Left depends on the log directory structure. The parameter must be set to include the directory that indicates the server.
SELECT
Min(DateTime), Max(DateTime), LogFile.Left(68)
FROM VDATA
GROUP BY LogFile.Left(68)
All - finding errors
Use the following query to search for errors:
SELECT Mid(LogFile, 54, 14), DateTime, Milliseconds, LogScope, LogLevel, Project, RespondentId, InterviewId, LogEntry
FROM VDATA
WHERE LogLevel * {error, fatal}
Find all errors by frequency:
SELECT COUNT(*) AS Number, MIN(DateTime) AS FirstTime, MAX(DateTime) AS LastTime, LogEntry
FROM VDATA
WHERE LogLevel * {error, fatal}
GROUP BY LogEntry
ORDER BY COUNT(*)
Use the following query to search for errors and warnings:
SELECT
Mid(LogFile, 54, 14), DateTime, Milliseconds, LogScope, LogLevel, Project, RespondentId, InterviewId, Mid(LogEntry, 35, 23)
FROM VDATA
WHERE LogLevel * {warning, error, fatal}
The following query groups the errors and warnings by project:
SELECT Project, MIN(DateTime) AS FirstTime, MAX(DateTime) AS LastTime, COUNT(*) AS Number, LogLevel, LogEntry
FROM VDATA
WHERE AppGroup = 'IVW' AND (LogLevel * {warning, error, fatal})
GROUP BY Project, LogEntry, LogLevel
ORDER BY Project, COUNT(*) DESC
You typically find extraneous errors and warnings in your log files. In particular, project errors are often ignored by scriptwriters. The errors might include missing template errors (that show up on the web tier), or missing sample fields (that show up on the interview tier). It is best to correct the errors, but when you monitor for stability and performance, you might want to create a query that filters out the persistent errors.
Interview tier – checking for engine shutdowns or starts
The following query lists engine shutdowns and starts, so they can be compared with the times of other issues:
SELECT
Mid(LogFile, 54, 14), DateTime, Milliseconds, LogScope, LogEntry
FROM VDATA
WHERE LogEntry.Find('Session Engine s') >= 0
When there are numerous shutdowns and restarts, check for spikes with the following query. The query groups engine shutdowns and starts, includes the log file directory to provide information about which server the engine is on.
SELECT
Mid(LogFile, 54, 14),
DatePart(DateTime, 'h'),
DatePart(DateTime, 'n'),
SUM(LogEntry.Find('Interviewer Session Engine shutdown') = 0) As EngineShutdowns,
SUM(LogEntry.Find('Interviewer Session Engine start') = 0) As EngineStarts
FROM VDATA
WHERE Logentry.Find('Interviewer Session Engine s') >= 0
GROUP BY
Mid(LogFile, 54, 14),
DatePart(DateTime, 'h'),
DatePart(DateTime, 'n')
The following query groups engine shutdown/starts into 15-minute intervals, to compare with values from other queries:
SELECT
DateTime.DateOnly() As Day,
CText(DateTime.DatePart('h'))+':'+CText(DateTime.DatePart('n')/15 * 15) As QuarterHour,
SUM(LogEntry.Find('Interviewer Session Engine shutdown') = 0) As EngineShutdowns,
SUM(LogEntry.Find('Interviewer Session Engine start') = 0) As EngineStarts
FROM VDATA
WHERE Logentry.Find('Interviewer Session Engine s') = 0
GROUP BY
DateTime.DateOnly(),
DateTime.DatePart('h'),
DateTime.DatePart('n')/15
ORDER BY
DateTime.DateOnly(),
DateTime.DatePart('h'),
DateTime.DatePart('n')/15
Interview tier – counting the number of interview starts/restarts
The following query counts the number of interview starts or restarts in 15-minute intervals:
SELECT
DateTime.DateOnly() As Day,
CText(DateTime.DatePart('h'))+':'+CText(DateTime.DatePart('n')/15 * 15) As QuarterHour,
SUM(LogEntry.Find('Interview start') = 0) As InterviewStarts,
SUM(LogEntry.Find('Interview restart') = 0) As InterviewRestarts
FROM VDATA
WHERE LogLevel = {metric} AND (LogEntry.Find('Interview start') = 0 OR LogEntry.Find('Interview restart') = 0)
GROUP BY
DateTime.DateOnly(),
DateTime.DatePart('h'),
DateTime.DatePart('n')/15
ORDER BY
DateTime.DateOnly(),
DateTime.DatePart('h'),
DateTime.DatePart('n')/15
The following query counts all types of starts, restarts, and stops separately. Abandons are included in the timeouts.
SELECT
DateTime.DateOnly() As Day,
CText(DateTime.DatePart('h'))+':'+CText(DateTime.DatePart('n')/15 * 15) As QuarterHour,
SUM(LogEntry.Find('Interview start') = 0) As InterviewStarts,
SUM(LogEntry.Find('Interview restart') = 0) As InterviewRestarts,
SUM(LogEntry.Find('Interview terminated') = 0) As InterviewTerminates,
SUM(LogEntry.Find('Interview complete') = 0) As InterviewCompletes,
SUM(LogEntry = 'Interview timeout') As InterviewTimeouts,
SUM(LogEntry.Find('Abandoning') >= 0) As InterviewTimeoutAbandons,
SUM(LogEntry.Find('Interview stopped') = 0) As InterviewStops,
SUM(LogEntry.Find('Interview shutdown') = 0) As InterviewShutdowns,
SUM(LogEntry.Find('Interview rejected') = 0) As InterviewRejects
FROM VDATA
WHERE LogLevel = {metric}
GROUP BY
DateTime.DateOnly(),
DateTime.DatePart('h'),
DateTime.DatePart('n')/15
ORDER BY
DateTime.DateOnly(),
DateTime.DatePart('h'),
DateTime.DatePart('n')/15
Interview tier – counting the number of projects cached
The following query counts the number of projects that are cached in 15-minute intervals:
SELECT
DateTime.DateOnly() As Day,
CText(DateTime.DatePart('h'))+':'+CText(DateTime.DatePart('n')/15 * 15) As QuarterHour,
SUM(LogEntry.Find('has been cached') >= 0 And LogEntry.Find('Version') = 0) As CachedProjects
FROM VDATA
WHERE (Logentry.Find('has been cached') >= 0 And LogEntry.Find('Version') = 0)
GROUP BY
DateTime.DateOnly(),
DateTime.DatePart('h'),
DateTime.DatePart('n')/15
ORDER BY
DateTime.DateOnly(),
DateTime.DatePart('h'),
DateTime.DatePart('n')/15
Interview tier – finding respondents who restart
The following query lists the respondents that restart an interview:
SELECT InterviewId, COUNT (*)FROM VDATA WHERE LogLevel={metric} GROUP BY
InterviewId HAVING COUNT(InterviewId)<>2
Interview tier – counting restarts from the ACTIVE queue
The following query counts the records that restart from the ACTIVE queue, and divides them into records that are recognized as a failover, and records that fail to restart:
SELECT
DatePart(DateTime, 'h') As Hour, DatePart(DateTime, 'n') As Minute,
SUM(LogEntry.Find('Failover situation found') >= 0) As FailoverSituationFound,
SUM(LogEntry.Find('Rejected, sample in ACTIVE') >= 0) As RejectedFromACTIVE#
FROM VDATA
WHERE (LogEntry.Find('Failover situation found') >= 0 OR LogEntry.Find('Rejected, sample in ACTIVE') >= 0)
GROUP BY
DateTime.DatePart('h'),
DateTime.DatePart('n')
Interview tier – investigating a particular respondent
Use the following query when a particular respondent experiences an error:
SELECT Mid (LogFile, 54), DateTime, Milliseconds, LogScope, Project, RespondentId,
InterviewId, LogEntry FROM VDATA WHERE RespondentId='1066724411' ORDER BY DateTime
Interview tier – find errors by project
Project errors must be sent to the associated scriptwriter. The following query finds non-scripting errors, but most of the errors are project errors.
SELECT Project, MIN(DateTime) AS FirstTime, MAX(DateTime) AS LastTime, COUNT(*) AS Number, LogEntry
FROM VDATA
WHERE (LogLevel * {error, fatal}) AND LEFT(LogEntry,4) <> 'HTTP' AND LEFT(LogEntry,26) <> 'The requested sample field'
GROUP BY Project, LogEntry
ORDER BY Project, COUNT(*) DESC
Interview tier – find sample field errors by project
The errors must be sent to the associated scriptwriter, and can be removed by checking whether a sample field exists:
IOM.SampleRecord.Exists(“SampleFieldName”)
SELECT Project, MIN(DateTime) AS FirstTime, MAX(DateTime) AS LastTime, COUNT(*) AS Number, LogLevel, LogEntry
FROM VDATA
WHERE (LogLevel * {warning, error, fatal}) AND (LogEntry LIKE 'The requested sample field%')
GROUP BY Project, LogEntry, LogLevel
ORDER BY Project, COUNT(*) DESC
Web tier: Counting the number of IsActive timeouts
IsActive timeouts can indicate that engines are spending too long processing script logic. The timeouts are an indication of high CPU, excessive script logic, or contention on an object that is used by the script.
The following example groups the IsActive timeouts by hour and minute, so that spikes can be identified:
SELECT
DatePart(DateTime, 'h') As Hour, DatePart(DateTime, 'n') As Minute, COUNT(*) As IsActiveTimeouts
FROM VDATA
WHERE Logentry.Find('IsActive') >= 0 AND LogLevel = {warning}
GROUP BY DatePart(DateTime, 'h'), DatePart(DateTime, 'n')
It might be useful to determine which engines are causing the spikes. The following query lists all of the IsActive entries, with the web server reporting the timeout and the engine that caused the issue:
SELECT
Mid(LogFile, 54, 14), DateTime, Milliseconds, LogScope, Project, RespondentId, InterviewId, Mid(LogEntry, 35, 23)
FROM VDATA
WHERE LogEntry.Find('IsActive') >= 0 AND LogLevel = {warning}
When there are many IsActive timeouts, the query can be used to group IsActive timeouts into 15-minute intervals, to compare with values from other queries:
SELECT
DateTime.DateOnly() As Day,
CText(DateTime.DatePart('h'))+':'+CText(DateTime.DatePart('n')/15 * 15) As QuarterHour,
COUNT(*) As IsActiveErrors
FROM VDATA
WHERE LogEntry.Find('IsActive') >= 0 AND LogLevel = {warning}
GROUP BY
DateTime.DateOnly(),
DateTime.DatePart('h'),
DateTime.DatePart('n')/15
ORDER BY
DateTime.DateOnly(),
DateTime.DatePart('h'),
DateTime.DatePart('n')/15
Web tier – counting the number of timeouts per question
When the cluster experiences failover due to IsActive timeouts, it is possible that particular projects are taking more time than expected. The following query can be used to find specific projects and questions that experience timeouts, so that these projects and questions can be further investigated. The query must be run before most of the issues occur to verify which projects are taking excessive CPU.
The query counts the number of timeouts by project and question type, so that timeouts that occur frequently can be investigated further:
SELECT
LogEntry.Mid(LogEntry.Find('I.Project') + 10, 8) As Project,
LogEntry.Mid(LogEntry.Find('HTTP Client')+ 12, 20) As Error,
IIf(LogEntry.Find('CreateInterview') = -1, IIf(LogEntry.Find('I.SavePoint') >= 0,
LogEntry.Mid(LogEntry.Find('I.SavePoint') + 12, LogEntry.Find('&', LogEntry.Find('I.SavePoint')) -
LogEntry.Find('I.SavePoint')-12), '[Post]'), '[CreateInterview]') As SavePoint,
COUNT(*)
FROM VDATA WHERE LogEntry.Find('http') = 0 And LogLevel = {error}
GROUP BY
LogEntry.Mid(LogEntry.Find('I.Project') + 10, 8),
LogEntry.Mid(LogEntry.Find('HTTP Client')+ 12),
IIf(LogEntry.find('CreateInterview') = -1, IIf(LogEntry.Find('I.SavePoint') >= 0,
LogEntry.Mid(LogEntry.Find('I.SavePoint') + 12, LogEntry.Find('&', LogEntry.find('I.SavePoint')) -
LogEntry.Find('I.SavePoint')-12), '[Post]'), '[CreateInterview]')
Web tier – find template errors by project
Template errors must be sent to the associated template developer:
SELECT Project, MIN(DateTime) AS FirstTime, MAX(DateTime) AS LastTime, COUNT(*) AS Number, LogEntry FROM VDATA
WHERE (LogLevel * {warning, error, fatal}) AND LogEntry.Find('Template') >= 0
GROUP BY Project, LogEntry
ORDER BY Project, COUNT(*) DESC
Determining the number of interviewers who are affected by an application pool crash
The LogScope field in each log message can be used to identify the process that logged the message. The LogScope contains the faulting process ID. You must first identify the process ID from the application crash event in the Windows Event Log. For example, in the following application event, the process ID is 0x19c8:
Level Date and Time Source EventID Task Category
Error 2013/4/16 1:42:50 Application Error 1000 Application Crashing Events
Faulting application name: w3wp.exe, version: 7.5.7601.17514, time stamp: 0x4ce7afa2
Faulting module name: Qsamp.dll, version: 1.1.1.8352, time stamp: 0x4fc3943e
Exception code: 0xc0000005
Fault offset: 0x0000000000019cfc
Faulting process id: 0x19c8
Faulting application start time: 0x01ce397e563dbf67
Faulting application path: c:\windows\system32\inetsrv\w3wp.exe
Faulting module path: C:\Program Files\Common Files\IBM\SPSS\DataCollection\6\IOM\Qsamp.dll
On the interview tier, the following IVW log query can be used to find the number of interviewers who were logged in with the specified application pool. Replace the process ID in the WHERE clause with the process ID from the event log. The calculation is based on the available logs. The count is not accurate when log files are missing.
SELECT SUM(LogEntry LIKE 'Interviewer session%created%' ) –
SUM(LogEntry LIKE '%Interviewer session%closed')
FROM VDATA
WHERE LogScope LIKE '%0x19c8'
On the Accessories server, similar steps can be used to find out how many users are affected by an Interviewer Server Administration application pool crash.
▪'Logging on' is found for a user login to 'Phone Participants' in the CIV* logs.
▪'Logging off' is found for a user logout from 'Phone Participants' in the CIV* logs.
List activation errors
Use the following query to list the activation errors:
SELECT Project, MIN(DateTime) AS FirstTime, MAX(DateTime) AS LastTime, COUNT(*)
AS Number, LogLevel, AppGroup, LogEntry FROM VDATA
WHERE (AppGroup = 'ACT') AND (LogLevel * {warning, error, fatal})
GROUP BY Project, LogEntry, LogLevel, AppGroup
ORDER BY Project, COUNT(*) DESC
See