Monitoring using log files
Log files are useful in catching project issues before they cause cluster issues. Logs are also useful when issues arise and need to be investigated. To get the most out of the log files, ensure that all of the required log files are available. On very busy clusters, this might require regular log file archival, sometimes as frequently as every two hours. As the cluster load increases, check the log file overwrite time interval, and then setup an archive process to ensure that no log files are lost. Log files do not need to be stored for a long period of time, but need to be available when an issue is reported. Keep logs for the largest interval between when an issue is reported and the investigation of that issue. A week’s worth of log files is typically sufficient.
The Log DSC is useful when investigating log files because it reduces the parsing requirements and enables you to use SQL queries for investigation. The Log DSC can be used with DM Query to view logs, or it can be used programmatically to monitor log files automatically. For more information, see
Log DSC and
DM Query.
The UNICOM Intelligence Interviewer - Server Admin Manage Logs feature is often used to gather logs. The feature compresses logs in server specific subfolders and is useful in structuring log files for investigation. The feature provides options for querying log files across all servers or focusing on a particular server. Select a directory instead of a specific file, and then set MR Init Custom to SearchSubFolders on the connection string to view logs across all folders.
When the log set is large, you can filter it. For example, use the following to filter logs by type:
FileTypeFilter=IVW;SearchSubFolders=True
The log set can also be filtered by time if the original create and modified times are available in the log files. For example, the following statement filters by time:
FileDateFilter=2013/04/12 7:00PM-2013/04/12 11:59PM;FileTypeFilter=IVW;SearchSubFolders=True
See
See also
Useful queries
There is no recipe for log file investigation. However, there are some queries that prove to be useful or provide useful templates.
All: Finding the log file overlap
There are many instances when an incomplete set of log files is delivered. This query lists the start and end times for the log files in each directory, thereby showing the period of time where log files are available for all computers.
When the log file overlap has been defined, these settings can be used to filter the log file. Set the WHERE clause to filter for the log file types that are important to the investigation (each log file type overwrites at a different speed).
The following example lists the minimum and maximum time in the IVW logs. The parameter for the call to Left depends on the log directory structure. Set the parameter to include the directory that indicates the server.
SELECT
Min(DateTime), Max(DateTime), LogFile.Left(68)
FROM VDATA
GROUP BY LogFile.Left(68)
All: Finding errors
Use the following query to search for errors:
SELECT
Mid(LogFile, 54, 14), DateTime, Milliseconds, LogScope, LogLevel, Project, RespondentId, InterviewId, LogEntry
FROM VDATA
WHERE LogLevel * {error, fatal}
Use the following query to search for errors and warnings:
SELECT
Mid(LogFile, 54, 14), DateTime, Milliseconds, LogScope, LogLevel, Project, RespondentId, InterviewId, Mid(LogEntry, 35, 23)
FROM VDATA
WHERE LogLevel * {warning, error, fatal}
Extraneous errors and warnings are typically found in log files. In particular, there are often project errors that are ignored by scriptwriters. These might include missing template errors that show up on the web tier, or missing sample fields showing up on the interview tier. It is best to correct these errors, but when monitoring for stability and performance, you might want to create a query that filters out those persistent errors.
Web tier: Counting the number of IsActive timeouts
IsActive timeouts can indicate that engines are spending too long processing script logic. This can be an indication of high CPU, excessive script logic, or contention on an object used by the script.
The following example groups the IsActive timeouts by hour and minute, allowing for the identification of spikes.
SELECT
DatePart(DateTime, 'h') As Hour, DatePart(DateTime, 'n') As Minute, COUNT(*) As IsActiveTimeouts
FROM VDATA
WHERE Logentry.Find('IsActive') >= 0 AND LogLevel = {warning}
GROUP BY DatePart(DateTime, 'h'), DatePart(DateTime, 'n')
It might be useful to see which engines are causing the spikes. This query lists all the IsActive entries with the web computer reporting the timeout and the engine that caused the issue.
SELECT
Mid(LogFile, 54, 14), DateTime, Milliseconds, LogScope, Project, RespondentId, InterviewId, Mid(LogEntry, 35, 23)
FROM VDATA
WHERE LogEntry.Find('IsActive') >= 0 AND LogLevel = {warning}
If there are many IsActive timeouts, this query can be used to group IsActive timeouts into 15 minute intervals to compare with values from other queries.
SELECT
DateTime.DateOnly() As Day,CText(DateTime.DatePart('h'))+':'+CText(DateTime.DatePart('n')/15 * 15) As QuarterHour,
COUNT(*) As IsActiveErrors
FROM VDATA
WHERE LogEntry.Find('IsActive') >= 0 AND LogLevel = {warning}
GROUP BY
DateTime.DateOnly(),
DateTime.DatePart('h'),
DateTime.DatePart('n')/15
ORDER BY
DateTime.DateOnly(),
DateTime.DatePart('h'),
DateTime.DatePart('n')/15
Interview tier: Checking for engine shutdowns or starts
The following query lists engine shutdowns and starts, allowing them to be compared with times of other issues.
SELECT
Mid(LogFile, 54, 14), DateTime, Milliseconds, LogScope, Project, RespondentId, InterviewId, LogEntry
FROM VDATA
WHERE LogEntry.Find('Session Engine s') >= 0
If there are many shutdowns and restarts, check for spikes using the following query that groups engine shutdowns and starts. The query includes the log file directory and provides information about which computer the engine is on.
SELECT
Mid(LogFile, 54, 14),
DatePart(DateTime, 'h'),
DatePart(DateTime, 'n'),
SUM(LogEntry.Find('Interviewer Session Engine shutdown') = 0) As EngineShutdowns,
SUM(LogEntry.Find('Interviewer Session Engine start') = 0) As EngineStarts
FROM VDATA
WHERE Logentry.Find('Interviewer Session Engine s') >= 0
GROUP BY
Mid(LogFile, 54, 14),
DatePart(DateTime, 'h'),
DatePart(DateTime, 'n')
Use the following query to group engine shutdowns and starts into 15 minute intervals, in order to compare values from other queries.
SELECT
DateTime.DateOnly() As Day,
CText(DateTime.DatePart('h'))+':'+CText(DateTime.DatePart('n')/15 * 15) As QuarterHour,
SUM(LogEntry.Find('Interviewer Session Engine shutdown') = 0) As EngineShutdowns,
SUM(LogEntry.Find('Interviewer Session Engine start') = 0) As EngineStarts
FROM VDATA
WHERE Logentry.Find('Interviewer Session Engine s') = 0
GROUP BY
DateTime.DateOnly(),
DateTime.DatePart('h'),
DateTime.DatePart('n')/15
ORDER BY
DateTime.DateOnly(),
DateTime.DatePart('h'),
DateTime.DatePart('n')/15
Interview tier: Counting the number of interview starts and restarts
The following query counts the number of interview starts or restarts (in 15 minute intervals).
SELECT
DateTime.DateOnly() As Day,
CText(DateTime.DatePart('h'))+':'+CText(DateTime.DatePart('n')/15 * 15) As QuarterHour,
SUM(LogEntry.Find('Interview start') = 0) As InterviewStarts,
SUM(LogEntry.Find('Interview restart') = 0) As InterviewRestarts
FROM VDATA
WHERE LogLevel = {metric} AND (LogEntry.Find('Interview start') = 0 OR LogEntry.Find('Interview restart') = 0)
GROUP BY
DateTime.DateOnly(),
DateTime.DatePart('h'),
DateTime.DatePart('n')/15
ORDER BY
DateTime.DateOnly(),
DateTime.DatePart('h'),
DateTime.DatePart('n')/15
Web tier: Counting the number of timeouts per question
When a cluster is experiencing failover due to IsActive timeouts, it is possible that there are particular projects that are taking more time than expected. The following query can be used to find specific projects and questions that experience timeouts, allowing the projects and questions to be further investigated. The query should be run at a time before the majority of the issues are encountered in order to determine which projects are taking excessive CPU (causing the entire cluster to become slower and possibly causing failovers).
The following query counts the number of timeouts by project and question type; timeouts that occur frequently can be investigated further.
SELECT
LogEntry.Mid(LogEntry.Find('I.Project') + 10, 8) As Project,
LogEntry.Mid(LogEntry.Find('HTTP Client')+ 12, 20) As Error,
IIf(LogEntry.Find('CreateInterview') = -1, IIf(LogEntry.Find('I.SavePoint') >= 0,
LogEntry.Mid(LogEntry.Find('I.SavePoint') + 12, LogEntry.Find('&', LogEntry.Find('I.SavePoint')) - LogEntry.Find('I.SavePoint')-12), '[Post]'), '[CreateInterview]') As SavePoint,
COUNT(*)
FROM VDATA WHERE LogEntry.Find('http') = 0 and LogLevel = {error}
GROUP BY
LogEntry.Mid(LogEntry.Find('I.Project') + 10, 8),
LogEntry.Mid(LogEntry.Find('HTTP Client')+ 12),
IIf(LogEntry.find('CreateInterview') = -1, IIf(LogEntry.Find('I.SavePoint') >= 0,
LogEntry.Mid(LogEntry.Find('I.SavePoint') + 12, LogEntry.Find('&', LogEntry.find('I.SavePoint')) - LogEntry.Find('I.SavePoint')-12), '[Post]'), '[CreateInterview]')
Interview tier: Counting the number of projects cached
The following query counts the number of projects cached in 15 minute intervals.
SELECT
DateTime.DateOnly() As Day,
CText(DateTime.DatePart('h'))+':'+CText(DateTime.DatePart('n')/15 * 15) As QuarterHour,
SUM(LogEntry.Find('cached') >= 0) As CachedProjects
FROM VDATA
WHERE Logentry.Find('cached') >= 0
GROUP BY
DateTime.DateOnly(),
DateTime.DatePart('h'),
DateTime.DatePart('n')/15
ORDER BY
DateTime.DateOnly(),
DateTime.DatePart('h'),
DateTime.DatePart('n')/15
Interview tier: Finding respondents who restart
The following query lists the respondents that restart an interview.
SELECT InterviewId, COUNT (*)FROM VDATA WHERE LogLevel={metric} GROUP BY InterviewId HAVING COUNT(InterviewId)<>2
Interview tier: Investigating a particular respondent
Use the following query when a particular respondent experiences an error.
SELECT Mid (LogFile, 54), DateTime, Milliseconds, LogScope, Project, RespondentId, InterviewId, LogEntry FROM VDATA WHERE RespondentId='1066724411' ORDER BY DateTime
See also