UNICOM Intelligence monitoring guidelines

Group	Counter	Servers	When to monitor?	Alert at
Memory	% Committed Bytes In Use	All DC	When issues arise
Memory	Available Mbytes	All DC	Always	Consistently < 20% of installed RAM indicates insufficient memory, if alerted investigate which process is using the memory
Memory	Page/sec	All DC	When issues arise	Sustained > 5
NetworkInterface	Bytes Total/sec	All DC	Benchmarking	Sustained > 80% of bandwidth
NetworkInterface	Output Queue Length	All DC	When issues arise
NetworkInterface	Packets/sec	All DC	When issues arise
Processor(_Total)	% Processor Time	All DC	Always	>= 80% for more than 1 minute, if alerted investigate which process is using the CPU
System	Context Switches/sec	All DC	When issues arise
System	Processor Queue Length	All DC	Benchmarking	Average value > 2
ASP.NET	Active Threads	All DC	When issues arise
ASP.NET	Request Execution Time	All DC	When issues arise
ASP.NET	Request Wait Time	All DC	When issues arise
ASP.NET	Requests Queued	All DC	Always	>= 0 for more than 1 minute
Interview Web [For all web tier instances]	Active Threads	Accessories, Web	Always	>= 30 per sec
Interview Web [For all web tier instances]	Average Response Time	Accessories, Web	Always	>= 500ms
Interview Web [For all web tier instances]	Current Queued Requests	Accessories, Web	Always	>= 0 for more than 1 minute
Interview Web [For all web tier instances]	Engines Failed	Accessories, Web	Always	At each increase
Interview Web [For all web tier instances]	Server Requests/sec	Accessories, Web	Always	>= 50 per sec
Process [For all w3wp processes]	% Processor Time	All DC	Benchmarking	>= 80%
Process [For all w3wp processes]	Thread Count	All DC	When issues arise	>= 64 * cores + 200
Process [For all w3wp processes]	Virtual Bytes	All DC	When issues arise	Increasing without leveling off
Process [For all w3wp processes]	Virtual Bytes Peak	All DC	When issues arise
Process [For all w3wp processes]	Working Set	All DC	When issues arise
Process [For all w3wp processes]	Working Set Peak	All DC	When issues arise
APP_POOL_WAS(_Total)	Total Worker Process Failures	Interviewing	Always	At each increase
APP_POOL_WAS(_Total)	Total Application Pool Recycles	Interviewing	Always	Log only
Interview Engine [For all interview engines]	Active Threads	Interviewing	Always	>= 30 per sec
Interview Engine [For all interview engines]	Average Response Time	Interviewing	Always	>= 100ms
Interview Engine [For all interview engines]	Completes/sec	Interviewing	Always	Log only
Interview Engine [For all interview engines]	Current Interviews	Interviewing	Always	Log only, ConnectionLimit stops engines from overloading, use PercentLoaded for alerting
Interview Engine [For all interview engines]	Current Queued Requests	Interviewing	Always	>= 0 for more than 1 minute
Interview Engine [For all interview engines]	Percent Loaded	Interviewing	Always	>= 80%
Interview Engine [For all interview engines]	Server Requests/sec	Interviewing	Always	>= 50 per sec
Interview Engine [For all interview engines]	Total Interviews	Interviewing	When issues arise
LogicalDisk [All Volumes]	% Free Space	All DC	Always	<=20%
PhysicalDisk(_Total)	% Disk Time	Database, FMRoot	Always	>= 50%
PhysicalDisk(_Total)	Avg. Disk Read Queue Length	Database, FMRoot	Always	>= 2
PhysicalDisk(_Total)	Avg. Disk Write Queue Length	Database, FMRoot	Always	>= 2
Process(sqlservr)	% Processor Time	Database	Always	>= 80% for more than 1 minute, if alerted investigate which process is using the CPU
Process(sqlservr)	Thread Count	Database	When issues arise
Process(sqlservr)	Virtual Bytes	Database	When issues arise
Process(sqlservr)	Virtual Bytes Peak	Database	When issues arise
Process(sqlservr)	Working Set	Database	When issues arise
Process(sqlservr)	Working Set Peak	Database	When issues arise
SQLServer:Access Methods	Full Scans/sec	Database	When issues arise
SQLServer:Buffer Manager	Buffer cache hit ratio	Database	When issues arise
SQLServer:General Statistics	Logical Connections	Database	When issues arise
SQLServer:General Statistics	User Connections	Database	When issues arise
SQLServer:Locks(_Total)	Lock Requests/sec	Database	Benchmarking
SQLServer:Locks(_Total)	Lock Waits/sec	Database	Benchmarking
SQLServer:Locks(_Total)	Number of Deadlocks/sec	Database	When issues arise
SQLServer:Memory Manager	Target Server Memory (KB)	Database	When issues arise
SQLServer:Memory Manager	Total Server Memory (KB)	Database	When issues arise
SQLServer:SQL Statistics	Batch Requests/sec	Database	When issues arise

Group	Counter	Servers	When to monitor?	Alert at
Interview Project	Web - Average time to start interview	Interviewing	Always	> 1 second
Interview Project	Web - Maximum time to start interview	Interviewing	Always	> 4 seconds
Interview Project	Web - Average time page-to-page	Interviewing	Always	> 1 second
Interview Project	Web - Maximum time page-to-page	Interviewing	Always	> 2 seconds
Interview Project	Telephone - Average time to start interview	Interviewing	Always	> 1 second
Interview Project	Telephone - Maximum time to start interview	Interviewing	Always	> 4 seconds
Interview Project	Telephone - Average time page-to-page	Interviewing	Always	> 1 second
Interview Project	Telephone - Maximum time page-to-page	Interviewing	Always	> 2 seconds

Group

Counter

Servers

When to monitor?

Alert at

Interview Project

Web - Average time to start interview

Interviewing

Always

> 1 second

Interview Project

Web - Maximum time to start interview

Interviewing

Always

> 4 seconds

Interview Project

Web - Average time page-to-page

Interviewing

Always

> 1 second

Interview Project

Web - Maximum time page-to-page

Interviewing

Always

> 2 seconds

Interview Project

Telephone - Average time to start interview

Interviewing

Always

> 1 second

Interview Project

Telephone - Maximum time to start interview

Interviewing

Always

> 4 seconds

Interview Project

Telephone - Average time page-to-page

Interviewing

Always

> 1 second

Interview Project

Telephone - Maximum time page-to-page

Interviewing

Always

> 2 seconds

' *********************************************************************************
' Last updated: 2014-04-24
' This script uses the Log DSC to query log files
' It also runs some standard queries
' The standard queries depend on if the logs are Web or Interview tier logs
' It is expected that the LOG_FILES_FOLDER is the top level of a directory structure similar to the following
' LOG_FILES_FOLDER
' --- Engine machine 1
' --- Engine machine 2
' --- Engine machine 3
#define DELIMITER ","
' Specify the tier type as Web or Interview
#define TIER_TYPE "App"
' Specify the folder for the logs
#define LOG_FILES_FOLDER "C:\Logs\App"
' *********************************************************************************
' Open the logs using the Log DSC
' Specify SearchSubFolders=True to load logs from all sub folders
Dim ConnectionString, adoConnection
ConnectionString = "Provider=mrOleDB.Provider.2;" + _
"Initial Catalog=" + LOG_FILES_FOLDER + ";" + _
"MR Init MDSC=mrLogDsc;" + _
"MR Init Category Names=1;" + _
"MR Init Custom=SearchSubFolders=True"
Set adoConnection = CreateObject("ADODB.Connection")
adoConnection.Open(ConnectionString)
Dim OutputPath, ReportName, SQLQuery, HeaderString
OutputPath = LOG_FILES_FOLDER + "\"
' Find the start and length of the machine name
' Should be a subdirectory of the LOG_FILES_FOLDER
Dim MachineIndex, MachineLength, MachineExpression
MachineIndex = Len(OutputPath)
MachineLength = GetSubFolderLength(LOG_FILES_FOLDER)
MachineExpression = "Mid(LogFile," + CText(MachineIndex) + "," + CText(MachineLength) + ")"
' First run a query to get the log file overlap
ReportName = "LogOverlap"
SQLQuery = "SELECT Min(DateTime), Max(DateTime), __MachineFromLogFile__ " + _
"FROM VDATA " + _
"GROUP BY __MachineFromLogFile__"
HeaderString = "StartDateTime, EndDateTime, Machine"
ExportSQLOutput(adoConnection, ReportName, HeaderString, SQLQuery, MachineExpression, OutputPath)
' Execute a set of standard SQL queries based on the tier type
If (UCase(TIER_TYPE) = "WEB") Then
' Get the number of IsActive timeouts in 15 minute intervals
ReportName = "IsActiveTimeoutsIn15MinuteIntervals"
SQLQuery = "SELECT DateTime.DateOnly() As Day, " + _
"CText(DateTime.DatePart('h'))+':'+CText(DateTime.DatePart('n')/15 * 15) As QuarterHour, " + _
"COUNT(*) As IsActiveErrors " + _
"FROM VDATA " + _
"WHERE LogEntry.Find('IsActive') >= 0 AND LogLevel = {warning} " + _
"GROUP BY " + _
"DateTime.DateOnly(), " + _
"DateTime.DatePart('h'), " + _
"DateTime.DatePart('n')/15 " + _
"ORDER BY " + _
"DateTime.DateOnly(), " + _
"DateTime.DatePart('h'), " + _
"DateTime.DatePart('n')/15"
HeaderString = "Day,QuarterHour,IsActiveErrors"
ExportSQLOutput(adoConnection, ReportName, HeaderString, SQLQuery, MachineExpression, OutputPath)

' Get the number of timeouts per question
ReportName = "TimeoutsPerProjectAndQuestion"
SQLQuery = "SELECT LogEntry.Mid(LogEntry.Find('I.Project') + 10, 8) As Project, " + _
"LogEntry.Mid(LogEntry.Find('HTTP Client')+ 12, 20) As Error, " + _
"IIf(LogEntry.Find('CreateInterview') = -1, IIf(LogEntry.Find('I.SavePoint') >= 0, LogEntry.Mid(LogEntry.Find('I.SavePoint') + 12, LogEntry.Find('&', LogEntry.Find('I.SavePoint')) - LogEntry.Find('I.SavePoint')-12), '[Post]'), '[CreateInterview]') As SavePoint, " + _
"COUNT(*) " + _
"FROM VDATA WHERE LogEntry.Find('http') = 0 and LogLevel = {error} " + _
"GROUP BY " + _
"LogEntry.Mid(LogEntry.Find('I.Project') + 10, 8), " + _
"LogEntry.Mid(LogEntry.Find('HTTP Client')+ 12), " + _
"IIf(LogEntry.find('CreateInterview') = -1, IIf(LogEntry.Find('I.SavePoint') >= 0, LogEntry.Mid(LogEntry.Find('I.SavePoint') + 12, LogEntry.Find('&', LogEntry.find('I.SavePoint')) - LogEntry.Find('I.SavePoint')-12), '[Post]'), '[CreateInterview]')"
HeaderString = "Project,Error,SavePoint,Count"
ExportSQLOutput(adoConnection, ReportName, HeaderString, SQLQuery, MachineExpression, OutputPath)

' Template errors by project
ReportName = "TemplateErrorsByProject"
SQLQuery = "SELECT Project, MIN(DateTime) AS FirstTime, MAX(DateTime) AS LastTime, COUNT(*) AS Number, LogEntry FROM VDATA " + _
"WHERE (LogLevel * {warning, error, fatal}) AND LogEntry LIKE '%Template%' " + _
"GROUP BY Project, LogEntry " + _
"ORDER BY Project, COUNT(*) DESC"
HeaderString = "Project,FirstTime,LastTime,Count,LogEntry"
ExportSQLOutput(adoConnection, ReportName, HeaderString, SQLQuery, MachineExpression, OutputPath)
Else
' List the engine shutdowns and starts
ReportName = "EngineShutdownsAndStarts"
SQLQuery = "SELECT __MachineFromLogFile__, DateTime, Milliseconds, LogScope, LogEntry " + _
"FROM VDATA " + _
"WHERE LogEntry.Find('Session Engine s') >= 0"
HeaderString = "Machine,DateTime,Milliseconds,LogScope,LogEntry"
ExportSQLOutput(adoConnection, ReportName, HeaderString, SQLQuery, MachineExpression, OutputPath)

' Count the interview starts and stops over all logs, hopefully the entire cluster
ReportName = "InterviewStartsAndStops"
SQLQuery = "SELECT " + _
"DateTime.DateOnly() As Day, " + _
"CText(DateTime.DatePart('h'))+':'+CText(DateTime.DatePart('n')/15 * 15) As QuarterHour, " + _
"SUM(LogEntry.Find('Interview start') = 0) As InterviewStarts, " + _
"SUM(LogEntry.Find('Interview restart') = 0) As InterviewRestarts, " + _
"SUM(LogEntry.Find('Interview terminated') = 0) As InterviewTerminates, " + _
"SUM(LogEntry.Find('Interview complete') = 0) As InterviewCompletes, " + _
"SUM(LogEntry = 'Interview timeout') As InterviewTimeouts, " + _
"SUM(LogEntry.Find('Abandoning') >= 0) As InterviewTimeoutAbandons, " + _
"SUM(LogEntry.Find('Interview stopped') = 0) As InterviewStops, " + _
"SUM(LogEntry.Find('Interview shutdown') = 0) As InterviewShutdowns, " + _
"SUM(LogEntry.Find('Interview rejected') = 0) As InterviewRejects " + _
"FROM VDATA " + _
"WHERE LogLevel = {metric} " + _
"GROUP BY " + _
"DateTime.DateOnly(), " + _
"DateTime.DatePart('h'), " + _
"DateTime.DatePart('n')/15 " + _
"ORDER BY " + _
"DateTime.DateOnly(), " + _
"DateTime.DatePart('h'), " + _
"DateTime.DatePart('n')/15"
HeaderString = "Day,QuarterHour,Start,Restart,Terminate,Complete,Timeout,TimeoutAbandon,Stop,Shutdown,Reject"
ExportSQLOutput(adoConnection, ReportName, HeaderString, SQLQuery, MachineExpression, OutputPath)

' Count the interview starts and stops per logscope/engine
ReportName = "InterviewStartsAndStops_PerEngine"
SQLQuery = "SELECT " + _
"LogScope, " + _
"DateTime.DateOnly() As Day, " + _
"CText(DateTime.DatePart('h'))+':'+CText(DateTime.DatePart('n')/15 * 15) As QuarterHour, " + _
"SUM(LogEntry.Find('Interview start') = 0) As InterviewStarts, " + _
"SUM(LogEntry.Find('Interview restart') = 0) As InterviewRestarts, " + _
"SUM(LogEntry.Find('Interview terminated') = 0) As InterviewTerminates, " + _
"SUM(LogEntry.Find('Interview complete') = 0) As InterviewCompletes, " + _
"SUM(LogEntry = 'Interview timeout') As InterviewTimeouts, " + _
"SUM(LogEntry.Find('Abandoning') >= 0) As InterviewTimeoutAbandons, " + _
"SUM(LogEntry.Find('Interview stopped') = 0) As InterviewStops, " + _
"SUM(LogEntry.Find('Interview shutdown') = 0) As InterviewShutdowns, " + _
"SUM(LogEntry.Find('Interview rejected') = 0) As InterviewRejects " + _
"FROM VDATA " + _
"WHERE LogLevel = {metric} " + _
"GROUP BY " + _
"LogScope, " + _
"DateTime.DateOnly(), " + _
"DateTime.DatePart('h'), " + _
"DateTime.DatePart('n')/15 " + _
"ORDER BY " + _
"DateTime.DateOnly(), " + _
"DateTime.DatePart('h'), " + _
"DateTime.DatePart('n')/15"
HeaderString = "LogScope,Day,QuarterHour,Start,Restart,Terminate,Complete,Timeout,TimeoutAbandon,Stop,Shutdown,Reject"
ExportSQLOutput(adoConnection, ReportName, HeaderString, SQLQuery, MachineExpression, OutputPath)

' Scripting errors by project
ReportName = "ScriptingErrorsByProject"
SQLQuery = "SELECT Project, MIN(DateTime) AS FirstTime, MAX(DateTime) AS LastTime, COUNT(*) AS Number, LogEntry " + _
"FROM VDATA " + _
"WHERE (LogLevel * {error, fatal}) AND LEFT(LogEntry,4) <> 'HTTP' AND LEFT(LogEntry,26) <> 'The requested sample field' " + _
"GROUP BY Project, LogEntry " + _
"ORDER BY Project, COUNT(*) DESC"
HeaderString = "Project,FirstTime,LastTime,Count,LogEntry"
ExportSQLOutput(adoConnection, ReportName, HeaderString, SQLQuery, MachineExpression, OutputPath)

' Sample errors by project
ReportName = "SampleErrorsByProject"
SQLQuery = "SELECT Project, MIN(DateTime) AS FirstTime, MAX(DateTime) AS LastTime, COUNT(*) AS Number, LogLevel, LogEntry " + _
"FROM VDATA " + _
"WHERE (LogLevel * {warning, error, fatal}) AND (LogEntry LIKE 'The requested sample field%') " + _
"GROUP BY Project, LogEntry, LogLevel " + _
"ORDER BY Project, COUNT(*) DESC"
HeaderString = "Project,FirstTime,LastTime,Count,LogLevel,LogEntry"
ExportSQLOutput(adoConnection, ReportName, HeaderString, SQLQuery, MachineExpression, OutputPath)
End If
adoConnection.Close()
' *********************************************************************************
Function ExportSQLOutput(adoConnection, ReportName, HeaderString, SQLQuery, MachineExpression, OutputPath)
Dim fso, FileName, TextStream, OutputLine
Dim adoRecordSet, adoField
FileName = TIER_TYPE + "_" + ReportName + ".csv"
Debug.Log("Creating report " + FileName)

' Substitute the machine index to create a machine name
SQLQuery = Replace(SQLQuery, "__MachineFromLogFile__", MachineExpression)
Set adoRecordSet = adoConnection.Execute(SQLQuery)
If (adoRecordSet.BOF And adoRecordSet.EOF) Then
Debug.Log(ReportName + ": No records")
ExportSQLOutput = False
Exit Function
End If

Set fso = CreateObject("Scripting.FileSystemObject")
On Error Resume Next
fso.DeleteFile(OutputPath + FileName, True)
On Error GoTo ExportSQLOutput_ErrorHandler
Set TextStream = fso.OpenTextFile(OutputPath + FileName, 2 '! ForWriting !', True)
TextStream.WriteLine(HeaderString)

adoRecordSet.MoveFirst()
While (Not(adoRecordSet.EOF))
OutputLine = ""
For Each adoField In adoRecordSet.Fields
OutputLine = OutputLine + CText(adoField.Value) + ","
Next
TextStream.WriteLine(OutputLine)
adoRecordSet.MoveNext()
End While
TextStream.Close()

ExportSQLOutput = True

Exit Function

ExportSQLOutput_ErrorHandler:
Debug.Log(ReportName + ": Error at line " + CText(Err.LineNumber) + ": " + Err.Description)
ExportSQLOutput = False
End Function
' *********************************************************************************
Function GetSubFolderLength(FolderPath)
Dim fso, folder, subfolder
Dim SubFolderLength

SubFolderLength = 0

Set fso = CreateObject("Scripting.FileSystemObject")

' TODO - Error
Set folder = fso.GetFolder(FolderPath)
For Each subfolder In folder.SubFolders
If (Len(subfolder.Name) > SubFolderLength) Then
SubFolderLength = Len(subfolder.Name)
End If
Next

GetSubFolderLength = SubFolderLength
End Function