Data Management performance guidelines white paper > Data management event scripts
 
Data management event scripts
The OnNextCase event is the primary focus for improving data management scripting performance. The OnNextCase event is called for each output record and consequently provides the best opportunity for improving overall DMS performance.
While a number of different customer scripts were used to analyze DMS performance, one script in particular was used to bench-mark performance. The data management script's metadata characteristics, that were used for performance tuning, are as follows:
Metadata characteristics
Fields
468
Variable instances
27640
Categories
3335
Records
200533
OnNextCase event (lines of script)
3750
Unless stated otherwise, the estimated performance improvements are based on the previously mentioned script.
Note All of the examples that are provided in the following section are anonymized.
Performance improvement methods
A number of different methods were to improve data management performance.
Scripting engine performance. Event sections with a DMS are executed by the UNICOM Intelligence scripting engine. The scripting engine parses the script to a P-code representation, which is then interpreted at runtime.
Data management function libraries. Interpreted scripts are not as fast as compiled native code. As a first step towards reducing the amount of interpreted script, common data management functions are ported to a compiled language, such as C++, and released as a data management function library.
Improving tabulation features to reduce data management. Analysis of several different DMS scripts has revealed that a large portion of the script that is added in the OnNextCase event is designed to work around limitations in the tabulation products. The data management script was reduced to meet the project's reporting needs.
Scripting guidelines. The final approach that was taken for improving data management performance was to investigate techniques that can be used to create more efficient and easier to maintain scripts. The guidelines that are established from this investigation are documented in the following sections.
The core UNICOM Intelligence DMOM objects were rewritten in C++, with a focus on improving performance for data cleaning operations in the OnNextCase event. The following performance improvements were made:
Core DMOM objects are now implemented C++. Avoids the interops cost from the scripting engine.
Optimized by-name collection access. Most data management script access child level questions by name. As such, name based collection access were improved.
Reduce data copying. Heap memory allocation and copying was identified as a serious bottleneck in the execution of data management script. Data copying is avoided (where possible).
The following performance improvement methods were also investigated, but are not yet included in UNICOM Intelligence:
Multiple-thread execution of OnNextCase. With the proliferation of multiple-core processors, an obvious performance improvement is the parallel execution of cases through the OnNextCase event. Unfortunately, implementing multiple-threaded OnNextCase execution is not without its challenges. In particular, the DMOM classes, initially written for single-threaded access, need to be enhanced for concurrent access. The reading and writing of cases also needs to be synchronized such that cases are written in the same order as they are read.
Compiled script. Beyond the manual porting of script functions to a compiled function library, other possibilities exist for compilation of script to native code. For example:
Automatic conversion of script to a compiled language. To provide the best performance improvement, the script needs to be converted to a statically typed, compiled language (such as C++). Unfortunately, the dynamic nature of UNICOM Intelligence script makes it very difficult to provide 100% conversion rates. Also, a third-party compiler is required to compile the generated code.
Just-in-Time (JIT) compilation. Switching from an interpreted scripting engine to JIT provides a significant performance improvement. Instead of parsing scripts to the P-code, that is used by the scripting engine interpreter, the parser generates byte code that can be used by a JIT runtime environment (such as Mono or the JRE). Although JIT provides significant performance gains to data management scripts, it requires significant effort to implement and is less useful for server-side interview scripts (where greater control of execution and state is required).
Scripting engine performance
Timing analysis, for the execution of a typical data management script, has shown that between 40% and 50% of execution time is spent in the scripting engine. Overall scripting engine performance has significantly improved since the UNICOM Intelligence 5.6 release.
Of the time that is spent in the scripting engine, most of the time is spent in runtime type discovery. For each call to an unknown object, such as an object that is assigned to a temporary variable, the scripting engine needs to check and then validate the property or method call. Before the 5.6 release, the scripting engine would validate by enumerating the method or property as it was called, but would not retain the type information for future calls. Starting with the 5.6 release, the scripting engine determines the type of the object and then checks to see whether type information is already held for the object. If type information is not held for the object, all of the methods and properties are enumerated for the object and the information is added to the types collection for future calls to objects of the same type.
The change to runtime type discovery resulted in an approximate 15% performance improvement (based on the DMS used for bench-marking). In relation to the scripting engine, this signifies a performance improvement of at least 30%.
General guidelines
The following section provides general scripting guidelines for improving performance in the OnNextCase event.
Use temporary variables
Use a temporary variable to store the expression result instead of repeating the expression when a series of statements uses the same expression (for example, you are comparing the result of a function call several times).
For example, the following defined categories are requested for the Desired variable in each expression.
If BrandAttr[{ABC}].Attribs.ContainsAny(Desired.DefinedCategories()) Then
DesiredBrands = DesiredBrands + {ABC}
End If
If BrandAttr[{Green}].Attribs.ContainsAny(Desired.DefinedCategories()) Then
DesiredBrands = DesiredBrands + {Green}
End If
Similarly, if the expression is used within a For...Next statement, consider using a temporary variable. For example:
Dim Brand, DesiredAttributes
DesiredAttributes = Desired.DefinedCategories()
For Each Brand in BrandAttr
If Brand.Attribs.ContainsAny(DesiredAttributes) Then
DesiredBrands = DesiredBrands + CCategorical(Brand.QuestionName)
End If
Next
The benefits of using temporary variables include the following:
Potentially less script needs to be written because the expression does not need to be repeated. Declaring and assigning the temporary variable might, however, result in more lines of script.
Improved performance as the expression is only run only once. Converting the benchmark script to use temporary variables, where expressions were repeated more than twice, resulted in an improvement of approximately 8%.
Use the With statement
Use a With statement to define the portion of code that is common to all statements when a series of statements applies to the same object (for example, when you are comparing the value of different nested questions on the same class). Using a With statement prevents the need to type each statement in full.
The following script summarizes questions from a nested loop:
If BrandAttr[{ABC}].Model[{X}].Rating = {_1} Then
ABC_X_Summary = ABC_X_Summary + {Rating}
ElseIf BrandAttr[{ABC}].Model[{X}].Quality = {_1} Then
ABC_X_Summary = ABC_X_Summary + {Quality}
ElseIf BrandAttr[{ABC}].Model[{X}].Price = {_1} Then
ABC_X_Summary = ABC_X_Summary + {Price}
End If
Using a With statement reduces the script:
With BrandAttr[{ABC}].Model[{X}]
If .Rating = {_1} Then
ABC_X_Summary = ABC_X_Summary + {Rating}
ElseIf .Quality = {_1} Then
ABC_X_Summary = ABC_X_Summary + {Quality}
ElseIf .Price = {_1} Then
ABC_X_Summary = ABC_X_Summary + {Price}
End If
End With
In another example, this time from an interview script, the script assigns several properties on the same object:
Sports.Label.Style.Font.Family = "'Palatino', 'Times New Roman'"
Sports.Label.Style.Font.Size = 16
Sports.Label.Style.Font.Effects = FontEffects.feBold
When a With statement is used, the script is simplified as follows:
With Sports.Label.Style.Font
.Family = "'Palatino', 'Times New Roman'"
.Size = 16
.Effects = FontEffects.feBold
End With
With statement benefits
Calls to the object are grouped within the With statement, which makes the script easier to read and maintain.
Less script needs to be written; the full object name does not need to be repeated.
Improved performance when the object that is being accessed is from a child collection in the object hierarchy. Using the With statement causes the object look-up to occur only once. Converting the bench-mark script to use the With statement resulted in a performance improvement of approximately 8%.
Use Select Case to simplify If...Then...Else logic
Use a Select Case statement, instead of If...Then...Else logic, when a series of conditional statements run comparisons on the same variable.
In the following example, If...Then...Else logic is used to band a numeric variable:
If Brands[{Green}].Quantity = 0 Then
GreenQuantity = {Band1}
ElseIf Brands[{Green}].Quantity >= 1 And Brands[{Green}].Quantity <= 5
GreenQuantity = {Band2}
ElseIf Brands[{Green}].Quantity >= 6 And Brands[{Green}].Quantity <= 10
GreenQuantity = {Band3}
ElseIf Brands[{Green}].Quantity >= 11 And Brands[{Green}].Quantity <= 20
GreenQuantity = {Band4}
ElseIf Brands[{Green}].Quantity >= 21 And Brands[{Green}].Quantity <= 40
GreenQuantity = {Band5}
ElseIf Brands[{Green}].Quantity >= 41 And Brands[{Green}].Quantity <= 70
GreenQuantity = {Band6}
ElseIf Brands[{Green}].Quantity > 70
GreenQuantity = {Band7}
End Select
Using Select Case reduces the script to:
Select Case Brands[{Green}].Quantity
Case 0
GreenQuantity = {Band1}
Case 1 To 5
GreenQuantity = {Band2}
Case 6 To 10
GreenQuantity = {Band3}
Case 11 To 20
GreenQuantity = {Band4}
Case 21 To 40
GreenQuantity = {Band5}
Case 41 To 70
GreenQuantity = {Band6}
Case > 70
GreenQuantity = {Band7}
End Select
Note The comparison operators run set comparisons when they are used with categorical variables. Set comparisons can be especially useful when they are used in the Select Case statement. For example, the following Select Case statement is selected when BrandsQ contains the Red category:
Select Case BrandsQ
Case >= {Red}
...
In another example, when the variable that is being compared is a single response categorical (that contains 1 answer), the following syntax can be used:
Select Case Country
Case <= {Brazil,Chile,Colombia,Ecuador,Paraguay,Peru,Uruguay,Venezuela}
Continent={South_America}
Note The Select Case statement would also be selected when Country is NULL or empty {}. The script also verifies that the question was answered.
Select Case statement benefits
Less script needs to be written; the variable that is being compared does not need to be repeated. The script is also easier to read and maintain.
Improved performance when the object that is being compared is from a child collection in the object hierarchy. Using the Select Case statement causes the object look-up to occur only once.
Use For Each...Next when setting multiple properties
Use a single For Each statement, rather than [..], when more than one property is set on collection objects. Each [..] instance is equivalent to a For Each statement, and is more efficient to iterate through the collection only once.
In the following script, each of the nested loop questions are initialized with [..]:
BrandsLoop[..].Price = NULL
BrandsLoop[..].Rating = NULL
BrandsLoop[..].Quantity = 0
To increase performance, the previous script must instead be written as follows:
Dim Brand
For Each Brand in BrandsLoop
With Brand
.Price = NULL
.Rating = NULL
.Quantity = 0
End With
Next
Note [..] should be used in cases where just one property is set on the iterated collection objects.
For Each...Next statement benefits
Calls to the object can be grouped within a With statement, which makes the script easier to read and maintain.
Performance is improved because the collection is iterated only once.
Use categorical literals
Use categorical literals instead of string literals when running categorical operations. In the following example, a string literal is used to simplify the look-up of key attributes in the attributes loop:
Dim Attribute
For Each Attribute in Split("Price,Quality,Appearance", ",")
KeyBrands = KeyBrands + Attribs[Attribute].Brands
Next
Instead of using the string literal, a categorical literal must instead be used. For example:
Dim Attribute
For Each Attribute in {Price, Quality, Appearance}
KeyBrands = KeyBrands + Attribs[CCategorical(Attribute)].Brands
Next
Note When For Each...Next is used to iterate over a categorical list, the category value is stored in the each variable as a long. The CCategorical function must be used to convert the category value into a categorical so that it can be used for looking into question or category collections. If the value is not converted to a categorical, the collection interprets the value as a numeric index and attempts to retrieve the item by its position in the collection, not by its category value.
Categorical literal benefits
Categorical literals are converted to category values when parsed. As a result, the category names are validated when the script is parsed, not when the script is run, which can help identify mistyped names earlier in the DMS process.
Performance can be improved when categorical literal is used in categorical expressions.
Use categorical set logic
Use categorical set logic to combine, filter, and remove answers. The following operators can be used with categorical sets:
Union: The union (+) operator runs a union of two category lists. The union operator is used when the responses from 2 different questions are combined. In the following example, the total awareness question is assigned the union of the Spontaneous and Prompted questions:
TotalAwareness = Spontaneous + Prompted
Intersection: The Intersection (*) operator runs an intersection of two category lists. The Intersection operator is used when one category list is filtered by another. In the following example, the answers from Q4Tot are filtered by the categories in Q5:
For Each Code in Q5.Categories
If Q4Tot.ContainsAny(Code) Then
Q5 = Q5 + CCategorical(Code)
End If
Next
Using the Intersection operator, the previous script can be simplified as follows:
Q5 = Q4Tot * Q5.DefinedCategories()
The Has Intersection (=*) operator returns true when the intersection of the two values is not empty. A common scenario when creating derived elements for tabulation is to check whether the answers for a variable contain at least one of the categories in a given category set. While the Intersection operator might be used for this purpose, there is a cost to returning the intersection category set. Because a boolean result is required, the Has Intersection operator provides better performance in this scenario. For example:
Galleries expression('museums =* {National_Art_Gallery, Northern_Gallery}')
Note The FilterBy function can be used in place of Intersection operator. The FilterBy function is equivalent to the Intersection operator, but its name makes it more obvious as to its intended purpose.
Q5 = Q4Tot.FilterBy(Q5.DefinedCategories())
Difference: The difference (-) operator returns categories that are in the category list on the left, but not in the category list on the right. The difference operator should be used when removing responses. The following example shows how special responses are removed from the analysis variable.
AnalysisBrands = BrandsQ - {DK, NA}
XUnion: The exclusive union returns the categories that are in either category list, but not in both category lists. The exclusive union operator is used to remove categories that are answered in both questions. The following example shows how brands that are selected for price or quality are returned, but brands selected for both are not.
BrandsPriceOrQuality = Attribs[{Price}].Brands / Attribs[{Quality}].Brands
Notes
Each of the categorical operators has a corresponding function that can be used when the operators are not understood. The set operation names are the names that are provided to each of the functions. For example, instead of using the intersection (*) operator, the function name can be used:
Q5 = Intersection(Q4Tot, Q5.DefinedCategories())
The category order from the left side is maintained when running categorical set operations. If the order of mention is important, the question where the order of mention needs to be preserved must be on the left side of the expression. In the following example, the answered categories are filtered, but the order of mention is preserved from the Brands question:
AnalysisBrands = BrandsQ * {Red, Green, Blue, Yellow}
The following script adds categories to the banner variable with a specific order:
If Segments.ContainsAny({Seg8}) Then Banner=Banner+{Seg8}
If Segments.ContainsAny({Seg5}) Then Banner=Banner+{Seg5}
If Segments.ContainsAny({Seg3}) Then Banner=Banner+{Seg3}
If Segments.ContainsAny({Seg2}) Then Banner=Banner+{Seg2}
If Segments.ContainsAny({Seg4}) Then Banner=Banner+{Seg4}
If Segments.ContainsAny({Seg6}) Then Banner=Banner+{Seg6}
If Segments.ContainsAny({Seg1}) Then Banner=Banner+{Seg1}
If Segments.ContainsAny({Seg7}) Then Banner=Banner+{Seg7}
Using the Intersection operator, the script can be simplified as follows:
Banner = {Seg8,Seg5,Seg3,Seg2,Seg4,Seg6,Seg1,Seg7} * Segments
Categorical set logic benefits
Less script needs to be written and after the operators are understood, the script is easier to read and maintain.
Improved performance when compared to the style of script that is listed for the Intersection operator. Converting the bench-mark script to use the Intersection operator, instead of For Each...Next when category lists are filtered, provides a performance improvement of approximately 5%.
See
Common data management task guidelines
Data Management performance guidelines white paper