Data editing > Data correction > Creating clean and dirty data files
 
Creating clean and dirty data files
Quick reference
To write correct records out to a clean data file and incorrect records out to a dirty data file, type:
split [only]
at the point at which records are to be written out. Type split only if the edit does not alter the contents of the record and you want to copy records directly from the original data file rather than from an intermediate file.
Clean and dirty data files are the terms used to refer to files of correct and incorrect or rejected records created automatically by the edit statement split.
Each time a record is read and reaches split, it is written out to the appropriate file in its current state. If any changes have been made with assignment statements, emit, delete, priority, require or the on-line edit, they will be saved in the clean data file if the record is now correct or in the dirty data file if the record still contains errors or has been rejected.
Split may occur several times in the edit, but each record will be written out once only. In the example below, the second split is redundant since all records will have been written out by the first one. The data to be checked is:
Card 1 Card 2 Card 3
+----5---+ .... 3----+----4 .... +----1----+
5 2 3
and the program is:
r sp c234'1/5', c309'1/5-&' :'&'
split
if (c146'12') emit c180'1'; else; reject
split
Suppose that the record has reached the require statement without error. Since c234’2’ and c309’3’, the record is correct so it is copied to the clean file. However, when the next statement is read and the contents of c146 are checked, it contains a ‘5’ which means that it must be rejected and should be copied to the dirty file by the second split. This does not happen because it has already been written out by the previous split. For this example to place the record in the dirty file instead, it should read:
r sp c234'1/5',c309'1/5-&' :'&'
if (c146'12') emit c180'1';else; reject
split
Split is often used at the end of an edit after online. This causes all records found in error by write and require statements to be offered in the on-line edit for correction and then saved in the clean or dirty file according to the type of on-line commands you use. For example, if a record is flagged as incorrect and you correct those errors, the record is placed in the clean data file. The same is true if you use ac to accept the record even if you do not make corrections. If you reject the record with rj, the record is placed in the dirty data file. By putting both statements at the end of the edit, you can be sure of seeing all erroneous records and of saving all records in their final state.
If some records are rejected from the run using reject;return, these records will not be included in the clean or dirty files unless the data is split before the records are rejected:
split
if (c132n'1/9') reject; return
In this example, because split appears in the edit before reject;return, all records will appear in one or other of the clean or dirty files (depending on whether or not they contain errors) even though records in which c132 does not contain any of the codes 1 through 9 have their edit terminated and are rejected from the tables.
if (c132n'1/9') reject; return
split
Here, because split appears after reject; return, only records in which c132 contains any of the codes 1 through 9 will appear the clean or dirty files. Again, which file the records are written to depends on whether or not they contain errors.
Note For more information about using reject, see Rejecting records. For more information about using return, see Jumping to the tabulation section.
By default, an intermediate data file is created for splitting. The name of this file is clean.q. If the run does not contain statements which alter the data (for example, recoding with assignment statements or creating new columns) then this file will be identical to the original data file. In such cases, you can save disk space during the run by splitting the original data file instead with the statement:
split only
Quantum does not change the original data file in any way; it reads records directly from this file and allocates them to the clean and dirty files rather than taking a backup copy of this file and reading records from there.
Notes
You can not use split only when the datapass reads input from another program (for example, when you use a corrections file to correct records rather than writing a forced edit or using the on-line edit). Instead, you should run Quantum using the corrections file only and write all records to a new data file. Then run the datapass on this new data file.
If you do an online edit but forget split or write, your changes are not saved. Also if you have created new cards and have not made thisread true for the new cards (for example, thisread3=1 for a new card 3), they are not written out.
If you use split on a levels (trailer card) job, splitting is switched on for all levels and must therefore be part of the top level edit. Additionally, it must appear once only and must not be part of an if statement. A reject statement at any level rejects the whole record and writes it to the dirty file.
See also
Data correction