Quickly verify, cleanse and analyze very large files
Overview
Andi Integrated Scripting includes many features that help you to verify and cleanse comma delimited files. The table paradigm allows any developer comfortable with Excel or SQL to quickly get to work accessing each "line" using the row/column paradigm. Data access to the rows is direct and extremely fast allowing you to iterate over the entire file in both forward and reverse direction.
The CSVTable provides an architecture that is extensible to provide new business rules or quickly switch plugins between customers to provide different cleansing functions.
Main class which enables high speed verification and cleansing of CSV file data. Navigation has been optimized to allow direct access to each line of the file using a row/column pattern much like an Excel spreadsheet. You access a row using a simple getValue(<rowIndex>,<columnIndex|columnName>) method or iterate over all rows using standard groovy closures.
There are two Column Value Processor approaches for performing verification and cleansing of your data. These approaches provide the opportunity to verify/cleanse column data before it is returned.
Andi Included Extension Processor
Extension which includes the following functions to help quickly configure your CSVTable scripts. The Andi extension ID is andi:included which allows you to access the included functions listed below.
stateCode - State name to code (Texas to TX)
stateDecode - State code to name (TX to Texas)
lastName - Formats in mixed case with additional rules to handle complex names such as O'Brien, Smith-Jones etc.
phoneNumber
zipCode
lowerCase
mixedCase
upperCase
replaceNonAscii - Replace non-ascii characters with blanks or your own character. Set CSVTable.setAlertIfNonAscii( true ) and turn on verfication mode as log to have verification messages logged. NOTE: Use the setReplaceNonAsciiChars(String replacement) to pre-replace all non-ascii characters to avoid verification exceptions.
trim
leftPad
rightPad
formatInteger
formatDecimal
For complete function information including available parameters, execute the following in an Andi Integrated Scripting environment.
// The details of the andi:included extensions
println extensionRegistry.getExtensionDetails("andi:included")Allows you to perform initial verifications and data-cleansing when a column is accessed. If your processor finds a data issue, it can immediately change the value or throw a VerificationException which can fail the script or Log the exception to a comma delimited error file enabling advanced issue resolution.
The processor is a Groovy Closure, or an anonymous block of code. The processor is called when the column is accessed using a getValue(<rowIndex>,<columnIndex>) method. A processor can be registered for each column of data allowing all columns to be verified/cleansed before they are used in your business logic.
A second type of processor is called an Extension Processor. It is registered like a Column Value Processor but the extension is a Java archive (jar) that can support 1-n functions. The advantages of an extension are:
all of the logic for verification and cleansing is located is a reusable jar
easily referenced within your script using a simple extension ID.
allows daisy-chaining multiple functions for a column to easily create more advanced verification/cleansing
The extensionRegistry reserved variable provides the following methods to work with the extensions.
register - register your custom extension by extending the com.jreflections.scripting.extensions.verifications.AbstractFileViewColumnVerifier class
getExtensionIds - Lists all of the registered extensions, including andi:included extension provided with the product
getExtensionDetails - Lists functions, parameters and more for every available function
extensionRegistry.register( new SampleVerifications( ) )
println extensionRegistry.getExtensionIds()
println extensionRegistry.getExtensionDetails("andi:included") Samples
Samples are attached that you can download and run in the Andi Integrated Scripting environment with very few changes. There are two attachments for each sample, a *.csv file which is the CSV file that the *.groovy script reads. To run the sample(s), download the attachments to a directory. Update the dir variable towards the top of the script, it should point to the directory where you downloaded the CSV file.
A separate path points to a temporary directory which is where temp files are created during processing. If processing large files, set this location to an alternate large drive using the setTempDirectory method.
address.groovy - Mixed case formatting, zip code formatting, state decoding and more
numbers.groovy - How to format integer and decimal numbers
names - Mixed case and last name formatting
multiFilesAddress.groovy - Same as address.groovy but allows processing all files in a directory with seperate batchRunId and logfiles
studentProfile.groovy - student, name, address files with processors on multiple tables. A more complex example showing how to join multiple files while still using processors on each to perform verification and transformation
(3 KB)
(3 KB)
(3 KB)
(2 KB)
(3 KB)
(3 KB)
(2 KB)
(2,775 KB)
(3,941 KB)
(167 KB)
(73 KB)
(218 KB)
(7,127 KB)