Quickly verify, cleanse and analyze very large files

Quickly verify, cleanse and analyze very large files

Overview

Andi Integrated Scripting includes many features that help you to verify and cleanse comma delimited files. The table paradigm allows any developer comfortable with Excel or SQL to quickly get to work accessing each "line" using the row/column paradigm. Data access to the rows is direct and extremely fast allowing you to iterate over the entire file in both forward and reverse direction. 

The CSVTable provides an architecture that is extensible  to provide new business rules or quickly switch plugins between customers to provide different cleansing functions.

image-20250906-143227.png

 

Main class which enables high speed verification and cleansing of CSV file data. Navigation has been optimized to allow direct access to each line of the file using a row/column pattern much like an Excel spreadsheet. You access a row using a simple getValue(<rowIndex>,<columnIndex|columnName>) method or iterate over all rows using standard groovy closures.

There are two Column Value Processor approaches for performing verification and cleansing of your data. These approaches provide the opportunity to verify/cleanse column data before it is returned.

Andi Included Extension Processor

Extension which includes the following functions to help quickly configure your CSVTable scripts. The Andi extension ID is andi:included which allows you to access the included functions listed below.

  • stateCode - State name to code (Texas to TX)

  • stateDecode - State code to name (TX to Texas)

  • lastName - Formats in mixed case with additional rules to handle complex names such as O'Brien, Smith-Jones etc.

  • phoneNumber

  • zipCode

  • lowerCase

  • mixedCase

  • upperCase

  • replaceNonAscii - Replace non-ascii characters with blanks or your own character. Set CSVTable.setAlertIfNonAscii( true ) and turn on verfication mode as log to have verification messages logged. NOTE: Use the setReplaceNonAsciiChars(String replacement) to pre-replace all non-ascii characters to avoid verification exceptions.

  • trim

  • leftPad

  • rightPad

  • formatInteger

  • formatDecimal

For complete function information including available parameters,  execute the following in an Andi Integrated Scripting environment.

// The details of the andi:included extensions println extensionRegistry.getExtensionDetails("andi:included")

Allows you to perform initial verifications and data-cleansing when a column is accessed. If your processor finds a data issue, it can immediately change the value or throw a VerificationException which can fail the script or Log the exception to a comma delimited error file enabling advanced issue resolution.

The processor is a Groovy Closure, or an anonymous block of code. The processor is called when the column is accessed using a getValue(<rowIndex>,<columnIndex>) method. A processor can be registered for each column of data allowing all columns to be verified/cleansed before they are used in your business logic.

A second type of processor is called an Extension Processor. It is registered like a Column Value Processor but the extension is a Java archive (jar) that can support 1-n functions. The advantages of an extension are:

  • all of the logic for verification and cleansing is located is a reusable jar

  • easily referenced within your script using a simple extension ID.

  • allows daisy-chaining multiple functions for a column to easily create more advanced verification/cleansing

The extensionRegistry  reserved variable provides the following methods to work with the extensions.

  • register - register your custom extension by extending the com.jreflections.scripting.extensions.verifications.AbstractFileViewColumnVerifier class

  • getExtensionIds - Lists all of the registered extensions, including andi:included extension provided with the product

  • getExtensionDetails - Lists functions, parameters and more for every available function

extensionRegistry.register( new SampleVerifications( ) ) println extensionRegistry.getExtensionIds() println extensionRegistry.getExtensionDetails("andi:included")

Samples

Samples are attached that you can download and run in the Andi Integrated Scripting environment with very few changes. There are two attachments for each sample, a *.csv file which is the CSV file that the *.groovy script reads. To run the sample(s), download the attachments to a directory. Update the dir variable towards the top of the script, it should point to the directory where you downloaded the CSV file.

A separate path points to a temporary directory which is where temp files are created during processing. If processing large files, set this location to an alternate large drive using the setTempDirectory method.

  • address.groovy - Mixed case formatting, zip code formatting, state decoding and more

  • numbers.groovy - How to format integer and decimal numbers

  • names - Mixed case and last name formatting

  • multiFilesAddress.groovy - Same as address.groovy but allows processing all files in a directory with seperate batchRunId and logfiles

  • studentProfile.groovy - student, name, address files with processors on multiple tables. A more complex example showing how to join multiple files while still using processors on each to perform verification and transformation

(3 KB)

(3 KB)

(3 KB)

(2 KB)

(3 KB)

(3 KB)

(2 KB)

(2,775 KB)

(3,941 KB)

(167 KB)

(73 KB)

(218 KB)

(7,127 KB)