Join CSV Files

Join CSV Files

Overview

To create a join between two CSV files, you create the join using the master table.

Method: DelimitedFileJoin addJoin( String joinId, CSVTable joinedFileView )

Indexed Joins

Indexed access provides performance improvements and prevents table scanning which can drastically reduce the time spent in looking up values.

Creating

Methods:

  • int createJoinIndex( joinId, joinTableKeyColumn, [<cacheColumn>,<cacheColumn>] )

  • int createJoinIndex( String joinId, String joinTableKeyColumn, List<String> cacheColumns, int initialCacheSize )

To create the join index pass the joinTableKeyColumn which tells the join which column to use when joining. The one or more columns passed as cacheColumns whose values will be cached. Optionally you can pass the initial Map size for the cache to attempt to increase performance. NOTE: The initial cache size is not the total cache size so don't add megabytes of cache space as the cache will grow in size automatically. Refer to java.util.HashMap initialCapacity.

table = new CSVTable( "c:/fileView/simpleJoin/address.csv" ) stateCodes = new CSVTable( "c:/fileView/simpleJoin/state_cd.csv" ) table.addJoin( "stateJoin", stateCodes ) stateRowsIndexed = table.createJoinIndex( 'stateJoin', 'state_cd', ['state'] )

Get Joined Value using a Method

Method:

  • MatchingRow getIndexedJoinValue( String joinId, String value)

The last step is to get values from the joined tables from your processor. The first method getIndexedJoinValue returns the first match found only. Use the Object MatchingRow.getValue( String columnName ) method to retrieve the value for a specific column. Remember that all values from the cache are returned.

Method:

  • MatchingRows getIndexedJoinValues( String joinId, String value )

Use the int MatchingRows.getNumberMatches( ) and MatchingRows.getRow( int index ) methods to get one of the values and then use the methods above to access the value.

createDelimitedRow and @indexedJoin

As discussed in another article, the createDelimitedRow will automatically get the required value for a column to create the delimited row. In the example below the "@indexedJoin(stateJoin:state_cd:state)" value is passed in the column name list. This will be interpreted and will use the stateJoin join created earlier in the script. Using the "state_cd" column value at rowIndex, it will lookup the value for the state column in the joined table.

Change the joinColumnName to any other name to access a different cached value in the index. For example, assuming the join table has an additional populate column, we can pass table.createJoinIndex( 'stateJoin', 'state_cd', ['state', 'population'] ) to cached the population as well. The look up the value using "@indexedJoin(stateJoin:state_cd,population)".

@indexedJoin(<joinId>:<masterColumnName>:<joinColumnName>)

import groovy.time.* timeStart = new Date() table = new CSVTable( "c:/fileView/simpleJoin/address.csv", 25000 ) table.setVerificationMode( "log", "d:/andi/validation/validationErrors.txt" ) stateCodes = new CSVTable( "c:/fileView/simpleJoin/state_cd.csv" ) table.addJoin( "stateJoin", stateCodes ) // For a simple code lookup, use a simple joinIndex which allows a key and one or // more cached column values. // state_cd column will be used as the key for the join and the column(s) in the // parens will be cached, in this case the state name. println "Join Created rowCount: " + table.createJoinIndex( 'stateJoin', 'state_cd', ['state'] ) remap = new File( "c:/fileView/simpleJoin/address.remap" ) remap.delete() out = remap.newWriter() try { table.eachRow() { rowIndex, table -> context.testExecutionCancelled() out.write( table.createDelimitedRow( rowIndex, ['@indexedJoin(stateJoin:state_cd:state)','city','zip_code','STREET_ADDRESS'], "*" ) + "\n" ) } } finally { out.close() } TimeDuration duration = TimeCategory.minus(new Date( ), timeStart) println duration