Data Selection

Data Selection

Overview

Andi has been optimized to quickly retrieve millions of rows for Data Analysis. By changing the way you use SQL, it will:

Simplify SQL Creation

Through decades of SQL optimization and use, it became very clear that the simpler the SQL, the faster you will receive the results. The approaches used have been the same approaches used by successful applications.

  • Select only the columns you need

  • Limit the number of SQL Where clause expressions. The best case is only one specific filter, preferably an indexed column, that will limit the number of rows returned.

  • Don't add an SQL Order By clause, instead sort the data in the Result Set

  • Format Timestamp values in the Result Set versus using Database Functions

The primary reason for these changes is to reduce Database CPU and Memory usage allowing Andi to be used in Production environments. This approach takes a little practice, but once you get the hang of it, you will never go back. The following provides two approaches to gathering the data before creating a chart.

Data Access Approaches

Database Only

Aggregate functions have been available as part of the SQL language for a long time.

SELECT SUM(AMOUNT) as SUM, MONTHNAME(SOLD_ON) as MONTH, YEAR(SOLD_ON), ITEM FROM ITEM_SALES GROUP BY MONTHNAME(SOLD_ON), YEAR(SOLD_ON), ITEM HAVING YEAR(SOLD_ON) = 2019 ORDER BY MONTH, ITEM;

When to use:

  • If you are an advanced SQL user

  • If you have an advanced SQL user on your team

  • You have spare resources - CPU and Memory on the target database server

Pros:

  • Useful when not using Andi.

Cons:

  • Requires advanced SQL skills

  • Uses increased amounts of Database CPU and Memory

  • Must call the DBA to stop the query from executing as Andi will not let you cancel a query, only stop retrieving the results once the query execution completes

  • Multiple temporary result sets created within the database using additional resources which can impact production performance. As the number of rows are increased in the initial set retrieved before the WHERE clause is applied, the more temporary resources as used by the database to produce the result.

Simple Where Clause

This first approach uses the database YEAR() function to limit the number of rows returned by the query as we are only interested in the data for 2019. 

SELECT AMOUNT, SOLD_ON as MONTH, ITEM FROM ITEM_SALES where YEAR(SOLD_ON) = 2019

When to use:

  • If you have 5 million or more rows that have a year other than 2019 that will end up in the result set that will never be used. An example would be the case where there are 50 years of data in the table and you are only going to be analyzing one of the year.

Pros:

  • Simple SQL

  • Less resources used on the Database

  • The query runs much faster allowing you to quickly stop the data retrieval process and release valuable production resources if a change is required before re-running the query

Cons:

  • None

No Where Clause

The first thing to remember when you are selecting data without a where clause is how many rows will be retrieved. The plus is that you can stop anytime and work on the results retrieved. The downside is that you may not have all of the data you are interested in. Always think through the ramifications.

SELECT AMOUNT, SOLD_ON as MONTH, ITEM FROM ITEM_SALES or SELECT AMOUNT, SOLD_ON as MONTH, SOLD_ON as YEAR, ITEM FROM ITEM_SALES

When to use:

  • When you can retrieve all of the rows in a reasonable amount of time and you have enough memory allocated to the Andi Java JVM to hold the result set

  • When creating an SQL statement with two timestamps, one as a month and the other as a year. This enables you to quickly filter by year so you only have a specific years worth of data in a chart

Pros:

  • Simple SQL

  • The Database almost immediately starts streaming rows

  • Allows you to quickly stop the data retrieval at any time

Cons:

  • If you have more rows to process than memory allocated, Andi could run out of memory. When this happens it appears that Andi becomes sluggish and unresponsive.