Data Selection
Overview
Andi has been optimized to quickly retrieve millions of rows for Data Analysis. By changing the way you use SQL, it will:
Dramatically reduce Database Server resource usage
Decrease the SQL execution time
Allow you to do more effective analysis
Simplify SQL Creation
Through decades of SQL optimization and use, it became very clear that the simpler the SQL, the faster you will receive the results. The approaches used have been the same approaches used by successful applications.
Select only the columns you need
Limit the number of SQL Where clause expressions. The best case is only one specific filter, preferably an indexed column, that will limit the number of rows returned.
Don't add an SQL Order By clause, instead sort the data in the Result Set
Format Timestamp values in the Result Set versus using Database Functions
The primary reason for these changes is to reduce Database CPU and Memory usage allowing Andi to be used in Production environments. This approach takes a little practice, but once you get the hang of it, you will never go back. The following provides two approaches to gathering the data before creating a chart.
Data Access Approaches
Database Only
Aggregate functions have been available as part of the SQL language for a long time.
SELECT SUM(AMOUNT) as SUM, MONTHNAME(SOLD_ON) as MONTH, YEAR(SOLD_ON), ITEM
FROM ITEM_SALES
GROUP BY MONTHNAME(SOLD_ON), YEAR(SOLD_ON), ITEM
HAVING YEAR(SOLD_ON) = 2019
ORDER BY MONTH, ITEM;When to use:
If you are an advanced SQL user
If you have an advanced SQL user on your team
You have spare resources - CPU and Memory on the target database server
Pros:
Useful when not using Andi.
Cons:
Requires advanced SQL skills
Uses increased amounts of Database CPU and Memory
Must call the DBA to stop the query from executing as Andi will not let you cancel a query, only stop retrieving the results once the query execution completes
Multiple temporary result sets created within the database using additional resources which can impact production performance. As the number of rows are increased in the initial set retrieved before the WHERE clause is applied, the more temporary resources as used by the database to produce the result.
Simple Where Clause
This first approach uses the database YEAR() function to limit the number of rows returned by the query as we are only interested in the data for 2019.
SELECT AMOUNT, SOLD_ON as MONTH, ITEM
FROM ITEM_SALES
where YEAR(SOLD_ON) = 2019When to use:
If you have 5 million or more rows that have a year other than 2019 that will end up in the result set that will never be used. An example would be the case where there are 50 years of data in the table and you are only going to be analyzing one of the year.
Pros:
Simple SQL
Less resources used on the Database
The query runs much faster allowing you to quickly stop the data retrieval process and release valuable production resources if a change is required before re-running the query
Cons:
None
No Where Clause
The first thing to remember when you are selecting data without a where clause is how many rows will be retrieved. The plus is that you can stop anytime and work on the results retrieved. The downside is that you may not have all of the data you are interested in. Always think through the ramifications.
SELECT AMOUNT, SOLD_ON as MONTH, ITEM
FROM ITEM_SALES
or
SELECT AMOUNT, SOLD_ON as MONTH, SOLD_ON as YEAR, ITEM
FROM ITEM_SALESWhen to use:
When you can retrieve all of the rows in a reasonable amount of time and you have enough memory allocated to the Andi Java JVM to hold the result set
When creating an SQL statement with two timestamps, one as a month and the other as a year. This enables you to quickly filter by year so you only have a specific years worth of data in a chart
Pros:
Simple SQL
The Database almost immediately starts streaming rows
Allows you to quickly stop the data retrieval at any time
Cons:
If you have more rows to process than memory allocated, Andi could run out of memory. When this happens it appears that Andi becomes sluggish and unresponsive.