Why this Query is fast, AND WHY this Query is slow

date

slug

why-this-query-fast-and-why-this-slow

author

status

Public

category

software engineering

updatedAt

Apr 19, 2023 10:42 AM

One night after finishing my daily job, I opened Twitter to look up what happend in the media lately. Then I found this tweet from Tim Ottinger on Twitter

Then I'm looking for another article, I found this from Martin Fowler

It lead me to a query question, first things first, every decision made when building Software is depends on what are we need, and when it comes to the “database” usually we have a lots of options such as NoSQL, Relational, Distributed and etc. Then, it’s the time comes how we are going to map our data into our database, the first solution will be a “query” and then a question left behind. How we are going to use such a “query” ? So it’s comes both a native query or an ORM. Back to the topic, why there’s a query that seems took a lot of times in ORM / native query?

Question as example we are using PostgreSQL

Table User with Integer Fields A,B,C
A has primary clustered Index
B has secondary IndexAll indexes are B+Tree
C has empty value

The usecase was written like this

1st Query
SELECT * FROM User WHERE A BETWEEN (1 AND 100000)

2nd Query
SELECT * FROM User WHERE B BETWEEN (10 AND 100000)

3rd Query
SELECT * FROM User WHERE C BETWEEN (5 AND 1000000)

We will go breaking down from the 3rd query, as if its we scanned through the C where there’s no indexing on that, and we have an empty value on the C it would simply said that C is faster than other query, its faster to scan if there’s no heap on the table, but if the case was there’s a billion rows the answer not 3rd query. If the tables is empty/row is empty, there’s no point to giving the indexing, its more expensive to crack open the data-structures for B+Tree, pull the pages of all the index, when there is no anything, so actually the query is slower for cracking the index rather than the value itself.

Lets pick the second ones, as we assume we scanning the secondary index as a B, assume the tables is large. The secondary index would point to the primary key every entry from B, at the end of the scan you will get a tucked in values together in B+Tree very very quick, but when we pick the list of a primary key where it is completely random, for example the B100 could be in the last row, or the B1 could be in the 300 row thats gonna be all over the place and we will write the query will be like this

2nd Query
SELECT * FROM B WHERE BETWEEN (10 AND 100000) ORDER BY B

We are gonna sort them, and then scan the entire full table from A primary index, its gonna be expensive, but for C its a free since the C is empty. Since this an unclustered index, if we insert a value and A1, 2, 3 and 4 it would be fit in the same page since it clustured index, for B we can’t depend on the insertion, its completely random the transaction not guarantee that we insert first will be inserted first, example that we might insert B1 as it is in the middle page, and we might insert B100 as it is in the first page, we may conclude this query would be slower in this case, because we need to find a row where it is put and that’s not optimal I think.

What happend when we set our B into a Bitmap Index? It would be very very fast for this case, because we mark for each page in every rows. We gonna collect all the pages, and it will point every pages that already mark even though B it’s an unclustured index, but thats still not the most fastest query for this question.

The last one will be SELECT * FROM User WHERE A BETWEEN (1 AND 100000) this was the most beautiful things, because we searching a primary index, where A is clustured. We are searching in a clustured index, where everything put in the same pages, that’s just a single scan in primary index. We found 100000 and we are going to find everything, for the B+Tree concept we go to the left node, its mean we find the smaller value on each node. So it will be going to the next page, and the next page. Now we pointing to the heap pages it self, because the primary key was the heap of this user table.

Also because A is primary key, B and C was there too.

That’s the example how we break-down all the stuff into that usecase and query, and what about the mechanism when we wrote that things into an ORM stuff?

What we are doing before we map those query, we are trying to do synchronizing between two quite different representations of data, one in the relational database, and the other in-memory. It should be referred as a memory problem, and there is not object here (why object? by it’s mean Object Relational Mapping)

ORM Visualization Mapping through memory by Martin Fowler

The complex things before we jump into the query is, the ORM itself should handle if there’s a concurrency problem happend there, after that problem we usually rely on the transaction of the ORM. On most cases you can’t let the transaction open some of the side effect will reduce the system’s overall throughput. Eventually they will hit timeout limits and fail, which is a problem for overall system behaviour.

Between using ORM and using Native Query there’s no wrong things here, but it depends on the usecase what you are doing, if you are doing a lot of TX usecase then ORM would be a pitfall of your system, you need an optimization such as locking level, but even-though

Everything is right in the world