Data retrieval optimization
‘Hi Jano, the site is running slow, don’t you know where the problem might be?’
‘Hi Viktor. Yes, I do. In your application.’ ?
This is what the initial communication with an admin often looked like in the past, when the amount of data in the database increased and the page started to run more and more slowly until the server sometimes stopped coping and simply crashed :). 99% of time, the admin is right, so today we’ll look at how to write the code so that it’s optimally written from the server’s point of view and clear at the same time.
When developing a web application, you must have encountered object-oriented programming (OOP for short) in connection with a data warehouse, which is mostly a relational database (MySQL, MariaDB, PostgreSQL,…).
We won’t discuss what OOP is. As part of OOP design, we can elegantly encapsulate properties and functionality for individual objects. We then initialize and use the objects. Simple, isn’t it?
However, OOP often lures us to well-written code, but not always to code optimal in terms of server load. That’s why it’s necessary to pay attention to how the data is loaded and subsequently filled into individual objects.
Within the relational database, data is divided into structured tables between which there’s a relational relationship of type 1:N, N:1 or N:M.
For a more specific idea, we’ll work with the following data model:
- interest (id, name)
- skill (id, name)
- student (id, name, surname, sex, birth_date)
- student_interest (student_id, nterest_id)
- student_skill (student_id, skill_id)
In layman’s terms, this structure says that we have a database of students for whom we record their interests and program skills. One student may have multiple interests and multiple program skills. It’s assumed that the interests and skills will be repeated in different students. That’s why we have data on interests and skills in separate tables, the so-called dial tables, and then we have the student’s link to the dial in the linking tables.
With such a data structure, we’ll create an OOP model:
- Student, Student_List, Student_Search
- Interest, Interest_List, Interest_Search
- Skill, Skill_List, Skill_Search
In this OOP model, Student, Interest, Skill represent classes for Entities, *_List classes for Entity list and *_Search are classes for Entity search. The search class has a connector to the database available, with which it can read data from the database and create a response in the required output structure.
So let’s see how to load a list of 50 students with their interests and skills.
Since we have created an OOP model, the logical step is to use the *_Search class to retrieve the necessary data.
It’s a very clear solution from the point of view of reading the code, but from the point of view of the load on the database server, it isn’t a very optimal solution. Why? Well, because the server must make 2n + 1 number of queries to the database, where ‘n’ is the number of students found from the first query (50 in our example).
Now imagine that we have a high level of website traffic and at one time we get 1000 visitors per page. In this case, the server must make 1000 x (2n + 1) queries to the database, which represents an exponential increase in the number of queries and the server works until it may not manage anymore and goes down ?
Therefore, it’s appropriate to look at our code from the other side during development, and that is performance or server load.
In our case, we’ll try to optimize our solution at the code level by finding the IDs of students we loaded in the first query. Subsequently, we retrieve data for these IDs from the database in a single query and process it at the code level.
Here is the resulting change in the data access logic:
As you may have noticed, suddenly we get from the number of 2n + 1 queries to the number 3, regardless of the value of n. Interesting, isn’t it?
Now some calculations:
For comparison with 1000 visitors and loading 50 students:
- 1st Method: 1000 x (2 x 50 + 1) = 101 000 SQL queries
- 2nd Method: 1000 x 3 = 3000 SQL queries
So we saved 101 000 – 3000, i.e. 98 000 queries – suddenly the server is slacking ?
How to polish it up even more?
If we have data that does not change too often over time, then we can write it to the CACHE layer at the application level (File cache, Redis cache, etc.). With this adjustment, we can get to the following numbers:
- 1st Method: (2 x 50 + 1) = 101
- 2nd Method: 1 x 3 = 3
As you have noticed, with 1000 visitors, in fact, each request is sent only once and the remaining 999 visitors get the data returned from the CACHE layer. Let’s invalidate this CACHE layer regularly to keep the data up-to-date.
What to watch out for?
It’s also necessary to pay attention to what data we ask for from the database. You should consider whether it wouldn’t be more appropriate to first read only the IDs of records (smaller data transfer) with a more complicated request to the database, and then to retrieve the necessary data on the basis of the found IDs. This request will be fast as we will be retrieving data based on the primary key.
Here we come to the alpha and omega of database optimization as such. That is table indexing. Properly set indexes can help us significantly in streamlining data retrieval from the database.
The index should be made up of the most searched columns. The table can contain multiple indexes. Beware, however, the rule ‘less is more’ applies here as well. The more indexes, the slower the INSERT and UPDATE requests, as the index needs to be updated. With massive tables, this can be a noticeable slowdown.
The optimization must first be properly analyzed and then implemented. The result of optimization should be to make data retrieval more efficient, not more complicated.