Fix Slow Queries Now: Must-Know Indexing Hacks!
Indexes are a powerful tool that significantly speed up query execution in databases. They function like pointers in a book, helping quickly locate the necessary information without reading every page. This article explains how indexes work, their internal structure, advantages, and limitations.
Why Are Indexes Needed?
When executing a query, the database searches for rows that meet the conditions. If the table contains a large number of records, searching without an index can be slow, as every row must be checked—this is called a full table scan. This increases query execution time and server load. Indexes help avoid this by providing a more efficient way to search, directing the query to relevant rows and minimizing read operations.
Index types
B-Trees
- Algorithmic Complexity:
- Search: O(log n)
- Insertion: O(log n)
- Deletion: O(log n)
- Structure: Balanced tree where nodes contain keys and links to child nodes. All keys are ordered to support data sorting.
- Search: Provides logarithmic search time due to minimal tree depth.
- Disk Usage: Typically occupies 20-30% of the indexed data size as keys and metadata are stored.
- Disk Storage: Stored in separate structures optimized for sequential reads, reducing I/O operations.
- Advantages:
- Suitable for range queries like
WHERE age BETWEEN 20 AND 30
. - Universal and used for most search operations.
- Suitable for range queries like
- Disadvantages:
- Increased insertion and deletion time if rebalancing is required.
B-Tree+ (Enhanced B-Trees)
- Algorithmic Complexity:
- Search: O(log n)
- Insertion: O(log n)
- Deletion: O(log n)
- Structure: In B-Tree+, data is stored only in leaf nodes, minimizing I/O operations. Internal nodes contain only keys and pointers.
- Search: Efficient due to compact data organization in leaf nodes.
- Disk Usage: May occupy 30-40% of indexed data size due to key duplication in leaf nodes.
- Disk Storage: Stored in sequential disk blocks to minimize fragmentation.
- Advantages:
- Enhanced performance for batch queries and reads.
- Supports sequential reading for ranges.
- Disadvantages:
- Consumes more disk space due to key duplication.
Hash Indexes
- Algorithmic Complexity:
- Search: O(1)
- Insertion: O(1)
- Deletion: O(1) (average)
- Structure: Uses hash functions to transform keys into hash values for quick exact matches.
- Search: Extremely fast for exact matches, e.g.,
WHERE id = 42
. - Disk Usage: Requires approximately 10-20% of the indexed data size.
- Disk Storage: Stored in segmented areas optimized for search operations.
- Advantages:
- High speed for exact matches.
- Disadvantages:
- Does not support ordered or range queries.
- Can consume significant space with a large number of records.
Full-Text Indexes
- Algorithmic Complexity:
- Search: O(log n)
- Insertion: O(log n)
- Deletion: O(log n)
- Structure: Optimized for tokenized textual data with mappings for quick lookups.
- Search: Supports phrase matching, stemming, and relevance ranking.
- Disk Usage: Requires additional space for tokenized data and metadata.
- Disk Storage: Stored in compressed formats for fast retrieval of textual data.
- Advantages:
- Efficient text searching in large datasets.
- Supports complex search operations like keyword proximity and partial word matching.
- Disadvantages:
- Requires specific configuration and may significantly increase storage.
Index Maintenance
Why is index maintenance important?
Over time, indexes can become fragmented due to frequent insert, update, and delete operations. This leads to reduced performance. Regular index maintenance helps maintain their efficiency.
Key Maintenance Methods
- REINDEX (PostgreSQL): Rebuilds the index, eliminating fragmentation and restoring performance.
REINDEX INDEX index_name;
- OPTIMIZE (MySQL): Eliminates table and associated index fragmentation.
OPTIMIZE TABLE table_name;
- Automatic Maintenance: Some databases (e.g., SQL Server) have built-in mechanisms for automatic index maintenance.
Index Comparison
Index Type | Search Complexity | Insert/Delete Complexity | Disk Usage | Main Advantages | Main Disadvantages |
---|---|---|---|---|---|
B-Tree | O(log n) | O(log n) | Moderate | Supports range queries | May require rebalancing |
B-Tree+ | O(log n) | O(log n) | High | Speeds up sequential reads | Consumes more disk space |
Hash | O(1) | O(1) (average) | Low | Fast exact-match search | Does not support range queries |
Full-Text | O(log n) | O(log n) | High | Text searching | High storage requirements |
Bitmap | O(1) | O(n) | Very Low | Efficient for low cardinality | Not suitable for high workloads |
Composite | O(log n) | O(log n) | High | Speeds up multi-column queries | Requires careful column ordering |
Covering | O(log n) | O(log n) | Very High | Avoids table lookups | Increases storage, slows writes |
Spatial | O(log n) | O(log n) | Moderate | Efficient for geospatial queries | Complex setup, limited scenarios |
Unique | O(log n) | O(log n) | Moderate | Ensures data uniqueness | Slightly reduces insert performance |
Using EXPLAIN for Query Optimization
EXPLAIN is a tool provided by most relational databases that allows analysis of query execution plans. It shows how the database plans to execute the query, including index usage, the number of rows to be read, and the order of operations.
How to Use EXPLAIN
- Execute the query with the
EXPLAIN
prefix:EXPLAIN SELECT * FROM users WHERE id = 42;
- The result will show:
- Execution type (e.g., index search or full table scan).
- Index used (if any).
- The number of rows to be processed.
How EXPLAIN Helps Optimize Queries
- Identifying Unused Indexes: If the query performs a full table scan instead of using an index, this may indicate the need for creating or modifying an index.
- Improving Operation Order: Analysis reveals suboptimal sequences of operations (e.g., table joins).
- Reducing Processed Rows: EXPLAIN shows the number of rows to be read, helping evaluate the need for additional indexes.
- JOIN Optimization: Analysis shows how tables are joined and helps improve the join order.
Example:
EXPLAIN SELECT * FROM orders WHERE order_date BETWEEN '2024-12-01' AND '2024-12-31';
The result may show:
- Use of an index for range queries.
- Expected number of rows to be processed.
Index Selectivity
Definition: Index selectivity measures how effectively an index can reduce the number of rows returned by a query. It is calculated as the ratio of unique values in the index to the total number of rows in the table.
High Selectivity:
- Characterized by a large number of unique values in the column.
- Example: Primary keys or unique identifiers.
- Advantages: Such indexes significantly reduce the search space and improve performance.
Low Selectivity:
- Characterized by a small number of unique values (e.g., Boolean fields).
- Example: Columns with gender or status data.
- Limitations: Such indexes are less effective for filtering large datasets.
Tip: Prioritize high-selectivity indexes for better performance.
When Indexes Fail to Work
- When the query affects a large number of rows, even with an index, execution may fall back to a full table scan.
- Low index selectivity makes it less effective.
- Functions applied to an indexed field in the query (e.g.,
WHERE LOWER(name) = 'john'
) can prevent index usage. - Queries with overly complex conditions not matching the column order in a composite index.
Example: SELECT * FROM users WHERE LOWER(name) = 'john';
- Even with an index on the
name
field, it cannot be used due to theLOWER
function.
Examples of Index Usage
Simple Example: Searching by ID
Suppose we have a users
table with a million rows, and we want to find the user with id = 42
:
- Without Index:
- The database scans every table row until the desired one is found.
- In the worst case, a million operations are needed.
- With Index:
- The index quickly finds the row with
id = 42
, using the B-Tree structure. - The search takes only a few steps.
- The index quickly finds the row with
More Complex Example: Range Query
Suppose we have an orders
table with a column order_date
, and we want to find all orders made in December 2024:
SELECT * FROM orders WHERE order_date BETWEEN '2024-12-01' AND '2024-12-31';
- Without Index:
- The database performs a full table scan, checking each row.
- With Index:
- An index on the
order_date
column speeds up the query by focusing only on the specified range.
- An index on the
What to read next?
- Database Systems: The Complete Book by Hector Garcia-Molina, Jeffrey D. Ullman, and Jennifer Widom
- SQL Performance Explained by Markus Winand
- High Performance MySQL by Silvia Botros, Jeremy Tinley, and Baron Schwartz
- PostgreSQL 14 Internals by Hans-Jürgen Schönig
- Designing Data-Intensive Applications by Martin Kleppmann
Conclusion
Indexes are a powerful tool for optimizing database performance. Understanding how they work and applying them effectively helps significantly speed up operations on large datasets. However, it is important to consider their limitations and use indexes wisely to avoid unnecessary resource overhead.
This article does not cover advanced index types such as GiST, SP-GiST, GIN, and BRIN, which are useful for specialized use cases like full-text search, spatial data, and large sequential datasets. You can learn more about these indexes in the PostgreSQL Index Documentation.