Subtleties of SQLite Indexes

emschwartz.me

130 points by emschwartz 2 days ago


ComputerGuru - 2 days ago

Not a great article; I clicked expecting something super technical about SQLite internals and found a mix of rdbms basics and some misconceptions. The limitations in the blog post aren't really specific to SQLite (for the most part), they're just how indexes (indices) and database engines work across the board. And some of the things phrased as "SQLite [can't] do this" is stuff that wouldn't make sense to do in the first place.

If you understand what (multi-column) indexes are at the lowest level (i.e. what data structure they represent, how they are used by the database, what the code reading them would look like) then all of this makes immediate and natural sense. Indexes aren't magic. They're just a shortcut to help the the db engine find your data more effectively.

This doesn't require super complex understanding of data structures, btw. The same limitations and advice would apply if you were trying to, for example, search a stack of resumes sorted by one or more attributes. If you have them sorted by position, years of experience, and applicant's last name; you wouldn't be able to quickly grab all resumes for the HR Manager position that have a last name starting with J - after all, you sorted them by years of experience first! It's physically not possible! You can only use the first part of that index, the position, to jump to HR Manager resumes; then you would need to manually go through them one-by-one to grab only the ones starting with J for each years-of-experience subgroup (if any).

porridgeraisin - 2 days ago

> multiple indices

Yes, sqlite uses only one index per table in FROM clause... Except for when it can use the OR optimization [1]

> Left to right, no skipping, stops at the first range

I don't know if we need a soundbite for this, nor is it sqlite specific. This is a direct consequence of the fact that indexes are basically just sorted lists you can binary search over. They are sorted left to right (you have to choose some order anyhow) meaning all the other orderings cannot avail of the speedup, and thus sqlite chooses to not use it.....

Except if left column of index is measured to be low cardinality by ANALYZE [2]'s output in the sqlite_stats table, in which case it is ok to scan over it and then binary search right column. More at skip-scan optimization [3].

[1] https://www.sqlite.org/optoverview.html#evaluating_or_constr...

[2] https://www.sqlite.org/lang_analyze.html

[3] https://www.sqlite.org/optoverview.html#the_skip_scan_optimi...

nikeee - 2 days ago

I use the mental model of nested maps for "column order matters". For example, an index "published, low_quality_probability, lang" is just a Map<date, Map<number, Map<lang, rowId>>> in my mental model. These maps are ordered by the order the index possesses. That explains why column order matters and why one cannot skip columns and why it stops at range queries.

Just imagine getting a final rowId from these nested maps and you'll see why the index works for some queries and doesn't for others.

wodenokoto - 2 days ago

The article is a developers journey into indexes and not a bad journey or travelogue imho.

Sure, if you are a database expert, it might be disappointing but I enjoyed reading it.

themafia - 2 days ago

You may not need a course. SQLite publishes a very useful guide on the query planner which goes over most of what you need to know with respect to indexes:

https://www.sqlite.org/optoverview.html

crazygringo - 2 days ago

I agree with other commenters that there's nothing new or surprising here [1] if you actually understand how indexes work. Which, if you're working in SQL, you should.

But the main takeaway, I disagree with. The author explains:

> When I first set up Scour's database, I put a bunch of indexes on the items table without really thinking about whether they would help. For example, I had separate indexes on the published date, the language, and the quality rating. Useless. It's more important to have one or a small handful of good composite indexes on multiple columns than to have separate indexes on each column.

Yes, the biggest error is throwing indexes at a table without having the slightest idea if they're helpful. But the idea that a smaller number of composite indexes is not the answer either.

The answer is to go through every single query and make sure you have an index that matches the query, adding/changing indexes (including composite indexes) or rewriting queries as required.

Indexes don't exist in a vacuum. They are optimizations for specific queries. Proper database design requires knowing exactly what information you will need to look up using which parameters, and designing the tables and indexes to produce that performantly. When you need to look up data in new ways, you often need to add new indexes or rearchitect tables entirely.

[1] Except that the partial index match condition treats 0.9 and .9 as different. That is unexpected to me, but it is kind of in the docs at "The terms in W and X must match exactly": https://www.sqlite.org/partialindex.html

Ameo - 2 days ago

The main takeaway from this for me is that SQLite’s query planner seems to be pretty limited. It’s reliant on stuff like the order in which WHERE conditions are specified, isn’t able to use multiple indexes in queries in many cases, bails out to scans when a variety of different operations show up in queries, etc.

It might be the case that SQLite has a simpler or less sophisticated query planner than other databases like Postgres or MariaDB, but in my experience those DBs struggle a lot with good querying planning as well. I’ve spent many hours in the past with issues like Postgres suddenly starting to ignore an index entirely because its computed table data distribution statistics got out of balance, or having to add manual annotations to MariaDB queries like STRAIGHT_JOIN in order to get a query to run faster.

I’m guessing that this is a really hard problem since it doesn’t seem to be really “solved” by any major DB vendor I’ve seen. A lot of modern DB engines like Clickhouse tend to just work around this problem by being so fast at full table scans that they don’t even need any sophisticated indexing set up at all.

kccqzy - 2 days ago

> It's worth being careful to only add indexes that will be used by real queries.

This reminds me of a technique used by Google App Engine SDK a long time ago before it was called cloud. Basically in development mode, the SDK captures the kind of queries you make, and then automatically add any index that would speed up this query into a configuration file. You then later deploy with this configuration file, which tells the production data store to create such indexes. I thought this was a genius idea. Not sure if something similar exists for SQLite.

rtpg - 2 days ago

Postgres is pretty good about using index data from multiple single column indices... if you don't order results! Otherwise the planner will get in the way and do nonsense.

A bunch of people spend years studying computer science and trees etc, and then when we actually need to care about that stuff the databases absolutely do not want us to declare the plan. Very annoying.

Think about how people would be guided to do the right thing on indices if they had to say "go along this index to get my data" in a "planned mode" when dealing with bad performance? So many data layout and index issues would become immediately obvious. You wouldn't magically get as many benefits from DB updates, granted.

garaetjjte - 2 days ago

One thing I'm missing in SQLite are multi-valued indexes. I would want to have an index on func_returning_list(some_column), and query doing WHERE x IN func_returning_list(some_column) should use that index.