A SQL Heuristic: ORs Are Expensive

ethanseal.com

164 points by ethanseal 3 days ago


magicalhippo - 2 days ago

We had a case where a single OR was a massive performance problem on MSSQL, but not at all on Sybase SQLAnywhere we're migrating away from. Which one might consider slightly ironic given the origins of MSSQL...

Anyway, the solution was to manually rewrite the query as a UNION ALL of the two cases which was fast on both. I'm still annoyed though by the fact that MSSQL couldn't just have done that for me.

boramalper - 2 days ago

I find MongoDB's ESR (Equality, Sort, Range) Guideline[0] quite helpful in that regard, which applies to SQL databases as well (since nearly all of them too use B-trees).

> Index keys correspond to document fields. In most cases, applying the ESR (Equality, Sort, Range) Guideline to arrange the index keys helps to create a more efficient compound index.

> Ensure that equality fields always come first. Applying equality to the leading field(s) of the compound index allows you to take advantage of the rest of the field values being in sorted order. Choose whether to use a sort or range field next based on your index's specific needs:

> * If avoiding in-memory sorts is critical, place sort fields before range fields (ESR)

> * If your range predicate in the query is very selective, then put it before sort fields (ERS)

[0] https://www.mongodb.com/docs/manual/tutorial/equality-sort-r...

ulrikrasmussen - 2 days ago

I am split on SQL. On one hand I love the declarative approach and the fact that I can improve the run-time complexity of my queries just by adding an index and leaving the queries as is. On the other hand, I hate how the run-time complexity of my queries can suddenly go from linear to quadratic if the statistics are not up to date and my query planner misjudges the amount of rows returned by a complex sub-query so it falls back to a sequential scan instead of an index lookup.

I would be interested in a query execution language where you are more explicit about the use of indexes and joins, and where a static analysis can calculate the worst-case run time complexity of a given query. Does anyone know if something like that exists?

Glyptodon - 2 days ago

If optimization is really as simple as applying De Morgan's laws, surely it could be done within the query planner if that really is the main optimization switch? Or am I misreading this somehow?

Edit: I guess the main difference is that it's just calculating separate sets and then merging them, which isn't really DeMorgan's, but a calculation approach.

ntonozzi - 2 days ago

I really like the extension pattern. I wish more of the tables at my company used it.

Another huge benefit that we're realizing as we move some of our heaviest tables to this pattern is that it makes it really easy to index a core set of fields in Elasticsearch, along with a tag to associate it with a particular domain model. This has drastically cut down on our search-specific denormlization and let us avoid expensive index schema updates.

prein - 2 days ago

This sort of thing is why looking at generated SQL while developing instead of just trusting the ORM to write good queries is so important.

I find query planning (and databases in general) to be very difficult to reason about, basically magic. Does anyone have some recommended reading or advice?

hans_castorp - 2 days ago

Not sure on which Postgres version this was tested with, but the first example runs in about 2ms with my Postgres 17 installation ("cold cache"). It uses a BitmapOr on the two defined indexes.

https://notebin.de/?5ff1d00b292e1cd5#AU4Gg8hnY6RAmS9LoZ18xWn...

This used the setup.sql from the linked GitHub repository.

VladVladikoff - 2 days ago

I absolutely adore LLMs for SQL help. I’m no spring chicken with SQL but so many times now I’ve taken a poorly optimized query, run it with ‘explain’ in front of it, and dumped it into an LLM asking to improve performance, and the results are really great! Performance vastly improved and I have yet seen it make a single mistake.

Animats - 2 days ago

The query optimizer knows how many items are in each index, but has no advance idea how many items will be in the result of a JOIN. An "a OR b" query on a table with millions of rows might have three hits on A, or millions of hits. The optimal query strategy for the two cases is very different.

Has anyone put machine learning in an SQL query optimizer yet?

jalk - 2 days ago

Can we agree that this is only applies to queries where all the filter conditions use cols with indexes? If no indexes can be used, a single full table scan with OR surely is faster than multiple full table scans.

andersmurphy - a day ago

Sqlite does this automatically [1] https://www.sqlite.org/optoverview.html#or_optimizations

- 2 days ago
[deleted]
galkk - 2 days ago

I strongly dislike the way the polroblem is presented and the “solution” is promoted. Author mentions merge join with count of top, and if th database supports index merges, it can be extremely efficient in described scenario. There are a lot of real optimizations that can be baked in such merges that author chooses to ignore.

The generalized guidance without even mentioning database server as a baseline, without showing plans and/or checking if the database can be guided is just a bad teaching.

It looks like author discovered a trick and tries to hammer everything into it by overgeneralizing.

ape4 - 2 days ago

The OR query certainly states user intent better then the rewrite. So its better at communicating with a future programmer.

sgarland - 2 days ago

As an aside, MySQL will optimize WHERE IN better than OR, assuming that the predicates are all constants, and not JSON. Specifically, it sorts the IN array and uses binary search.

That said, I’m not sure if it would have any impact on this specific query; I’d need to test.

cyanydeez - 2 days ago

Yeah, just watch community. Every OR splits the universe, duh.

to11mtm - 2 days ago

If ORs are expensive you've probably got a poor table design or the ORs are for some form of nasty query that is more UI/Dashboard based.

A good table has the right indexes, and a good API to deal with the table is using the RIGHT indexes in it's criteria to get a good result.