Diff Algorithms

flo.znkr.io

270 points by znkr a day ago


o11c - a day ago

There are at least 3 fundamentally different kinds of diff:

* Single-dimensional. Diffs of text lines are just this.

* Multi-dimensional. Diffs of words or characters are usually going to be this since lines still matter, but there are multiple approaches (line-first? weighted tokens?).

* Tree-based. Unfortunately, these are woefully scarce and poorly documented.

For text diffs, it's nontrivial to get the "missing newline at end of file" logic working.

For tree diffs, consider that for HTML something like `<p>x</p><p>y</p>` should be unmergeable, whereas `<b>x</b><b>y</b>` should be mergeable.

(Aside: the blind promotion of `<em>` and `<strong>` did great harm to the notion of semantic HTML. Most things people use italics for (book titles, thoughts, foreign words) are explicitly things that `<em>` should not be used for.)

dekhn - a day ago

The creator of the Myers algorithm is Gene Myers. He also helped create the BLAST algorithm, one of the fastest and most important DNA and protein search algorithms, and also implemented most of the original human genome assembly done by Celera. he also helped invent and publish the suffix array.

therealfigtree - 14 hours ago

Not to critique on the article, but as a general suggestion, it would be great if people don't use emojis as a scale. Unless it shows a specific character like number emojis, they are sometimes read differently by people from different language backgrounds. So what the author might be trying to say might lead to misunderstandings.

yboris - a day ago

Mildly related: my favorite tool for viewing .git diffs diff2html - a CLI that with one command opens the diff in your browser

https://diff2html.xyz/ -- https://github.com/rtfpessoa/diff2html

teabee89 - a day ago

A while ago I discovered the lesser known Tichy diff algorithm that seems to preserve more context and is better suited for big code refactors compared to Myers: https://www.researchgate.net/publication/220439403_The_Strin... via https://bryanpendleton.blogspot.com/2010/04/more-study-of-di...

rs186 - a day ago

Great work. I was just recently dealing with creating diff in Go and faced the same problem of finding a good library. Some diffmatchpatch APIs expect/return escaped texts, and I was screaming why would you do that??? Why doesn't the library just return raw strings? I ended up using diffmatchpatch to get patch objects and then produced unified diffs with some vibecoding. I'll definitely try this out when I revisit this.

PS regarding readability, I think VSCode put a lot of effort into creating nice-to-read diffs (e.g. https://code.visualstudio.com/updates/v1_81#_diff-editor), some of which is done in the algorithm itself (https://code.visualstudio.com/updates/v1_78#_diff-algorithm-...). But apparently that's in TypeScript, and not all heuristics done there for an editor is suitable to be in a generic diff algorithm. Still, there might be something worth exploring.

linsomniac - 3 hours ago

My favorite differ in more than trivial cases is "vimdiff". Just hit a case where I wished I had "git vimdiff".

shae - 7 hours ago

For source code diffs where a tree sitter grammar exists, difftastic is the best choice by far. It's better than you think it is.

https://github.com/Wilfred/difftastic

No really, if you haven't tried it, it's better than you think it is.

https://www.scannedinavian.com/tools-built-on-tree-sitters-c...

(I know, already mentioned later in comments by leeoniya, still deserves a top level comment!)

h4ch1 - 13 hours ago

What's a SoTA diffing algorithm for diffing of a deeply nested JSON structure for example?

I am currently researching and have built my own naive implementation for diffing a dense tree and the performance... isn't very great.

The ideal outcome is to create patches by diffing different versions of the JSON object, then being able to apply said patches anywhere in the "commit" tree.

jFriedensreich - 19 hours ago

while work on pure algorithms is invaluable i always feel work on knowledge augmented algorithms has lots of untapped potential. two examples: recording key events like move and delete on a more fine grained timescale or directly from editors and then storing those as mutable metadata in commits that is only allowed to be used for diff generation. as its provable if diffs are technically correct these do not weaken the consistency guarantees while adding helpful context. they are also highly compressable and pruneable. another one is optimizing diffs for consumption by llms and let those generate for optimal human readability.

ashu1461 - a day ago

Apart from source code versioning what are the other most important real world use cases of diff algorithms ?

karteum - 21 hours ago

I wish that the diff/patch would be able to better take into account moved data (not only as deletion+add but with with a proper semantic indicating the moved block). This would both lead to smaller and more readable patches.

I noticed that some peoble have worked on such an algorithm, e.g. https://en.wikipedia.org/wiki/User:Cacycle/diff

fiddlerwoaroof - 21 hours ago

I wonder about the importance of minimality: it itself seems like a heuristic for “interest” or some other thing that users of diffs actually care about.

For example, a diff like:

    + x += 1
    - x -= 1
Seems almost useless: it doesn’t provide any context about the meaning of x and, as a result, nearly every source review tool provides unchanged line in addition to highlighting the change. And, even then, by preventing comments on arbitrary lines of the file, GitHub’s code review makes it pretty hard to call out other relevant code.
dvcoolarun - 10 hours ago

I recently learned that diff algorithms (beyond just code comparison) are used to detect database schema and configuration drift. I thought that was a pretty intriguing application!

jeisc - 11 hours ago

I got Beyond Compare and never looked back

amai - 14 hours ago

See also

Nugroho (2019): How Different Are Different diff Algorithms in Git?

https://arxiv.org/abs/1902.02467