Git is Elegant

RMAG news

Enough with the fear of the bad rebase. Let’s learn how git works under the hood.

Files and Folders

For each of our source files, a SHA-1 hash is created from its complete contents, and then the contents are compressed with zlib. The hash becomes git’s “filename” for that compressed version.

For each folder in our project, a tree file is created. It’s a text file of four columns, one row per file or sub-folder. The four columns are: the “git filename” hash; the user’s original filename/foldername; the file system permissions as a bitfield; and a type which specifies if the row is representing a file (“blob”) or a sub-folder (“tree”).

100644 blob 0155eb4229851634a0f03eb265b69f5a2d56f341 README.md
040000 tree d8329fc1cc938780ffdd9f94e0d364e0ea74f579 src
100644 blob fa49b077972391ad58037050f2a75f74e3671e92 tsconfig.json
100644 blob 1f7a7a472abf3dd9643fd615f6da379c4acb3e3a package.json

This tree file is itself treated like a source file, being both compressed and then named by its own contents-hash. This allows tree files to recurse just like folders do.

All of the above live in the /.git/objects/ folder.

Commits

Most tree files are pointed to by other tree files, but the top-most tree is pointed to by a commit. A commit is a text file with five pieces of information: the “git filename” of a top-most tree file, the parent (previous, older) commit, the author (from user.name and user.email git config settings), the datetime (as seconds since epoch – 1 January 1970, midnight UTC), and a commit message following a blank line.

tree a1e672eba0d552e704f294625bab398f0e4bef70
parent 5938f809b156e743aa47075befb5ff414ad791ff
author Ron Newcomb <pscion@yahoo.com> 1713826102 -0700
committer Ron Newcomb <pscion@yahoo.com> 1713826102 -0700

wip – it still doesn’t build

Most commits are pointed to by the next-most recent commit. In other words, each commit’s pointer points backward in time. The newest commit points to the second-newest commit. But what points to the newest commit?

Branches

The top-most commit is pointed to by a branch. A branch is a text file in /.git/refs/heads/ whose filename is the branch name and whose contents is just the “git filename” of that newest commit.

~/myproject/.git/refs/heads> cat my-temporary-branch
6862ff2bdfdc4953f86cd3ba345e058a01cc0952

So far, we have branches which point to commits which point to (other commits and) tree files which point to (other tree files and) compressed file blobs.

The Word “HEAD”

Some coders consider the “head” of a linked list to be the oldest entry, where newer entries are appended to the “tail” end. An example is a list containing ABCD where A is the oldest entry and the “list pointer” points at A, and D points to nothing. But in git as in functional programming it’s the other way around. The same list is represented DCBA where the “pointer” points at newest entry D while A points at nothing. This arrangement allows “the older part of the list” to remain immutable.

For example, if an unknown, second pointer points at DCBA, and the first pointer prepends E to the list, the first pointer can point at E and therefore EDCBA while the second pointer still points at D and hence DCBA. From the second pointer’s point of view the list hasn’t been modified out from underneath itself. The list DCBA is still there completely untouched and unaware of E.

With that in mind, the reason branches are stored in a folder called /heads/ might make more sense. Each file is the name of a branch and is pointed at a list of commits, DCBA. An older branch on the same path can point to it’s older commit C and it still sees the same CBA history as it always did even as the first branch adds commits over time FEDCBA.

HEAD as a Cursor

So, one definition of the word HEAD is the most recent commit in a string of commits. But the other definition, “the” head, is more like a cursor or current pointer. This is the concept of the current working branch, the head. There is a singular file, /.git/HEAD, which is a one-line text file that points at a branch.

ref: refs/heads/my-temporary-branch

Think of the file HEAD as a cursor. If we’re on branch main then the HEAD file has the path to the file refs/heads/main. Switching to branch my-temporary-branch means changing the /.git/HEAD file contents to refs/heads/my-temporary-branch.

Also, HEAD can point directly to a commit instead of indirectly through a branch. This is called the detached HEAD state. For example, git checkout HEAD^^ will point HEAD to a commit located two commits back from wherever the current branch points. No branch or tag in the whole system points directly at this commit. But we can still create a new branch to point to it, and create commits that point to that commit. It’s just a directed acyclic graph after all.

Tags and Annotations

As an aside there’s tags, of which are two types. The lightweight tag is effectively identical to a branch but with the expectation that it is immutable – it never changes which commit it points to. Heavyweight tags point to an annotation which then point to the commit. Annotations are text files that hold the tag author, tag creation datetime, the tag name, a tag message, and points to a commit. Tags are stored in /.git/refs/tags/ and the tag name is also the filename.

object 6862ff2bdfdc4953f86cd3ba345e058a01cc0952
type commit
tag v2.0
tagger Ron Newcomb <pscion@yahoo.com> 1714360500 -0700

typescript added and builds without error; can now turn on ci/cd checks

File System Allowances

As nice as the above setup is, there’s a problem using it on most operating systems’ file systems. One, the above scheme means the folder /.git/objects gets a very large number of files in it very quickly. Most general-purpose file systems do not perform well on that. Second, a zlib-compressed text file is very small, sometime just a hundred bytes, and there will be a lot of these very small files. Most general-purpose file systems do not perform well on that, either. So git does two things to mitigate these problems.

For the first problem, the first two or three characters of the “git filename” is broken off to become the name of a subfolder. So instead of /.git/objects/bd37ea4bf8876c306771c8666ec92f4d582135bd we’ll have /.git/objects/bd/37ea4bf8876c306771c8666ec92f4d582135bd. (Note the added slash between objects/bd and 37ea.) This helps bring down the number of items directly contained in /objects to a reasonable number.

For the second problem there are packfiles. Once a set of those tiny compressed files are old enough and numerous enough, git will concat them into a larger file, called a packfile, and maintain a corresponding index file to remember which file begins where. When packing, git can also use diffs if two files are very similar, further saving space at the cost of increasing time to access a very particular old version of a file.

These mitigations don’t fundamentally change any of the concepts in play. They are merely optimizations that the git tools already step around for we.

Losing Work

By now we understand that git rarely deletes anything. Every modified file is hashed, compressed, and stored in its entirety, every time, with no direct reference to its older version. “Diffs” are not stored except for much older work in packfiles, and maybe not even then. But, sometimes a rebase goes wrong, and work goes missing. What now?

The first thing to remember is our work’s missing modifications are still in git somewhere, but they are almost definitely orphaned. Although git does delete orphans eventually, they must be a few months old so last week’s rebase nightmare can still be overcome.

The second thing to remember is due to the above immutable-full-content-hash way that git works, if we have 4 commits in a row DCBA, and we squash CB into X, DXA can’t just magically happen. Commit D still has commit C as its parent, so it must be “moved” so commit X is its new parent. (This is rebasing: giving it a new “base” to work from.) But if we move commit D to a different parent, then line two of the text file representing D is changed, which changes its contents and therefore its hash and therefore its git filename. (So technically speaking rebasing D doesn’t actually change D but creates a new commit Y. )

Anyway, when we compare a branch-to-be-rebased to the new main, we’ll see all of the changes in CB on both branch and main, and they conflict, and it’s very confusing.

The first trick is in the start of the (interactive) rebase -i main, where we see a screen like this.

pick c6a71d7 commit B
pick 898a9b5 commit C
pick c6729d2 commit D

we can change pick to s for squash (or not), but in the case of CB already being squashed and merged into main as commit X, we actually want to completely delete the lines for those commits, leaving only our commit(s) for D, like so.

pick c6729d2 commit D

After saving the file, most times, the rebase will immediately succeed. we completely deleted two commits from our branch and it improves things remarkably.

The other side of this, if we didn’t already know this and are now in the thick of it with detached HEAD states and a boatload of conflicts, try to –abort and start over, knowing this. While we can’t completely avoid conflicts, merely understanding the linked list nature of commits, the branches-as-named-pointers that point to a few of them, and why squashing is causing the issue not rebasing, can help give we a clear idea of why it’s all going pear-shaped. That is always more powerful than a magical incantation to recover orphaned commits. (But for expediency: create a new backup branch to point to the same commit we’re about to rebase so nothing gets orphaned, or, git log –graph –reflog afterward to browse around for orphans.)

Gamers Need Not Apply

Git is specialized for text files and immutability. The one thing that git does not do well is incremental changes to large binary files. Binary doesn’t compress nearly as well as text does, and git’s philosophy of whole-content hashing and storage doesn’t work well for incremental changes to large binaries. So images, videos, skeletal animations, tesselated meshes, audio samples and so on would take a very large amount of disk space and time under git’s implementation. This is one reason the videogame industry tends to use Perforce over git, which diffs a binary against its old version and merely stores the latest copy with a reverse-diff to re-create the older version. Older source control systems also used a diff philosophy, and some poorer ones stored forward diffs from the original, with occasional checkpoints. This is obviously slower and very brittle in the case of one stray bit flip on disk instantly corrupting large portions of a repository downwind of itself. Combined with the tendency of older source control systems only giving a local machine the minimum it needs for a branch and one can see why they failed so often, and have been replaced so totally by git.

Leave a Reply

Your email address will not be published. Required fields are marked *