Git Internals

Introduction

There are many version control systems, but git is undoubtedly the most popular, and regularly used, thanks to online social platforms such as GitHub and GitLab.

Yet, it is a tool that is still vastly misunderstood and feared. In this post I aim to take a look at some of the internal moving parts of git, primarily what’s inside the .git directory (inc the various subdirectories and files).

My hope is that by better understanding how git works, and the concepts it is built upon, readers will feel more empowered and confident when working with git (especially when they have issues and would normally be unsure of what to do).

Note: this article isn’t an introduction to git, and does presume that the reader is familiar with (i.e. a user of) git.

General Concept

I wanted to take a quick moment just to clarify the terminology associated with the general concepts of how git works (so we’re all on the same page):

  • Working Directory: your project files.
  • Staging Area: a file that tracks the changes to your project files.
  • Repository: the location where your project files are stored.

Note: these bullet points are just summarizations, but I would like to extend upon it slightly in that: your ‘working directory’ can change depending on what ‘version’ of the project you have ‘checked out’ from the git repository (i.e. this is what happens when you change your ‘branch’ with git checkout <branch_name>).

So for example, commands like git add will copy objects from the working directory into the staging area (aka the ‘index’), while git reset will remove objects from the staging area.

A command such as git diff compares your working directory to your staging area, while using the --staged flag will change this behaviour such that git will compare your staging area to your actual repository state.

Subcommands: Porcelain and Plumbing

The git version control system wasn’t initially designed to be a user-friendly interface, and so alongside the more commonly used subcommands are commands that can carry out very low-level operations.

This has resulted in much confusion around what commands are intended for use by general users and which commands exist for the purpose of internal use.

Note: although used internally, the low-level subcommands are also typically used by systems that require such granular operational control.

The git subcommands are generally split into one of two groups:

  • Porcelain: the user-friendly interface (e.g. git checkout, git pull etc.)
  • Plumbing: low-level interface (e.g. git cat-file, git rev-parse etc.)

Git Subcommands

Below is a list of the git subcommands (as of git version 2.22.0), and knowing which are meant to be ‘porcelain’ and which are meant to be ‘plumbing’ can be difficult.

$ man git-<tab>

git-add                       git-commit-tree               git-fsck                      
git-am                        git-config                    git-fsck-objects              
git-annotate                  git-count-objects             git-gc                        
git-apply                     git-credential                git-get-tar-commit-id         
git-archimport                git-credential-cache          git-grep                      
git-archive                   git-credential-cache--daemon  git-gui                       
git-bisect                    git-credential-store          git-hash-object               
git-blame                     git-cvsexportcommit           git-help                      
git-branch                    git-cvsimport                 git-http-backend              
git-bundle                    git-cvsserver                 git-http-fetch                
git-cat-file                  git-daemon                    git-http-push                 
git-check-attr                git-describe                  git-imap-send                 
git-check-ignore              git-diff                      git-index-pack                
git-check-mailmap             git-diff-files                git-init                      
git-check-ref-format          git-diff-index                git-init-db                   
git-checkout                  git-diff-tree                 git-instaweb                  
git-checkout-index            git-difftool                  git-interpret-trailers        
git-cherry                    git-fast-export               git-log                       
git-cherry-pick               git-fast-import               git-ls-files                  
git-citool                    git-fetch                     git-ls-remote                 
git-clean                     git-fetch-pack                git-ls-tree                   
git-clone                     git-filter-branch             git-mailinfo                  
git-column                    git-fmt-merge-msg             git-mailsplit                 
git-commit                    git-for-each-ref              git-merge                     
git-commit-graph              git-format-patch              git-merge-base                
git-merge-file                git-rebase                    git-show-index
git-merge-index               git-receive-pack              git-show-ref
git-merge-one-file            git-reflog                    git-stage
git-merge-tree                git-remote                    git-stash
git-mergetool                 git-remote-ext                git-status
git-mergetool--lib            git-remote-fd                 git-stripspace
git-mktag                     git-remote-testgit            git-submodule
git-mktree                    git-repack                    git-svn
git-multi-pack-index          git-replace                   git-symbolic-ref
git-mv                        git-request-pull              git-tag
git-name-rev                  git-rerere                    git-unpack-file
git-notes                     git-reset                     git-unpack-objects
git-p4                        git-rev-list                  git-update-index
git-pack-objects              git-rev-parse                 git-update-ref
git-pack-redundant            git-revert                    git-update-server-info
git-pack-refs                 git-rm                        git-upload-archive
git-parse-remote              git-send-email                git-upload-pack
git-patch-id                  git-send-pack                 git-var
git-prune                     git-sh-i18n                   git-verify-commit
git-prune-packed              git-sh-i18n--envsubst         git-verify-pack
git-pull                      git-sh-setup                  git-verify-tag
git-push                      git-shell                     git-web--browse
git-quiltimport               git-shortlog                  git-whatchanged
git-range-diff                git-show                      git-worktree
git-read-tree                 git-show-branch               git-write-tree

But there is a way to find out! Currently the man git page describes which commands are intended as porcelain and which are plumbing. Simple search for GIT COMMANDS and you’ll find the two groupings.

My own generalized way of making a distinction is to consider the day-to-day subcommands I use (e.g. git add, git diff) as being porcelain, while the more esoteric subcommands (e.g. git fsck, git multi-pack-index) as being more plumbing orientated.

In practice it doesn’t really matter which subcommands are porcelain and which are plumbing. If there’s a subcommand you feel you need to use, then go ahead and use it. My personal perspective on this is: if you’re ever unsure of what it is you’re doing you’re unlikely to use a subcommand.

Most users do not diverge from the well trodden path of: git add, git commit, git pull, git push, git diff (with an occasional git rebase).

What’s interesting about the plumbing subcommands is that some of them are used internally by git when you’re calling the porcelain subcommands (e.g. git read-tree, git update-index, git update-ref will be called by other porcelain commands such as git add or git commit).

Note: although we’ll be looking at a couple of plumbing commands in this article, I’ll refer you to the git book for a look at the different plumbing commands available and how they’re used.

The .git directory

When you start a new project that you want to use version control for, you’ll typically run the git init subcommand:

git init [dir]

Most people will know that there is now a .git directory created in the root of your project directory, but that’s about where their understanding of things stop.

Let’s see what’s initially inside the .git directory of a new project…

$ tree .git/

.git/
├── HEAD
├── config
├── description
├── hooks
│   ├── applypatch-msg.sample
│   ├── commit-msg.sample
│   ├── fsmonitor-watchman.sample
│   ├── post-update.sample
│   ├── pre-applypatch.sample
│   ├── pre-commit.sample
│   ├── pre-push.sample
│   ├── pre-rebase.sample
│   ├── pre-receive.sample
│   ├── prepare-commit-msg.sample
│   └── update.sample
├── info
│   └── exclude
├── objects
│   ├── info
│   └── pack
└── refs
    ├── heads
    └── tags

8 directories, 15 files

OK, so there’s some important directories and files here that we need to learn a bit about in order to appreciate how git works.

Note: I’m not going to explain every file and directory, only those necessary to understand the fundamentals.

Here are some interesting ones:

  • HEAD: contains a pointer to the tip of the current branch.
  • config: contains project-specific configuration options.
  • info: contains a global exclude file †
  • objects: contains four types of ‘objects’ (commit, tree, blob, tag).
  • refs: contains pointers to ‘commit’ objects.

† this is separate from a local user’s .gitignore.

References and Objects

The two most important concepts in git are: references and objects.

For example, your branches, tags and remotes are all references to commits. While your commits are objects, your files are objects, your directories are objects.

References

Git is built upon the simple premise of using ‘pointers’ to data, and these pointers are typically referred to as ‘references’ (or ‘refs’ for short).

This is what the .git/refs directory stores: references.

As I mentioned earlier, these references all point to a ‘commit’ object…

remote    branch     tag
  |          |        |
  |          |        |
  |          V        |
  ------> commit <-----
             |
             |
             V
           tree
             |
             |
             V
           blob

Note: you can see from the above ascii graph that the ‘commit’ object itself points to a ‘tree’ object, and that tree object points to a ‘blob’ object. We’ll dig into these reference ‘object’ types in more detail in the “Object Types” section.

It’s worth clarifying now that although we conceptually talk in terms of ‘branches’ in git, the internal directory structure (where references to branches are stored) uses the term ‘heads’ instead. It’s a terrible name (like most things in git’s lexicon), but it’s best to just accept it and move on.

The reason git uses ‘references’ is it enables users to be able to refer to a specific commit without having to remember the full SHA1 hash.

Imagine wanting to checkout your master branch but instead of just executing git checkout master you had to remember the specific hash.

git checkout b5d34b608ce697f0d20d011ee569529bca3feee8

Not very practical heh.

The HEAD reference

If you recall from earlier, we said the HEAD file contains a pointer to the tip of the current branch.

If we were to look at the .git/HEAD file we would find that by default it has the following content:

ref: refs/heads/master

You can see it’s a pointer to another location (the reference .git/refs/heads/master), which means it’s a pointer to a pointer!

Remember that refs/heads/master is a reference file (which refers to our master branch), and the contents of that file is a pointer to a commit hash. So this is telling us that ultimately HEAD is pointing to our master branch.

But at this point in time I’ve only executed git init, and so I’ve not actually committed anything into git. This means that there isn’t actually a master file inside of the .git/refs/heads subdirectory.

If we look back at the earlier directory tree (which we printed after running git init), we’ll notice that although there is a .git/refs/heads directory, there is no master file. A file called master won’t exist in that subdirectory until I make my first commit.

Note: if you recall from earlier I said that the refs/heads subdirectory was essentially a synonym for ‘branches’ created locally for this project. Hence, the default file referenced by the HEAD file is master (because it’s referencing the master branch).

Let’s now create a commit so that we can see a refs/heads/master file and what it points to…

$ echo foo > foo.txt
$ git add foo.txt
$ git commit -m "foo"

[master (root-commit) b5d34b6] foo
 1 file changed, 1 insertion(+)
 create mode 100644 foo.txt

Once we do this we’ll find git has created a master file inside of .git/refs/heads and the contents of that file is the hash of my first commit (which indicates that the master reference file, or ‘branch’, is pointing at a specific commit snapshot):

b5d34b608ce697f0d20d011ee569529bca3feee8

When you execute a command (such as) git checkout master, internally git will resolve master into refs/heads/master and that is what tells git which commit object to now point to.

Subcommands and References

Although a reference is a pointer to a commit hash, it doesn’t mean you can use a reference within a git subcommand.

Here is an example subcommand that works fine with a reference: git log. We can use git log origin/master, and git will know to internally resolve that reference to the fully qualified path .git/refs/remotes/origin/master.

Knowing that, we would also know that it is possible to use a partial reference path such as git log refs/remotes/origin/master or maybe git log remotes/origin/master.

All these variations work fine, but we typically use git log origin/master for convenience (because it’s less typing).

But using a shorted ‘reference’ isn’t possible with commands like git checkout and git pull for different reasons. With git pull if we look at man git-pull we see we need to provide a <repository> <refspec> and that means the refspec we provide will be scoped to .git/refs/remote/.

If I look at .git/refs/remote/ I’ll see only a single directory origin, and inside of that are all the branches (i.e. refspecs) for the origin remote. So if I attempted to do something like git pull origin HEAD this wouldn’t work because there’s a HEAD file inside of that origin directory (and it points to a different commit from our local HEAD in .git/HEAD)!

This means we’d end up trying to pull the changes from the remote master!! Which happens because HEAD on the remote is setup to track the master branch…

$ git remote show origin

* remote origin
  Fetch URL: git@github.com:example/repo.git
  Push  URL: git@github.com:example/repo.git
  HEAD branch: master
  Remote branches:
    ...

So subsequently doing git pull origin HEAD would bring in lots of unexpected changes to your local branch 😬

Note: using HEAD isn’t a problem when doing something like git push origin HEAD because it’s a fundamentally different operation and so git knows to reference the local HEAD file to get the commit range before pushing to the remote.

Similarly, using a shortened ‘reference’ isn’t possible with a command like git checkout as its internal logic will cause a detached HEAD state (e.g. if you were to do something like git checkout refs/heads/master instead of git checkout master).

Let’s now understand what a ‘detacted HEAD’ means, and why it is a git checkout would cause that when using a refspec…

Detached HEAD

Internally git does recognize the reference and can resolve it to the appropriate .git/refs directory, but the behaviour of the checkout command changes when checking out a reference that is a qualified path such as refs/heads/master. What you would discover is you don’t checkout the branch but are placed into a ‘detached HEAD’ at the relevant commit.

Why is that? Well, if we look at the documentation for the checkout subcommand (man git-checkout) we would discover…

if it (the given branch name) refers to a branch (i.e., a name that, when prepended with “refs/heads/”, is a valid ref), then that branch is checked out. Otherwise, if it refers to a valid commit, your HEAD becomes “detached” and you are no longer on any branch.

Running git checkout master means you’ve given an identifier (i.e. master) that git can internally resolve to refs/heads/master and thus git will happily checkout that branch, while git checkout refs/heads/master is a direct reference that git first resolves to a commit.

Hence it’s like you had actually run the subcommand git checkout <commit-hash>, and so git puts you into a detached HEAD state.

If you’re unfamiliar with what a ‘detached HEAD’ state is, then it simply means the HEAD file no longer is pointing at a reference such as .git/refs/heads/master but directly to a commit hash. The purpose of a detached HEAD is to allow you to do work off a branch.

I’ve never had a need to work ‘off’ a branch (:shrugs:) and so I can only presume there are situations where you would want to do that.

OK, now that we have our first commit let’s dig a little deeping into the ‘objects’ git defines, and how the .git directory structure has changed…

Object Types

There are four main types of objects in git:

  1. commit
  2. tree
  3. blob
  4. tag

Note: we’ll primarily be covering the first three object types.

Since we committed a single file into git there has been a few new files and directories created:

  • index: a binary file containing a sorted list of path names.

  • COMMIT_EDITMSG: temporary file used to store latest commit message.

  • objects/25/7cc5642cb1a054f08cc83f2d943e56fd3ebe99: the foo.txt file (type: blob)

  • objects/b5/d34b608ce697f0d20d011ee569529bca3feee8: commit message data (type: commit)

  • objects/fc/f0be4d7e45f0ef9592682ad68e42270b0366b4: directory tree (type: tree)

You’ll notice that the new objects are stored in a subdirectory which uses the first two characters from the hash of the object’s contents.

For example, the foo.txt blob object’s content was hashed into 257cc5642cb1a054f08cc83f2d943e56fd3ebe99. Next git took the first two characters 25 and made a subdirectory, and then moved the object into that directory while naming the object file using the remaining characters (i.e. 7cc5642cb1a054f08cc83f2d943e56fd3ebe99).

In order to look at these files you’ll need a couple different plumbing commands: git ls-files and git cat-files.

Let’s start with the index file.

The index is a binary file which tracks our working directory and our staging area (use --stage flag to see staging area). The index enables fast comparisons between the tree object it defines and the working tree.

We’ll need to use git ls-files in order to read the contents:

$ git ls-files

foo.txt

It only has foo.txt tracked, which is correct. There are no other files or directories at this point in time (we’ll add more as we go).

To look at the different ‘objects’ we’ll use the git cat-files command which decompresses the file and displays the file contents (we’ll use the -t flag to return the ‘type’ and the -p flag to ‘print’ the contents).

Note: we don’t provide the path (e.g. objects/../...) as the argument, but the sha itself (shortened sha is acceptable too).

$ git cat-file -t 257cc5642cb1a054f08cc83f2d943e56fd3ebe99
blob

$ git cat-file -p 257cc5642cb1a054f08cc83f2d943e56fd3ebe99
foo

$ git cat-file -t b5d34b608ce697f0d20d011ee569529bca3feee8
commit

$ git cat-file -p b5d34b608ce697f0d20d011ee569529bca3feee8
tree fcf0be4d7e45f0ef9592682ad68e42270b0366b4
author Integralist <example@gmail.com> 1585480397 +0100
committer Integralist <example@gmail.com> 1585480397 +0100

foo

$ git cat-file -t fcf0be4d7e45f0ef9592682ad68e42270b0366b4
tree

$ git cat-file -p fcf0be4d7e45f0ef9592682ad68e42270b0366b4
100644 blob 257cc5642cb1a054f08cc83f2d943e56fd3ebe99    foo.txt

What’s also interesting is that when you execute command such as git add, git will ‘conceptually’ copy the file to your staging area, but internally it has created a ‘blob’ object. While a command such as git commit then creates the ‘commit’ and ‘tree’ objects to reference the already existing ‘blob’ object. I mention this because I wanted to be clear that these three objects don’t all get created at the same time.

Snapshots, Not Differences

We saw earlier an ascii graph that indicated the hierarchy of these objects. It showed that git reference types (e.g. remotes, branches and tags) all point to a ‘commit’ object. This commit object will include a pointer to a ‘tree’ object, and the tree object is a list of files (i.e. blobs) and directories (i.e. more trees).

It’s this graph that builds up the entire snapshot of the repository. This is why you shouldn’t think of a git commit as being a patch or set of changes to a bunch of files, but instead should see each commit as a complete snapshot of your entire project at a singular point in time.

If any files or directories change, then their commit hash will change and thus the HEAD commit will consist of different tree and blob objects (resulting in a different hash-tree graph).

With that in mind, let’s start by looking at the commit object we have (git cat-file -p b5d34b6). We can see the first line says tree followed by a hash (all other information is the typical commit information you’re used to seeing when you run git status).

If we look at the tree object git cat-file -p fcf0be4 (which the commit object linked to), then we can see it consists of a single line: a blob object with its hash and its filename foo.txt (this makes sense as our project only contains this single file).

Lastly, let’s look at the blob object git cat-file -p 257cc56 (which the tree object linked to), then we can see the contents of that blob object is the contents of the foo.txt file itself.

OK, so what happens if I add a new file bar.txt and a new subdirectory baz with another file qux.txt within that subdirectory…

$ tree
.
├── bar.txt
├── baz
│   └── qux.txt
└── foo.txt

1 directory, 3 files

Once I add baz/qux.txt and commit it I then inspect the new objects in my .git/objects folder. From there I locate the commit object (I do that by looking at the .git/refs/heads/master and seeing what commit hash it has) and once I cat-file -p that hash, I follow its tree pointer…

$ git cat-file -p edc6771b338b472d901358e530db7cede202c1c7

100644 blob 5716ca5987cbf97d6bb54920bea6adde242d87e6    bar.txt
040000 tree 3d15e426c95bac2548d7255af9c5e240df786e03    baz
100644 blob 257cc5642cb1a054f08cc83f2d943e56fd3ebe99    foo.txt

$ git cat-file -p 3d15e426c95bac2548d7255af9c5e240df786e03

100644 blob 100b0dec8c53a40e4de7714b2c612dad5fad9985    qux.txt

We can see from the above output that the tree object not only includes my project files, but now a baz directory (itself a tree object). Looking at that tree object shows there is one file inside of it (a blob object for qux.txt).

If we review the index file again we’ll see our new set of files/directories:

$ git ls-files

bar.txt
baz/qux.txt
foo.txt

Tags

Along the way I’ve been tagging my commits. A tag (as far as git internals are concerned) is another ‘object’ type. Let’s look at my tags:

$ git tag -n

v1  foo
v2  an anotated tag

So we can see I have two separate tags, and each one points at a different commit (the v1 tag was a lightweight tag and so the associated foo comes from the commit message, while the v2 tag was an annotated tag and so the message I gave at that point was displayed).

In order to see the commit that a tag is associated with, we’ll need another plumbing subcommand rev-list:

$ git rev-list -n 1 v1

b5d34b608ce697f0d20d011ee569529bca3feee8

$ git rev-list -n 1 v2

0b56156eba23ae9bee8c32137605397cf7c9e88e

But for us to see what the ‘tag’ object type looks like internally, we need to get the hash that the tag reference file is set to:

$ cat .git/refs/tags/v1

b5d34b608ce697f0d20d011ee569529bca3feee8

$ cat .git/refs/tags/v2

75d37b7c37173def7a0a8cd43d674edc8e9ce614

Once we have that hash we can use cat-file to see the ‘tag’ object:

$ git cat-file -t 75d37b7c37173def7a0a8cd43d674edc8e9ce614

tag

$ git cat-file -p 75d37b7c37173def7a0a8cd43d674edc8e9ce614

object 0b56156eba23ae9bee8c32137605397cf7c9e88e
type commit
tag v2
tagger Integralist <example@gmail.com> 1585592962 +0100

an anotated tag

OK, so you may have noticed I used cat-file on the v2 (annotated) tag, but not on the v1 (lightweight) tag. That was not an accidental omission.

A lightweight tag is just a reference to a commit hash, but an annotated tag is more complex and so a ‘tag object’ is created, and we can see that when we inspect the hash inside the v2 tag reference.

We can see the tag object includes a pointer to the ‘commit’ object (0b56156eba23ae9bee8c32137605397cf7c9e88e) as well as information about the ‘tagger’ (in this case me!)

Remotes

When you add a remote like so:

git remote add origin git@github.com:Integralist/dotfiles.git

We can now look at the configuration of our remote:

$ git remote show origin

* remote origin
  Fetch URL: git@github.com:Integralist/dotfiles.git
  Push  URL: git@github.com:Integralist/dotfiles.git
  HEAD branch: master
  Remote branches:
    linux                                new (next fetch will store in remotes/origin)
    master                               new (next fetch will store in remotes/origin)
    minimal-mac-version-of-linux-version new (next fetch will store in remotes/origin)
  Local ref configured for 'git push':
    master pushes to master (local out of date)

You might be confused though if you were to look at .git/refs and don’t see a remotes subdirectory. This happens automatically if you clone an existing repository, but it’ll also be created when executing git fetch after manually adding a new remote to an existing repository.

I added my new origin remote (see above), but it was only once I had executed a git fetch was I then able to see a ‘remote’ reference:

refs/
| remotes/
| | origin/
| | | master

If I inspect the .git/refs/remotes/origin/master file, then I’ll see the latest commit my remote master branch is on. It’s also interesting to remember what we mentioned earlier about references that point to commits being interchangeable with commit hashes in various subcommands.

For example, git diff allows you to specify two branches to compare against each other (remember a branch is just a reference file that points to a commit hash), and so you might want to compare your local master against your remote master branch:

git diff master..origin/master

This is just a shortened way of doing:

git diff master..refs/remotes/origin/master

Which itself is just a shortened way of doing:

git diff master..c3865b72b019ced930cfc601b09b874685c29e72

Note: one last thing I wanted to mention (and there was no other place really to mention this) is that git comes with a UI! you can execute the command gitk to use it.