Git Internals
Introduction
There are many version control systems, but git is undoubtedly the most popular, and regularly used, thanks to online social platforms such as GitHub and GitLab.
Yet, it is a tool that is still vastly misunderstood and feared. In this post I aim to take a look at some of the internal moving parts of git, primarily what’s inside the .git
directory (inc the various subdirectories and files).
My hope is that by better understanding how git works, and the concepts it is built upon, readers will feel more empowered and confident when working with git (especially when they have issues and would normally be unsure of what to do).
Note: this article isn’t an introduction to git, and does presume that the reader is familiar with (i.e. a user of) git.
General Concept
I wanted to take a quick moment just to clarify the terminology associated with the general concepts of how git works (so we’re all on the same page):
- Working Directory: your project files.
- Staging Area: a file that tracks the changes to your project files.
- Repository: the location where your project files are stored.
Note: these bullet points are just summarizations, but I would like to extend upon it slightly in that: your ‘working directory’ can change depending on what ‘version’ of the project you have ‘checked out’ from the git repository (i.e. this is what happens when you change your ‘branch’ with
git checkout <branch_name>
).
So for example, commands like git add
will copy objects from the working directory into the staging area (aka the ‘index’), while git reset
will remove objects from the staging area.
A command such as git diff
compares your working directory to your staging area, while using the --staged
flag will change this behaviour such that git will compare your staging area to your actual repository state.
Subcommands: Porcelain and Plumbing
The git version control system wasn’t initially designed to be a user-friendly interface, and so alongside the more commonly used subcommands are commands that can carry out very low-level operations.
This has resulted in much confusion around what commands are intended for use by general users and which commands exist for the purpose of internal use.
Note: although used internally, the low-level subcommands are also typically used by systems that require such granular operational control.
The git
subcommands are generally split into one of two groups:
- Porcelain: the user-friendly interface (e.g.
git checkout
,git pull
etc.) - Plumbing: low-level interface (e.g.
git cat-file
,git rev-parse
etc.)
Git Subcommands
Below is a list of the git
subcommands (as of git version 2.22.0
), and knowing which are meant to be ‘porcelain’ and which are meant to be ‘plumbing’ can be difficult.
$ man git-<tab>
git-add git-commit-tree git-fsck
git-am git-config git-fsck-objects
git-annotate git-count-objects git-gc
git-apply git-credential git-get-tar-commit-id
git-archimport git-credential-cache git-grep
git-archive git-credential-cache--daemon git-gui
git-bisect git-credential-store git-hash-object
git-blame git-cvsexportcommit git-help
git-branch git-cvsimport git-http-backend
git-bundle git-cvsserver git-http-fetch
git-cat-file git-daemon git-http-push
git-check-attr git-describe git-imap-send
git-check-ignore git-diff git-index-pack
git-check-mailmap git-diff-files git-init
git-check-ref-format git-diff-index git-init-db
git-checkout git-diff-tree git-instaweb
git-checkout-index git-difftool git-interpret-trailers
git-cherry git-fast-export git-log
git-cherry-pick git-fast-import git-ls-files
git-citool git-fetch git-ls-remote
git-clean git-fetch-pack git-ls-tree
git-clone git-filter-branch git-mailinfo
git-column git-fmt-merge-msg git-mailsplit
git-commit git-for-each-ref git-merge
git-commit-graph git-format-patch git-merge-base
git-merge-file git-rebase git-show-index
git-merge-index git-receive-pack git-show-ref
git-merge-one-file git-reflog git-stage
git-merge-tree git-remote git-stash
git-mergetool git-remote-ext git-status
git-mergetool--lib git-remote-fd git-stripspace
git-mktag git-remote-testgit git-submodule
git-mktree git-repack git-svn
git-multi-pack-index git-replace git-symbolic-ref
git-mv git-request-pull git-tag
git-name-rev git-rerere git-unpack-file
git-notes git-reset git-unpack-objects
git-p4 git-rev-list git-update-index
git-pack-objects git-rev-parse git-update-ref
git-pack-redundant git-revert git-update-server-info
git-pack-refs git-rm git-upload-archive
git-parse-remote git-send-email git-upload-pack
git-patch-id git-send-pack git-var
git-prune git-sh-i18n git-verify-commit
git-prune-packed git-sh-i18n--envsubst git-verify-pack
git-pull git-sh-setup git-verify-tag
git-push git-shell git-web--browse
git-quiltimport git-shortlog git-whatchanged
git-range-diff git-show git-worktree
git-read-tree git-show-branch git-write-tree
But there is a way to find out! Currently the man git
page describes which commands are intended as porcelain and which are plumbing. Simple search for GIT COMMANDS
and you’ll find the two groupings.
My own generalized way of making a distinction is to consider the day-to-day subcommands I use (e.g. git add
, git diff
) as being porcelain, while the more esoteric subcommands (e.g. git fsck
, git multi-pack-index
) as being more plumbing orientated.
In practice it doesn’t really matter which subcommands are porcelain and which are plumbing. If there’s a subcommand you feel you need to use, then go ahead and use it. My personal perspective on this is: if you’re ever unsure of what it is you’re doing you’re unlikely to use a subcommand.
Most users do not diverge from the well trodden path of: git add
, git commit
, git pull
, git push
, git diff
(with an occasional git rebase
).
What’s interesting about the plumbing subcommands is that some of them are used internally by git when you’re calling the porcelain subcommands (e.g. git read-tree
, git update-index
, git update-ref
will be called by other porcelain commands such as git add
or git commit
).
Note: although we’ll be looking at a couple of plumbing commands in this article, I’ll refer you to the git book for a look at the different plumbing commands available and how they’re used.
The .git
directory
When you start a new project that you want to use version control for, you’ll typically run the git init
subcommand:
git init [dir]
Most people will know that there is now a .git
directory created in the root of your project directory, but that’s about where their understanding of things stop.
Let’s see what’s initially inside the .git
directory of a new project…
$ tree .git/
.git/
├── HEAD
├── config
├── description
├── hooks
│ ├── applypatch-msg.sample
│ ├── commit-msg.sample
│ ├── fsmonitor-watchman.sample
│ ├── post-update.sample
│ ├── pre-applypatch.sample
│ ├── pre-commit.sample
│ ├── pre-push.sample
│ ├── pre-rebase.sample
│ ├── pre-receive.sample
│ ├── prepare-commit-msg.sample
│ └── update.sample
├── info
│ └── exclude
├── objects
│ ├── info
│ └── pack
└── refs
├── heads
└── tags
8 directories, 15 files
OK, so there’s some important directories and files here that we need to learn a bit about in order to appreciate how git works.
Note: I’m not going to explain every file and directory, only those necessary to understand the fundamentals.
Here are some interesting ones:
HEAD
: contains a pointer to the tip of the current branch.config
: contains project-specific configuration options.info
: contains a global exclude file †objects
: contains four types of ‘objects’ (commit, tree, blob, tag).refs
: contains pointers to ‘commit’ objects.
† this is separate from a local user’s
.gitignore
.
References and Objects
The two most important concepts in git are: references and objects.
For example, your branches, tags and remotes are all references to commits. While your commits are objects, your files are objects, your directories are objects.
References
Git is built upon the simple premise of using ‘pointers’ to data, and these pointers are typically referred to as ‘references’ (or ‘refs’ for short).
This is what the .git/refs
directory stores: references.
As I mentioned earlier, these references all point to a ‘commit’ object…
remote branch tag
| | |
| | |
| V |
------> commit <-----
|
|
V
tree
|
|
V
blob
Note: you can see from the above ascii graph that the ‘commit’ object itself points to a ‘tree’ object, and that tree object points to a ‘blob’ object. We’ll dig into these reference ‘object’ types in more detail in the “Object Types” section.
It’s worth clarifying now that although we conceptually talk in terms of ‘branches’ in git, the internal directory structure (where references to branches are stored) uses the term ‘heads’ instead. It’s a terrible name (like most things in git’s lexicon), but it’s best to just accept it and move on.
The reason git uses ‘references’ is it enables users to be able to refer to a specific commit without having to remember the full SHA1 hash.
Imagine wanting to checkout your master branch but instead of just executing git checkout master
you had to remember the specific hash.
git checkout b5d34b608ce697f0d20d011ee569529bca3feee8
Not very practical heh.
The HEAD reference
If you recall from earlier, we said the HEAD
file contains a pointer to the tip of the current branch.
If we were to look at the .git/HEAD
file we would find that by default it has the following content:
ref: refs/heads/master
You can see it’s a pointer to another location (the reference .git/refs/heads/master
), which means it’s a pointer to a pointer!
Remember that refs/heads/master
is a reference file (which refers to our master branch), and the contents of that file is a pointer to a commit hash. So this is telling us that ultimately HEAD
is pointing to our master
branch.
But at this point in time I’ve only executed git init
, and so I’ve not actually committed anything into git. This means that there isn’t actually a master
file inside of the .git/refs/heads
subdirectory.
If we look back at the earlier directory tree (which we printed after running git init
), we’ll notice that although there is a .git/refs/heads
directory, there is no master
file. A file called master
won’t exist in that subdirectory until I make my first commit.
Note: if you recall from earlier I said that the
refs/heads
subdirectory was essentially a synonym for ‘branches’ created locally for this project. Hence, the default file referenced by theHEAD
file ismaster
(because it’s referencing themaster
branch).
Let’s now create a commit so that we can see a refs/heads/master
file and what it points to…
$ echo foo > foo.txt
$ git add foo.txt
$ git commit -m "foo"
[master (root-commit) b5d34b6] foo
1 file changed, 1 insertion(+)
create mode 100644 foo.txt
Once we do this we’ll find git has created a master
file inside of .git/refs/heads
and the contents of that file is the hash of my first commit (which indicates that the master
reference file, or ‘branch’, is pointing at a specific commit snapshot):
b5d34b608ce697f0d20d011ee569529bca3feee8
When you execute a command (such as) git checkout master
, internally git will resolve master
into refs/heads/master
and that is what tells git which commit object to now point to.
Subcommands and References
Although a reference is a pointer to a commit hash, it doesn’t mean you can use a reference within a git subcommand.
Here is an example subcommand that works fine with a reference: git log
. We can use git log origin/master
, and git will know to internally resolve that reference to the fully qualified path .git/refs/remotes/origin/master
.
Knowing that, we would also know that it is possible to use a partial reference path such as git log refs/remotes/origin/master
or maybe git log remotes/origin/master
.
All these variations work fine, but we typically use git log origin/master
for convenience (because it’s less typing).
But using a shorted ‘reference’ isn’t possible with commands like git checkout
and git pull
for different reasons. With git pull
if we look at man git-pull
we see we need to provide a <repository> <refspec>
and that means the refspec we provide will be scoped to .git/refs/remote/
.
If I look at .git/refs/remote/
I’ll see only a single directory origin
, and inside of that are all the branches (i.e. refspecs) for the origin
remote. So if I attempted to do something like git pull origin HEAD
this wouldn’t work because there’s a HEAD
file inside of that origin
directory (and it points to a different commit from our local HEAD
in .git/HEAD
)!
This means we’d end up trying to pull the changes from the remote master
!! Which happens because HEAD
on the remote is setup to track the master
branch…
$ git remote show origin
* remote origin
Fetch URL: git@github.com:example/repo.git
Push URL: git@github.com:example/repo.git
HEAD branch: master
Remote branches:
...
So subsequently doing git pull origin HEAD
would bring in lots of unexpected changes to your local branch 😬
Note: using
HEAD
isn’t a problem when doing something likegit push origin HEAD
because it’s a fundamentally different operation and so git knows to reference the localHEAD
file to get the commit range before pushing to the remote.
Similarly, using a shortened ‘reference’ isn’t possible with a command like git checkout
as its internal logic will cause a detached HEAD
state (e.g. if you were to do something like git checkout refs/heads/master
instead of git checkout master
).
Let’s now understand what a ‘detacted HEAD’ means, and why it is a git checkout
would cause that when using a refspec…
Detached HEAD
Internally git does recognize the reference and can resolve it to the appropriate .git/refs
directory, but the behaviour of the checkout command changes when checking out a reference that is a qualified path such as refs/heads/master
. What you would discover is you don’t checkout the branch but are placed into a ‘detached HEAD’ at the relevant commit.
Why is that? Well, if we look at the documentation for the checkout subcommand (man git-checkout
) we would discover…
if it (the given branch name) refers to a branch (i.e., a name that, when prepended with “refs/heads/”, is a valid ref), then that branch is checked out. Otherwise, if it refers to a valid commit, your HEAD becomes “detached” and you are no longer on any branch.
Running git checkout master
means you’ve given an identifier (i.e. master
) that git can internally resolve to refs/heads/master
and thus git will happily checkout that branch, while git checkout refs/heads/master
is a direct reference that git first resolves to a commit.
Hence it’s like you had actually run the subcommand git checkout <commit-hash>
, and so git puts you into a detached HEAD state.
If you’re unfamiliar with what a ‘detached HEAD’ state is, then it simply means the HEAD
file no longer is pointing at a reference such as .git/refs/heads/master
but directly to a commit hash. The purpose of a detached HEAD is to allow you to do work off a branch.
I’ve never had a need to work ‘off’ a branch (:shrugs:
) and so I can only presume there are situations where you would want to do that.
OK, now that we have our first commit let’s dig a little deeping into the ‘objects’ git defines, and how the .git
directory structure has changed…
Object Types
There are four main types of objects in git:
- commit
- tree
- blob
- tag
Note: we’ll primarily be covering the first three object types.
Since we committed a single file into git there has been a few new files and directories created:
index
: a binary file containing a sorted list of path names.COMMIT_EDITMSG
: temporary file used to store latest commit message.objects/25/7cc5642cb1a054f08cc83f2d943e56fd3ebe99
: thefoo.txt
file (type: blob)objects/b5/d34b608ce697f0d20d011ee569529bca3feee8
: commit message data (type: commit)objects/fc/f0be4d7e45f0ef9592682ad68e42270b0366b4
: directory tree (type: tree)
You’ll notice that the new objects are stored in a subdirectory which uses the first two characters from the hash of the object’s contents.
For example, the foo.txt
blob object’s content was hashed into 257cc5642cb1a054f08cc83f2d943e56fd3ebe99
. Next git took the first two characters 25
and made a subdirectory, and then moved the object into that directory while naming the object file using the remaining characters (i.e. 7cc5642cb1a054f08cc83f2d943e56fd3ebe99
).
In order to look at these files you’ll need a couple different plumbing commands: git ls-files
and git cat-files
.
Let’s start with the index
file.
The index
is a binary file which tracks our working directory and our staging area (use --stage
flag to see staging area). The index enables fast comparisons between the tree object it defines and the working tree.
We’ll need to use git ls-files
in order to read the contents:
$ git ls-files
foo.txt
It only has foo.txt
tracked, which is correct. There are no other files or directories at this point in time (we’ll add more as we go).
To look at the different ‘objects’ we’ll use the git cat-files
command which decompresses the file and displays the file contents (we’ll use the -t
flag to return the ‘type’ and the -p
flag to ‘print’ the contents).
Note: we don’t provide the path (e.g.
objects/../...
) as the argument, but the sha itself (shortened sha is acceptable too).
$ git cat-file -t 257cc5642cb1a054f08cc83f2d943e56fd3ebe99
blob
$ git cat-file -p 257cc5642cb1a054f08cc83f2d943e56fd3ebe99
foo
$ git cat-file -t b5d34b608ce697f0d20d011ee569529bca3feee8
commit
$ git cat-file -p b5d34b608ce697f0d20d011ee569529bca3feee8
tree fcf0be4d7e45f0ef9592682ad68e42270b0366b4
author Integralist <example@gmail.com> 1585480397 +0100
committer Integralist <example@gmail.com> 1585480397 +0100
foo
$ git cat-file -t fcf0be4d7e45f0ef9592682ad68e42270b0366b4
tree
$ git cat-file -p fcf0be4d7e45f0ef9592682ad68e42270b0366b4
100644 blob 257cc5642cb1a054f08cc83f2d943e56fd3ebe99 foo.txt
What’s also interesting is that when you execute command such as git add
, git will ‘conceptually’ copy the file to your staging area, but internally it has created a ‘blob’ object. While a command such as git commit
then creates the ‘commit’ and ‘tree’ objects to reference the already existing ‘blob’ object. I mention this because I wanted to be clear that these three objects don’t all get created at the same time.
Snapshots, Not Differences
We saw earlier an ascii graph that indicated the hierarchy of these objects. It showed that git reference types (e.g. remotes, branches and tags) all point to a ‘commit’ object. This commit object will include a pointer to a ‘tree’ object, and the tree object is a list of files (i.e. blobs) and directories (i.e. more trees).
It’s this graph that builds up the entire snapshot of the repository. This is why you shouldn’t think of a git commit as being a patch or set of changes to a bunch of files, but instead should see each commit as a complete snapshot of your entire project at a singular point in time.
If any files or directories change, then their commit hash will change and thus the HEAD commit will consist of different tree
and blob
objects (resulting in a different hash-tree graph).
With that in mind, let’s start by looking at the commit object we have (git cat-file -p b5d34b6
). We can see the first line says tree
followed by a hash (all other information is the typical commit information you’re used to seeing when you run git status
).
If we look at the tree object git cat-file -p fcf0be4
(which the commit object linked to), then we can see it consists of a single line: a blob object with its hash and its filename foo.txt
(this makes sense as our project only contains this single file).
Lastly, let’s look at the blob object git cat-file -p 257cc56
(which the tree object linked to), then we can see the contents of that blob object is the contents of the foo.txt
file itself.
OK, so what happens if I add a new file bar.txt
and a new subdirectory baz
with another file qux.txt
within that subdirectory…
$ tree
.
├── bar.txt
├── baz
│ └── qux.txt
└── foo.txt
1 directory, 3 files
Once I add baz/qux.txt
and commit it I then inspect the new objects in my .git/objects
folder. From there I locate the commit object (I do that by looking at the .git/refs/heads/master
and seeing what commit hash it has) and once I cat-file -p
that hash, I follow its tree
pointer…
$ git cat-file -p edc6771b338b472d901358e530db7cede202c1c7
100644 blob 5716ca5987cbf97d6bb54920bea6adde242d87e6 bar.txt
040000 tree 3d15e426c95bac2548d7255af9c5e240df786e03 baz
100644 blob 257cc5642cb1a054f08cc83f2d943e56fd3ebe99 foo.txt
$ git cat-file -p 3d15e426c95bac2548d7255af9c5e240df786e03
100644 blob 100b0dec8c53a40e4de7714b2c612dad5fad9985 qux.txt
We can see from the above output that the tree object not only includes my project files, but now a baz
directory (itself a tree object). Looking at that tree object shows there is one file inside of it (a blob object for qux.txt
).
If we review the index
file again we’ll see our new set of files/directories:
$ git ls-files
bar.txt
baz/qux.txt
foo.txt
Tags
Along the way I’ve been tagging my commits. A tag (as far as git internals are concerned) is another ‘object’ type. Let’s look at my tags:
$ git tag -n
v1 foo
v2 an anotated tag
So we can see I have two separate tags, and each one points at a different commit (the v1 tag was a lightweight tag and so the associated foo
comes from the commit message, while the v2 tag was an annotated tag and so the message I gave at that point was displayed).
In order to see the commit that a tag is associated with, we’ll need another plumbing subcommand rev-list
:
$ git rev-list -n 1 v1
b5d34b608ce697f0d20d011ee569529bca3feee8
$ git rev-list -n 1 v2
0b56156eba23ae9bee8c32137605397cf7c9e88e
But for us to see what the ‘tag’ object type looks like internally, we need to get the hash that the tag reference file is set to:
$ cat .git/refs/tags/v1
b5d34b608ce697f0d20d011ee569529bca3feee8
$ cat .git/refs/tags/v2
75d37b7c37173def7a0a8cd43d674edc8e9ce614
Once we have that hash we can use cat-file
to see the ‘tag’ object:
$ git cat-file -t 75d37b7c37173def7a0a8cd43d674edc8e9ce614
tag
$ git cat-file -p 75d37b7c37173def7a0a8cd43d674edc8e9ce614
object 0b56156eba23ae9bee8c32137605397cf7c9e88e
type commit
tag v2
tagger Integralist <example@gmail.com> 1585592962 +0100
an anotated tag
OK, so you may have noticed I used cat-file
on the v2 (annotated) tag, but not on the v1 (lightweight) tag. That was not an accidental omission.
A lightweight tag is just a reference to a commit hash, but an annotated tag is more complex and so a ‘tag object’ is created, and we can see that when we inspect the hash inside the v2 tag reference.
We can see the tag object includes a pointer to the ‘commit’ object (0b56156eba23ae9bee8c32137605397cf7c9e88e
) as well as information about the ‘tagger’ (in this case me!)
Remotes
When you add a remote like so:
git remote add origin git@github.com:Integralist/dotfiles.git
We can now look at the configuration of our remote:
$ git remote show origin
* remote origin
Fetch URL: git@github.com:Integralist/dotfiles.git
Push URL: git@github.com:Integralist/dotfiles.git
HEAD branch: master
Remote branches:
linux new (next fetch will store in remotes/origin)
master new (next fetch will store in remotes/origin)
minimal-mac-version-of-linux-version new (next fetch will store in remotes/origin)
Local ref configured for 'git push':
master pushes to master (local out of date)
You might be confused though if you were to look at .git/refs
and don’t see a remotes
subdirectory. This happens automatically if you clone an existing repository, but it’ll also be created when executing git fetch
after manually adding a new remote to an existing repository.
I added my new origin
remote (see above), but it was only once I had executed a git fetch
was I then able to see a ‘remote’ reference:
refs/
| remotes/
| | origin/
| | | master
If I inspect the .git/refs/remotes/origin/master
file, then I’ll see the latest commit my remote master
branch is on. It’s also interesting to remember what we mentioned earlier about references that point to commits being interchangeable with commit hashes in various subcommands.
For example, git diff
allows you to specify two branches to compare against each other (remember a branch is just a reference file that points to a commit hash), and so you might want to compare your local master
against your remote master
branch:
git diff master..origin/master
This is just a shortened way of doing:
git diff master..refs/remotes/origin/master
Which itself is just a shortened way of doing:
git diff master..c3865b72b019ced930cfc601b09b874685c29e72
Note: one last thing I wanted to mention (and there was no other place really to mention this) is that git comes with a UI! you can execute the command
gitk
to use it.