Wednesday, June 10, 2015

Git by diff

Git is a Distributed Version Control System. It stores file history and the current state of repository on the local machine. We can use this to understand the internals of git better. The idea is:

1. Store a copy of the repo folder, including the .git directory where git stores all state information it needs.
2. Do a git operation.
3. Compare the folder from step 1 with the current folder. Learn.
4. Go to 1.

Initial Setup

We create a script which makes a copy of the directory containing a test repo.

#!/usr/bin/env bash
I kept the script in a directory which is in my bin path, so I could call it from anywhere.
I use beyond compare as the diff tool, but any good diff tool will do.

Git by diff

Git init

We use git init to create a repository.

$ mkdir gitbydiff
$ cd gitbydiff
$ git init
Initialized empty Git repository in c:/tmp/gitbydiff
Now let us diff the repo directory gitbydiff and its initial (empty) contents gitbydiff_prev.

So git creates a bunch of files in the .git directory. They are mostly directories, and we will not dig into them for now. Let us take a quick look at two of the files:

The config file contains the default configuration for the repository. We can update the configuration using the git config command or by editing this file.
The HEAD file contains a reference to the current branch, which is the master. The text in the file is ref: refs/heads/master.

Git add

Let us add a file, and a directory with a file and add it to git.

File - afile
Directory - adir
Another file - adir/bfile

$ echo "afile contents" > afile
$ mkdir adir
$ echo "adir/bfile contents" > adir/bfile
$ git add .

Diff reveals a few new files in the .git folder. An index file, and a few directories/files in the objects directory.

How do we read these files? Turns out these are compressed files, which can be decompressed to reveal its contents. I borrowed a python snippet which can be used to decompress the file.
$ python -c "import zlib,sys;print repr(zlib.decompress(" < c:/tmp/gitbydiff/.git/objects/04/4d81646daa6ef55b55630a66cce5c8e098d5b7
'blob 15\x00afile contents\n'
So this was a compressed file with reference to a 'blob' with the contents of afile! The other object contains the other file bfile which we created.
$ python -c "import zlib,sys;print repr(zlib.decompress(" < c:/tmp/gitbydiff/.git/objects/b7/404f599ac68db583e53ce9a903ea1c7479f86c
'blob 20\x00adir/bfile contents\n'
Git stores data as objects, which are key value pairs. The key is a checksum of the contents. Git uses SHA-1 hash to generate the checksum.

In the objects directory, git stores the object under:

The index is a binary file, whose contents we can see with the git ls-files command:

$ git ls-files --stage
100644 b7404f599ac68db583e53ce9a903ea1c7479f86c 0       adir/bfile
100644 044d81646daa6ef55b55630a66cce5c8e098d5b7 0       afile
In essense the index is a list of SHA-1 hashes, their paths and permissions.

Git commit

Let us commit our change to git.

$ git commit -m "First commit from branch master"
 2 files changed, 2 insertions(+)
 create mode 100644 adir/bfile
 create mode 100644 afile

In the diff we see three new objects, a log directory and a refs/heads/master file.

The refs/heads/master contains the entry of an object: 20c9dc1abcbb5e7e60d5a15da9fc11f3bf83e4c7

What is in this object? Instead of using our python snippet to view it, we can use the git cat-file command to concisely reveal the contents of a git object. Note also that the cat-file command only needs the first few characters of the object's SHA-1 hash

$ git cat-file -p 20c9dc
tree 631d18013981d64940cef00c422fcff00044e432
author xyz  1433905500 -0400
committer xyz  1433905500 -0400

First commit from branch master
So the refs/heads/master points to an object which in turn contains a pointers to a 'tree' object and the commit message.

What is in the 'tree' object 631d1...?

$ git cat-file -p 631d1
040000 tree 309fb8dd6b7c0a9bb2234cd6dc0e5d65fe9fccde    adir
100644 blob 044d81646daa6ef55b55630a66cce5c8e098d5b7    afile
A 'tree' adir and a 'blob' afile. But adir and afile are the directory and file we created!
We can guess now that the contents of the 'tree' adir is a 'blob' bfile which is the file we created under adir. Let us check.

$ git cat-file -p 309fb
100644 blob b7404f599ac68db583e53ce9a903ea1c7479f86c    bfile
Right. bfile and afile objects themselves point to the contents of the file. For example, bfile:

$ git cat-file -p b7404
adir/bfile contents
From this we can infer that git stores a directory structure as a set of pointers with trees representing directories (and the root), and blobs representing files.

is represented as:
branch master -> tree with commit message
                                           -> blob afile -> contents of afile
                                           -> tree adir -> blob bfile -> contents of bfile

The data structure used here is a Directed Acyclic Graph (DAG).

We ignore the logs directory here. This is related to relog which we will not look into.

Git branch

Let us now create a branch.

$ git checkout -b abranch
Switched to a new branch 'abranch'

The HEAD file now contains refs: refs/heads/abranch.
The refs/heads has a new file - abranch, which contains 20c9dc... which is the same as what the refs/heads/master contains. That is the object which points to our directory structure.

This is logical: a new branch was created, git points its HEAD to it since we switched to it and the branch points to the root of our repo.

Let us do a commit now.

$ echo "change to a file from branch abranch" > afile
$ git add .
$ git commit -m "Second commit from branch abranch"
 1 file changed, 1 insertion(+), 1 deletion(-)

We see three new objects and that refs/heads/abranch and COMMIT_EDITMSG files has changed.

refs/heads/abranch points to one of the new objects 89008. Following this object we see

$ git cat-file -p 89008
tree 4bbff3bb19598db5ab1cae521f652a018651221d
parent 20c9dc1abcbb5e7e60d5a15da9fc11f3bf83e4c7
author xyz  1433907011 -0400
committer xyz  1433907011 -0400

Second commit from branch abranch
$ git cat-file -p 4bbff3
040000 tree 309fb8dd6b7c0a9bb2234cd6dc0e5d65fe9fccde    adir
100644 blob 99037eec622590773d235c6abfa44d437fae0c73    afile
$ git cat-file -p 99037
change to a file from branch abranch
What is git doing here?
Since the new branch, abranch, has a modified afile it is updating the state of abranch to reflect this.
It creates a new 'root' object which points to adir and afile. The new afile object that is pointed to is the updated file which is only in this branch. The master still points to the old afile.

For adir/bfile, both master and abranch use the same object, since that directory has not really changed across the branches. Git does not duplicate data for common objects across branches.

Git merge

As a final step in our experiment let us merge the branch into the master.

$ git checkout master
Switched to branch 'master'
$ git merge abranch
Updating 20c9dc1..89008dd
 afile | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

The HEAD is now ref: refs/heads/master. Nothing else changed. Though abranch is merged into master, its objects are left intact.

I will stop at this point. This method can be used to explore other git commands like rebase or concepts like remote repositories. Another good source to learn more about git is by peeking at the git source code itself.

1 comment:

Boston, MA, United States