Emmanuel Ogunbo
3 min readMay 12, 2021

--

Git Data Model: A Brief Introduction

Git
Git is a collaborative tool used for tracking changes in files. It is another type of source control system that makes software development easy for developers.

Terminologies
Briefly, I will introduce some of the Git terminologies that would make understanding the git data model easy. This also makes it easy to understand the Git interface:

Root: This represents the base directory that is being tracked by git.

File: A file in git is an array of bytes. It is called blob in git. In pseudocode, we can represent a file as:
type blob= array<byte>

Folder: A folder is a map of string, which is the folder name/key, to a file/folder. It is called a tree in Git. It comprises a file and also a folder. In pseudocode, we can represent it as:
type tree= map<string, blob| tree>

Commits: A commit is a snapshot of the Git directory. It records your changes and stores it in the git object-store. In pseudocode, a commit can be modeled as:
commit = struc {
parents: array<commits>
author: string
message: string
Date: string;
snapshot: tree

}
Note: In practice, parents contain an array of commits that are referenced by their id. Similarly, the snapshot is referenced by the id of the tree in the object-store.

Git Model

Git uses an immutable directed acyclic graph to model commit history. Git takes a snapshot of the root directory and uses that to model the commit history. Each snapshot has some number of parents(e.g a merge commit) and precedes what changes and what did not change.

In the diagram above, the circle represents a snapshot(copy of the entire root directory), which can also be a commit. You would observe that a commit precedes the one before it, by indication of the arrow. The commits can have meta information like the author of the commit, date and time the commit was made, and a host of other information

Git Commit Model of History

To understand the model more using the diagram above, each commit points to a previous one. Note that a commit can have more than one commit, like the case of a merge commit. Take a look at commit C, you would see it’s a feature and preceded by commit B. However, commit B contains some bug so a new snapshot/commit is made available at commit D, the bug fix. Both commits C and D are present in commit E, the feature plus bug fix.

Git Storage

Git stores blob, tree, commit on disk. It defines an object which can be a blob, tree, commits.

type object = blob | tree | commit

All objects are content-address where objects are addressed by their id hash. i.e a mapping of key, which represents the object hash, to the object. In pseudocode:

objects = map<string, objects>

Apart from the object store, Git also maintains a set of references. This makes it easy to attach the long hexadecimal numbers(hash id) to a meaningful name like the master, HEAD, etc.
references = map<string, string>
For instance, master — > 4af35…df. Also, note that references are mutable but not the commit history.

With a good understanding of the Git data model, it becomes easy to understand the various types of Git commands and their functions.

I hope you enjoy reading this article!

--

--