Image of The Definitive Guide to Git's Code

ADVERTISEMENT

Table of Contents

Git Overview

Over the past 15 years Git has grown from a tiny program written by a single developer to the most popular version control software (VCS) on the planet. Git is an essential tool that developers use to share code and collaborate on software projects. It has become a staple tool that developers are expected to know how to use if they are going to be brought onto a team project.

But what does Git actually do? Git's core functionality can be simplified into 3 basic parts:

  1. Allow developers to keep track of updates to their code over time
  2. Allow developers to easily combine their code updates with previous updates or new updates made by other people
  3. Allow developers to easily share code over the Internet

There are many Git books and tutorials on how to use Git spread across the Internet, and eventually we hope to build a library of those here at Initial Commit, but for now we'll focus on the inner workings of Git's code to help curious developers understand how it functions.

Why Learn How Git's Code Works?

Before diving in, we should address the question 'Who in their right mind would want to spend time learning about how Git's code works?' Here are a few reasons to learn about Git's code:

  1. Git's codebase – at least in it's initial form – is a manageable size to wrap your head around. Git's initial commit comes packaged in only 10 files, and comprises less than 1000 lines total. This is tiny compared to most codebases of any scope and maximizes the knowledge-to-effort ratio of this endeavor.
  2. Git's code actually runs in its initial form. Later on we'll walk through the steps to download Git's full codebase, retrieve it's initial form, and run it's original commands.
  3. Git's creator and original author Linus Torvalds is very picky about design principles, so understanding how he built this thing offers useful knowledge for structuring your own software projects in the future.
  4. The code itself is not that hard to grasp. If you have a basic or intermediate knowledge of programming, you should be able to follow along with the detailed inline code comments in our Baby Git project.
  5. Curiosity – In my experience as a software developer, I've found that each new programming language, tool, or project that I've integrated into my repertoire has expanded my skill set and correspondingly the set of opportunities that I have in my professional and hobbyist careers. Sometimes exploring a topic in depth purely due to curiosity is a good enough reason!

What language(s) is Git written in?

As of March 27, 2019 Git's code is made up of the following programming languages, as seen on Git's Github page:

Figure 1: Distribution of Git's Programming Languages

Git programming languages

From this we can see that almost 50% of Git's code is written in C. This means knowledge of the C programming language will be very important to help us understand how Git functions. In fact, Git's original code base – or initial commit – is entirely 100% written in C (besides the Makefile). If you're familiar with other more modern statically typed languages like Java or C++, you shouldn't have too many problems reading C code. However, there are 2 major differences between C and Java/C++ that you will need to grasp:

  1. C doesn't have classes. That's right; C is not an object-oriented language, that's why C++ was created. The closest structure C has to the class is the Struct. You can read more about this on my guest post here.
  2. C uses pointers often. (C++ does too, so if you are familiar them that's great!). You can think of a pointer as a memory address that points to a particular variable you are working with. This makes accessing variable addresses and values a bit different than in higher-level languages like Java and Python.

YES! Git's code is free and open source under the GNU General Public License version 2

Where to Find Git's Code?

Git's code is stored on Git's Github page. You can download the ZIP file directly from GitHub or open a terminal window and clone the repository using the following command:

git clone https://github.com/git/git.git

Navigate into the freshly downloaded git directory and run the git log command to take a peek at the latest commits made by the Git development community.

If you take a look at the files and directors in the project root (i.e. in the main git directory), you'll see a large collection of C header files (files ending in the .h extension, such as blob.h) and source code files (files ending in the .c extension, such as blob.c). The .h files contain information to be shared among multiple source files using the #include preprocessing directive. The .c files contain the actual code that makes Git tick.

You'll also notice some files ending in the .sh extension, which are shell scripts, and some files ending in the .perl extension, which are Perl scripts. In general, each of these files corresponds to a particular Git feature, command, or object (more info on Git objects here).

However, analyzing this current version of Git's codebase would get unwieldy fast, simply because there are so many files and folders to go through. Let's break this problem down into one of a more manageable size.

The Initial Version of Git: Git's Initial Commit

As mentioned above, Git's initial commit is small in size, and it actually works – so how to we retrieve it? We can do that by running the following commands in a terminal window in the git directory:

  1. git log --reverse

    This command will display a list of Git's commit history starting at the inception of its development, instead of the most recent commits. Note that the very first commit in the list has an ID of e83c5163316f89bfbde7d9ab23ca2e25604af290.

  2. git checkout e83c5163316f89bfbde7d9ab23ca2e25604af290

    Now if you examine the contents of the `git` directory, you'll notice almost all of the files have disappeared! In fact there are only 10 files left (11 if you include the README):

    • Makefile
    • cache.h
    • init-db.c
    • update-cache.c
    • read-cache.c
    • write-tree.c
    • commit-tree.c
    • read-tree.c
    • cat-file.c
    • show-diff.c
    • README

Feel free to look through these files for a peek at how Git works under the hood. The Initial Commit team has thoroughly documented this codebase with inline code comments.

We also wrote a guidebook for developers containing an in depth walk-through of the code, which you can start reading for free.

Baby Git Makefile

Studying Git's Makefile is a great way to learn how Makefiles work and how they are implemented in practice. For more details check out my article on the Baby Git Makefile.

Baby Git Header Files

Studying Git's header files is a great way to learn how C header files work and how they are implemented in practice. For more details check out my article on the Baby Git Header Files.

Baby Git Object Database

Git uses an object database to track and store all of the files, folders, changes, and commits that we create. For more details on this check out my article on the Baby Git Object Database.

Baby Git Staging Index

In order to specify files to be committed, Git uses a staging area also referred to as an index or current directory cache. For more details on this check out my article on the Baby Git Staging Index.

Checking Out an Initial Commit

If you're new to Git and interested in learning how to peek under the hood at the first version of your favorite software project, check out my guide on Checking Out Initial Commits with Git.

Next Steps

If you're interested in learning more about how Git works under the hood, check out our Baby Git Guidebook for Developers, which dives into Git's code in an accessible way. We wrote it for curious developers to learn how Git works at the code level. To do this we documented the first version of Git's code and discuss it in detail.

Final Notes