From LinuxMIPS
Jump to navigationJump to search

Why this page?

I wanted to understand what GIT was all about, but all I can see is what-it-does level documentation.

So I've sketched a description based on what I know about SCMs already. If you think it's useful and the inevitable errors are fixable, please go on and add to it.

Repositories and "objects"

Like CVS and more modern SCMs, GIT provides a repository in which you can lodge a 'project' (a hierarchical directory structure and its file data), update incrementally, and subsequently extract any past version of the whole repository, subsets and individual files.

The repository, of course, is itself a hierarchically-structured set of Linux files.

GIT calls the stored images of project files and directories "objects", which I find... well, objectionable: The excessively abstract word "object" is already widely used as in "object file" and "object-orientated". The latter meaning of 'object' is loose, vague and poorly understood too - another good reason to avoid it. Sigh.

I'm going to refer to these things as 'GIT files'; for many purposes GIT looks like a filesystem in its own right. It happens to be a filesystem which automagically stores old versions, and which internally uses hash-indexed data. Filesystems which do this are known.

GIT's not a filesystem in the full Linux sense, because you don't access it strictly through open/read/write/unlink etc; it's interface looks more like CVS (etc).

The repository is a tree of files on a Linux filesystem: but you are not entitled to believe that GIT files are one-to-one with files in the repository. On the other hand, experience shows that a repository system is a lot more reliable if you have unchanged GIT files represented by unchanged Linux files...

GIT is distributed

Unlike CVS but like some modern SCMs, GIT is "distributed": that is, everyday developer interactions terminate at a local copy of the repository, and the system works well even without a full-time, low-latency or high-bandwidth connection to peer repositories.

That makes an interesting problem: you want to be able to synchronise a pair of repositories without user intervention, and be confident that a set of peer copies which synchronise with each other will evolve (fairly rapidly) towards being identical so long as the graph of pairings is connected.

That requires that a single pair-synchronisation reliably ends up in a common state which captures all changes from both ends. That sets limits on the kinds of repository evolution which are permitted. You cannot make unsynchronisable changes to the repository: the easiest solution to this is that the thing grows, but no data is ever discarded.

All merge tools (even helpfully automated ones) operate under user control and work locally.

GIT relies on hashes

Like a few other proposed systems, GIT relies on hashes to uniquely identify GIT files based on their data. A 160-bit SHA1 hash is big enough that the chance of two different files in a repository having the same hash is vanishingly remote. [Hmm, some people use SHA1 hashes truncated to 128 bits, GIT has full 160 bits: 8 bits in the ?? and 152 bits in the * of .git/objects/??/*].

You need to be careful: remember that old thing about "how big does a party have to be before you have a 50% chance that two people have the same birthday?" The answer is 23 or so, which surprises people who haven't heard it before... It turns out the group size where you get a 50% chance is a bit bigger than the square root of the number of possible birthdays (23 is indeed a bit bigger than the square root of 365).

So you have a 50% chance of a false "alias" between SHA-identified files when you have somewhere around 2^80 files in the archive: so far so good, we really don't want a repository as big as that. The chance of an alias in the archive varies as the square of the number of files in the archive, too: so a very large archive of 100M files (that's around 2^27) has about a one in 2^((80-27)*2) or one in 2^106 chance of an alias. I really did mean 'vanishingly remote'...

It is not clear to me whether GIT's repository integrity depends on this very unlikely event never happening, or whether it would detect it and refuse a commit.

Indexes, trees and commits - a user view of a project

If you're prepared to keep files 'forever', version management with hash-identified files is just a matter of maintaining appropriate index information to locate the right set of GIT files: and GIT defines special GIT files called 'trees' and 'commits', for that purpose. A "tree" records the directory structure of the project, while a "commit" snapshots the version seen by a user who's just committed some changes. See #More on trees and commits below.

The "commit" GIT file is analogous to a thing called a "view" in other systems.

Storing data

Since GIT is completely in charge of its own data, it can (and does) compress data behind your back using gzip/bzip etc. Further, it can (and already attempts to) "cross-compress" related files - in particular, you can store one GIT file's data in the form of a patch to apply to some other hash-identified GIT file.

That use of patch is wholly private to GIT, and has no logical connection with a user's experience of patch/merge as a way of incorporating other's changes or porting your changes to a different root version.

But when a user commits changes the system usually knows the differences between the new and previous version, so that makes cross-compression practicable by identifying a pair of GIT files which can be cross-compressed.

More on trees and commits

(if only I understood them better.)

Originally written by

Dominic Sweetman 11:07, 14 Sep 2005 (BST)