Inside the Mercurial repo

The way Mercurial handles its repo is really quite simple.

That's simple, as in 'most things are simple once you know the answer'. I found the explanation helpful10, so this section attempts the 10,000ft (FL100 if you prefer) view of Mercurial.

First remember that any file or component can only have one or two parents. You can't merge more than one other branch at once.

We start with the basic building block, which Mercurial calls a revlog. A revlog is a thing that holds a file and all the changes in the file history11. The revlog stores the differences between successive versions of the file, though it will periodically store a complete version of the file instead of a difference, so that the content of any particular file version can always be reconstructed without excessive effort.

Under the secret-squirrel Mercurial .hg directory at the top of your project is a store which holds a revlog for each file in your project. So you have the complete history of the project locally. No more round trips to the server.

Both the differences between successive versions and the periodic complete versions of a file are compressed before storing. This is surprisingly effective at minimising the storage requirements this entire history of your project. I have a small Java project handy, comprising a little over 300 source modules. There are 5 branches plus the mainline, and some 1920 commits in all. A Subversion checkout of the current mainline takes 51Mb. Converting the project to Mercurial yields a Mercurial repository that takes 60Mb, so a little bigger. Remember, though, that the Mercurial repository includes not just the working copy, but also the entire history of the project.

Any point in the evolution of a revlog can be uniquely identified with a nodeid. This is simply the SHA1 hash of the current file contents concatenated with the nodeids of one or both parents of the current revision. Note that this way, two file states are identical if and only if the file contents are the same *and* the file has the same history.

Here's a dump of a revlog index:

$ hg debugindex .hg/store/data/pome.txt.i
   rev    offset  length   base linkrev nodeid       p1           p2
     0         0      32      0       0 6bbbd5d6cc53 000000000000 000000000000
     1        32      51      0       1 83d266583303 6bbbd5d6cc53 000000000000
     2        83      84      0       2 14a54ec34bb6 83d266583303 000000000000
     3       167      76      3       4 dc4df776b38b 83d266583303 000000000000
$

Note here that a file state can have two parents. If both the parent nodeids are non-null, the file state has two parents, and the state is therefore the result of a merge.

Let's dump out a revlog at a particular revision:

$ hg debugdata .hg/store/data/pome.txt.i 2
There was a gibbon one morning
said "I think I will fly to the moon".
So with two great palms strapped to his arms,
he started his takeoff run.
$

The next component is the manifest. This is simply a list of all the files in the project, together with their current nodeids. The manifest is a file, held in a revlog. The nodeid of the manifest, therefore, identifies the project filesystem at a particular point.

$ hg debugdata .hg/store/00manifest.i 5
poem.txt5168b1a5e2f44aa4e0f164e170820845183f50c8
$

Finally we have the changeset. This is the atomic collection of changes to a repository that leads to a new revision. The changeset info includes the nodeid of the corresponding manifest, the timestamp and committer ID, a list of changed files and a comment. The changeset also includes the nodeid of the parent changeset, or the two parents if the change is a merge. The changeset description is held in a revlog, the changelog.

$ hg debugdata .hg/store/00changelog.i 5
1ccc11b6f7308cc8fa1573c2f3811a4710c91e3e
Jim Hague <jim.hague@acm.org>
1209061793 -3600
poem.txt
pome.txt

Merge first line branch
$

The nodeid of the changeset, therefore, gives us a globally unique identifier for any particular change. Changesets have a Subversion-like incrementing change number, but it is peculiar to that repository. The nodeid, however, is global.

One more detail remains to complete the picture. How do we get back from a particular file change to find the responsible changeset? Each revlog change has a linkrev entry that does just this.

So, now we have a repository with a history of the changes applied to that repository. Each change has a unique identifier. If we find that change in another repository, it means that at the point in the other repository we have exactly the same state; the file contents and history are identical.

At this point we can see how pulling changes from another repository works. Mercurial has to determine which changesets in the source repository are missing in the target repository. To do this, for each head in the source repo it has to find the most recent change in that head that it already present in the target repo, and get any remaining changes after that point. These changes are then copied over and applied.

The Mercurial revlog format has proved remarkably durable. Since the first release of Mercurial in April 2005, there have been a total of 5 changes to the file format. However, of those, all but one have been changes to the handling of file names. The most recent change, in October 2008, and its predecessor in December 2006, were both introduced purely to cope with Windows specific issues. The one change that touched the data structures described above was in April 2006. The format introduced, RevLogNG, changed only the details of index data held, not the overall design. The chief Mercurial developer, Matt Mackall, notes that the code in present-day Mercurial devoted to reading the old format comprises 28 lines of Python. Compared with, say, the early tribulations of Subversion and the switch from bdfs to fsfs, this is an impressive record.



Footnotes

... helpful10
For the curious, Bryan O'Sullivan's excellent Mercurial book has a chapter on the subject, and the Mercurial website has a fair amount of detail too.
... history11
For any non-trivial file, this will actually be two files on the disc, a data file and an index.
Jim Hague 2009-05-22