Version Control Blog Archive
DevCamps news
DevCamps is a system for managing development, integration, staging, and production environments. It was developed by End Point for, and with the help of, some of our ecommerce clients. It grew over the space of several years, and really started to become its own standalone project in 2007.
Camps are a behind-the-scenes workhorse of our web application development at End Point, and don't always get much attention because everyone's too busy using camps to get work done! But this summer a few things are happening.
In early July we unveiled a redesign of the devcamps.org website that features a more whimsical look, a better explanation of what camps are all about, and endorsements by business and developer users. Marko Bijelic of Hipinspire did the design. Take a look:
In less than two weeks, on August 17, I'm going to be giving a talk on camps at YAPC::EU in Riga, Latvia. YAPC::EU is Europe's annual Perl conference, and will be a nice place to talk about camps.
Many Perl developers are doing web applications, which is camps' main focus, so that's reason enough. But camps also started around the Interchange application server, which is written in Perl. And the camp system is currently implemented in Perl as well.
We've set up a lot of camp systems for Perl web applications. So even though we've also set up camp systems for web applications using Ruby on Rails, Sinatra, Django, and PHP, it's a nice homecoming to talk about camps to Perl enthusiasts.
Interactive Git: My New Found Friend(s)
As a software engineer I'm naturally inclined to be at least somewhat introverted :-), combine that with the fact that End Point is PhysicalWaterCooler challenged and you have a recipe for two things to naturally occur, 1) talking to oneself (but then who doesn't do that really? no, really.), 2) finding friends in unusual places. Feeling a bit socially lacking after a personal residence move, I was determined to set out to find new friends, so I found one, his name is "--interactive", or Mr. git add --interactive.
"How did we meet?" You ask. While working on a rather "long winded" project I started to notice myself sprinkling in TODOs throughout the source code, not a bad habit really (presuming they do actually eventually get fixed), but unfortunately the end result is having a lot of changed files in git that you don't really need to commit, but at the same time don't really need to see every time you want to review code. I'm fairly anal about reviewing code and so I was generally in the habit of running a `git status` followed by a `git diff
"But what about your other old friends?" You then ask. Well, as it turns out my spending so much time with interactive add made `git stash` feel a bit lonely, and it dawned on me that tracking those TODOs in the working tree at all may be a bit silly. What could a guy do, perhaps these two friends might actually like to party together? As it turns out they had already been introduced and do like to party together (not sure why they couldn't have just invited me before, though it might have something to do with my past friendship with SVN and RCS). Either way, to once and for all get those unsightly TODOs out from under my immediate purview while keeping other changes I still needed in the index I found `git stash save --patch --no-keep-index "TODO Tracking"`. "save" instructs git stash to save a new stash, "--patch" tosses it into an interactive mode similar to the one described above for add, "--no-keep-index" instructs stash not to keep the changes in the working tree that are added to the created stash, and the "TODO Tracking" is just a message to make it easy for a human to understand what the stash contains (I made this one up for my specific immediate purpose). This leaves my working tree and index clean for me to do more pressing work and to know that when I have the time/need to restore those past TODOs I can, so that they may be worked on as well. Note that I've not really used this technique much (read: I've just done it now for the first time) so we'll see if it really is that useful, but the interactive patching I've used and it is definitely worth it.
As a further side bar I was discussing multiple commit indexes in a Git repo with someone in the #yui channel, and as soon as I found the above it occurred to me that using multiple stashes where you pop them could work in effect the same way, though I don't know if there is a way to add patches to an already created stash. That might make a neat feature to investigate and/or request from the Git core.
Just so you aren't too concerned, there is still a place in my heart for `git add` and `git status` even if I don't see them as frequently as I once did.
Version Control Visualization and End Point in Open Source
Over the weekend, I discovered an open source tool for version control visualization, Gource. I decided to put together a few videos to showcase End Point's involvement in several open source projects.
Here's a quick legend to help understand the videos below:
The Videos
Interchange from endpoint on Vimeo.
Bucardo from endpoint on Vimeo.
One of the articles that references Gource suggests that the videos can be used to visualize and analyze the community involvement of a project (open source or not). One might also be able to qualitatively analyze the stability of project file architecture from a video, but this won't reveal anything definitive about the code stability since external factors can influence file structure. For example, since I am intimately familiar with the progress of Spree, I can identify when Spree transitioned to Rails 3 in the video, which required reorganization of the Spree core functionality (read more about this here and here).
In the case of this article, I wanted to highlight End Point's involvement in a few open source projects where we've had various levels of involvement. We've contributed to Interchange since 2000. We've been involved in Spree less lately, but had more presence in early 2009. In the smaller projects Bucardo and pgsi, End Point employees have worked on a team to be the primary contributors to the projects in addition to a few external contributors. Open source is important to End Point, and it's great to see our presence demonstrated in these cute videos.
Using "diff" and "git" to locate original revision/source of externally modified files
I recently ran into an issue where I had a source file of unknown version which had been substantially modified from its original form, and I wanted to find the version of the originating software that it had originally come from to compare the changes. This file could have come from any number of the 100 tagged releases in the repository, so obviously a hand-review approach was out of the question. While there were certainly clues in the source file (i.e., copyright dates to narrow down the range of commits to review) I thought up and used this technique:
Here are our considerations:
- We know that the number of changes to the original file is likely small compared to the size of the file overall.
- Since we're trying to uncover a likely match for the purposes of reviewing, exactness is not required; i.e., if there are lines in common with future releases, we're interested in the changes, so a revision with the fewest number of changes is preferred over finding the *exact* version of the file that this was originally based on.
The basic thought, then, is that we want to take the content of the unversioned file (i.e., the file that was changed) and find the revision of the corresponding file in the repository with the least number of changes, which we'll measure as the count of the lines in the source code diff. This struck me as similar to the copy detection that git does, insofar as it can detect content that is similar to some source content with a certain amount of tolerance for changes from the base. The difference in this case is that we're comparing content across a number of refs rather than across all of the blobs in a single ref. This recipe distilled down to the following bash command:
for ref in $(git tag);
do
echo -n $ref;
diff -w <(git show $ref:/path/to/versioned/file 2>/dev/null) modified_file | wc -l;
done | sort -k2 -n
The results of running this command is a list of the tags in the repository ordered by how similar they are to the target content (most similar first). A few comments:
- We iterate through all tags in the project; while there could indeed be changes to the relevant file in intermediate versions, due to the way the release worked it's likely the original file was based on a released (aka tagged) version.
- We're using diff's -w option, as the content may have changed spaces to tabs or vice versa, depending on the editor/editing habits of the original user. This helps us ensure that the changes that we're focusing on are the ones that change something substantial.
- We're doing a numeric sort so the lines with the least number of changes show up at the top.
- For the specific case I used this technique with, there were a number of revisions that had the least number of changed lines. Upon reviewing this smaller set of revisions (using the git diff rev1 rev2 -- path/to/content syntax), it turns out that the file in question had remained unchanged in each of these revisions, so any one of them was useful for my purposes.
- The flexibility in the version detection works in this case because this was an isolated part of the system that did not have any changes or dependencies. If there had been important changes to the system as a whole independent of the changes to this file (but which had an affect on the operation of this specific part), we would need to have a more exact method of identifying the file.
git branches and rebasing
Around here I have a reputation for finding the tiniest pothole on the path to git happiness, and falling headlong into it while strapped to a bomb ...
But at least I'm dedicated to learning something each time. This time it involved branches, and how git knows whether you have merged that branch into your current HEAD.
My initial workflow looked like this:
$ git checkout -b MY_BRANCH (some editing) $ git commit $ git push origin MY_BRANCH (later) $ git checkout origin/master $ git merge --no-commit origin/MY_BRANCH (some testing and inspection) $ git commit $ git rebase -i origin/master
This last step was the trip-and-fall, although it didn't hurt me so much as launch me off my path into the weeds for a while. Once I did the "git rebase", git no longer knows that MY_BRANCH has been successfully merged into HEAD. So later, when I did this:
$ git branch -d MY_BRANCH error: the branch 'MY_BRANCH' is not fully merged.
As I now understand it, the history is no longer a subset of the history associated with MY_BRANCH, so git can't tell the two are related and refuses to delete the branch unless you supply it with -D. A relatively harmless situation, but it set off all sorts of alarms for me, as I thought I messed up the merge somehow.
Postgres configuration best practices
This is the first in an occasional series of articles about configuring PostgreSQL. The main way to do this, of course, is the postgresql.conf file, which is read by the Postgres daemon on startup and contains a large number of parameters that affect the database's performance and behavior. Later posts will address specific settings inside this file, but before we do that, there are some global best practices to address.
Version Control
The single most important thing you can do is to put your postgresql.conf file into version control. I care not which one you use, but go do it right now. If you don't already have a version control system on your database box, git is a good choice to use. Barring that, RCS. Doing so is extremely easy. Just change to the directory postgresql.conf is in. The process for git:
- Install git if not there already (e.g. "sudo yum install git")
- Run: git init
- Run: git add postgresql.conf pg_hba.conf
- Run: git commit -a -m "Initial commit"
For RCS:
- Install as needed (e.g. "sudo apt-get install rcs")
- Run: mkdir RCS
- Run: ci -l postgresql.conf pg_hba.conf
Note that we also checked in pg_hba.conf as well. You want to check in any file in that directory you may possibly change. For most people, that only means postgresql.conf and pg_hba.conf, but if you use other files (pg_ident.conf) check those in as well.
Ideally you want the version checked in to be the "raw" configuration files that came with the system - in other words, before you started messing with them. Then you make your initial changes and check it in. From then on of course, you commit every time you change the file.
At a bare minimum, the version control system should be telling you:
- Exactly what was changed
- When it was changed
- Who made the change
- Why it was changed
The first two items happen automatically in all version control systems, so you don't have to worry about those. The third item, "who made the change", must be entered manually if on a shared account (e.g. postgres) and using RCS. If you are using git, you can simply set the environment variables GIT_AUTHOR_NAME and GIT_AUTHOR_EMAIL. For shared accounts, I have a custom bashrc file called "gregbashrc" that is called when I log in that sets those ENVs as well as a host of other items.
The fourth item, "why it was changed", is generally the content of the commit message. Never leave this blank, and be as descriptive and verbose as possible - someone later on will be grateful you did. It's okay to be repetitive and state the obvious. If this was done as part of a specific ticket number or project name, mention that as well.
Safe Changes
It's important that the changes you make to the postgresql.conf file (or other files) actually work and don't cause Postgres to be unable to parse the file, or handle a changed setting. Never make changes and restart Postgres, because if it doesn't work, you've got a broken config file, no Postgres daemon, and most likely unhappy applications and/or users. At the very least, do a reload first (e.g. /etc/init.d/postgresql reload or just kill -HUP the PID). Check the logs and see if Postgres was happy with your changes. If you are lucky, it won't even require a restart (some changes do, some do not).
A better way to test your changes is to make it on an identical test box. That way, all the wrinkles are ironed out before you make the changes on production and attempt a reload or restart.
Another way I've found handy is to simply start a new Postgres daemon. Sounds like a lot of work, but it's pretty automatic once you've done it a few times. The process generally looks like this, assuming your production postgresql.conf is in the "data" directory, and your changes are in data/postgresql.conf.new:
- cd ..
- initdb testdata
- cp -f data/postgresql.conf.new testdata/
- echo port=5555 >> testdata/postgresql.conf
- echo max_connections=10 >> testdata/postgresql.conf
The max_connections is not strictly necessary, of course, but unless you are changing something that relies on that setting, it's nicer to keep it (and the resulting memory) low.
- pg_ctl -D testdata -l test.log start
- cat test.log
- pg_ctl -D testdata stop
- rm -fr testdata (or just keep it around for next time)
The test.log file will show you any problems that might have popped up with your changes, and once it works you can be fairly confident it will work for the "main" daemon as well, so to finish up:
- cd data
- mv -f postgresql.conf.new postgresql.conf
- git commit postgresql.conf -m "Adjusted random_page_cost to 2, per bug #4151"
- kill -HUP `head -1 postmaster.pid`
- psql -c 'show random_page_cost'
Keeping it Clean
The postgresql.conf file is fairly long, and can be confusing to read with its mixture of comments, in-line comments, strange wrapping, and the commented out vs. not-commented-out variables. Hence, I recommend this system:
- Put a big notice at the top of the file asking people to make changes to the bottom
- Put all important variables at the bottom, sans comments, one per line
- Line things up
- Put into logical groups.
This avoids having to hunt for settings, prevents the gotcha of when a setting is changed twice in the file, and makes things much easier to read visually. Here's what I put at the top of the postgresql.conf:
## ## PLEASE MAKE ALL CHANGES TO THE BOTTOM OF THIS FILE! ##
I then add a good 20+ empty lines, so anyone viewing the file is forced to focus on the all-caps message above.
The next step is to put all the settings you care about at the bottom of the file. Which ones should you care about? Any setting you have changed (obviously), any setting that you *might* change in the future, and any that you may not have changed, but someone may want to look up. In practice, this means a list of about 25 items. After aligning all the values to the right and breaking things into logical groups, here's what the bottom of the postgresql.conf looks like:
## Connecting port = 5432 listen_addresses = '*' max_connections = 100 ## Memory shared_buffers = 400MB work_mem = 1MB maintenance_work_mem = 1GB ## Disk fsync = on synchronous_commit = on full_page_writes = on checkpoint_segments = 100 ## PITR archive_mode = off archive_command = '' archive_timeout = 0 ## Planner effective_cache_size = 18GB random_page_cost = 2 ## Logging log_destination = 'stderr' logging_collector = on log_filename = 'postgres-%Y-%m-%d.log' log_truncate_on_rotation = off log_rotation_age = 1d log_rotation_size = 0 log_min_duration_statement = 200 log_statement = 'ddl' log_line_prefix = '%t %u@%d %p' ## Autovacuum autovacuum = on autovacuum_vacuum_scale_factor = 0.1 autovacuum_analyze_scale_factor = 0.3
Because everything is in one place, at the bottom of the file, and not commented out, it's very easy to see what is going on. The groups above are somewhat arbitrary, and you can leave them out or create your own, but at least keep things grouped together as much as possible. When in doubt, use the same order as they appear in the original postgresql.conf.
Sometimes people change important settings in a group, such as for bulk loading of data. In this case, I usually make a separate group for it at the very bottom. This makes it easy to switch back and forth, and helps to prevent people from (for example) forgetting to switch fsync back on:
## Bulk loading only - leave 'on' for everyday use! autovacuum = off fsync = off full_page_writes = off
Ownership and permissions
All the conf files should be owned by the postgres user, and the configuration files should be world-readable if possible (indeed, it's a requirement for Debian based system that postgresql.conf be readable for psql to work!). Be careful about SELinux as well: it can get ornery if you do things like use symlinks.
Backups
One final note - make sure you are backing up your changes as well. PITR and pg_dump won't save your postgresql.conf! If you are checking things in to a remote version control system, then some of the pressure is off, but you should have some sort of policy for backing up all your conf files explicitly. Even if using a local git repo, tarring and copying up the whole thing is usually a very quick and cheap action.
Continuing an interrupted git-svn clone
I've run into the issue before when using git-svn to clone a large svn repo; something interrupts the transfer, and you end up having to restart the git-svn clone process again. Attempting to git-svn clone from a partially transferred svn clone directory results in error messages from git-svn, and it's not immediately clear what you need to do to pick the process back up from where you left off.
In the past I've just blown away the partially-transferred repo and started the clone over, but that's a waste of time and server resources, not to mention extremely frustrating, particularly if you're substantially into the clone process.
Fortunately, this is not necessary; just go into your partially retrieved git-svn repo and execute git-svn fetch. This continues fetching the svn revisions from where you left off. When the process completes, you will have empty directory with just the .git directory present. Looking at git status shows all of the project files deleted (oh noes!), however this is just misdirection. At this point, you just need to issue a git reset --hard to check out the files in the HEAD commit.
More illustratively:
$ git svn clone http://svn.example.com/project/trunk project # download, download, download, break! $ cd project; ls -a .git $ git svn fetch # download, download, download, success! $ ls -a .git $ git status # On branch master # Changes to be committed: # (use "git reset HEAD..." to unstage) # # deleted: foo.c # deleted: foo.h # $ git reset --hard; ls -a1 .git foo.c foo.h $
Make git grep recurse into submodules
If you've done any major work with projects that use submodules, you may have been surprised that `git grep` will fail to return matches that match in a submodule itself. If you go into the specific submodule directory and run the same `git grep` command, you will be able to see the results, so what to do in that case?
Fortunately, `git submodule` has a subcommand which lets us execute arbitrary commands in all submodule repos, intuitively named `git submodule foreach`.
My first attempt at a command to search in all submodules was:
$ git submodule foreach git grep {pattern}
This worked fine, except when {pattern} was multiple words or otherwise needed shell escaping. My next attempt was:
$ git submodule foreach git grep "{pattern}"
This properly passed the escapes to the shell (ending up with "'multi word phrase'" in my case), however an additional problem surfaced; the return value of the command resulted in an abort of the foreach loop. This was solved via:
$ git submodule foreach "git grep {pattern}; true"
A more refined version could be created as a git alias, automatically escape its arguments, and union with the results of `git grep`, thus providing the submodule-aware `git grep` I'd been hoping existed already. I leave this as an exercise to the reader... :-)
It's also worth noting that the file paths reported are relative to the containing submodule, so you would need to incorporate the `git submodule foreach`-supplied $path variable to pinpoint the full paths of the files in question.
Git Submodules: What is the Ideal Workflow?
Last week, I asked some coworkers at End Point about the normal workflow for using git submodules. Brian responded and the discussion turned into an overview on git submodules. I reorganized the content to be presented in a FAQ format:
How do you get started with git submodules?
You should use git submodule add to add a new submodule. So for example you would issue the commands:
git submodule add git://github.com/stephskardal/extension1.git extension git submodule init
Then you would git add extension (the path of the submodule installation), git commit.
What does the initial setup of a submodule look like?
The super project repo stores a .gitmodules file. A sample:
[submodule "extension1"]
path = extension
url = git://github.com/stephskardal/extension1.git
[submodule "extension2"]
path = extension_two
url = git://github.com/stephskardal/extension2.git
When you have submodules in a project, do you have to separately clone them from the master project, or does the initial checkout take care of that recursively for you?
Generally, you will issue the commands below when you clone a super project repository. These commands will "install" the submodule under the main repository.
git submodule init git submodule update
How do you update a git submodule repository?
Given an existing git project in the "project" directory, and a git submodule extension1 in the the extension directory:
First, a status check on the main project:
~/project> git status # On branch master nothing to commit (working directory clean)
Next, a status check on the git submodule:
~/project> cd extension/ ~/project/extension> git status # Not currently on any branch. nothing to commit (working directory clean)
Next, an update of the extension:
~/project/extension> git fetch remote: Counting objects: 30, done. remote: Compressing objects: 100% (18/18), done. remote: Total 19 (delta 9), reused 0 (delta 0) Unpacking objects: 100% (19/19), done. From git://github.com/stephskardal/extension1 0f0b76b..9cbb6bd master -> origin/master ~/project/extension> git checkout master Previous HEAD position was 0f0b76b... Added before_filter to base controller. Switched to branch "master" Your branch is behind 'origin/master' by 5 commits, and can be fast-forwarded. ~/project/extension> git merge origin/master Updating f95a2d5..9cbb6bd Fast forward extension.rb | 10 + README | 36 + TODO | 11 +- ... ~/project/extension> git status # On branch master nothing to commit (working directory clean)
Next, back to the main project:
~/project/extension> cd .. ~/project> git status # On branch master # Changed but not updated: # (use "git add..." to update what will be committed) # (use "git checkout -- ..." to discard changes in working directory) # # modified: extension # no changes added to commit (use "git add" and/or "git commit -a")
Now, a commit to include the submodule repository change. Brian has made it a convention to manually include SUBMODULE UPDATE: extension_name in the commit message to inform other developers that a submodule update is required.
~/project> git add extension ~/project> git commit [master eba52d5] SUBMODULE UPDATE: extension 1 files changed, 1 insertions(+), 1 deletions(-)
What does git store internally to track the submodule? The HEAD position? That would seem to be the minimal information needed to tie the specific submodule-tracked version with the version used in the superproject.
It stores a specific commit SHA1 so even if HEAD moves the super project's "reference" doesn't, which is why updating to the upstream version must be followed by a commit so that the super project is "pinned" to the same commit across repos. You'll see in the example above that the submodule project was in a detached head state (not on a branch) so HEAD doesn't really make sense.
It is critical that the super project repo store an exact position for the submodule otherwise you would not be able to associate your own code with a particular version of a submodule and ensure that a given submodule is at the same position across repos. For instance, if you updated to an upgraded version of a submodule and committed it not realizing that it broke your own code, you can check out a previous spot in the repository where the code worked with the submodule.
Hopefully, this discussion on git submodules begins to show how powerful git and submodules can be for making it easy for non-core developers to start sharing their code on an open source project.
Thanks to Brian Miller and David Christensen for contributing the content for this post! I reference this article in my article on Software Development with Spree - I've found it very useful to use git submodules to install several Spree extensions on recent projects. The Spree extension community has a few valuable extensions including that introduce features such as product reviews, faq, blog organization, static pages, and multi-domain setup.
Postgres: Hello git, goodbye CVS
It looks like 2010 *might* be the year that Postgres officially makes the jump to git. Currently, the project uses CVS, with a script that moves things to the now canonical Postgres git repo at git.postgresql.org. This script has been causing problems, and is still continuing to do so, as CVS is not atomic. Once the project flips over, CVS will still be available, but CVS will be the slave and git the master, to put things in database terms. The conversion from git to CVS is trivial compared to the other way around, so there is no reason Postgres cannot continue to offer CVS access to the code for those unwilling or unable to use git.
On that note, I'm happy to see that the number of developers and committers who are using git - and publicly stating their happiness with doing so - has grown sharply in the last couple of years. Peter Eisentraut (with some help from myself) set up git.postgresql.org in 2008, but interest at that time was not terribly high, and there was still a lingering question of whether git was really the replacement for CVS, or if it would be some other version control system. There is little doubt now that git is going to win. Not only for the Postgres project, but across the development world in general (both open and closed source).
To drive the point home, Andrew has announced he is working on git integration with the Postgres build farm. Of course, I submitted a patch to do just that back in March 2008, but I was ahead of my time :). Besides, mine was a simple proof of concept, while it sounds like Andrew is actually going to do it the right way. Go Andrew!
Of all the projects I work on, the great majority are using git now. We've been using git at End Point as our preferred VCS for both internal projects and client work for a while now, and are very happy with our choice. There is only one other project I work on besides Postgres that uses CVS, but it's a small project. I don't know of any other project of Postgres' size that is still using CVS (anyone know of any?). Even emacs recently switched away from CVS, although they went with bazaar instead of git for some reason. Subversion is still being used by a substantial minority of the projects I'm involved with, mostly due to the historical fact that there was a window of time in which CVS was showing its limitations, but subversion was the only viable option. Sure would be nice if perl.org would offer git for Perl modules, as they do for subversion currently (/hint). Finally, there are a few of my projects that use something else (mercurial, monotone, etc.). Overall, git accounts for the lion's share of all my projects, and I'm very happy about that. There is a very steep learning curve with git, but the effort is well worth it.
If you want to try out git with the Postgres project, first start by installing git. Unfortunately, git is still new enough, and actively developed enough, that it may not be available on your distro's packaging system, or worse, the version available may be too old to be useful. Anything older than 1.5 should *not* be used, period, and 1.6 is highly preferred. I'd recommend taking the trouble to install from source if git is older than 1.6. Once installed, here's the steps to clone the Postgres repo.
git clone git://git.postgresql.org/git/postgresql.git postgres
This step may take a while, as git is basically putting the entire Postgres project on your computer - history and all! It took me three and a half minutes to run, but your time may vary.
Once that is done, you'll have a directory named "postgres". Change to it, and you can now poking around in the code, just like CVS, but without all the ugly CVS directories. :)
For more information, check out the "Working with git" page on the Postgres wiki.
Here's to 2010 being the year Postgres finally abandons CVS!
RCS vs. Git for quick versioning
As a consultant, I'm often called to make changes on production systems - sometimes in a hurry. One of my rules is to document all changes I make, no matter how small or unimportant they may seem. In addition to local notes, I always check in any files I change, or might change in the future, into version control. In the past, I would always use RCS. However, Jon Jensen challenged me to rethink my automatic use of RCS and give Git a try for this.
This makes sense on some levels. We use Git for most everything here at End Point, and it is our preferred version control system. I still use other systems: there are some clients and projects that require the use of Subversion, Mercurial, and even CVS. The advantage of Git for quick one off checkins is that, similar to RCS, there is no central repository, and setup is extremely easy.
As an example, one of the files I often check into version control is postgresql.conf, the main configuration file for the Postgres database. Before I even edit the file, I'll check it in, so the sequence of events looks like this:
mkdir RCS ci -l postgresql.conf edit postgresql.conf
The creation of the RCS directory is optional but recommended. RCS (which stands for Revision Control System) uses a very simple tracking mechanism. A new file that tracks all changes is created for each file. This new file takes the original name of the file and adds a ",v" to the end of it. However, it's annoying to have all those "comma vee" files laying around, so RCS has a nice trick that when a directory named RCS exists, all the comma vee files will be placed into that directory. The "ci -l postgresql.conf" checks in (ci) the file, and the "-l" file instructs RCS to immediately check it back out again and lock it (as the current user). This is an RCS specific advisory lock, and only gets in the way if you try to check in the file as a different user. The final command above, "edit postgresql.conf" calls up my editor of choice so I can start modifying the file.
Once the file has been modified, checking in the changes made is as simple as once again doing:
ci -l postgresql.conf
Now that it has been checked in, I can perform other common version control tasks against it. To see the complete log of changes:
rlog postgresql.conf
To see the differences between the current version and the last checkin, or against a specific version:
rcsdiff postgresql.conf rcsdiff -r1.3 postgresql.conf
To find a string in a specific previous version:
co -p -r1.3 postgresql.conf | grep foobar
Using Git for this purpose is fairly similar. The first steps now become:
git init git add postgresql.conf git commit postgresql.conf edit postgresql.conf
Technically, one more step than before, but not really a big deal. Note that we don't need to create a special directory to hold the versioning information: by default, Git puts everything in a ".git" directory. Once we've made changes to the file, we can commit out changes with:
git commit postgresql.conf
to see the log of changes:
git log postgresql.conf
To see the differences between the current version and the last checkin, or against a specific version:
git diff postgresql.conf git diff 11a049bc80fe4a2f4584465fe13d8bb4ee479f23 postgresql.conf
To find a string in a specific previous version:
git show 11a049bc80fe4a2f4584465fe13d8bb4ee479f23:postgresql.conf | grep foobar
With Git, there is also quite a bit more than an be done now - easy branching, grepping, generating diffs, etc. However, most of it is overkill for the simple purpose of tracking local changes. On the downside, Git does not have the simple version numbering that RCS has, and the syntax can be a bit trickier and non-intuitive.
So, did I make the switch? Well, yes and no. I've been trying to use Git for simple checkins the last few weeks, and have had mixed results. Here's my breakdown of areas in which they differ:
Ease of use
RCS wins this one. All you really need to remember to use RCS is "ci -l filename". The only other commands you might possibly need is "rlog filename" and "rcsdiff filename". On the other hand, Git requires a deeper understanding of objects, trees, add vs. commit, and the use of long, hard to type hexadecimal numbers. It's also not very intuitive, and the command arguments can be complex. To be fair, for this particular use case Git is not really that much more complex, but the advantage still goes to RCS.
Availability
RCS wins this one as well. On many systems, RCS is already installed by default. Even when it is not, a "yum install rcs" or the equivalent works just fine 100% of the time. RCS has been around a long, long time, and it's solid, tested, and very available on any system you run into. In contrast, Git is fairly new, does not come pre-installed on most systems, and is not even available via all packaging systems. This is one factor that would definitely prevent me from using it everywhere. Maybe years from now when it is a standard tool, this will change, but for now, RCS wins this one.
Diffs
The rcsdiff command is handy, but very limited. If all you want is the simplest of bare-bones diffs, all is good. However, Git allows you to view diffs in different formats, add color, generate patches, and many other features that can be nice to have.
Fancy tricks
RCS is designed to be dirt simple and good at what it does: track single files. The design of Git was for a large, distributed project with complex needs. This means that Git has many tricks and features that the designers of RCS did not even dream of. While most of them are not needed when you are simply doing versioning of local files, there are definitely times when the full power of Git is nice to have.
Grouping
RCS has no concept of projects or trees: everything is simply a file. This means that you cannot track relationships between files. The only possible way to do so is to compare the timestamps that two files were checked in. In contrast, Git does not consider files at all, but simply treats everything as objects in a tree. This allows easy grouping of files together in a single logical commit. It also allows for things such as branching and merging.
Versioning
While Git uses SHA1 checksums to name each object with a unique identity, RCS simply uses a "single dot" version number, and increments it for you. Thus, the first time you check in a file, it is set as version 1.1. The second version is 1.2, and so on. This is very useful when you are simply tracking a lone file - you know that version 1.20 is the 20th recorded change, and that comparing or viewing an earlier version is as simple as using the "-r x.y" option. Calling what Git does "versioning" is somewhat of a misnomer - it has a completely different philosophy about how objects are tracked, which lends itself great to distributed and collaborative projects, but not so well to single files.
Blame
Here's one area where Git wins hands down. For RCS, you do a checkin, and the file is locked as the current local user. There is no indication of the actual person doing the checkin (as opposed to the account name), unless you add it to the checkin comment each time, and that gets laborious and annoying. With Git, you can set some standard environment variables (even on a shared account), and Git will record who made the change. Not only can you see who made each commit and when, but you can use the awesome "git blame" command to view who made the last change to each line in a file.
As an aside, how do we do the assignment mentioned above in a shared account? Setting the author for Git commits is as simple as setting environment variables like so:
$ export GIT_AUTHOR_NAME="Greg Sabino Mullane" $ export GIT_AUTHOR_EMAIL="greg@endpoint.com"
On a shared account, just create an alias. For example:
cat > .gregs_stuff export GIT_AUTHOR_NAME="Greg Sabino Mullane" export GIT_AUTHOR_EMAIL="greg@endpoint.com" <ctrl-D> cat >> .bashrc alias greg='source ~/.gregs_stuff' <ctrl-D>
Editor support
One of the nice things about RCS is that it has been around for so long that many editors have integrated support for it. For example, calling up a file in emacs that has been checked in via RCS shows a display in the status line at the bottom of the screen showing that the file is controlled by RCS, what the current version number is, whether it is locked or not. While there is Git support as well, it's only available in very new versions of emacs (and other editors). Advantage, RCS.
Bloat
Because Git is a real version control system, and a complicated one at that, it carries a lot of setup baggage. Just creating a repository and checking in a single file creates about 37 files underneath the .git directory. This number grows sharply with every commit you do. By contrast, RCS creates a single file (and one additional for each file you track). This means you can easily ship around the "dot vee" files to other systems.
Final analysis
When looking at all the factors, RCS still wins. It's simple, gets the job done, and most important of all, is available on all systems. I may revisit this in a few years when Git is more widespread.
rsync and bzip2 or gzip compressed data
A few days ago, I learned that gzip has a custom option --rsyncable on Debian (and thus also Ubuntu). This old write-up covers it well, or you can just `man gzip` on a Debian-based system and see the --rsyncable option note.
I hadn't heard of this before and think it's pretty neat. It resets the compression algorithm on block boundaries so that rsync won't view every block subsequent to a change as completely different.
Because bzip2 has such large block sizes, it forces rsync to resend even more data for each plaintext change than plain gzip does, as noted here.
Enter pbzip2. Based on how it works, I suspect that pbzip2 will be friendlier to rsync, because each thread's compressed chunk has to be independent of the others. (However, pbzip2 can only operate on real input files, not stdin streams, so you can't use it with e.g. tar cj directly.)
In the case of gzip --rsyncable and pbzip2, you trade a little lower compression efficency (< 1% or so worse) for reduced network usage by rsync. This is probably a good tradeoff in many cases.
But even more interesting for me, a couple of days ago Avery Pennarun posted an article about his experimental code to use the same principles to more efficiently store deltas of large binaries in Git repositories. It's painful to deal with large binaries in any version control system I've used, and most people simply say, "don't do that". It's too bad, because when you have everything else related to a project in version control, why not some large images or audio files too? It's much more convenient for storage, distribution, complete documentation, and backups.
Avery's experiment gives a bit of hope that someday we'll be able to store big file changes in Git much more efficiently. (Though it doesn't affect the size of the initial large object commits, which will still be bloated.)
Git rebase: Just-Workingness Baked Right In (If you're cool enough)
Reading about rebase makes it seem somewhat abstract and frightening, but it's really pretty intuitive when you use it a bit. In terms of how you deal with merging work and addressing conflicts, rebase and merge are very similar.
Given branch "foo" with a sequence of commits:
foo: D --> C --> B --> A
I can make a branch "bar" off of foo: (git branch bar foo)
foo: D --> C --> B --> A bar: D --> C --> B --> A
Then I do some development on bar, and commit. Meanwhile, somebody else develops on foo, and commits. Introducing new, unrelated commit structures.
foo: E --> D --> C --> B --> A bar: X --> D --> C --> B --> A
Now I want to take my "bar" work (in commit X) and put it back upstream in "foo".
- I can't push from local bar to upstream foo directly because it is not a fast-forward operation; foo has a commit (E) that bar does not.
- I therefore have to either merge local bar into local foo and then push local foo upstream, or rebase bar to foo and then push.
A merge will show up as a separate commit. Meaning, merging bar into foo will result in commit history:
foo: M --> X --> D --> C --> B --> A
\
E --> D --> C --> B --> A
(The particulars may depend on conflicts in E versus X).
Whereas, from branch "bar", I could "git rebase foo". Rebase would look and see that "foo" and "bar" have commits in common starting from D. Therefore, the commits in "bar" more recent than D would be pulled out and applied on top of the full commit history of "foo". Meaning, you get the history:
bar: X' --> E --> D --> C --> B --> A
This can be pushed directly to "foo" upstream because it contains the full "foo" history and is therefore a fast-forward operation.
Why does X become X' after the rebase? Because it's based on the original commit X, but it's not the same commit; part of a commit's definition is its parent commit, and while X originally referred to commit D, this derivative X' refers instead to E. The important thing to remember is that the content of the X' commit is taken initially from the original X commit. The "diff" you would see from this commit is the same as from X.
If there's a conflict such that E and X changed the same lines in some file, you would need to resolve it as part of rebasing, just like in a regular merge. But those changes for resolution would be part of X', instead of being part of some merge-specific commit.
Considerations for choosing rebase versus merge
Rebasing should generally be the default choice when you're pulling from a remote into your repo.
git pull --rebase
Note that it's possible to make --rebase the default option for pulling for a given branch. From Git's pull docs:
To make this the default for branch name, set configuration branch.name.rebase to true.
However, as usual with Git, saying "do this by default" only gets you so far. If you assume rebase is always the right choice, you're going to mess something up.
Probably the most important rule for rebasing is: do not rebase a branch that has been pushed upstream, unless you are positive nobody else is using it.
Consider:
- Steph has a Spree fork on Github. So on her laptop, she has a repo that has her Github fork as its "origin" remote.
- She also wants to easily pull in changes from the canonical Spree Github repo, so she has that repo set up as the "canonical" remote in her local repo.
- Steph does work on a branch called "address_book", unique to her Github fork (not in the canonical repo).
- She pushes her stuff up to "address_book" in origin.
- She decides she needs the latest and greatest from canonical. So she fetches canonical. She can then either: rebase canonical/master into address_book, or merge.
The merge makes for an ugly commit history.
The rebase, on the other hand, would make her local address_book branch incompatible with the upstream one she pushed to in her Github repo. Because whatever commits she pushed to origin/address_book that are specific to that branch (i.e. not on canonical/master) will get rebased on top of the latest from canonical/master, meaning they are now different commits with a different commit history. Pushing is now not really an option.
In this case, making a different branch would probably be the best choice.
Ultimately, the changes Steph accumulates in address_book should indeed get rebased with the stuff in canonical/master, as the final step towards making a clean history that could get pulled seamlessly onto canonical/master.
So, in this workflow, a final step for publishing a set of changes intended for upstream consumption and potential merge into the main project would be, from Steph's local address_book branch:
# get the latest from canonical repo git fetch canonical # rebase the address book branch onto canonical/master git rebase canonical/master # work through any conflicts that may come up, and naturally test # your conflict fixes before completing ... git push origin address_book:refs/heads/address_book_release_candidate
That would create a branch named "address_book_release_candidate" on Steph's Github fork, that has been structured to have a nice commit history with canonical/master, meaning that the Spree corefolks could easily pull it into the canonical repo if it passes muster.
What you would not ever do is:
git fetch canonical # make a branch based off of canonical/master git branch canonical_master canonical/master # rebase the master onto address_book git rebase address_book
As that implies messing with the commit history of the canonical master branch, which we all know to be published and therefore must not be subject to history-twiddling.
That Feeling of Liberation? It's Git.
In the last few weeks, a few of us have been working on a project for Puppet involving several lines of concurrent development. We've relied extensively on the distributed nature of Git and the low cost of branching to facilitate this work. Throughout the process, I occasionally find myself pondering a few things:
- How do teams ever coordinate work effectively when their version control system lacks decent branching support?
- The ease with which commits can be sliced and diced and tossed about (merge, rebase, cherry-pick, and so on) is truly delightful
- It is not unreasonable to describe Git as "liberating" in this process: here is a tool with which the the logical layer (your commit histories) largely reflect reality, with which the engineer is unencumbered in his/her ability accomplish the task at hand, and from which the results' cleanliness or messiness is the product of the engineering team's cleanliness or messiness rather than a by-product of the tool's deficiencies
- One "canonical" branch in a particular repository, into which all work is merged by a single individual
- Engineers do work in their own branches/repositories, which they "publish" (in this case, on Github) through occasional pushes
- Different lines of development take place on different branches, keeping the logical threads of development separate until any given piece progresses sufficiently to warrant merging back into the canonical branch
Seemingly-speculative development efforts are worth more in this approach, because the most seemingly-speculative work can go out on an independent branch, starting from the common history, to be used later (or not) according to need. The ease of sharing the work, of keeping it cleanly isolated but generally low-cost to integrate later, all reduce the "speculative" part of speculation.
Much of the public discussion of distributed development in practice, using Git, revolves around Linux kernel development. That's of course a massive project with many contributors and a great many lines of development. It's easy to look at distributed version control and the related development practices and say "this is not necessary; my project isn't that complex and doesn't need all this fanciness." Such a conclusion, while understandable, ignores the most important factor in all software development work: human beings do the work.
Human beings can mentally envision complex structures, relationships, processes with instantaneous ease. While our thought processes on a given thread may move along serially, our general approach to problems often involves a graph or web rather than a single line. Furthermore, concurrent processing is second-nature to all of us, depending on the situation:
- The car driver guides the steering wheel such that over the course of traveling forty feet, the car smoothly achieves a ninety-degree change of direction, while coordinating the changing of gears and acceleration through manipulation of clutch, accelerator, and gear shift, all while chatting with the child in the back seat
- The singer performing a Bach aria manipulates diaphragm, jaw, tongue, lips, etc., to achieve the ideal resonance for the current vowel across a intricate repeated sequence of pitch relationships, while focusing on the sound of the organ for tuning and ensemble, and while envisioning the expansive overarching shape of the phrase to ensure the large-scale dynamic fits the musical expression needed
- The child in the outfield hums quietly, thinking about the cartoons he watched yesterday, while intently watching to see if the tee-ball will ever be coming his way
In my experience, when speaking about development tasks with my peers, the most common situation is for the conversation to be muddied by an excess of ideas and possibilities. Too many topics and ways forward bubble about in our collective head, and development forces us to shed these until we arrive at the stripped-bare essentials. Furthermore, it is similarly common that certain questions cannot be answered in the abstract, and require the rolling-up of sleeves to arrive at a solution. Along the way to that solution, how often does one come upon implementation choices that were not previously considered, the implications of which requiring further assessment?
We often think, individually or collectively, in webs of relationships. A tool that requires us to develop serially defies our basic humanity. This is the true liberation Git brings: concurrent development -- by a team of many, a few, or one -- can be sanely achieved. Put the new thing in a branch and move on. Merging it later will very possibly be easy, but even if it's not, it is always possible.
To quote a special fella, "freedom's untidy". Development tools that facilitate multiple lines of concurrent development mean that one ends up in the situation of dealing with, well, multiple lines of development. The technical problem (no branching!) becomes a meatspace problem (aagh! branches!). There's no magical elixir for that problem, as it requires social solutions, such as email or a wiki. The meatspace problems exist in any case, Git simply forces you to recognize them and plan for them.
Subverting Subversion for Fun and Profit
One of our clients recently discovered a bug in a little-used but vital portion of the admin functionality of their site. (Stay with me here...) After traditional debugging techniques failed on the code in question, it was time to look to the VCS for identifying the regression.
We fortunately had the code for their site in version control, which is obviously a big win. Unfortunately (for me, at least), the repository was stored in Subversion, which means that my bag o' tricks was significantly diminished compared to my favorite VCS, git. After attempting to use 'svn log/svn diff -c' to help identify the culprit based on what I *thought* the issue might be, I realized that svn was just not up to the task.
Enter git-svn. Using git svn clone file://path/to/repository/trunk, I was able to acquire a git-ized version of the application's repository. For this client, we use DevCamps exclusively, so the entire application stack is stored in the local directory and run locally, including apache instance and postgres cluster. These pieces are necessarily unversioned, and are ignored in the repository setup. I was able to stop all camp services in the old camp directory (svn-based), rsync over all unversioned files to the new git repository (excluding the .svn metadata), replace the svn-based camp with the new git-svn based one, and fire up the camp services again. Started up immediately and worked like a charm. I now had git installed and working in what had previously only been svn-capable before.
Now that I had a git installation, I was able to pull one of my favorite tools from my toolbox when fighting regressions: git-bisect. In my previous svn contortions, I had located a previous revision several hundred commits back which did not exhibit the regression, so I was able to start the bisect with the following command: git bisect start bad good. In this case, bad was master and good was the revision I had found previously. Using git svn find-rev rnumber, I found the SHA1 commit for the good ref as git saw it.
From this point, I was able to quickly identify the commit which introduced the regression. In reviewing the diff, there was nothing that I would have expected to cause the issue at hand; the code did not touch any of the affected area of the admin. But git had never lied to me before. I compared the code currently in master with that introduced in the implicated commit and saw that most of it was still in place. I began selectively commenting out pieces of the code the commit introduced, and was able to enable/disable the bug with increasingly fine granularity. Finally, I was able to identify the single line which when removed caused the issue to evaporate. This was a line in an innocuous template which had a simple variable interpolation (inside an HTML comment, nonetheless); however, this line (which was in a file which was included with every document, added in the implicated commit) revealed a bug in the parser of the app-server which was causing the symptoms in the unrelated admin area.
It's certain that I would never have been able to find the source of this issue without git-bisect, as manual bisection with svn would have been too tedious to even consider. I am able to happily interact with the rest of the development team with git being my secret weapon; git svn dcommit enables me to push my commits upstream, and git svn fetch/git svn rebase enable me to pull in the upstream changes. I'll never need to tell my subversive secret (except, you know, on the company blog), and my own happiness and productivity has increased. Profit!!11 all around.
Emacs Tip of the Day: ediff-revision
I recently discovered a cool feature of emacs: M-x ediff-revision. This launches the excellent ediff-mode with the defined version control system's concept of revision spelling. In my case, I was wanting to compare all changes between two git branches introduced several commits ago relative to each branches' head.M-x ediff-revision prompted for a filename (defaulting to the current buffer's file) and two revision arguments, which in vc-git's case ends up being anything recognized by git rev-parse. So I was able to provide the simple revisions master^ and otherbranch^{4} and have it Do What I Mean™.
I limited the diff hunks in question to those matching specific regexes (different for each buffer) and was able to quickly and easily verify that all of the needed changes had been made between each of the branches.
As usual, C-h f ediff-revision is a good jumping off point for finding more about this useful editor command, as is C-h f ediff-mode for finding more about ediff-mode in general.
Bare git repositories and newspapers
During a recent discussion about git, I realized yet again that previous knowledge of a Version Control System (VCS) actively hinders understanding of git: this is especially challenging when trying to understand the difference between bare vs non-bare repositories.An analogy might be helpful: assume a modern newspaper, where the actual contents of the physical pages are stored in a database; i.e., the database might store contents of articles in one table, author information in another, page layout information in yet another table, and information on how an edition is built in yet another table, or perhaps in an external program. Any particular edition of the paper just happens to be a particular instantiation of items that live in the database.
Suppose an editor walks in and tells the staff "Create a special edition that consists of the front pages of the past week's papers." That edition could easily be created by taking all the front page articles from the past week from the database. No new content would be needed in the content tables themselves, just some metadata changes to label the new edition and description of how to build it.
One could consider the database, then, to be the actual newspaper.
Let's apply that analogy to git:
A git repository is the newspaper database. A particular git branch is the equivalent of a particular day's paper: e.g., the edition for February 5, 2009 consisting of a set of articles, glued together by a layout specification, tied to a label 'February 5, 2009'. In git terms, that would be blobs of data, glued together by references, perhaps labeled by either a branch or a tag.
A bare git repository, then, is the newspaper database itself, not a huge stack of all the editions ever printed. That's a large contrast to some other VCSs where a repository is the first edition ever printed, with diff's stored on top of that. Running git clone is equivalent to a database copy of all the tables of the database. Doing a git checkout of a branch is the equivalent of asking the newspaper factory to read in the metadata and content from the database and produce a physical paper instance of the newspaper.
Test::Database Postgres support
At our recent company meeting, we organized a 'hackathon' at which the company was split into small groups to work on specific projects. My group was Postgres-focused and we chose to add Postgres support to the new Perl module Test::Database.
This turned out to be a decent sized task for the few hours we had to accomplish it. The team consisted of myself (Greg Sabino Mullane), Mark Johnson, Selena Deckelmann, and Josh Tolley. While I undertook the task of downloading the latest version and putting it into a local git repository, others were assigned to get an overview of how it worked, examine the API, and start writing some unit tests.
In a nutshell, the Test::Database module allows an easy interface to creating and destroying test databases. This can be a non-trivial task on some systems, so putting it all into a module make sense (as well as the benefits of preventing everyone from reinventing this particular wheel). Once we had a basic understanding of how it worked, we were off.
While all of our tasks overlapped to some degree, we managed to get the job done without too much trouble, and in a fairly efficient manner. We made a new file for Postgres, added in all the required API methods, wrote tests for each one, and documented everything as we went along. The basic method to create a test database is to use the initdb program to create a new Postgres cluster, then modify the cluster to use a local Unix socket in the newly created directory (this side-stepping completely the problem of using an already occupied port). Then we can start up the new cluster via the pg_ctl command, and create a new database.
At the end of the day, we had a working module that passed all of its tests. We combined our git patches into a single one mailed it to the author of the module, so hopefully you'll soon see a new version of Test::Database with Postgres support!
Git it in your head
Git is an interesting piece of software. For some, it comes pretty naturally. For others, it's not so straightforward.
Comprehension and appreciation of Git are not functions of intellectual capacity. However, the lack of comprehension/appreciation may well indicate one of the following:
- Mistakenly assuming that concepts/procedures from other VCSes (particularly non-distributed "traditional" ones like CVS or Subversion) are actually relevant when using Git
- Not adequately appreciating the degree to which Git's conception of content and history represent a logical layer, as opposed to implementation details
CVS and Subversion both invite the casual user to basically equate the version control repository and all operations around it to the file system itself. They ask you to understand how files and directories are treated and tracked within their respective models, but that model is basically oriented around files and directories, period. Yes, there are branches and tags. Branches in particular are entirely inadequate in both systems. They don't really account for branching as a core possibility that should be structured into the logical model itself; consequently, both systems can keep things simple (the model basically amounts to files and directories), neither one challenges the user mentally, and neither one does much for you when you have real problems to solve that involve branching.
With such a low barrier to entry, where the logical model is barely distinguishable for day-to-day use from the file system, it's easy for engineers to think of the VCS as a taken-for-granted utility, that should "Just Work" and be really easy and not challenge assumptions, etc. Then Git comes along and punishes anyone who takes that view; if you try to treat Git as a simple utility to drop in the place of CVS/SVN, you will eventually suffer. The engineer must grasp the logical layer in order to make effective use of the tool on anything beyond the shallowest of levels.
So, here's the deal: when they say that Git tracks content, not files, they mean it. They're telling you that Git isn't just a nice versioned history of your file system. Rather, Git offers you a deal: you take an hour or two to learn its logical layer (its object model, in a sense), and in exchange Git will give you branching and distributed workflow as a basic way of life. It's a good deal.
Consequently, those new to Git or those having trouble with Git may do well to throw out any assumptions. Instead, memorize this:
- Objects are just things stored in Git that have a type, some data, some Git-oriented headers, and a unique ID consisting of a SHA1 hash of the components of the object. There aren't very many object types to learn
- Blobs are simple objects that just contain some data and could be thought of as "leaves" on a tree. They might represent text files, binary files, symlinks, etc. But they are "blobs", not files, because the blob only represents the data, not any real-world identity of that data ("content, not files")
- Trees are objects that contain a list of blob ids paired with some properties and a real-world identity (relative to the tree) for each blob. File system directories map to trees, but they aren't the same thing.
- Commits are objects that reference a single tree object (the top level tree of the repository) and some arbitrary number of parent commits.
- Refs aren't standard Git objects, in that they aren't storing versioned data or anything like that; rather, they are simply named pointers, each of which references a particular commit object. Branches are refs. Magic things like
HEADare refs. Again, all the ref needs to do is specify a particular commit object.
That's really not all that much stuff to remember. Then, before thinking about how it relates to files, think through the implications of the object model above. You'll never get how it works with the file system if you don't get it as a standalone model first:
- Blobs aren't versioned. They are standalone pieces of content that are referenced by trees. They represent content state, period.
- A tree's identity is determined by the content of its blobs and their identity relative to that tree
- Consequently, two trees may ultimately only have one different blob (tree A has blob X under branch Z while tree B instead has blob Y under branch Z), but they are two unique trees, that happen to reference some arbitrary number of common trees/blobs (the member trees/blobs that are the same between both will literally be "the same" between both, as they are identified by SHA1)
- Since the state of the tree determines its identity, it's easy for Git to determine where differences between two trees occur.
- Since the tree and the parent commits make up a commit object, it's easy for Git to easily determine whether or not a specific state (combination of tree state and revision history) exists or not in a given history; this is what allows for flexible branching and distributed operations
Once you get all that in your head, then map it to the filesystem.
- Files map to blobs
- Directories map to trees
- Your "checkout" is the mapping of a particular commit's tree to your file system. That's where your working tree starts.
- Changing a file means introducing a new blob, which introduces a new tree, which cascades up to the top of your working tree. That's how the magic happens.
Study the object model, the logical layer, whatever you want to call it. It's not an implementation detail; it's the model that makes everything possible. You have to understand Git's concept of revision history and whatnot if you're going to make it work for you. Just like you need to learn something new and idiomatic whenever you pick up a new piece of sophisticated software.
Perl 5 now on Git
It's awesome to see that the Perl 5 source code repository has been migrated from Perforce to Git, and is now active at http://perl5.git.perl.org/. Congratulations to all those who worked hard to migrate the entire version control history, all the way back to the beginning with Perl 1.0!
Skimming through the history turns up some fun things:
- The last Perforce commit appears to have been on 16 December 2008.
- Perl 5 is still under very active development! (It seems a lot of people are missing this simple fact, so I don't feel bad stating it.)
- Perl 5.8.0 was released on 18 July 2002, and 5.6.0 on 23 March 2000. Those both seem so recent ...
- Perl 5.000 was released on 17 October 1994.
- Perl 4.0.00 was released 21 March 1991, and the last Perl 4 release, 4.0.36, was released on 4 February 1993. For having an active lifespan of only 4 or so years till Perl 5 became popular, Perl 4 code sure kicked around on servers a lot longer than that.
- Perl 1.0 was announced by Larry Wall on 18 December 1987. He called Perl a "replacement" for awk and sed. That first release included 49 regression tests.
- Some of the patches are from people whose contact information is long gone, rendered in Git commits as e.g. Dan Faigin, Doug Landauer <unknown@longtimeago>.
- The modern Internet hadn't yet completely taken over, as evidenced by email addresses such as isis!aburt and arnold@emoryu2.arpa.
- The first Larry Wall entry with email address larry@wall.org was 28 June 1988, though he continued to use his jpl.nasa.gov after that sometimes too.
- There are some weird things in the commit notices. For example, it's hard to believe the snippet of Perl code in the following change notice wasn't somehow mangled in the conversion process:
commit d23b30860e3e4c1bd7e12ed5a35d1b90e7fa214c
Author: Larry Wall <lwall@scalpel.netlabs.com>
Date: Wed Jan 11 11:01:09 1995 -0800
duplicate DESTROY
In order to fix the duplicate DESTROY bug, I need to remove [the
modified] lines from sv_setsv.
Basically, copying an object shouldn't produce another object without an
explicit blessing. I'm not sure if this will break anything. If Ilya
and anyone else so inclined would apply this patch and see if it breaks
anything related to overloading (or anything else object-oriented), I'd
be much obliged.
By the way, here's a test script for the duplicate DESTROY. You'll note
that it prints DESTROYED twice, once for , and once for . I don't
think an object should be considered an object unless viewed through
a reference. When accessed directly it should behave as a builtin type.
#!./perl
= new main;
= '';
sub new {
my ;
local /tmp/ssh-vaEzm16429/agent.16429 = bless $a;
local = ; # Bogusly makes an object.
/tmp/ssh-vaEzm16429/agent.16429;
}
sub DESTROY {
print "DESTROYED\n";
}
Larry
sv.c | 4 ----
1 files changed, 0 insertions(+), 4 deletions(-)
Yes, it really is that weird. Check it out for yourself.
The Easy Git summary information from eg info has some interesting trivia:
Total commits: 36647 Number of contributors: 926 Number of files: 4439 Number of directories: 657 Biggest file size, in bytes: 4176496 (Changes5.8) Commits: 31178
And there's a nice new POD document instructing how work with the Perl repository using Git: perlrepository.
In other news, maintenance release Perl 5.8.9 is out, expected to be the last 5.8.x release. The change log shows most bundled modules have been updated.
Finally, use Perl also notes that Booking.com is donating $50,000 to further Perl development, specifically Perl 5.10 development and maintenance. They're also hosting the new Git master repository. Thanks!
Google Sponsored AFS Hack-A-Thon
Day One:
Woke up an hour early, due to having had a bit of confusion as to the start time (the initial email was a bit optimistic as to what time AFS developers wanted to wake up for the conference).
Met up with Mike Meffie (an AFS Developer of Sine Nomine) and got a shuttle from the hotel to the 'Visitors Lobby'; only to find out that each building has a visitors lobby. One neat thing, Google provides free bikes (beach cruisers) to anyone who needs them. According to the receptionist, any bike that isn't locked down is considered public property at Google. However, it's hard to pedal a bike and hold a briefcase; so off we went hiking several blocks to the correct building. Mike was smart enough to use a backpack, but hiked with me regardless.
The food was quite good, a reasonably healthy breakfast including fresh fruit (very ripe kiwi, and a good assortment). The coffee was decent as well! After much discussion, it was decided that Mike & I would work towards migrating the community CVS repository over to git. Because git sees the world as 'patch sets' instead of just individual file changes, migrating it from a view of the 'Deltas' makes the most sense. The new git repo. (when complete) should match 1:1 to the Delta history. There was a good amount of teasing as to whether Mike and I could make any measurable progress in 2 days. Derrick was able to provide pre-processed delta patches and the bare CVS repo. (though we spent a good amount of the day just transferring things around and determining what machine should be used for development).
Lunch (rather tasty sandwiches) and after lunch snacks were provided; Google definitely doesn't skimp on the catering. Made good progress for one day of combined work, we now have a clear strategy for processing the deltas and initial code that is showing strong promise. Much teasing ensued that Mike & I should not be allowed to eat if we did not have the git repo. ready for use. Dinner was a big group affair of food, beer, and Kerberos.
Day Two:
After arriving with Mike Meffie via the shuttle, we found out that Tom Keiser (also of Sine Nomine) had been left behind! The shuttle driver was kind enough to go pick up Tom (who ended up at a related, but different hotel than the conference recommended) and bring him for questioning (or development, as the case may be). Determined that the major issue in applying the deltas was simply due to inconsistencies in what the 'base' import should consist of... After several rounds of cleanup, all but a few of the deltas (and those were fixed by hand) applied cleanly!
On the food side, Google outdid itself with these cornbread 'pizzas' that were extremely good. Once we started having a few branches to play with, things came together quickly... generating much buzz and excitement (at least, for us). We all split off for dinner, with a few of us escorting Tom to his train then getting some Indian food (on a rather busy day, as it was the 'Festival of Lights').
In Conclusion:
We were able to get a clean specification with consensus for how we want to produce the public git repository. The specifications are even available on the OpenAFS wiki. The tools (found at '/afs/sinenomine.net/public/openafs/projects/git_work/') to produce this repo. are all in a rough working form, with only the 'merge' tool still needing some development effort. All of these efforts were definitely facilitated by Google providing a comfortable work environment, a solid internet connection and good food to keep us fueled through it all.
Things to do now:
- Clean up and document the existing tools
- Improve the merge process to simplify folding the branches
- Actually produce the Git repository
- Validate the consistency of the Git repository against the CVS repository
- Determine how tags are to be ported over and apply them
- Publish repo. publicly
Know your tools under the hood
Git supports many workflows; one common model that we use here at End Point is having a shared central bare repository that all developers clone from. When changes are made, the developer pushes the commit to the central repository, and other developers see the relevant changes on subsequent pulls.
We ran into an issue today where after a commit/push cycle, suddenly pulls from the shared repository were broken for downstream developers. It turns out that one of the commits had been created by root and pushed to the shared repository. This worked fine to push, as root had read-write privileges to the filesystem, however it meant that the loose objects which the commit created were in turn owned by root as well; fs permissions on the loose objects and the updated refs/heads/branch prevented the read of the appropriate files, and hence broke the pull behavior downstream.
Trying to debug this purely on the reported messages from the tool itself would have resulted in more downtime at a critical time in the client's release cycle.
There are a couple of morals here:
- Don't do anything as root that doesn't need root privileges. :-)
- Understanding how git works at a low level enabled a speedy detection of the (*ahem*) root cause of the problem and led to quick correction of the underlying permissions/ownership issues.
Stepping into version control
It's no little secret that we here at End Point love and encourage the use of version control systems to generally make life easier both on ourselves as well as our clients. While a full-fledged development environment is ideal for maintaining/developing new client code, not everyone has the time to be able to implement these quickly.
A situation we've sometimes found with clients editing/updating production data directly. This can be through a variety of means; direct server access, scp/sftp, or web-based editing tools which save directly to the file system.
I recently implemented a script for a client who uses a web-based tool for managing their content in order to provide transparent version control. While they are still making changes to their site directly, we now have the ability to roll back any changes on a file-by-file basis as they are created, modified, or deleted.
I wanted something that was: 1) fast, 2) useful, and 3) stayed out of the user's way. I turned naturally to git.
In the user's account, I executed git init to create a new git repository in their home directory. I then git added the relevant parts that we definitely wanted under version control. This included all of the relevant static content, the app server files, and associated configuration: basically anything we might want to track changes to.
Finally, I determined the list of directories which we would like to automatically detect any newly created files. These corresponded to the usual places where new content was apt to show up. I codified the automatic update of the git repo in a script called git_heartbeat, which is called periodically from cron.
The basic listing for git_heartbeat:
#!/bin/bash # automatically add any new files in these space-separated directories AUTO_ADD_DIRS="catalogs/acme/pages htdocs" # make sure we're in the proper git root directory cd /home/acme # actually add any newly created files in $AUTO_ADD_DIRS find $AUTO_ADD_DIRS -print0 | xargs -0 git add DATE=`date` git commit -q -a -m "Acme Co git heartbeat - $DATE" > /dev/null
A couple notes:
- git commit -a takes care of the modification/deletion of any already tracked files. The git add ensures that any newly created files are currently in the index and will be included with the commit.
- if no files have been added, removed, or deleted, no checkpoint is created. This ensures that every commit in the log is meaningful and corresponds to an actual change to the site itself.
- Compared to other VCSs which keep metadata in each versioned subdirectory (such as Subversion), this approach stays out of the user's way; we don't have to worry about the user accidentally overwriting/deleting data in their upload directories and thus corrupting the repository.
- This approach is fast; it runs near instantaneously for thousands of files, so we could even push the cron interval to every minute if desired. For our purposes, this system works great as is.
- Once the git tools are installed, there is no need to set up a central repository; git repos are very cheap to create/use and for a use case such as this, require little to no maintenance beyond the initial setup.
Areas of improvement/known issues:
- This script could definitely be improved insofar as providing more informative information as to which files were added/modified/deleted. However, git's own tools can come in quite useful; for instance, git log --stat will show the files which each heartbeat commit affected.
- Since this is set up as a general cron job running every hour (the period is configurable, obviously), it does preclude extended stagings for non-heartbeat commits; basically, anything which takes longer than the heartbeat interval will be inadvertently committed.






