Version Control Blog Archive
I was charged with cleaning up a particularly large, sprawling set of files comprising a git repository. One whole "wing" of that structure consisted of files that needed to stay around in production (they were various PDFs, PowerPoint presentations, and Windows EXEs that were only ever needed by the customer's partners, and downloaded from the live site – our developer camps never wanted to have local copies of these files, which amounted to over 280 MB (and since we have dozens of camps shadowing this repository, all on the same server, this will save a few GB at least).
I should point out that our preferred deployment is to have production, QA, and development all be working clones of a central repository. Yes, we even push from production, especially when clients are the ones making changes there. (Gasp!)
So: the aim here is to make the stuff vanish from all the other clones (when they are updated), but to preserve the stuff in one particular clone (production). Also, we want to ensure that no future updates in that "wing" are tracked.
# From the "production" clone: $ cd stuff $ git rm -r --cached . $ cd .. $ echo "stuff" >>.gitignore $ git commit ... $ git push ...Now, everything that was in the "stuff" tree remains, for "production", but every other clone will remove these files when they update from the central repository:
$ git pull origin master ... delete mode 100644 stuff/aaa delete mode 100644 stuff/aab ...
DevCamps is a system for managing development, integration, staging, and production environments. It was developed by End Point for, and with the help of, some of our ecommerce clients. It grew over the space of several years, and really started to become its own standalone project in 2007.
Camps are a behind-the-scenes workhorse of our web application development at End Point, and don't always get much attention because everyone's too busy using camps to get work done! But this summer a few things are happening.
In early July we unveiled a redesign of the devcamps.org website that features a more whimsical look, a better explanation of what camps are all about, and endorsements by business and developer users. Marko Bijelic of Hipinspire did the design. Take a look:
In less than two weeks, on August 17, I'm going to be giving a talk on camps at YAPC::EU in Riga, Latvia. YAPC::EU is Europe's annual Perl conference, and will be a nice place to talk about camps.
Many Perl developers are doing web applications, which is camps' main focus, so that's reason enough. But camps also started around the Interchange application server, which is written in Perl. And the camp system is currently implemented in Perl as well.
We've set up a lot of camp systems for Perl web applications. So even though we've also set up camp systems for web applications using Ruby on Rails, Sinatra, Django, and PHP, it's a nice homecoming to talk about camps to Perl enthusiasts.
As a software engineer I'm naturally inclined to be at least somewhat introverted :-), combine that with the fact that End Point is PhysicalWaterCooler challenged and you have a recipe for two things to naturally occur, 1) talking to oneself (but then who doesn't do that really? no, really.), 2) finding friends in unusual places. Feeling a bit socially lacking after a personal residence move, I was determined to set out to find new friends, so I found one, his name is "--interactive", or Mr. git add --interactive.
"How did we meet?" You ask. While working on a rather "long winded" project I started to notice myself sprinkling in TODOs throughout the source code, not a bad habit really (presuming they do actually eventually get fixed), but unfortunately the end result is having a lot of changed files in git that you don't really need to commit, but at the same time don't really need to see every time you want to review code. I'm fairly anal about reviewing code and so I was generally in the habit of running a `git status` followed by a `git diff
"But what about your other old friends?" You then ask. Well, as it turns out my spending so much time with interactive add made `git stash` feel a bit lonely, and it dawned on me that tracking those TODOs in the working tree at all may be a bit silly. What could a guy do, perhaps these two friends might actually like to party together? As it turns out they had already been introduced and do like to party together (not sure why they couldn't have just invited me before, though it might have something to do with my past friendship with SVN and RCS). Either way, to once and for all get those unsightly TODOs out from under my immediate purview while keeping other changes I still needed in the index I found `git stash save --patch --no-keep-index "TODO Tracking"`. "save" instructs git stash to save a new stash, "--patch" tosses it into an interactive mode similar to the one described above for add, "--no-keep-index" instructs stash not to keep the changes in the working tree that are added to the created stash, and the "TODO Tracking" is just a message to make it easy for a human to understand what the stash contains (I made this one up for my specific immediate purpose). This leaves my working tree and index clean for me to do more pressing work and to know that when I have the time/need to restore those past TODOs I can, so that they may be worked on as well. Note that I've not really used this technique much (read: I've just done it now for the first time) so we'll see if it really is that useful, but the interactive patching I've used and it is definitely worth it.
As a further side bar I was discussing multiple commit indexes in a Git repo with someone in the #yui channel, and as soon as I found the above it occurred to me that using multiple stashes where you pop them could work in effect the same way, though I don't know if there is a way to add patches to an already created stash. That might make a neat feature to investigate and/or request from the Git core.
Just so you aren't too concerned, there is still a place in my heart for `git add` and `git status` even if I don't see them as frequently as I once did.
Over the weekend, I discovered an open source tool for version control visualization, Gource. I decided to put together a few videos to showcase End Point's involvement in several open source projects.
Here's a quick legend to help understand the videos below:
One of the articles that references Gource suggests that the videos can be used to visualize and analyze the community involvement of a project (open source or not). One might also be able to qualitatively analyze the stability of project file architecture from a video, but this won't reveal anything definitive about the code stability since external factors can influence file structure. For example, since I am intimately familiar with the progress of Spree, I can identify when Spree transitioned to Rails 3 in the video, which required reorganization of the Spree core functionality (read more about this here and here).
In the case of this article, I wanted to highlight End Point's involvement in a few open source projects where we've had various levels of involvement. We've contributed to Interchange since 2000. We've been involved in Spree less lately, but had more presence in early 2009. In the smaller projects Bucardo and pgsi, End Point employees have worked on a team to be the primary contributors to the projects in addition to a few external contributors. Open source is important to End Point, and it's great to see our presence demonstrated in these cute videos.
I recently ran into an issue where I had a source file of unknown version which had been substantially modified from its original form, and I wanted to find the version of the originating software that it had originally come from to compare the changes. This file could have come from any number of the 100 tagged releases in the repository, so obviously a hand-review approach was out of the question. While there were certainly clues in the source file (i.e., copyright dates to narrow down the range of commits to review) I thought up and used this technique:
Here are our considerations:
- We know that the number of changes to the original file is likely small compared to the size of the file overall.
- Since we're trying to uncover a likely match for the purposes of reviewing, exactness is not required; i.e., if there are lines in common with future releases, we're interested in the changes, so a revision with the fewest number of changes is preferred over finding the *exact* version of the file that this was originally based on.
The basic thought, then, is that we want to take the content of the unversioned file (i.e., the file that was changed) and find the revision of the corresponding file in the repository with the least number of changes, which we'll measure as the count of the lines in the source code diff. This struck me as similar to the copy detection that git does, insofar as it can detect content that is similar to some source content with a certain amount of tolerance for changes from the base. The difference in this case is that we're comparing content across a number of refs rather than across all of the blobs in a single ref. This recipe distilled down to the following bash command:
for ref in $(git tag); do echo -n $ref; diff -w <(git show $ref:/path/to/versioned/file 2>/dev/null) modified_file | wc -l; done | sort -k2 -n
The results of running this command is a list of the tags in the repository ordered by how similar they are to the target content (most similar first). A few comments:
- We iterate through all tags in the project; while there could indeed be changes to the relevant file in intermediate versions, due to the way the release worked it's likely the original file was based on a released (aka tagged) version.
- We're using diff's -w option, as the content may have changed spaces to tabs or vice versa, depending on the editor/editing habits of the original user. This helps us ensure that the changes that we're focusing on are the ones that change something substantial.
- We're doing a numeric sort so the lines with the least number of changes show up at the top.
- For the specific case I used this technique with, there were a number of revisions that had the least number of changed lines. Upon reviewing this smaller set of revisions (using the git diff rev1 rev2 -- path/to/content syntax), it turns out that the file in question had remained unchanged in each of these revisions, so any one of them was useful for my purposes.
- The flexibility in the version detection works in this case because this was an isolated part of the system that did not have any changes or dependencies. If there had been important changes to the system as a whole independent of the changes to this file (but which had an affect on the operation of this specific part), we would need to have a more exact method of identifying the file.
Around here I have a reputation for finding the tiniest pothole on the path to git happiness, and falling headlong into it while strapped to a bomb ...
But at least I'm dedicated to learning something each time. This time it involved branches, and how git knows whether you have merged that branch into your current HEAD.
My initial workflow looked like this:
$ git checkout -b MY_BRANCH (some editing) $ git commit $ git push origin MY_BRANCH (later) $ git checkout origin/master $ git merge --no-commit origin/MY_BRANCH (some testing and inspection) $ git commit $ git rebase -i origin/master
This last step was the trip-and-fall, although it didn't hurt me so much as launch me off my path into the weeds for a while. Once I did the "git rebase", git no longer knows that MY_BRANCH has been successfully merged into HEAD. So later, when I did this:
$ git branch -d MY_BRANCH error: the branch 'MY_BRANCH' is not fully merged.
As I now understand it, the history is no longer a subset of the history associated with MY_BRANCH, so git can't tell the two are related and refuses to delete the branch unless you supply it with -D. A relatively harmless situation, but it set off all sorts of alarms for me, as I thought I messed up the merge somehow.
This is the first in an occasional series of articles about configuring PostgreSQL. The main way to do this, of course, is the postgresql.conf file, which is read by the Postgres daemon on startup and contains a large number of parameters that affect the database's performance and behavior. Later posts will address specific settings inside this file, but before we do that, there are some global best practices to address.
The single most important thing you can do is to put your postgresql.conf file into version control. I care not which one you use, but go do it right now. If you don't already have a version control system on your database box, git is a good choice to use. Barring that, RCS. Doing so is extremely easy. Just change to the directory postgresql.conf is in. The process for git:
- Install git if not there already (e.g. "sudo yum install git")
- Run: git init
- Run: git add postgresql.conf pg_hba.conf
- Run: git commit -a -m "Initial commit"
- Install as needed (e.g. "sudo apt-get install rcs")
- Run: mkdir RCS
- Run: ci -l postgresql.conf pg_hba.conf
Note that we also checked in pg_hba.conf as well. You want to check in any file in that directory you may possibly change. For most people, that only means postgresql.conf and pg_hba.conf, but if you use other files (pg_ident.conf) check those in as well.
Ideally you want the version checked in to be the "raw" configuration files that came with the system - in other words, before you started messing with them. Then you make your initial changes and check it in. From then on of course, you commit every time you change the file.
At a bare minimum, the version control system should be telling you:
- Exactly what was changed
- When it was changed
- Who made the change
- Why it was changed
The first two items happen automatically in all version control systems, so you don't have to worry about those. The third item, "who made the change", must be entered manually if on a shared account (e.g. postgres) and using RCS. If you are using git, you can simply set the environment variables GIT_AUTHOR_NAME and GIT_AUTHOR_EMAIL. For shared accounts, I have a custom bashrc file called "gregbashrc" that is called when I log in that sets those ENVs as well as a host of other items.
The fourth item, "why it was changed", is generally the content of the commit message. Never leave this blank, and be as descriptive and verbose as possible - someone later on will be grateful you did. It's okay to be repetitive and state the obvious. If this was done as part of a specific ticket number or project name, mention that as well.
It's important that the changes you make to the postgresql.conf file (or other files) actually work and don't cause Postgres to be unable to parse the file, or handle a changed setting. Never make changes and restart Postgres, because if it doesn't work, you've got a broken config file, no Postgres daemon, and most likely unhappy applications and/or users. At the very least, do a reload first (e.g. /etc/init.d/postgresql reload or just kill -HUP the PID). Check the logs and see if Postgres was happy with your changes. If you are lucky, it won't even require a restart (some changes do, some do not).
A better way to test your changes is to make it on an identical test box. That way, all the wrinkles are ironed out before you make the changes on production and attempt a reload or restart.
Another way I've found handy is to simply start a new Postgres daemon. Sounds like a lot of work, but it's pretty automatic once you've done it a few times. The process generally looks like this, assuming your production postgresql.conf is in the "data" directory, and your changes are in data/postgresql.conf.new:
- cd ..
- initdb testdata
- cp -f data/postgresql.conf.new testdata/
- echo port=5555 >> testdata/postgresql.conf
- echo max_connections=10 >> testdata/postgresql.conf
The max_connections is not strictly necessary, of course, but unless you are changing something that relies on that setting, it's nicer to keep it (and the resulting memory) low.
- pg_ctl -D testdata -l test.log start
- cat test.log
- pg_ctl -D testdata stop
- rm -fr testdata (or just keep it around for next time)
The test.log file will show you any problems that might have popped up with your changes, and once it works you can be fairly confident it will work for the "main" daemon as well, so to finish up:
- cd data
- mv -f postgresql.conf.new postgresql.conf
- git commit postgresql.conf -m "Adjusted random_page_cost to 2, per bug #4151"
- kill -HUP `head -1 postmaster.pid`
- psql -c 'show random_page_cost'
Keeping it Clean
The postgresql.conf file is fairly long, and can be confusing to read with its mixture of comments, in-line comments, strange wrapping, and the commented out vs. not-commented-out variables. Hence, I recommend this system:
- Put a big notice at the top of the file asking people to make changes to the bottom
- Put all important variables at the bottom, sans comments, one per line
- Line things up
- Put into logical groups.
This avoids having to hunt for settings, prevents the gotcha of when a setting is changed twice in the file, and makes things much easier to read visually. Here's what I put at the top of the postgresql.conf:
## ## PLEASE MAKE ALL CHANGES TO THE BOTTOM OF THIS FILE! ##
I then add a good 20+ empty lines, so anyone viewing the file is forced to focus on the all-caps message above.
The next step is to put all the settings you care about at the bottom of the file. Which ones should you care about? Any setting you have changed (obviously), any setting that you *might* change in the future, and any that you may not have changed, but someone may want to look up. In practice, this means a list of about 25 items. After aligning all the values to the right and breaking things into logical groups, here's what the bottom of the postgresql.conf looks like:
## Connecting port = 5432 listen_addresses = '*' max_connections = 100 ## Memory shared_buffers = 400MB work_mem = 1MB maintenance_work_mem = 1GB ## Disk fsync = on synchronous_commit = on full_page_writes = on checkpoint_segments = 100 ## PITR archive_mode = off archive_command = '' archive_timeout = 0 ## Planner effective_cache_size = 18GB random_page_cost = 2 ## Logging log_destination = 'stderr' logging_collector = on log_filename = 'postgres-%Y-%m-%d.log' log_truncate_on_rotation = off log_rotation_age = 1d log_rotation_size = 0 log_min_duration_statement = 200 log_statement = 'ddl' log_line_prefix = '%t %u@%d %p' ## Autovacuum autovacuum = on autovacuum_vacuum_scale_factor = 0.1 autovacuum_analyze_scale_factor = 0.3
Because everything is in one place, at the bottom of the file, and not commented out, it's very easy to see what is going on. The groups above are somewhat arbitrary, and you can leave them out or create your own, but at least keep things grouped together as much as possible. When in doubt, use the same order as they appear in the original postgresql.conf.
Sometimes people change important settings in a group, such as for bulk loading of data. In this case, I usually make a separate group for it at the very bottom. This makes it easy to switch back and forth, and helps to prevent people from (for example) forgetting to switch fsync back on:
## Bulk loading only - leave 'on' for everyday use! autovacuum = off fsync = off full_page_writes = off
Ownership and permissions
All the conf files should be owned by the postgres user, and the configuration files should be world-readable if possible (indeed, it's a requirement for Debian based system that postgresql.conf be readable for psql to work!). Be careful about SELinux as well: it can get ornery if you do things like use symlinks.
One final note - make sure you are backing up your changes as well. PITR and pg_dump won't save your postgresql.conf! If you are checking things in to a remote version control system, then some of the pressure is off, but you should have some sort of policy for backing up all your conf files explicitly. Even if using a local git repo, tarring and copying up the whole thing is usually a very quick and cheap action.