Thursday, July 31, 2008

The Holy War of Tool Choice

Tools are vital to any open-source project. Because of that, tool choice is critical. You really don't want to be switching tools frequently, because doing so is more work than you'll ever be able to put into the project. By that same token, you don't want to pick tools no one in the project wants to use. This leads to the problem--tool choice is often a "Holy War", so to speak, because everyone has their pet tool and thinks everything else to be inferior.

Let's use Pidgin as a practical example. Prior to our most recent legal problems (which prompted our rename), we had been evaluating new version control software for some time. At that time, we were using CVS. Almost everyone who has developed on a project as large as Pidgin will agree that CVS sucks. You can't rename files unless you rename them in the repository itself, branching and merging are a pain, etc. I wasn't involved with the project at this point in time, but I do know that every distributed VCS in existence at the time was evaluated. After the evaluation, Monotone was the preferred choice. Keep in mind that this was over two years ago, more likely closer to three.

We continued to use CVS, and later made an ill-advised switch to Subversion. Sticking with these tools was mainly due to our complete reliance on SourceForge, who offered only CVS and SVN. Finally, when it came time to rename our project to Pidgin, we had a donated virtual server on which we could run (essentially) whatever we wanted. Ethan Blanton used tailor to convert our SVN repository into a monotone database.

We also set up Trac for bug, patch, and feature request tracking. The built-in wiki was a bonus. We ended up installing a number of plugins for Trac to customize it and provide some additional features, such as restricting some ticket fields from those who we feel should not be able to modify them.

All was well initially. We had a few complaints because we chose to use monotone instead of sticking with SVN, but these were users who had no intention of contributing patches, so we were (mostly) happy to see them angrily go back to using our "blessed" releases. We did, however, have a lot of confusion initially about what monotone was and how it related to Pidgin. A few hundred explanations later and those questions faded too.

Trac proved to have some scalability and performance issues for us on numerous occasions. Eventually we were able to narrow the biggest issues down to their causes and implement fixes. Some of this involved a LOT of poking and prodding, as well as some assistance from Trac developers, who we thank profusely for their time in resolving the issues. I am quite surprised, overall, that we haven't had any real complaints about Trac. For a while it was frequently completely unusable, and of course we saw a number of complaints related to that, but since the issues have been resolved we've seen no real complaints.

As for monotone, fast-forward a year from the announcement of the rename. We consistently get complaints about having to download a "huge" database just to get the latest development source for Pidgin. We do also get a few complaints that it's hard to use, but almost all of these are solved simply by pointing people at the UsingPidginMonotone wiki page. As for the huge database, we do acknowledge that it is inconvenient for some people, but "shallow-pull" support is coming to monotone. A shallow pull would be similar to an svn checkout.

I don't mind the database, myself. I have 11 working copies (checkouts) from my single pidgin database (8 distinct branches, plus duplicates of the last three branches I worked on or tested with). Each clean checkout (that is, a checkout prior to running autogen.sh and building) is approximately 61 MB. If this were SVN, each working copy would be approximately 122 MB due to svn keeping a pristine copy of every file to facilitate 'svn diff' and 'svn revert' without needing to contact the server the working copy was pulled from. Now, let's add that up. For SVN, I would have 11 times 122 MB, or 1342 MB, just in working copies. For monotone, I have 11 times 61 MB for the working copies (671 MB), plus 229 MB for the database, for a grand total of 900 MB. For me, this is an excellent bargain, as I save 442 MB of disk space thanks to the monotone model. For another compelling comparison that's sure to ruffle a few feathers, let's compare to git. If I clone the git mirror of our monotone repository, I find a checkout size of 148 MB after git-repack--running git-gc also increased the size by 2 MB, but I'll stick with the initial checkout size for fairness. If I multiply this by my 11 checkouts, I will have 1628 MB. This is even more compelling for me, as I now save 728 MB of disk space with monotone.

Richard Laager also took the time to point out an interesting, but as yet unexploited feature of monotone--the use of certs allow us to do a number of useful things while being backed by the cryptographic signing of certs. For one, we could implement something similar to the kernel's Signed-Off-By markers. We could also have an automated test suite run compiles of each revision and add a cert indicating whether or not the test succeeded. We could also include a test suite using check (we do already have some tests) and use more certs to indicate success or failure. The extensibility of monotone using lua and the certificates is truly mind-boggling.

I think I've picked on version control enough here, but before I move on, I'd like to say that I don't specifically hate any one version control system. Ok, maybe I hate CVS and Subversion, but I'm certainly not alone in that. I don't, however, hate git, bzr, hg, etc. In fact, I think hg looks quite promising--the Adium folks have decided to switch to it, and are currently in the process of trying to convert from SVN. However, given that for us monotone was the best choice when we made our decision, and given the extensibility of monotone, I think it would be foolish of us to choose another tool at this point. Perhaps in a few more years when all the DVCSes have had time to mature further it will make sense to revisit this decision.

Another tool choice involves communication between developers and users. Specifically, forums. We have elected not to have forums for Pidgin, although as a legacy of our continued involvement with Sourceforge for download hosting we are stuck with the forums they provided us. We are quite frequently belittled for our choice to forego widely visible forums. At least four of us that I am aware of have a strong distaste for forums. I find them to be worse than the ticket system for attracting duplicated questions which most users won't search beforehand. All the useless extra features, such as emoticons, formatting, etc., are a waste, as well.

It's also been my experience with forums that one or two people will start answering others' questions with misinformation that will sound legitimate to those receiving the bad answers. I've seen this numerous times, but it comes to mind so quickly as I've seen it quite recently on Ubuntu's Launchpad "Questions" forum or tracker or whatever you call it for Pidgin.

Instead of forums, we have chosen to use mailing lists. We receive complaints about this too. These complaints run the gamut from "mailing lists are hard to use" to "but you can't search the mailing list!" In reality, the mailing lists are searchable--Google indexes our mailing lists, as do a number of other search engines. Perhaps we could provide a nice search box or something that does the necessary google magic (make the search query "terms site:pidgin.im") to make it a bit friendlier, but searchability exists. Not that it matters much in my experience, but hey, I'm just a developer; what do I know?

I'd also argue that mailing lists are easier to use than forums, as all you have to do is, y'know, send an e-mail. With a forum, you have to register, log in, find the right forum if there is more than one, then post your question, and check back for answers. With the mailing list, those genuinely interested in providing assistance will helpfully use the "Reply All" feature of their mail client to provide the answer both to the user requesting help and the mailing list's archive, thus providing an answer that is both helpful and searchable.

Even so, all the choices made during the evolution of a project will result in people trying to start a crusade, holy war, revolt, ..., over the chosen tool. Unfortunately, it's a fact of life. The best we can do is take it in stride, justify our choices, and move on.

Note: It's come to my attention that I had missed the ability to share a git database across multiple working copies. In that scenario, the total size of the database and 11 working copies is slightly under 750 MB, and thus a space savings in the neighborhood of 150 MB over monotone. It had been my understanding that I needed a copy of the database per working copy. I stand corrected. I don't use git on a daily basis, as the projects I work with currently use CVS, SVN, or monotone, so I am bound to miss finer details of git here and there. There are other reasons I prefer to stick with monotone, but I won't get into them here, as they're not important to the point of this post.