migrating jruby wiki - the history

posted by qmx on 21 January 2011

When I chose jruby as my opensource contribution target, I really meant to contribute.

If you really want to contribute, the best way of doing it is looking for what the project needs. (this sounds dumb, but most of the time we lean on choosing the more enjoyable tasks).

After Oracle acquired sun, things got cloudy on Kenai’s future, and jruby’s wiki was hosted there! We would need to migrate away from Kenai.

I started to look at migration strategies, following advice from Nick and Charlie; Initially we wanted to convert mediawiki (ugly) source to markdown, and put it on github wiki. I started trying to improve Charlie’s mediakiller, and later realized that the wiki’s source was totally messed up. Mediawiki allows you to mix it with pure html, and we have used it through mostly all pages, making this migration nearly impossible to be clean.

I went ahead to convert from rendered html to markdown, and got stuck on lack of markdown-extra, that made the horde of anchor-links unusable. On a perfect timing, Charlie introduced me to Consiliens, that was trying to migrate from mediawiki too.

We agreed this time that a full dump from Kenai would keep the wiki safe, so I made an ugly script and dumped all pages on my machine.

Consiliens helped me on the investigation, and together with Charlie we discovered many things:

  • pandoc’s html2markdown would break badly.
  • Kenai didn’t released internal changes to mediacloth gem (used by them to render the pages).
  • wikicloth gem was breaking on multiple sections (used extensively through the wiki)

Lacking better alternatives, I started to investigate more about wikicloth, and found that the master version was actually rendering fine! I then asked David (wikicloth’s owner) to release all the fixes that he made (and he was kindly responsive)! We just would be able to make an html parser to convert to markdown, finally!

As I began to create yet another html2mardown tool, after a good glass of chilean carmenère, it clicked:

Why not use wikicloth’s gem to render the wiki as-is?

Thrilled with this idea, I started reading gollum’s (github wiki engine) sourcecode, and found it was using github-markup gem. I did a 3-liner pull request, enabling mediawiki rendering via wikicloth, and with a little help from the interwebs, technoweenie merged it and deployed on github’s infrastructure! He found a bug on link handling (those ugly [edit] links), that was a no-brainer to fix, and finally we got it working!

After the deploy I started to prepare the final migration run, taking hot data from Kenai straight into gollum’s repository. I asked Charlie about keeping the history linear (Kenai stored only revisions by page), and he suggested me to sort it by the timestamps. Redis’ sorted sets came to my mind, and I quickly stored the pages using the timestamps as scores. Less than one hour later, the dump process finished, and I started to work on gollum import process.

Gollum works backed by a git repository, so I would need to drive git programatically to insert page additions keeping ownership and ordering. Thanks to grit gem, this task was relatively easy.

After fiddling with the repository, making all the most bizarre git history mistakes, I’ve got a clean history, ready to be imported on github’s final version.

Charlie pushed the changes to the official repository, and we got an initial version. Counting with murphy’s help, Charlie discovered that mediawiki uses [[Description|Link]] where gollum uses [[Link|Description]]. So all links got essentially broken. To help making things worse, whitespace was being handled differently by Kenai, markdown, and github-markup, making it nearly impossible to make a full automated migration.

I’ve made a fix to this behavior, and thanks to technoweenie, it made it into gollum. We rushed to fix the majority of links, and finally the wiki went live.

After that, I began to look on how to make the old wiki to get links to the new infrastructure. After fiddling with a sandboxless system, I’ve finished marking all pages with a DO NOT EDIT link. Nick then changed the redirect function of the site to send all wiki links to github’s address.

The work on the wiki still continues, as I’m gradually migrating all mediawiki pages to a cleaner markdown format (with appropriate updates to content, if possible).

In the end, this was an awesome experience on true opensource fashion. We can do more with less, helping each other. I learned tons of things, helped my favorite opensource project, and as a side effect, got help from awesome people, and helped polish many things in several projects.

Many thanks to all people involved, you all made this possible.

You rock!