Printing a Blog

I got interested in applying my new XSLT wizardry to the task of printing an entire blog. Like making every post into a big PDF and sending it off to a print-on-demand service. Digital backups = bad. Paper backups = good.

I thought it would be easy, just do “Wordpress export”, then write the XSLT to turn the export XML file into FO, and process into PDF. Well, it is easy, in principle. But in practice it’s not.

The first problem is that posts have HTML in them. Not a lot. It’s almost entirely in-line stuff, like bold and italics. But it’s there, and you have to format it. Which means that you have to get it into the XML document tree so that XSLT can walk the tree and convert the HTML into FO one tag at a time. (By the way, a great resource on how to do the HTML to FO is at IBM developerWorks.)

But the problem is that Wordpress exports blog entries using the content:encoded tag, with the stuff inside that tag is a CDATA string. Which means that your XSLT processor gets the chunk of HTML as a string, and not as nodes in the XML tree, which means you can’t process it. But if you pull off the CDATA and send the Wordpress version of the blog posts directly into your XML parser, bad things happen. It’s not valid XHTML, so you can’t process it. Lots and lots of error messages, like 20000 for my blog.

And like that, like a flash of lightening, I finally understood what all the fuss is about XHTML, etc. Data has to start structured so that if you want to process it later, you won’t have to clean it up. Duh.

The first step is to hack your Wordpress so that export does not hide the XHTML inside of a CDATA. Here’s the patch:

*** wp-admin/includes/export.php-       2009-06-16 02:22:19.000000000 -0700
--- wp-admin/includes/export.php        2009-06-17 03:35:16.000000000 -0700
***************
*** 14,24 ****
  header("Content-Disposition: attachment; filename=$filename");
  header('Content-Type: text/xml; charset=' . get_option('blog_charset'), true);

! $where = '';
! if ( $author and $author != 'all' ) {
!       $author_id = (int) $author;
!       $where = $wpdb->prepare(" WHERE post_author = %d ", $author_id);
! }

  // grab a snapshot of post IDs, just in case it changes during the export
  $post_ids = $wpdb->get_col("SELECT ID FROM $wpdb->posts $where ORDER BY post_date_gmt ASC");
--- 14,25 ----
  header("Content-Disposition: attachment; filename=$filename");
  header('Content-Type: text/xml; charset=' . get_option('blog_charset'), true);

! $where = $wpdb->prepare("WHERE post_type != 'revision'");
!
! #if ( $author and $author != 'all' ) {
! #     $author_id = (int) $author;
! #     $where = $wpdb->prepare("WHERE post_author = %d ", $author_id);
! #}

  // grab a snapshot of post IDs, just in case it changes during the export
  $post_ids = $wpdb->get_col("SELECT ID FROM $wpdb->posts $where ORDER BY post_date_gmt ASC");
***************
*** 160,165 ****
--- 161,167 ----
  <?php the_generator('export');?>
  <rss version="2.0"
        xmlns:content="http://purl.org/rss/1.0/modules/content/"
+       xmlns:excerpt="http://purl.org/rss/1.0/modules/excerpt/"
        xmlns:wfw="http://wellformedweb.org/CommentAPI/"
        xmlns:dc="http://purl.org/dc/elements/1.1/"
        xmlns:wp="http://wordpress.org/export/<?php echo WXR_VERSION; ?>/"
***************
*** 200,206 ****

  <guid isPermaLink="false"><?php the_guid(); ?></guid>
  <description></description>
! <content:encoded><?php echo wxr_cdata( apply_filters('the_content_export', $post->post_content) ); ?></content:encoded>
  <excerpt:encoded><?php echo wxr_cdata( apply_filters('the_excerpt_export', $post->post_excerpt) ); ?></excerpt:encoded>
  <wp:post_id><?php echo $post->ID; ?></wp:post_id>
  <wp:post_date><?php echo $post->post_date; ?></wp:post_date>
--- 202,208 ----

  <guid isPermaLink="false"><?php the_guid(); ?></guid>
  <description></description>
! <content:encoded><?php echo apply_filters('the_content_export', $post->post_content); ?></content:encoded>
  <excerpt:encoded><?php echo wxr_cdata( apply_filters('the_excerpt_export', $post->post_excerpt) ); ?></excerpt:encoded>
  <wp:post_id><?php echo $post->ID; ?></wp:post_id>
  <wp:post_date><?php echo $post->post_date; ?></wp:post_date>

Wow, lots of changes. Why? Here’s the rundown:

You have to filter out all the auto-saves that Wordpress maintains in the database. I was lazy and broke the author filtering when I made my hack. YMMV.
You have to add some namespaces so that your XML parser does not flip out.
You have to remove the CDATA thingie.

Even once you have the XHTML outside the CDATA, there’s two more layers of problems, actually. First is that Wordpress has a set of filters it uses to turn the simple input it allows into HTML, by doing stuff like adding the

’s for you, etc. So you need to get that filter to trigger during your export. To do that, you use the following plugin:

<?php
/*
Plugin Name: XHTML Export
Version: 1.0
Plugin URI: http://nella.org
Description: Insures output is XHTML
Author: Jeff R. Allen
Author URI: http://nella.org
Disclaimer: This Plugin comes with no warranty, expressed or implied
*/

function xhtml_export( $content )
{
  $content = apply_filters('the_content', $content);
  // there's another way to do this: see start of the stylesheet from IBM developerWorks
  $content = str_replace('&nbsp;', '&#160;', $content);
  return $content;
}

add_filter( 'the_excerpt_export', 'xhtml_export' );
add_filter( 'the_content_export', 'xhtml_export' );
?>

Put that into your wp-content/plugins directory, then go to your plugins page and activate it.

The output of the Wordpress filtering step is XHTML, so if you never ever resort to the HTML tab to hack at the HTML yourself, you’d be OK. But I do that a lot, and I also imported posts from Movable Type, which wasn’t as XHTML-aware back in the day. So I still had a massive mess on my hands. Lots and lots of XML errors.

Right about this time, in flies HTML Tidy to save the day. It can turn HTML into XHTML pretty reliably. And the Tidy-up plugin for Wordpress arranges for Tidy to do its work on each of your posts in turn and makes it super-easy to edit the posts and fix the problems.

So that’s how far I have gotten so far. I have a tidy blog, I have a big file called blog.xml which is valid XML, including all the XHTML children nodes of the content:encoded node. Now I am hacking away on the XSLT to turn the blog into FO to make a beautiful book. As I’m not a designer, that’s the part where things fall apart for me. But I’m hoping to find a nice design template from Lulu or someone and follow that.

Also… Tidy is also useful if you are trying to debug your FO output. Use a Makefile like this so that the error messages you get from FOP will have line numbers you can go look at instead of character offsets into the one long line that your XSLT processor emits. Tidy stupidly defaults to non-UTF8, even when you tell it it is reading an XML file, and even when the XML file has the right encoding named. So be sure to use the -utf8 flag. And don’t forget the -q flag, which shuts up the stupid advertisement the Tidy developers saw fit to put in their software. (Hello, W3C people? Yeah, we don’t care. Shut up. Thanks. okloveyabye.)

all: view

view: blog.pdf
  gnome-open $<

blog.fo: blog.xml rss-to-fop.xslt
  xsltproc --output $@ rss-to-fop.xslt $<

blog.fob: blog.fo
  tidy -utf8 -q -i -xml -o $@ $<

blog.pdf: blog.fob
  fop -fo $< $@

clean:
  rm -f blog.fob blog.fo blog.pdf