<nudge>, <nudge>, <wink>, <wink>

FOP can get confused and do stupid things. Giving it a nudge in the right direction fixes it. For example, when a table is going to fill the page right up from where it starts to the end of the region available for it, it does ok. But add a footnote onto that table, and the footnote ends up dangling at the top of the next page.

I thought a bit about why this is happening, and I thought about what if you added a keep to the footnote, etc. What I ended up wanting to do is to nudge down something higher on the page so that it created a bit of whitespace in a place where the reader wouldn’t notice. That made the table not fit, which made the table break, which put the table footnote where I expected it to be, instead of hanging in space.

Here’s how you do it. Add this to your customization layer:

  <!-- a hack to let us push down the next section when we know it
       is going to leave a badly formatted table (or whatever)
       where it lies as-is. -->
  <xsl:template match="nudge">
    <fo:block>
      <xsl:attribute name="space-after">
        <xsl:value-of select="@how-much"/>
      </xsl:attribute>
    </fo:block>
  </xsl:template>

Then add this to your DocBook document: <nudge how-much=”2cm”/>

Note: no space between the 2 and the cm, or else it thinks you mean 2 points and 0 centimeters or something. Whatever. It doesn’t work with the space, so don’t put it in.

Problem solved… but understand that if anything moves, you’ll have to go re-nudge it. Think very carefully before liberally sprinkling these hither and yon. I did it because I have a hard page break a few paragraphs before, and I knew my table was basically always going to have this problem.

Dealing with a clogged link

A friend asked me a question that reminded me of some great resources I want to mention here (in case I ever need to find them again…)

They are:

In my response to my friend, I also touched on something interesting I learned from Cricket, and from years as a consultant. Here’s the big lesson: Infrastructure projects don’t sell themselves.

Why? Because they are not sexy. Their payback is hidden down in the noise of day to day operations and (apparently) fixed costs. Once you get used to “we have to have 256 kbit/sec VSATs everywhere, and the WAN is too slow to do file sharing”, it appears to be the unchanging reality, the fixed costs that people just suck it up and pay. But if you measure a problem and show before and after pictures, then you can market the solution. It’s no secret that the farther up the management ladder you go the harder it is to explain what’s broken, why, and why it will cost $X to fix it. The charitable reasoning for why this is is that “those people are busy, they are big picture people”. The less charitable explanation is that the Peter Principle moves incompetent people up out of day-to-day work into management, where they need to be pandered to like children in kindergarten: “Ooooh, look at the pretty colors! OK, now sign here.”

We used to call it “the manager test”. Cricket passed the manager test because even a manager could understand Cricket’s output and agree to fund changes necessary to improve link utilization. At Microsoft, I can even remember one time that Cricket even managed to pass “the executive test”, when a link upgrade was going to be so expensive that a VP had to sign for it, and he only agreed to spend the money after seeing the graphs with his own eyes.

Pictures work. Everything else is a waste of time.

Here’s something else I wrote to my friend that I’d like to share:

It is the “measuring and marketing” part you need right now… Improving your bandwidth situation needs to be a stealth project, hidden under the covers of some other project.

This is how you raise the profile of and fund infrastructure projects, at least in dysfunctional environments where the hamsters in the wheelhouse need and want to be pandered to by bright colors and tender morsels of “synergy”.

PS: I’m keeping my friend anonymous, since we wouldn’t want his bosses to find this, on the off chance they might understand that the last sentence is talking about them. 🙂

Disabling hyphenation in DocBook

When you are using the chain “DocBook -> FO -> PDF”, it is the FO processor that decides on the hyphenation of your words. This is because it knows the lengths of the lines it is making.

In FOP, hyphenation can only be turned on and off at the level of <fo:blocks>. For some dumb reason, DocBook doesn’t have a way to turn off hyphenation in one region of the document.

Here’s a way to add a hyphenate flag onto the DocBook <para> tag:

  <!-- make it possible to turn off hyphenation when it's giving us probs -->
  <xsl:template match="para[@hyphenate='false']">
    <fo:block hyphenate="false" xsl:use-attribute-sets="normal.para.spacing">
      <xsl:call-template name="anchor"/>
      <xsl:apply-templates/>
    </fo:block>
  </xsl:template>

Put that into your customization stylesheet. Most of that garbage is copied from block.xsl in the DocBook stylesheets. The key is the selector, which makes this template fire if and only if you have a para that looks like this: <para hyphenate=”false”>.

Happy non-hyphenation!

Printing a Blog

I got interested in applying my new XSLT wizardry to the task of printing an entire blog. Like making every post into a big PDF and sending it off to a print-on-demand service. Digital backups = bad. Paper backups = good.

I thought it would be easy, just do “WordPress export”, then write the XSLT to turn the export XML file into FO, and process into PDF. Well, it is easy, in principle. But in practice it’s not.

The first problem is that posts have HTML in them. Not a lot. It’s almost entirely in-line stuff, like bold and italics. But it’s there, and you have to format it. Which means that you have to get it into the XML document tree so that XSLT can walk the tree and convert the HTML into FO one tag at a time. (By the way, a great resource on how to do the HTML to FO is at IBM developerWorks.)

But the problem is that WordPress exports blog entries using the <content:encoded> tag, with the stuff inside that tag is a CDATA string. Which means that your XSLT processor gets the chunk of HTML as a string, and not as nodes in the XML tree, which means you can’t process it. But if you pull off the CDATA and send the WordPress version of the blog posts directly into your XML parser, bad things happen. It’s not valid XHTML, so you can’t process it. Lots and lots of error messages, like 20000 for my blog.

And like that, like a flash of lightening, I finally understood what all the fuss is about XHTML, etc. Data has to start structured so that if you want to process it later, you won’t have to clean it up. Duh.

The first step is to hack your WordPress so that export does not hide the XHTML inside of a CDATA. Here’s the patch:

*** wp-admin/includes/export.php-       2009-06-16 02:22:19.000000000 -0700
--- wp-admin/includes/export.php        2009-06-17 03:35:16.000000000 -0700
***************
*** 14,24 ****
  header("Content-Disposition: attachment; filename=$filename");
  header('Content-Type: text/xml; charset=' . get_option('blog_charset'), true);

! $where = '';
! if ( $author and $author != 'all' ) {
!       $author_id = (int) $author;
!       $where = $wpdb->prepare(" WHERE post_author = %d ", $author_id);
! }

  // grab a snapshot of post IDs, just in case it changes during the export
  $post_ids = $wpdb->get_col("SELECT ID FROM $wpdb->posts $where ORDER BY post_date_gmt ASC");
--- 14,25 ----
  header("Content-Disposition: attachment; filename=$filename");
  header('Content-Type: text/xml; charset=' . get_option('blog_charset'), true);

! $where = $wpdb->prepare("WHERE post_type != 'revision'");
!
! #if ( $author and $author != 'all' ) {
! #     $author_id = (int) $author;
! #     $where = $wpdb->prepare("WHERE post_author = %d ", $author_id);
! #}

  // grab a snapshot of post IDs, just in case it changes during the export
  $post_ids = $wpdb->get_col("SELECT ID FROM $wpdb->posts $where ORDER BY post_date_gmt ASC");
***************
*** 160,165 ****
--- 161,167 ----
  <?php the_generator('export');?>
  <rss version="2.0"
        xmlns:content="http://purl.org/rss/1.0/modules/content/"
+       xmlns:excerpt="http://purl.org/rss/1.0/modules/excerpt/"
        xmlns:wfw="http://wellformedweb.org/CommentAPI/"
        xmlns:dc="http://purl.org/dc/elements/1.1/"
        xmlns:wp="http://wordpress.org/export/<?php echo WXR_VERSION; ?>/"
***************
*** 200,206 ****

  <guid isPermaLink="false"><?php the_guid(); ?></guid>
  <description></description>
! <content:encoded><?php echo wxr_cdata( apply_filters('the_content_export', $post->post_content) ); ?></content:encoded>
  <excerpt:encoded><?php echo wxr_cdata( apply_filters('the_excerpt_export', $post->post_excerpt) ); ?></excerpt:encoded>
  <wp:post_id><?php echo $post->ID; ?></wp:post_id>
  <wp:post_date><?php echo $post->post_date; ?></wp:post_date>
--- 202,208 ----

  <guid isPermaLink="false"><?php the_guid(); ?></guid>
  <description></description>
! <content:encoded><?php echo apply_filters('the_content_export', $post->post_content); ?></content:encoded>
  <excerpt:encoded><?php echo wxr_cdata( apply_filters('the_excerpt_export', $post->post_excerpt) ); ?></excerpt:encoded>
  <wp:post_id><?php echo $post->ID; ?></wp:post_id>
  <wp:post_date><?php echo $post->post_date; ?></wp:post_date>

Wow, lots of changes. Why? Here’s the rundown:

  1. You have to filter out all the auto-saves that WordPress maintains in the database. I was lazy and broke the author filtering when I made my hack. YMMV.
  2. You have to add some namespaces so that your XML parser does not flip out.
  3. You have to remove the CDATA thingie.

Even once you have the XHTML outside the CDATA, there’s two more layers of problems, actually. First is that WordPress has a set of filters it uses to turn the simple input it allows into HTML, by doing stuff like adding the <p>’s for you, etc. So you need to get that filter to trigger during your export. To do that, you use the following plugin:

<?php
/*
Plugin Name: XHTML Export
Version: 1.0
Plugin URI: http://nella.org
Description: Insures output is XHTML
Author: Jeff R. Allen
Author URI: http://nella.org
Disclaimer: This Plugin comes with no warranty, expressed or implied
*/

function xhtml_export( $content )
{
  $content = apply_filters('the_content', $content);
  // there's another way to do this: see start of the stylesheet from IBM developerWorks
  $content = str_replace('&nbsp;', '&#160;', $content);
  return $content;
}

add_filter( 'the_excerpt_export', 'xhtml_export' );
add_filter( 'the_content_export', 'xhtml_export' );
?>

Put that into your wp-content/plugins directory, then go to your plugins page and activate it.

The output of the WordPress filtering step is XHTML, so if you never ever resort to the HTML tab to hack at the HTML yourself, you’d be OK. But I do that a lot, and I also imported posts from Movable Type, which wasn’t as XHTML-aware back in the day. So I still had a massive mess on my hands. Lots and lots of XML errors.

Right about this time, in flies HTML Tidy to save the day. It can turn HTML into XHTML pretty reliably. And the Tidy-up plugin for WordPress arranges for Tidy to do its work on each of your posts in turn and makes it super-easy to edit the posts and fix the problems.

So that’s how far I have gotten so far. I have a tidy blog, I have a big file called blog.xml which is valid XML, including all the XHTML children nodes of the <content:encoded> node. Now I am hacking away on the XSLT to turn the blog into FO to make a beautiful book. As I’m not a designer, that’s the part where things fall apart for me. But I’m hoping to find a nice design template from Lulu or someone and follow that.

Also… Tidy is also useful if you are trying to debug your FO output. Use a Makefile like this so that the error messages you get from FOP will have line numbers you can go look at instead of character offsets into the one long line that your XSLT processor emits. Tidy stupidly defaults to non-UTF8, even when you tell it it is reading an XML file, and even when the XML file has the right encoding named. So be sure to use the -utf8 flag. And don’t forget the -q flag, which shuts up the stupid advertisement the Tidy developers saw fit to put in their software. (Hello, W3C people? Yeah, we don’t care. Shut up. Thanks. okloveyabye.)

all: view

view: blog.pdf
  gnome-open $<
blog.fo: blog.xml rss-to-fop.xslt
  xsltproc --output $@ rss-to-fop.xslt $<
blog.fob: blog.fo
  tidy -utf8 -q -i -xml -o $@ $<
blog.pdf: blog.fob
  fop -fo $< $@
clean:
  rm -f blog.fob blog.fo blog.pdf

Just Married

I’ve been away from the web for a while because I was in Olivone, Ticino, Switzerland getting married!

Thanks to friends and family who came from so far away to witness such a special day.

And thanks also to our wonderful vendors, who made the day go so well. If you are thinking of putting on a wedding in Blenio, give these guys a call:

It feels different to be married. It seems like it shouldn’t… the house is the same, we still sleep on the same sides of the bed. But it’s different. Good. And safe. And happy. And… different.

Viva gli sposi, all of them, whereever they are in the world. This is why getting married is special! Now I know!