Skip to content

The SQL backlash

I remember sitting in my databases class years ago and thinking, “This can’t possibly be the right way to store data.” It was a strange class, because it mixed theory and practice in a way that was anathema to the way I think. The theory part bored me to tears, and seemed ludicrously useless (first normal form? third normal form? who cares!). The practice part seemed ridiculously complicated and pointless. Why go to all that effort to write something down? Why not just write it? Yes, yes, I know, ACID and all that. I grudgingly learned it all, and I passed the class. But it never felt right to me. It felt like the students were putting on a performance to satisfy the teacher. It didn’t feel like Theory of Computing (i.e. Turing machines, regular expressions, the halting problem, etc), which just Felt Right.

So, you can imagine that it warms my heart immensely to read about NoSQL, a growing backlash against SQL/ACID data management.

What’s especially heart-warming about the backlash is that it’s coming from honest to god working computer scientists, not the academics, and not some bussinessmen who are trying to move SQL aside for their own profit. This is simply the next evolution of how to think about storing data: there are ways to process data where ACID is not a useful feature, and you need to have tools in your toolbox that respect that reality.

PS: A particularly philisophical reader might notice that I’m commenting here on the fundamental difference between the mathematics of computers and software engineering. Where the mathematicians lead us, the path (though steep and rocky) is usually an interesting trip. Where the software engineers lead us sometimes takes us into the malarial swamps of SQL. And the artisans of software, those who rightly reject the clunky boots and balky toolbelts that the software engineers wear? They put the sweet-smelling flowers into the meadows that the mathematical path takes us meandering by….

<nudge>, <nudge>, <wink>, <wink>

FOP can get confused and do stupid things. Giving it a nudge in the right direction fixes it. For example, when a table is going to fill the page right up from where it starts to the end of the region available for it, it does ok. But add a footnote onto that table, and the footnote ends up dangling at the top of the next page.

I thought a bit about why this is happening, and I thought about what if you added a keep to the footnote, etc. What I ended up wanting to do is to nudge down something higher on the page so that it created a bit of whitespace in a place where the reader wouldn’t notice. That made the table not fit, which made the table break, which put the table footnote where I expected it to be, instead of hanging in space.

Here’s how you do it. Add this to your customization layer:

  <!-- a hack to let us push down the next section when we know it
       is going to leave a badly formatted table (or whatever)
       where it lies as-is. -->
  <xsl:template match="nudge">
    <fo:block>
      <xsl:attribute name="space-after">
        <xsl:value-of select="@how-much"/>
      </xsl:attribute>
    </fo:block>
  </xsl:template>

Then add this to your DocBook document: <nudge how-much=”2cm”/>

Note: no space between the 2 and the cm, or else it thinks you mean 2 points and 0 centimeters or something. Whatever. It doesn’t work with the space, so don’t put it in.

Problem solved… but understand that if anything moves, you’ll have to go re-nudge it. Think very carefully before liberally sprinkling these hither and yon. I did it because I have a hard page break a few paragraphs before, and I knew my table was basically always going to have this problem.

Dealing with a clogged link

A friend asked me a question that reminded me of some great resources I want to mention here (in case I ever need to find them again…)

They are:

In my response to my friend, I also touched on something interesting I learned from Cricket, and from years as a consultant. Here’s the big lesson: Infrastructure projects don’t sell themselves.

Why? Because they are not sexy. Their payback is hidden down in the noise of day to day operations and (apparently) fixed costs. Once you get used to “we have to have 256 kbit/sec VSATs everywhere, and the WAN is too slow to do file sharing”, it appears to be the unchanging reality, the fixed costs that people just suck it up and pay. But if you measure a problem and show before and after pictures, then you can market the solution. It’s no secret that the farther up the management ladder you go the harder it is to explain what’s broken, why, and why it will cost $X to fix it. The charitable reasoning for why this is is that “those people are busy, they are big picture people”. The less charitable explanation is that the Peter Principle moves incompetent people up out of day-to-day work into management, where they need to be pandered to like children in kindergarten: “Ooooh, look at the pretty colors! OK, now sign here.”

We used to call it “the manager test”. Cricket passed the manager test because even a manager could understand Cricket’s output and agree to fund changes necessary to improve link utilization. At Microsoft, I can even remember one time that Cricket even managed to pass “the executive test”, when a link upgrade was going to be so expensive that a VP had to sign for it, and he only agreed to spend the money after seeing the graphs with his own eyes.

Pictures work. Everything else is a waste of time.

Here’s something else I wrote to my friend that I’d like to share:

It is the “measuring and marketing” part you need right now… Improving your bandwidth situation needs to be a stealth project hidden under the covers of some other project.

This is how you raise the profile of and fund infrastructure projects, at least in disfunctional environments where the hamsters in the wheelhouse need and want to be pandered to by bright colors and tender morsels of “synergy”.

PS: I’m keeping my friend anonymous, since we wouldn’t want his bosses to find this, on the off chance they might understand that the last sentence is talking about them. :)

Disabling hyphenation in DocBook

When you are using the chain “DocBook -> FO -> PDF”, it is the FO processor that decides on the hyphenation of your words. This is because it knows the lengths of the lines it is making.

In FOP, hyphenation can only be turned on and off at the level of <fo:blocks>. For some dumb reason, DocBook doesn’t have a way to turn off hyphenation in one region of the document.

Here’s a way to add a hyphenate flag onto the DocBook <para> tag:

  <!-- make it possible to turn off hyphenation when it's giving us probs -->
  <xsl:template match="para[@hyphenate='false']">
    <fo:block hyphenate="false" xsl:use-attribute-sets="normal.para.spacing">
      <xsl:call-template name="anchor"/>
      <xsl:apply-templates/>
    </fo:block>
  </xsl:template>

Put that into your customization stylesheet. Most of that garbage is copied from block.xsl in the DocBook stylesheets. The key is the selector, which makes this template fire if and only if you have a para that looks like this: <para hyphenate=”false”>.

Happy non-hyphenation!

Printing a Blog

I got interested in applying my new XSLT wizardry to the task of printing an entire blog. Like making every post into a big PDF and sending it off to a print-on-demand service. Digital backups = bad. Paper backups = good.

I thought it would be easy, just do “Wordpress export”, then write the XSLT to turn the export XML file into FO, and process into PDF. Well, it is easy, in principle. But in practice it’s not.

The first problem is that posts have HTML in them. Not a lot. It’s almost entirely in-line stuff, like bold and italics. But it’s there, and you have to format it. Which means that you have to get it into the XML document tree so that XSLT can walk the tree and convert the HTML into FO one tag at a time. (By the way, a great resource on how to do the HTML to FO is at IBM developerWorks.)

But the problem is that Wordpress exports blog entries using the <content:encoded> tag, with the stuff inside that tag is a CDATA string. Which means that your XSLT processor gets the chunk of HTML as a string, and not as nodes in the XML tree, which means you can’t process it. But if you pull off the CDATA and send the Wordpress version of the blog posts directly into your XML parser, bad things happen. It’s not valid XHTML, so you can’t process it. Lots and lots of error messages, like 20000 for my blog.

And like that, like a flash of lightening, I finally understood what all the fuss is about XHTML, etc. Data has to start structured so that if you want to process it later, you won’t have to clean it up. Duh.

The first step is to hack your Wordpress so that export does not hide the XHTML inside of a CDATA. Here’s the patch:

*** wp-admin/includes/export.php-       2009-06-16 02:22:19.000000000 -0700
--- wp-admin/includes/export.php        2009-06-17 03:35:16.000000000 -0700
***************
*** 14,24 ****
  header("Content-Disposition: attachment; filename=$filename");
  header('Content-Type: text/xml; charset=' . get_option('blog_charset'), true);

! $where = '';
! if ( $author and $author != 'all' ) {
!       $author_id = (int) $author;
!       $where = $wpdb->prepare(" WHERE post_author = %d ", $author_id);
! }

  // grab a snapshot of post IDs, just in case it changes during the export
  $post_ids = $wpdb->get_col("SELECT ID FROM $wpdb->posts $where ORDER BY post_date_gmt ASC");
--- 14,25 ----
  header("Content-Disposition: attachment; filename=$filename");
  header('Content-Type: text/xml; charset=' . get_option('blog_charset'), true);

! $where = $wpdb->prepare("WHERE post_type != 'revision'");
!
! #if ( $author and $author != 'all' ) {
! #     $author_id = (int) $author;
! #     $where = $wpdb->prepare("WHERE post_author = %d ", $author_id);
! #}

  // grab a snapshot of post IDs, just in case it changes during the export
  $post_ids = $wpdb->get_col("SELECT ID FROM $wpdb->posts $where ORDER BY post_date_gmt ASC");
***************
*** 160,165 ****
--- 161,167 ----
  <?php the_generator('export');?>
  <rss version="2.0"
        xmlns:content="http://purl.org/rss/1.0/modules/content/"
+       xmlns:excerpt="http://purl.org/rss/1.0/modules/excerpt/"
        xmlns:wfw="http://wellformedweb.org/CommentAPI/"
        xmlns:dc="http://purl.org/dc/elements/1.1/"
        xmlns:wp="http://wordpress.org/export/<?php echo WXR_VERSION; ?>/"
***************
*** 200,206 ****

  <guid isPermaLink="false"><?php the_guid(); ?></guid>
  <description></description>
! <content:encoded><?php echo wxr_cdata( apply_filters('the_content_export', $post->post_content) ); ?></content:encoded>
  <excerpt:encoded><?php echo wxr_cdata( apply_filters('the_excerpt_export', $post->post_excerpt) ); ?></excerpt:encoded>
  <wp:post_id><?php echo $post->ID; ?></wp:post_id>
  <wp:post_date><?php echo $post->post_date; ?></wp:post_date>
--- 202,208 ----

  <guid isPermaLink="false"><?php the_guid(); ?></guid>
  <description></description>
! <content:encoded><?php echo apply_filters('the_content_export', $post->post_content); ?></content:encoded>
  <excerpt:encoded><?php echo wxr_cdata( apply_filters('the_excerpt_export', $post->post_excerpt) ); ?></excerpt:encoded>
  <wp:post_id><?php echo $post->ID; ?></wp:post_id>
  <wp:post_date><?php echo $post->post_date; ?></wp:post_date>

Wow, lots of changes. Why? Here’s the rundown:

  1. You have to filter out all the auto-saves that Wordpress maintains in the database. I was lazy and broke the author filtering when I made my hack. YMMV.
  2. You have to add some namespaces so that your XML parser does not flip out.
  3. You have to remove the CDATA thingie.

Even once you have the XHTML outside the CDATA, there’s two more layers of problems, actually. First is that Wordpress has a set of filters it uses to turn the simple input it allows into HTML, by doing stuff like adding the <p>’s for you, etc. So you need to get that filter to trigger during your export. To do that, you use the following plugin:

<?php
/*
Plugin Name: XHTML Export
Version: 1.0
Plugin URI: http://nella.org
Description: Insures output is XHTML
Author: Jeff R. Allen
Author URI: http://nella.org
Disclaimer: This Plugin comes with no warranty, expressed or implied
*/

function xhtml_export( $content )
{
  $content = apply_filters('the_content', $content);
  // there's another way to do this: see start of the stylesheet from IBM developerWorks
  $content = str_replace('&nbsp;', '&#160;', $content);
  return $content;
}

add_filter( 'the_excerpt_export', 'xhtml_export' );
add_filter( 'the_content_export', 'xhtml_export' );
?>

Put that into your wp-content/plugins directory, then go to your plugins page and activate it.

The output of the Wordpress filtering step is XHTML, so if you never ever resort to the HTML tab to hack at the HTML yourself, you’d be OK. But I do that a lot, and I also imported posts from Movable Type, which wasn’t as XHTML-aware back in the day. So I still had a massive mess on my hands. Lots and lots of XML errors.

Right about this time, in flies HTML Tidy to save the day. It can turn HTML into XHTML pretty reliably. And the Tidy-up plugin for Wordpress arranges for Tidy to do its work on each of your posts in turn and makes it super-easy to edit the posts and fix the problems.

So that’s how far I have gotten so far. I have a tidy blog, I have a big file called blog.xml which is valid XML, including all the XHTML children nodes of the <content:encoded> node. Now I am hacking away on the XSLT to turn the blog into FO to make a beautiful book. As I’m not a designer, that’s the part where things fall apart for me. But I’m hoping to find a nice design template from Lulu or someone and follow that.

Also… Tidy is also useful if you are trying to debug your FO output. Use a Makefile like this so that the error messages you get from FOP will have line numbers you can go look at instead of character offsets into the one long line that your XSLT processor emits. Tidy stupidly defaults to non-UTF8, even when you tell it it is reading an XML file, and even when the XML file has the right encoding named. So be sure to use the -utf8 flag. And don’t forget the -q flag, which shuts up the stupid advertisement the Tidy developers saw fit to put in their software. (Hello, W3C people? Yeah, we don’t care. Shut up. Thanks. okloveyabye.)

all: view

view: blog.pdf
  gnome-open $<
blog.fo: blog.xml rss-to-fop.xslt
  xsltproc --output $@ rss-to-fop.xslt $<
blog.fob: blog.fo
  tidy -utf8 -q -i -xml -o $@ $<
blog.pdf: blog.fob
  fop -fo $< $@
clean:
  rm -f blog.fob blog.fo blog.pdf

Just Married

I’ve been away from the web for a while because I was in Olivone, Ticino, Switzerland getting married!

Thanks to friends and family who came from so far away to witness such a special day.

And thanks also to our wonderful vendors, who made the day go so well. If you are thinking of putting on a wedding in Blenio, give these guys a call:

It feels different to be married. It seems like it shouldn’t… the house is the same, we still sleep on the same sides of the bed. But it’s different. Good. And safe. And happy. And… different.

Viva gli sposi, all of them, whereever they are in the world. This is why getting married is special! Now I know!

Apache FOP and document properties

I am a little bit obsessive about checking out the document properties in PDF files I read. I can’t explain why, but there you have it.

I was sad when I noticed the PDF file being emitted by my XML has no document properties. So I figured, no problem, I can just go find the right FO tags, grep for them in the DocBook XSL and reverse engineer what stuff I need to put into my DocBook to get them set right. I figured it would be something obscure like, <author> or something. In fact, DocBook already knows who the author is, in order to format the title page nicely, so no dice there.

What I found is this page on the Apache FOP faq that explains that FOP can’t do it. WTF? This can’t be that hard, and it really seems like a natural thing that would make it into version 1.0. (Of course, FOP is version 0.94, which might explain something as well.) In their defense, it seems like this is also braindeadness in the FO spec, beacuse I found a commercial implementation of FO that says you have to resort to special extension namespaces to specify metadata in their implementation. But Apache FOP already has the “fox” namespace for this purpose, so no big deal.

The solution is to write your own Java program (real user friendly there, guys) that uses a nifty PDF library called iText to add on the metadata in a post processing step.

In case someone else needs it, here’s what I came up with:

/* based on example here:
   http://itextdocs.lowagie.com/tutorial/general/copystamp/index.php
*/

import java.io.FileOutputStream;
import java.util.HashMap;

import com.lowagie.text.pdf.PdfReader;
import com.lowagie.text.pdf.PdfStamper;

public class Stamp {
    public static void main(String[] args) {
        try {
            if (args.length != 6) {
                System.out.println("Arguments expected:");
                System.out.println("  pdf-in pdf-out title subject author keywords");
            } else {
                // we create a reader for a certain document
                PdfReader reader = new PdfReader(args[0]);

                // we create a stamper that will copy the document to a new file
                PdfStamper stamp =
                    new PdfStamper(reader,
                                   new FileOutputStream(args[1]));

                // adding the metadata
                HashMap moreInfo = new HashMap();
                moreInfo.put("Title", args[2]);
                moreInfo.put("Subject", args[3]);
                moreInfo.put("Author", args[4]);
                moreInfo.put("Keywords", args[5]);
                stamp.setMoreInfo(moreInfo);

                // closing PdfStamper will generate the new PDF file
                stamp.close();
            }
        }
        catch (Exception de) {
            de.printStackTrace();
        }
    }
}

The <xen> of <xslt>

For a project I am doing right now, I descended into DocBook hell. Not completely unscathed, I made it through the learning curve (why don’t they call it what it is: The Unfathomable and Horrific Tunnel of Learning) and blinked slowly in the light of day.

I realized DocBook is nice, but it’s not actually what I wanted. Doh.

What I wanted was a structured way to represent my data, and I want two things to happen to it. Today, I want to publish my data with DocBook. Tomorrow I want someone to be able to suck the brains out of my DocBook document (leaving it to wander the earth as a zombie) and to put them into a wiki so that my project can become a community-maintained database, instead of being a single DocBook-formatted document that lives in my Subversion repository.

The way to do that is to step back from DocBook as a primary source, and instead generate DocBook from my data. I organize my original data into XML, then reformat it with XSLT from my data structure into DocBook, then DocBook reformats it into XML-FO or HTML, and those become readable documents.

To other people contemplating the same idea, I’d say “go for it”. But beware the startup cost is huge. The results are nice. Here are the things you’ll need to learn:

  • How to use your XML editor (Emacs in nXML-mode for me). You cannot just limp along with Notepad or vi. Won’t work, don’t try it.
  • How to write a schema for your new data structure. There are like 12 schema formats to choose from, but I chose Relax NG Compact Syntax because nXML-mode prefers it. Don’t use DTD or else your brain will melt. SGML = Bad. Everything post-SGML = slightly less bad. Farther from SGML is better (thus Relax NG compact = best).
  • How to load your schema into your editor. (Tip: C-C C-S C-F for nXML)
  • XSLT (which is the ugliest, stupidest, most verbose language ever foisted by computer science on users)
  • How to use XInclude to build up XML documents from pieces. Don’t skimp on this. Figure it out and use it, because the alternative is the supreme ugliness that is SGML external entities. Remember: SGML = bad, XML = slightly less bad.
  • How to make your XSLT processor work (and which of the 12 to choose — go with xsltproc, it’s super fast and stable, and doesn’t care which exact point release of Java you have. Remember: C good. Java bad. Write once, test everywhere…)

Don’t try to do it without a schema. You need to be 100% sure your data is in the right format before you go too far hacking on the XSLT, or else you’ll get confused and sad. Just suck it up and get the schema right, so that your editor can whack you with a clue-by-four before XSLT starts wasting your time going off into tag soup never-never land.

Trafigura’s West African dumping

Here’s an interesting story, well told, about an industrial process that takes refinery waste from the United States (derived from high-sulfur Mexican crude oil), cycles it through Europe, then dumps the result in West Africa.

The company running this racket (or “innovative commodity exchange”, as they call it) is Trafigura.

Learn more here:

Here’s a quick summary:

  • An arbitrage opportunity exists for energy traders based on differing regulatory frameworks in rich countries and poor ones.
  • A chemical process can turn a waste product in one jurisdiction into two outputs, gasoline usable in a lenient jurisdiction (West Africa), and the waste extracted from the original product.
  • If you buy one ship of high quality gas, you can dissolve the waste stream from several other tankers of coke gasoline into it, meaning that you can dispose of the waste stream by getting your customers to burn it for you in their cars.
  • A clever and immoral company can take advantage to squeeze profits where others just saw costs. The profits come from the externalities of burning high-sulfur gasoline (decreased longevity due to sulfur-rich smog)
  • None of this is precisely against the law. Tanfigura and it’s contractors made minor infractions here and there, playing fast and loose with the rules. But what they are doing is fundamentally not illegal — though it should be.
  • Trafigura was working towards, or achieved, the ability to reprocess this stuff at sea, likely to further reduce the power of regulators over their work.

How much other stuff like this is going on? Who are the people that organize and operate this kind of thing? How do they sleep at night?

Gates Foundation vs the Lancet

The Lancet has published an academic paper analyzing the deployment of funds at the Gates Foundation against a backdrop of the actual burden of disease. The bottom line is the Gates Foundation does not come out looking too good, seemingly interested in whizbang gadgets and not in focusing on the job at hand. Another really interesting and sad note was the extent to which being nearby the Gates Foundation, geographically or culturally gets you in the money. PATH and the University of Washington raked in the cash. African researchers? Not so much.

At the same time, the Lancet published an editorial and a commentary. Of course, being academics, you know the knives are going to come out and some serious backstabbing is going to happen. (”They fight so hard because the stakes are so low.” Sigh.) They saved the really rude things for the editorial, a particularly cowardly form of academic infighting, as that way no one has to put their name on the insults. At least the commentary is signed, though in keeping with the fact they will be held accountable for their words, they are much more restrained.

The thing that most pissed me off about the Lancet’s editorial is the stuff about transparency. They whine and moan about how the Gates Foundation didn’t come ask them what to do. You know what, all you Masters of the Public Health Universe? You had your chance. You wasted 100 fucking years, and things just got worse and worse. Some of you were wanking, writing useless papers. Some of you were too busy teaching the next generation of wankers to go out and find out what its like to be poor. The rest of you were on public health tourism packages, in business class and five star hotels. There are no poor people in the Addis Ababa Sheraton… except the waiters, but you don’t notice them anyway, I suppose.

If the Gates Foundation wants to know what works, the only way to know is to go ask the people doing it, those laboring in obscurity in tiny, underfunded local NGOs, and those laboring in sweaty, dirty, dangerous, uncomfortable places with overfunded and overexposed NGOs like MSF.

As for the commentary, it’s major point (made three times over, according to my underlines on the copy I read on the bus) is that the Gates Foundation should be investing in putting into practice things that we already know work, instead of whizbang things for the future. The whizzy MPH speak for this is “service delivery”: i.e. making sure the pharmacy wasn’t cleaned out by thieves the night before the sick baby arrives in the ambulance that someone remembered to put fuel into.

I would be amenable to this argument, except that we already know why service delivery is so bad. It’s because a few people in this world are corrupt assholes, and something is wrong with the cultures where service delivery is bad that lets the corrupt assholes ruin things for those who just want to be healthy. The fact that people are corrupt assholes is not a problem. England has plenty of corrupt assholes (in fact, they seem to be in charge of the parliament here), but the NHS keeps running anyway. In Switzerland every year there are probably two or three doctors who lose their license for insurance fraud, billing for stuff they didn’t do. What’s the difference between the corrupt assholes in Switzerland and places where service delivery is failing patients? It’s good governance and accountability.

Even a short little career in humanitarian aid like I have had can make you cynical, and I’ll freely admit I am cynical. But I see hope everywhere I look, too. Good people trying to make their health system work get torn down by the system, and the system is made of a thousand corrupt assholes, from big corruption (the Minister of Health of Uganda for example) all the way down to little corruption (the numerous minor staff problems we faced every single week in Saclepea, Liberia).

The answer is that people won’t be healthy until they and their neighbors take responsibility to make a health system that works. It doesn’t matter how much the Lancet whines to the Gates Foundation, and it doesn’t really matter what the Gates Foundation invests in anyway. The demand for healthy communities needs to come from educated, organized, and disciplined communities. Whatever helps get us there, we should invest in. Whatever is unrelated to that is a distraction and not an ethical use of time and money.