A good summary of Damian Conway’s talk at OSCON, in which he walks through some really evil code.
A good summary of Damian Conway’s talk at OSCON, in which he walks through some really evil code.
If it takes me longer than five minutes or so to remember something or to look it up, I’ll probably blog it.
I wanted to expand character ranges like ‘a-z’ into their explicit ‘abcdefghijklmnopqrstuvwxyz’ versions, and I wanted to do it in a single regexp. I figured this would need a call to a subroutine to do the actual expansion, but couldn’t remember how to get the regexp to do that. I wasted a fair bit of time reading the stuff about (?{ code }) expressions in the perlre man page, but that’s for running arbitrary code on the left-hand side, in the pattern to be matched.
What really I wanted was the /e flag, which makes the right-hand side evaluate as Perl:
$string =~ s/([^\\])-(.)/expandpair($1, $2)/eg;
So, after mirod’s comments on my last post, I realised that my XML::Twig was a couple of releases behind, and that the params to att_xml_string are going to change in the near future, whenever Fedora’s RPMs catch up: and that this would break my PDFs all over again when it happened.
So I stopped using att_xml_string to re-entity-reference attribute contents. Instead, I’m using HTML::Entities::encode_entities, which is probably what I should have been doing from the start, since it’s not XML::Twig’s job to make sure the XML I’m building from its output is well-formed.
I think the lesson here is don’t always look for the solution from the source of the problem. And that blogs work.
XML::Twig is my preferred XML parsing module, but I’ve had a bit of a fight with it this week, because it started making my PDF generation code explode.
The short story is that this was because some of the source XML docs had entity refs in their attribute text, like this:
<reference publisher="Smith & Jones" />
When XML::Twig parsed this, it returned “Smith & Jones” as the value of the ‘publisher’ attribute. This was then being written out into an intermediate XML document, which xsltproc quite rightly refused to deal with.
According to the XML spec, XML::Twig is doing the right thing:
When an entity reference appears in an attribute value, or a parameter entity reference appears in a literal entity value, its replacement text MUST be processed in place of the reference itself as though it were part of the document at the location the reference was recognized
It’s really the job of my post-parsing software to make sure that its XML output has all its entities properly referenced on the way out, but XML::Twig provides a method for me. Instead of
$element->{att}{publisher};
I’m using this:
$element->att_xml_string('publisher', '"', 1);
which makes sure that ‘&’, ‘”‘ ‘<’ and ‘>’ characters in attributes remain XML-safe.
The second and third arguments aren’t very well documented. The second one tells the method which type of quotes to escape (single or double), and the third switches on quoting of ‘>’, which strictly speaking don’t need to be quoted in XML most of the time. But, why risk it.
The way I found out about the undocumented arguments: my tests started working again but were spitting out dozens of ‘Uninitialised variable’ warnings about the arguments I was not passing in.
[Edit - I'm using v3.29 of XML::Twig, which is a couple of releases behind. This post only really applies to that version - see the comments for more info]
The project I’m currently working on is based around a set of classes representing documents. These get some of their methods exposed in a simple REST web service interface, which is how the AJAX-y web front-end drives them.
Back when it started, I had two document test scripts: one which called them directly, and the other via RPC. The scripts had a tendency to get out of sync, as features would be added to the documents, incorporated into the direct test script, and left out of the RPC script.
Eventually the document script was so fat that I chopped it up into ten more behavioural scripts, which made things much easier to manage. Only now my RPC script was beyond all hope, which was a bit of a problem: if the objects didn’t work under RPC, the app would be useless.
I’ve never user mock objects for testing, but the technique I came up with to bring the RPC test up to speed reminded me a little of what I know about them. So I named them mock-documents, which is probably a misnomer.
The mock-document is initialised with the same parameters as a real document: all it contains, apart from the creator, is an AUTOLOAD method. When any method is called on it, AUTOLOAD converts it and its parameters into a URL and tries to invoke it via RPC, returning the results in the same format as the real document would. There’s a bit of fooling around needed at this stage – most of the methods return hashrefs, but a well-defined subset return straight XML.
The test scripts now run twice: once on the real documents, and once on the mock-document versions. All the RPC stuff is transparent to the tests, and any changes from now on will apply to both the direct and RPC documents.
All this was made much easire by the fact that all the document test scripts get their document objects from a single test document factory class – so it’s the only thing that needs to even know about the mock-documents. But that’s a topic for another post.
The reason mock-documents is a misnomer is that there really is a real document in there, behind the RPC layer.
Another advantage of this technique: anything that looks under the hood of the objects under test will break as soon as it’s run on the RPC version, so it’s a way to tell if your tests are really as black-box as you thought they were. Mine aren’t.