Do not meddle in the affairs of wizards, for you are crunchy and good with ketchup.
This article is about XS. It explains what it is, why it is, how it works, and how to use it. It includes complete, working examples of XS modules that you can use as models for your own code. It is an express goal of this article to provide the background and information necessary for you to write your own XS modules.
This article is in five parts
| November | Introduction | motivation, definitions, examples |
| December | Architecture | the Perl interpreter, calling conventions, data representation |
| January | Tools | h2xs, xsubpp, DynaLoader |
| February | Modules | Math::Ackermann, Set::Bit |
| March | Align::NW |
Needleman-Wunsch global optimal sequence alignment |
Align::NW
Four months ago, we set out to write an XS module. Now we have the concepts and tools that we need to complete the task. This month, we write Align::NW, an XS module to do global optimal sequence alignment.
We need to describe the sequence alignment problem in some detail so that we can talk about the methods and data structures that we are going to implement in XS.
Given two sequences
abcdeffh bdefghijkl
we want to find the best way to align them over their entire lengths: something like this
abcdeffh | ||| | b.defghijkl
This alignment has 5 matches, one mismatch, and one gap of length 1. We score the alignment according to a payoff matrix
| match | 4 |
| mismatch | -3 |
| gap open | -2 |
| gap extend | -1 |
and find that the alignment has a score of
5*4 + 1*(-3) + 1*(-2-1) = 20 - 3 - 3 = 14
Out of all possible alignments, we want to find the one with the highest score for a given payoff matrix.
We'll refer to the sequences as A and B, and to their lengths as lA and lB, respectively. The number of possible alignments is exponential in lA and lB, so enumerating and scoring them all in order to find the best one is obviously intractable.
The Needleman-Wunsch (NW) dynamic programming algorithm sets up a score matrix, with one sequence across the top and the other down the left. The entries in the matrix are called cells.
| b | c | d | e | |
| a | ||||
| b | ||||
| c | ||||
| d |
Then it fills in the matrix with scores, working from top left to bottom right.
| b | c | d | e | |
| a | -3 | -3 | -3 | -3 |
| b | 4 | 1 | 0 | -1 |
| c | 1 | 8 | 5 | 4 |
| d | 0 | 5 | 12 | 9 |
Finally, it searches for the highest scoring cell on the bottom and right edges, and backtracks from that cell through the matrix to recover the alignment.
abcd ||| bcde
The rules for scoring the matrix and backtracking aren't too difficult to explain, but we needn't concern ourselves with them. Interested readers can glean them from the Align::NW sources, or read them directly in the references given in the module POD. Proving that these rules actually lead to the optimal alignment is harder: Needleman and Wunsch got their names on the algorithm for doing that.
Here is our overall plan for implementing the NW algorithm as an XS module.
The first thing we need for any module is a name. We call this one Align::NW.
A Perl implementation of an XS module isn't always feasible, but
having one can be very useful. It allows us to address design issues in Perl, without getting tangled up in C language coding. Here is a straight Perl implementation of Align::NW.
The new method creates a new Align::NW object, like this
my $nw = { a => $a,
b => $b,
rows => $rows,
cols => $cols,
dp => $dp,
payoff => $payoff,
options => \%options };
$a and $b are the two sequences to be aligned. $dp is the score matrix; it is indexed like this
$cell = $dp->[$row][$col];
The entries in the score matrix are cells. Each cell looks like this
$cell = { row => $row,
col => $col,
score => $score,
prev => $prev };
A cell knows its own row and column. It also has a score, and a reference to a previous cell in the matrix. The prev references are used to backtrack through the matrix after all the scores are filled in.
The score method fills in the matrix. There are lA*lB cells in the matrix. Furthermore, computing the score for each cell requires a loop over rows and a loop over columns, so filling in the score matrix requires O(lA*lB*(lA+lB)) operations. This is much better than exponential, but still bad enough that it is worth implementing score in C.
The align method backtracks through the matrix and generates the alignment. Backtracking is conceptually simple, but rather intricate to implement. Furthermore, it runs in O(lA+lB). Implementing align in C would be difficult, and wouldn't improve performance.
Here is a C implementation of the score method, together with a main() to drive it. It doesn't align the sequences; it doesn't handle input or output; it doesn't have options. All it does is fill in the score matrix.
Converting a method from Perl to C raises two distinct issues
We could convert the code and not the data. C code can operate directly on Perl data: that's what the Perl C API is for. However, accessing Perl data is expensive, whether the Perl interpreter does it or our own C code does it. To realize big performance gains in the NW algorithm, we need to convert the data to C, as well.
The data structures translate directly from Perl to C:
typedef struct
{
int match;
int mismatch;
int open;
int extend;
} Payoff;
typedef struct Cell
{
int row;
int col;
int score;
struct Cell *prev;
} Cell;
typedef struct
{
Payoff payoff;
char *pszA;
char *pszB;
int nRows;
int nCols;
Cell **ppCell;
} NW;
The Cell and NW structs absolutely have to be in C for performance. The Payoff struct doesn't: it's values are constant, so we could leave it in Perl and have score make a local copy.
If we leave Payoff in Perl, then score has to access it through the Perl C API; if we convert it to C, then our Perl code has to access it through an XS interface. This is ultimately a question of style and convenience. In this implementation, I converted Payoff to C.
When data is in C, we need XS to operate on it from Perl. This means that we have to add an interface for each of our C structs. Here is an interface for the Payoff struct.
Payoff *new (int match, int mismatch, int open, int extend); void DESTROY(Payoff *pPayoff);
Since we are coding in C, and not Perl, we are responsible for our own memory management. The new routine allocates storage for the struct and initializes it; the DESTROY routine frees the storage.
The interface for the Matrix struct is similar
Matrix *new (char *pszA, char *pszB, Payoff *pPayoff); void DESTROY(Matrix *pMatrix); void score (Matrix *pMatrix); Cell *cell (Matrix *pMatrix, int nRow, int nCol); void dump (Matrix *pMatrix);
In addition to new and DESTROY routines, it has score to fill in the score matrix, cell to return a pointer to a particular cell in the matrix, and dump for debugging.
The interface for the Cell struct is very simple. The new routine for the Matrix struct also allocates and frees the cells in the matrix. All that remains for the Cell interface is accessors
int row (Cell *pCell); int col (Cell *pCell); int score(Cell *pCell); Cell *prev (Cell *pCell);
These interfaces can't stand as written, because there are two routines named new and two named DESTROY. We can fix this with the common C hack of prefixing each routine with the name of the struct on which it operates
Payoff *payoff_new (int match, int mismatch, int open, int extend); void payoff_DESTROY(Payoff *pPayoff); int cell_row (Cell *pCell); int cell_col (Cell *pCell); int cell_score (Cell *pCell); Cell *cell_prev (Cell *pCell); Matrix *matrix_new (char *pszA, char *pszB, Payoff *pPayoff); void matrix_DESTROY(Matrix *pMatrix); void matrix_score (Matrix *pMatrix); void matrix_dump (Matrix *pMatrix); Cell *matrix_cell (Matrix *pMatrix, int nRow, int nCol);
Let's review our situation. We started with a Perl module named Align::NW. To improve performance, we decided to reimplement the score method in C. The score method needs to bring its data along with it, so we created three C language structs. Each struct needs an interface so that we can manage it from Perl. Among them, these interfaces have 11 entry points. Each entry point will need XS code to connect it to Perl.
We could simply install all 11 entry points in the Align::NW package, but that lacks structure. Instead, we'll treat each struct as an object, and regard the subroutines that operate on them as methods.
In Perl, a package provides a namespace for the methods of a class. We create a separate package for each struct, and install the routines for each struct in the corresponding package. These structs are all part of the Align::NW module, so we nest the packages in the Align::NW namespace
Align::NW::MatrixAlign::NW::CellAlign::NW::Payoff
Next, we write XS code to connect our C language routines to Perl. As always, we begin with h2xs
.../development>h2xs -A -n Align::NW .../development>cp score.c Align/NW/ .../development>cp score.h Align/NW/
h2xs generates NW.xs for us
#include "EXTERN.h" #include "perl.h" #include "XSUB.h" MODULE = Align::NW PACKAGE = Align::NW
and we add
#include "score.h"
to bring in our header file.
MODULE and PACKAGE
Now consider the MODULE and PACKAGE directives. The MODULE is correct as generated. The PACKAGE directive is not. We actually need three different PACKAGE directives, one for each of our C structs. We also enable prototypes for each package.
MODULE = Align::NW PACKAGE = Align::NW::Matrix PROTOTYPES: ENABLE MODULE = Align::NW PACKAGE = Align::NW::Cell PROTOTYPES: ENABLE MODULE = Align::NW PACKAGE = Align::NW::Payoff PROTOTYPES: ENABLE
The xsubs for each package will be placed after the corresponding PACKAGE directive. Even though we are writing the Align::NW module, we don't need an Align::NW PACKAGE directive, because we aren't installing any xsubs in the Align::NW package.
We make entries in the typemap to tell xsubpp how to convert between our C data types and Perl data types. A review of the C interface shows that we aren't actually passing any structs across the interface, but rather pointers to structs. As discussed last month, we use the XS type T_PTROBJ to convert C struct pointers to Perl objects
Cell * T_PTROBJ Matrix * T_PTROBJ Payoff * T_PTROBJ
Now we're ready to start writing xsubs. The signature for matrix_new is
Matrix *matrix_new(char *pszA, char *pszB, Payoff *pPayoff);
In XS, we write this as
Matrix * matrix_new(pszA, pszB, pPayoff) char * pszA char * pszB Payoff * pPayoff
As written, this xsub will work, but it won't do everything that we want.
We want to call matrix_new as a constructor
$matrix = matrix_new Align::NW::Matrix $a, $b, $payoff;
When we do this, Perl passes the package name as the first argument. We need to declare this argument, and then add a CODE directive so that we can ignore it
Matrix * matrix_new(package, pszA, pszB, pPayoff) char * package char * pszA char * pszB Payoff * pPayoff CODE: RETVAL = matrix_new(pszA, pszB, pPayoff); OUTPUT: RETVAL
xsubpp will generate code to install matrix_new in the Align::NW::Matrix package, because we placed the matrix_new xsub after the
MODULE = Align::NW PACKAGE = Align::NW::Matrixline in
NW.xs. However, the pointer that matrix_new returns is blessed into a package that is named after the return type of the matrix_new xsub. As written, that type is Matrix *, and the corresponding Perl package is MatrixPtr, not Align::NW::Matrix.
There are two issues here
Align::NW:: prefixPtr suffix
For an external interface—one used by applications—I would want to suppress the Ptr suffix. It adds visual clutter and exposes an implementation detail. However, this is an internal interface: it is used only by the Align::NW module. In this context, the Ptr suffix isn't entirely extraneous: it helps us remember what sort of objects we are dealing with. I'll let it stand.
We definitely want the Align::NW:: prefix. Otherwise, we pollute the top-level package namespace with the names of our C structs.
Taking both the prefix and the suffix, we get a package name of Align::NW::MatrixPtr. To get our Matrix * blessed into this package, we declare the xsub like this
Align::NW::Matrix * matrix_new(package, pszA, pszB, pPayoff) ...
xsubpp will convert this to the C language declaration
Align__NW__Matrix *matrix_new(....);
We can accommodate this in our C code by adding a typedef to score.h
typedef Matrix Align__NW__Matrix;
We want our methods to be in the same package as our objects, so we change the PACKAGE directive to match
MODULE = Align::NW PACKAGE = Align::NW::MatrixPtr
All the same arguments apply to the Score and Payoff structs, and we use corresponding typedefs and PACKAGE directives for them.
We distinguish the names of the subroutines in our C interface by prefixing them with the names of the structs on which they operate. However, we don't need to do this in Perl code: the methods for different structs are segregated into different packages. When we write
$matrix = matrix_new Align::NW::MatrixPtr $a, $b, $payoff;
the matrix_ prefix is ugly and redundant. Furthermore, we have a method named matrix_DESTROY that serves as a destructor. This won't work at all: Perl expects the destructor for an Align::NW::MatrixPtr object to be named Align::NW::MatrixPtr::DESTROY, not Align::NW::MatrixPtr::matrix_DESTROY.
We use a PREFIX directive to suppress the matrix_ prefix in Perl code. The PREFIX directive goes on the same line as the MODULE and PACKAGE directives. If we write
MODULE = Align::NW PACKAGE = Align::NW::MatrixPtr PREFIX = matrix_
then xsubpp deletes the string matrix_ from the beginning of each xsub name before it installs it. The name of our constructor is then Align::NW::MatrixPtr::new, and we can write
$matrix = new Align::NW::MatrixPtr $a, $b, $payoff;
Here are the final MODULE lines for our three C structs
MODULE = Align::NW PACKAGE = Align::NW::MatrixPtr PREFIX = matrix_ MODULE = Align::NW PACKAGE = Align::NW::CellPtr PREFIX = cell_ MODULE = Align::NW PACKAGE = Align::NW::PayoffPtr PREFIX = payoff_
The remaining xsubs are straightforward. Align::NW::PayoffPtr::payoff_new is a constructor, so it needs a CODE directive, like Align::NW::MatrixPtr::matrix_new. For the other xsubs, all we have to write is the signature.
We changed the names of our datatypes when we added the Align::NW:: prefixes and accepted the Ptr suffix. We need to change the entries in the typemap to match
Align::NW::Cell * T_PTROBJ Align::NW::Matrix * T_PTROBJ Align::NW::Payoff * T_PTROBJ
Now we can compile and build our module. Add an OBJECT key to the WriteMakefile() call in Makefile.PL
OBJECT => 'NW.o score.o',
and do
.../development/Align/NW>perl Makefile.PL .../development/Align/NW>make test
The results should be
t/nw................ok All tests successful.
Here are sources and distribution for the XS implementation of Align::NW
For 4 months, the introduction to this article promised a stub module that you could use a starting point for your own XS modules. It didn't work out that way. As always, there is more than one way to do it. Math::Ackermann, Bit::Vector, and Align::NW illustrate three different ways to write XS modules. All of them can serve as models for your modules.
dynamic programming
Align::NW
Align:: is probably too generic a term for a top-level name in the CPAN, but it is adequate for our purposes. If we implemented the related Smith-Waterman algorithm for local optimal sequence alignment, we could call it Align::SW.
dp
dp stands for dynamic programming.