Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456
XS Mechanics

Do not meddle in the affairs of wizards, for you are crunchy and good with ketchup.

XS Mechanics

This article is about XS. It explains what it is, why it is, how it works, and how to use it. It includes complete, working examples of XS modules that you can use as models for your own code. It is an express goal of this article to provide the background and information necessary for you to write your own XS modules.

This article is in five parts

November	Introduction	motivation, definitions, examples
December	Architecture	the Perl interpreter, calling conventions, data representation
January	Tools	h2xs, xsubpp, DynaLoader
February	Modules	`Math::Ackermann`, `Set::Bit`
March	`Align::NW`	Needleman-Wunsch global optimal sequence alignment

`Align::NW`

Four months ago, we set out to write an XS module. Now we have the concepts and tools that we need to complete the task. This month, we write Align::NW, an XS module to do global optimal sequence alignment.

Sequence Alignment

We need to describe the sequence alignment problem in some detail so that we can talk about the methods and data structures that we are going to implement in XS.

Given two sequences

	abcdeffh
	bdefghijkl

we want to find the best way to align them over their entire lengths: something like this

	abcdeffh
	 | ||| |
	 b.defghijkl

This alignment has 5 matches, one mismatch, and one gap of length 1. We score the alignment according to a payoff matrix

match	4
mismatch	-3
gap open	-2
gap extend	-1

and find that the alignment has a score of

	5*4 + 1*(-3) + 1*(-2-1) = 20 - 3 - 3 = 14

Out of all possible alignments, we want to find the one with the highest score for a given payoff matrix.

We'll refer to the sequences as A and B, and to their lengths as lA and lB, respectively. The number of possible alignments is exponential in lA and lB, so enumerating and scoring them all in order to find the best one is obviously intractable.

The Needleman-Wunsch Algorithm

The Needleman-Wunsch (NW) dynamic programming algorithm sets up a score matrix, with one sequence across the top and the other down the left. The entries in the matrix are called cells.

	b	c	d	e
a
b
c
d

Then it fills in the matrix with scores, working from top left to bottom right.

	b	c	d	e
a	-3	-3	-3	-3
b	4	1	0	-1
c	1	8	5	4
d	0	5	12	9

Finally, it searches for the highest scoring cell on the bottom and right edges, and backtracks from that cell through the matrix to recover the alignment.

	abcd
	 |||
	 bcde

The rules for scoring the matrix and backtracking aren't too difficult to explain, but we needn't concern ourselves with them. Interested readers can glean them from the Align::NW sources, or read them directly in the references given in the module POD. Proving that these rules actually lead to the optimal alignment is harder: Needleman and Wunsch got their names on the algorithm for doing that.

The Plan

Here is our overall plan for implementing the NW algorithm as an XS module.

Implement the algorithm as a straight Perl module
Analyze (or benchmark) the code for performance
Reimplement performance-critical methods in C
Write XS to connect the C routines to the Perl module

Perl Implementation

The first thing we need for any module is a name. We call this one Align::NW.

A Perl implementation of an XS module isn't always feasible, but having one can be very useful. It allows us to address design issues in Perl, without getting tangled up in C language coding. Here is a straight Perl implementation of Align::NW.

The new method creates a new Align::NW object, like this

    my $nw = { a       => $a,
	       b       => $b,
	       rows    => $rows,
	       cols    => $cols,
	       dp      => $dp,
	       payoff  => $payoff,
	       options => \%options };

$a and $b are the two sequences to be aligned. $dp is the score matrix; it is indexed like this

	$cell = $dp->[$row][$col];

The entries in the score matrix are cells. Each cell looks like this

	$cell = { row   => $row,
		  col   => $col,
		  score => $score,
		  prev  => $prev };

A cell knows its own row and column. It also has a score, and a reference to a previous cell in the matrix. The prev references are used to backtrack through the matrix after all the scores are filled in.

Performance analysis

The score method fills in the matrix. There are lA*lB cells in the matrix. Furthermore, computing the score for each cell requires a loop over rows and a loop over columns, so filling in the score matrix requires O(lA*lB*(lA+lB)) operations. This is much better than exponential, but still bad enough that it is worth implementing score in C.

The align method backtracks through the matrix and generates the alignment. Backtracking is conceptually simple, but rather intricate to implement. Furthermore, it runs in O(lA+lB). Implementing align in C would be difficult, and wouldn't improve performance.

C Implementation

Here is a C implementation of the score method, together with a main() to drive it. It doesn't align the sequences; it doesn't handle input or output; it doesn't have options. All it does is fill in the score matrix.

Converting a method from Perl to C raises two distinct issues

converting code
converting data

We could convert the code and not the data. C code can operate directly on Perl data: that's what the Perl C API is for. However, accessing Perl data is expensive, whether the Perl interpreter does it or our own C code does it. To realize big performance gains in the NW algorithm, we need to convert the data to C, as well.

C structs

The data structures translate directly from Perl to C:

	typedef struct
	{
	    int match;
	    int mismatch;
	    int open;
	    int extend;
	} Payoff;

	typedef struct Cell
	{
	    int          row;
	    int          col;
	    int          score;
	    struct Cell *prev;
	} Cell;

	typedef struct
	{
	    Payoff   payoff;
	    char    *pszA;
	    char    *pszB;
	    int      nRows;
	    int      nCols;
	    Cell   **ppCell;
	} NW;

The Cell and NW structs absolutely have to be in C for performance. The Payoff struct doesn't: it's values are constant, so we could leave it in Perl and have score make a local copy.

If we leave Payoff in Perl, then score has to access it through the Perl C API; if we convert it to C, then our Perl code has to access it through an XS interface. This is ultimately a question of style and convenience. In this implementation, I converted Payoff to C.

C struct interfaces

When data is in C, we need XS to operate on it from Perl. This means that we have to add an interface for each of our C structs. Here is an interface for the Payoff struct.

	Payoff *new    (int match, int mismatch, int open, int extend);
	void    DESTROY(Payoff *pPayoff);

Since we are coding in C, and not Perl, we are responsible for our own memory management. The new routine allocates storage for the struct and initializes it; the DESTROY routine frees the storage.

The interface for the Matrix struct is similar

	Matrix *new    (char *pszA, char *pszB, Payoff *pPayoff);
	void    DESTROY(Matrix *pMatrix);
	void    score  (Matrix *pMatrix);
	Cell   *cell   (Matrix *pMatrix, int nRow, int nCol);
	void    dump   (Matrix *pMatrix);

In addition to new and DESTROY routines, it has score to fill in the score matrix, cell to return a pointer to a particular cell in the matrix, and dump for debugging.

The interface for the Cell struct is very simple. The new routine for the Matrix struct also allocates and frees the cells in the matrix. All that remains for the Cell interface is accessors

	int   row  (Cell *pCell);
	int   col  (Cell *pCell);
	int   score(Cell *pCell);
	Cell *prev (Cell *pCell);

These interfaces can't stand as written, because there are two routines named new and two named DESTROY. We can fix this with the common C hack of prefixing each routine with the name of the struct on which it operates

	Payoff *payoff_new    (int match, int mismatch, int open, int extend);
	void    payoff_DESTROY(Payoff *pPayoff);
	
	int     cell_row      (Cell *pCell);
	int     cell_col      (Cell *pCell);
	int     cell_score    (Cell *pCell);
	Cell   *cell_prev     (Cell *pCell);
	
	Matrix *matrix_new    (char *pszA, char *pszB, Payoff *pPayoff);
	void    matrix_DESTROY(Matrix *pMatrix);
	void    matrix_score  (Matrix *pMatrix);
	void    matrix_dump   (Matrix *pMatrix);
	Cell   *matrix_cell   (Matrix *pMatrix, int nRow, int nCol);

From structs to objects

Let's review our situation. We started with a Perl module named Align::NW. To improve performance, we decided to reimplement the score method in C. The score method needs to bring its data along with it, so we created three C language structs. Each struct needs an interface so that we can manage it from Perl. Among them, these interfaces have 11 entry points. Each entry point will need XS code to connect it to Perl.

We could simply install all 11 entry points in the Align::NW package, but that lacks structure. Instead, we'll treat each struct as an object, and regard the subroutines that operate on them as methods.

In Perl, a package provides a namespace for the methods of a class. We create a separate package for each struct, and install the routines for each struct in the corresponding package. These structs are all part of the Align::NW module, so we nest the packages in the Align::NW namespace

Align::NW::Matrix
Align::NW::Cell
Align::NW::Payoff

XS code

Next, we write XS code to connect our C language routines to Perl. As always, we begin with h2xs

	.../development>h2xs -A -n Align::NW
	.../development>cp score.c Align/NW/
	.../development>cp score.h Align/NW/

h2xs generates NW.xs for us

	#include "EXTERN.h"
	#include "perl.h"
	#include "XSUB.h"

	MODULE = Align::NW              PACKAGE = Align::NW

and we add

	#include "score.h"

to bring in our header file.

`MODULE` and `PACKAGE`

Now consider the MODULE and PACKAGE directives. The MODULE is correct as generated. The PACKAGE directive is not. We actually need three different PACKAGE directives, one for each of our C structs. We also enable prototypes for each package.

	MODULE = Align::NW   PACKAGE = Align::NW::Matrix
	PROTOTYPES: ENABLE
	
	MODULE = Align::NW   PACKAGE = Align::NW::Cell
	PROTOTYPES: ENABLE
	
	MODULE = Align::NW   PACKAGE = Align::NW::Payoff
	PROTOTYPES: ENABLE

The xsubs for each package will be placed after the corresponding PACKAGE directive. Even though we are writing the Align::NW module, we don't need an Align::NW PACKAGE directive, because we aren't installing any xsubs in the Align::NW package.

typemap

We make entries in the typemap to tell xsubpp how to convert between our C data types and Perl data types. A review of the C interface shows that we aren't actually passing any structs across the interface, but rather pointers to structs. As discussed last month, we use the XS type T_PTROBJ to convert C struct pointers to Perl objects

	Cell   *       T_PTROBJ
	Matrix *       T_PTROBJ
	Payoff *       T_PTROBJ

xsubs

Now we're ready to start writing xsubs. The signature for matrix_new is

	Matrix *matrix_new(char *pszA, char *pszB, Payoff *pPayoff);

In XS, we write this as

	Matrix *
	matrix_new(pszA, pszB, pPayoff)
	        char   * pszA
	        char   * pszB
	        Payoff * pPayoff

As written, this xsub will work, but it won't do everything that we want.

constructor

We want to call matrix_new as a constructor

	$matrix = matrix_new Align::NW::Matrix $a, $b, $payoff;

When we do this, Perl passes the package name as the first argument. We need to declare this argument, and then add a CODE directive so that we can ignore it

	Matrix *
	matrix_new(package, pszA, pszB, pPayoff)
	        char   * package
	        char   * pszA
	        char   * pszB
	        Payoff * pPayoff
	        CODE:
	        RETVAL = matrix_new(pszA, pszB, pPayoff);
	        OUTPUT:
	        RETVAL

package

xsubpp will generate code to install matrix_new in the Align::NW::Matrix package, because we placed the matrix_new xsub after the

	MODULE = Align::NW   PACKAGE = Align::NW::Matrix

line in NW.xs. However, the pointer that matrix_new returns is blessed into a package that is named after the return type of the matrix_new xsub. As written, that type is Matrix *, and the corresponding Perl package is MatrixPtr, not Align::NW::Matrix.

There are two issues here

getting an Align::NW:: prefix
suppressing the Ptr suffix

For an external interface—one used by applications—I would want to suppress the Ptr suffix. It adds visual clutter and exposes an implementation detail. However, this is an internal interface: it is used only by the Align::NW module. In this context, the Ptr suffix isn't entirely extraneous: it helps us remember what sort of objects we are dealing with. I'll let it stand.

We definitely want the Align::NW:: prefix. Otherwise, we pollute the top-level package namespace with the names of our C structs.

Taking both the prefix and the suffix, we get a package name of Align::NW::MatrixPtr. To get our Matrix * blessed into this package, we declare the xsub like this

	Align::NW::Matrix *
	matrix_new(package, pszA, pszB, pPayoff)
	...

xsubpp will convert this to the C language declaration

	Align__NW__Matrix *matrix_new(....);

We can accommodate this in our C code by adding a typedef to score.h

	typedef Matrix Align__NW__Matrix;

We want our methods to be in the same package as our objects, so we change the PACKAGE directive to match

	MODULE = Align::NW      PACKAGE = Align::NW::MatrixPtr

All the same arguments apply to the Score and Payoff structs, and we use corresponding typedefs and PACKAGE directives for them.

Method name

We distinguish the names of the subroutines in our C interface by prefixing them with the names of the structs on which they operate. However, we don't need to do this in Perl code: the methods for different structs are segregated into different packages. When we write

	$matrix = matrix_new Align::NW::MatrixPtr $a, $b, $payoff;

the matrix_ prefix is ugly and redundant. Furthermore, we have a method named matrix_DESTROY that serves as a destructor. This won't work at all: Perl expects the destructor for an Align::NW::MatrixPtr object to be named Align::NW::MatrixPtr::DESTROY, not Align::NW::MatrixPtr::matrix_DESTROY.

We use a PREFIX directive to suppress the matrix_ prefix in Perl code. The PREFIX directive goes on the same line as the MODULE and PACKAGE directives. If we write

	MODULE = Align::NW  PACKAGE = Align::NW::MatrixPtr  PREFIX = matrix_

then xsubpp deletes the string matrix_ from the beginning of each xsub name before it installs it. The name of our constructor is then Align::NW::MatrixPtr::new, and we can write

	$matrix = new Align::NW::MatrixPtr $a, $b, $payoff;

Here are the final MODULE lines for our three C structs

	MODULE = Align::NW  PACKAGE = Align::NW::MatrixPtr  PREFIX = matrix_
		
	MODULE = Align::NW  PACKAGE = Align::NW::CellPtr    PREFIX = cell_
		
	MODULE = Align::NW  PACKAGE = Align::NW::PayoffPtr  PREFIX = payoff_

The remaining xsubs are straightforward. Align::NW::PayoffPtr::payoff_new is a constructor, so it needs a CODE directive, like Align::NW::MatrixPtr::matrix_new. For the other xsubs, all we have to write is the signature.

Typemap revisited

We changed the names of our datatypes when we added the Align::NW:: prefixes and accepted the Ptr suffix. We need to change the entries in the typemap to match

	Align::NW::Cell   *	T_PTROBJ
	Align::NW::Matrix *	T_PTROBJ
	Align::NW::Payoff *	T_PTROBJ

Test

Now we can compile and build our module. Add an OBJECT key to the WriteMakefile() call in Makefile.PL

	OBJECT => 'NW.o score.o',

and do

	.../development/Align/NW>perl Makefile.PL
	.../development/Align/NW>make test

The results should be

	t/nw................ok
	All tests successful.

Distribution

Here are sources and distribution for the XS implementation of Align::NW

Stub Modules

For 4 months, the introduction to this article promised a stub module that you could use a starting point for your own XS modules. It didn't work out that way. As always, there is more than one way to do it. Math::Ackermann, Bit::Vector, and Align::NW illustrate three different ways to write XS modules. All of them can serve as models for your modules.

NOTES

dynamic programming: Here is a brief description of dynamic programming, in the context of a simpler problem.
Align::NW: Align:: is probably too generic a term for a top-level name in the CPAN, but it is adequate for our purposes. If we implemented the related Smith-Waterman algorithm for local optimal sequence alignment, we could call it Align::SW.
dp: dp stands for dynamic programming.

Steven McDougall / resume / swmcd@world.std.com / 2000 March 10