------------------------------------------------------------------------ Throwing Bioinformatics a [W-]curve: Perl for high-speed, high-volume DNA sequence comparison. ------------------------------------------------------------------------ A funny thing happend on the way to a petri dish: watching things grow isn't enough any more. You have to look inside them, further than a microscope can go. Recent problems in biology can no longer be handled by cell cultures alone. In particular, using the information contained in DNA involves managing huge amounts of raw data in increasingly complex ways. Academics are persuing questions on the causes of disease and origin of species; lots of money is being thrown at improving the cycle of drug discovery by pharmaceutical companies. All of this centers around finding faster, more flexable ways of comparing sequences of DNA that make up our genes. By now, everyone has seen the representation of DNA as "base pairs" as the letters "C", "G", "T", and "A". Their sequences make up the genetic code. A common question, then, is why are the comparisons so difficult, when all the DNA is just a bunch of strings? Answer: they aren't. DNA is not read as a single sequence but in chunks of 3 units (see next page). Given the [CATG] units, there are 64 combinations of DNA -- but only 20 amino acids they encode. This leaves the third base pair in each sequence partially redundent. The number of mappings ranges from 2 (ARG, SER) to one (TRP) with most having 4 encodings. Good news: biologically, we can survive copying errors. Bad news: variations make it really difficult to make the comparisons. The obvious fix would be to convert the DNA to regexen like /CA[TG]CT[CG]/ Catch: it doesn't work. The simple reason is that genes are thousands of bases long, which leaves the regexen unmanagable. The real reason is that eucaryote genes (pretty much evertyhing from yeast on up) have non-copying pieces of DNA named "interons" that don't have any affect on the resulting protein. Of the 450_000 bases in hemogloben, about 85% of the DNA never gets "used": it's snipped out of the RNA copy before being used to make proteins. The obvious fix here would be to ignore the interons -- which would work if only we could find them reliably. There are also repeats: short sequences of DNA repeated multiple times. These affect the protein without changing it too much. One example of this is a TATA repeat in hemogloben that gives us A, B, and O blood. As protiens get larger -- and their DNA encoding longer -- the range of changes that lead to largely equivalent but still different proteins also grows. Now try comparing genes between species... ... which is exactly what you need to do next. Microarrays are good for showing what is active in a cell -- for example active genes in human liver cancer cells. The problem is that you probably don't want to grow humans in order to test genetic variations and cures. You want to call the company in Maine that makes the funny hairless mice. The problem now is finding the gene in a mouse that matches most closely the gene in a human that when it goes haywire gives humans liver cancer. This requires scanning the mouse chromosome for similar-enough matches that they may perform the same function in a mouse as a person. Existing techniques for making these comparisons perform alignments between the various genes. This involves paring up bases in the seqence, introducing "gaps" where the two genes don't seem to fit: ...CATCTT---CATTGCGCC... ...CATCTTGCCCATTGCGCC... The problem here is that DNA repeats the same short sequences many times and two genes may not align at three-unit boundries: ...CATCTGTA--CTGATA... ...CATCTGTATACTGATA... Gaps are also recursive: adding one offsets all the bases downstream, which may add new gaps or may require a larger gap to push everythign back into its proper place. The most effective methods today -- BLAST, Fasta, Clustal -- cannot handle long genes or repeats well, inerons at all. Face it: for comparing genes, strings just don't cut it. Curves, however, may. Approximation techiques work by trading degrees of freedom in the problem set for a more inclusive outcome. With only one dimension to start with, strings simply don't have anything to give up. One way of converting a string of DNA into a usable curve is by using it as input to a finite state machine -- similar to Dr. Wolfram's cellular automatia. At each step the "bug" does whatever the next base in the DNA sequence tells it to. This can convert a linear string into whatever convouted shape the rules tell it to. Given a useful set of rules, the shape produced can be approached with approximation techniques. One example is to take a square with corners at [+/-1,+/-1] Label one set of opposing corners "A" and "T" the others "C" and "G". Label the bases along the gene [1..N], start the curve at [0,0,0] and let the bug walk halfway to the corner labeled with the next base. Comparing the genes then involves comparing the curves generated by two bugs walking their respective genes. Which is where I finally saw a camel peeking in under my tent. The tent was parked over a C++ based visualazation system called the W-curve (described above). The technique is pretty good for viewing gened within strings of DNA, but comparing any long set of DNA using 3D curves on a monitor can just as easily induce blidness as instanity. My mission was to find a way for the machine to make comparisons instead of increasingly feeble-eyed humans. Our goal is to boldly going were no nerd had gone before: comparing large sequences of DNA quickly. Making it effecient gives a few good examples of how to look at problems and adapt them for use with Perl. The first steps were looking at the algorithem. One of the more common things that done to a W-curve is rotating it about the z-axis. Cartesian co-ordinates make this a three step operation, with three 4x4 matrices involved. Ouch. First fix: convert the coord's to cylindrical: [r, a, Z], leaving rotation as a simple addition. Now the coordinates work, but the maximum radius is sqrt 2. Second fix: start the corners at [0,1], which gives a unit radius: >From This: A+------+------+ C (1,1) | | | | | | | | | | | | --+------+------+-- | | | | | | | | | | | | G+------+------+ T To this: +A /|\ / | \ / | \ / | \ / | \ G / | \ C (1,0) --+------+------+-- \ | / \ | / \ | / \ | / \ | / \|/ +T At this point the only thing left is to compute the half interval. Halving the distance is one way, but this requires a square root, which is expensive. Taking the arctan of the half-intervals is simple enough and involves mostly table lookups. The unit radius helps since addition and by one is simple and division by one even easier. The last modification was to notice that rotations are simple addition, and the math works equally well for any of the corners. This allows computing the half-interval from any point to (1,0) and rotating it to the appropriate corner. The sequence is: find the angle for a corner (e.g., C == 0, A == PI/2) and subtract it from the current vector's angle. After that the math for a point with respect to (1,0) is the same, with a single addition required to move the point around to where it belongs. The point is that two rather minor looking geometric shifts simplified my life on this enormously. Result: generating W-curves in Perl is a snap. Using map, of course, to convert split //, $dnastring into an array of [$radius, $angle]. Comparing the curves is where the real fun begins, and introduces one way to deal with Perl's lack of pointers for effeciently handling multiple sequences of data. People used to C bemoan the lack of pointers in Perl: "arrays just don't cut it for comparisons", they say, "becuase you have to scan the entire array to access the offsets." This is partly due to programmers misusing structures in Perl -- or not using them at all. The usual approach to comparing two lists in C is a pair of pointers: foo_t *a = list1; foo_t *b = list2; while( *a != (foo_t *)NULL && *b != (foo_t *)NULL ) { foo_compare( a, b ); } Lacking pointers in Perl, one fix is to use arrays with offsets to walk the lists: for( 0..$#a ) { foo_compare $a[$_], $b[$_]; } the problem here is that you end up walking a list for each offset, with a serious performance hit. Another approach uses a for-loop to walk one of the lists, indexing the other. This still has to walk the second list for its data: my $i = 0; for( @a ) { foo_compare $_, $b[$i++]; } Perl can iterate one list quite effectively, it's the second list that hurts. Walking the second list can be avoided if it can be consumed for processing (or duplicating it is cheap): my @c = (); for( @a ) { my $b = shift @b; foo_compare $_, $b; push @c, $b; } @b = @c; The catch here is that @b has to be regenerated for each use or cached by pushing it onto a second list in the loop. Using a referent for @b helps a bit, but can run into issues with the copy back from \@c if $b is blessed or its value used directly (e.g., as the key to a dbi-style metadata hash). My way out of this was to store the data in a single list. Instead of generating separate arrays, the W-curve uses a single list with offsets zero and one for the resulting data: my @dna = @_[0,1]; my $i = length $dna[0]; my $j = length $dna[1]; my @curvz = (); $#curvz = $i > $j ? $i : $j; for( 0, 1 ) { my @a = split //, $dna[$_]; for my $bucket ( @curvz ) { last unless @a; $bucket->[$_] = next_w_curve_value shift @a; } } This leaves the two results merged into a single array @curvz with the values being compared at offsets 0 and 1 in $curves[$i]. Comparing the curves now involves only a single pass over @curvz to complete: compare_two_curve_points $_ for( @curvz ); ... compare_two_curvepoints { my( $a, $b ) = @{$_[0]}; # or just use the referent with offsets: # # my $a = shift; # # $a->[0] and $a->[1] # # $a and $b are the values being compared. } This works well for our use becuase one of the curves will be re-used for each pass. Generating a new curve then requires only clipping the combined array to the two curves' maximum length and generating a curve with offset 1: my $string1 = shift; my $length1 = length $string1; $#curvz = $length0 > length $string1 ? $length0 : $length1; my $a = split //, $string1; $curves[0] = [ 0, 0 ]; for my $bucket ( @curvz ) { $bucket->[1] = next_w_curve_value shift @a; } ... compare_two_curve_points @$_ for( @curvz ); Looks familiar, eh? Re-merging the second curve into a retained array saves re-generating the first curve and leaves comparing them simple. My first-pass comparision function sums a value computed separately at each point and averages it over the length of the longer array. This simple average has the advantage that an array can be partitioned and the sections compared separately. For two points [r0,a0], [r1,a0] the mesasure is computed with: ( $r0, $r1 ) = ( $r1, $r0 ) if $r1 > $r0; my $score = $r0 - $r1 * cos( $a0 - $a1 ); Nice thing about cosines: cos(-a) == cos(a), so the angles don't have to be reversed. The result is a measure that includes distance along the longer radius in the difference but not deviation from it. If the two radii are at less than 90 degrees to one another then their differences will subtract, otherwise they will add; if both radii are small then their angle does not matter much. This allows the two curves to wander along in roughly the same direction with minor separations and still have a fairly low score. The idea here is to handle for variatioins caused by repeats without having to explicitly align the curves. Note that this makes no attempt to handle interons. The current method is only useful for procaryotes or cDNA samples -- which is fine since cDNA is what gets used in microarray expiraments. Using the techniques here I was able to compare the whole genomes of m.genatalaum and m.pneumonia in around 7 minutes in a single process, Perl-5.8 job on linux. This is an improvement over alignment techniques, but does not go very far towards handling larger procaryotes or anything using a eucaryotic-ish quantity of genes to run itself. The next step required a bit if cut and run, so to speak: by slicing up the DNA strings, generation and comparison tasks could hopefully be done in reasonable time. This is workable with W-curves since -- unlike alignment -- they do not modify their input data by introducing gaps: the method can be used for additive piecewise comparison. So, my next step was to partition the generation and comparison so that they could be handled in parallel. The eventual goal of this was to slurp up some time on a local 8-node Beowulf Cluster ("BC") to get the really big jobs done. BC's in their various incarnations currently make up the most powerful number-crunching systems in the known universe. The trick to using them well involves going from "embarassingly parallel" task scheduling to truly-parallel execution using message passing and distributed data. Contrary to most notions, this is an area where Perl shines: most of the nasty work in a BC involves bookkeeping to maintain list of available nodes, jobs pending and dispatched, and associate the return results from individual nodes with their total results. These tasks are nicely suited to Perl's arrays and hashes; Perl's exception handling works well with the kind of error recovery required for network errors and adaptive scheduling. The PVM module makes multi-node dispatch simple enough, with the underlying PVM library handling many of the message passing details automatically. The first step in designing message-driven code is to figure out what the messages will be. In this case reading in a specie's data, selecting a gene for the first and seconcd w-curve lists, initiating computation and comparison, partitioning, and passing back a result seemed like a decent collection of tasks. The code then had to be split into chunks to handle the various messages. This left it looking like event-driven code used in GUI design: handlers for the various messages that call processing routines to really get the work done. Fortunately for me, the messages and their handlers are fairly straightforward: pick a GenBank file for the specie's genome, pick a gene for a species and merge it into the curve array, make the comparison, return the results. Aside from the GenBank file path (which can be made short with default paths), the remaining messages are short strings or pairs of integers. One issue at this point is how to deal with partitioned comparisons. What we came up with is having the nodes generate their own segments of the W-curves using sections of the source data. Since the curve only uses one curve point and the next base, curve generation can be daisy-chained: each node returns a message of the w-curve at the end of its section, then the dispatcher kicks off another job from that point onward. The alternative is to have each node generate the entire w-curve and then compare only sections of them. The choice between these is largely a matter of memory management and time saved generating part of the curved. At this point we are still converting the code for execution in the BC and testing how well my first -- admittedly simplistic -- comparison algorithem works. Once we have worked out the kinks for m.genatalaum and m.pneumonae we can start on procaryotes with longer genomes (e.g., e.coli) and cDNA from eucaryotes. The main point of all this is that Perl can be used for serious number crunching without resoring to XS. It can also be used for programs in messy, complicated situations like genome reseasrch. The main issues are making sure that the problem space is properly defined for handling by Perl and that optimizations are designed into the process carefully instead of being ad hoc hacks. Bioinformatics is a growing, changing field. Perl's fits nicely into this darwinian environment because it adapts. Gracefully handling nested and dynamic data structures is one of its strengths, effecient integration with networking libraries is another. The trick to finding effective solutions is basing them on effective Perl rather than recycled C. The most interesting thing in projects like these is what we find in them: the origins of genes, animal models for human cancer, starting points for drug research. The fun part as a Perl hacker is seeing just how easily Perl handles it all. ------------------------------------------------------- Other places to look for information on Bioinformatics: ------------------------------------------------------- http://www.bioinfo.org/ General information, links to everything else. http://www.bioperl.org/ The bio* projects were intended to build a consistent set of bioinformatics tools in various high-level languages, Perl, Python, Java among them. The BioPerl project has gone the furthest and is the most used of these. The site describes using BioPerl and has good links to other sites. Developing Bioinformatics Computer Skills Gibas & Jambeck, O'Reilly Press. The content covers where biology and informatics meet, including both the chemestry of DNA and basics of Perl. This is a really good place for anyone to start since it covers both biology, Perl, and how they work together. http://www.ora.com/ O'Reilly's bioinformatics conference is detailed on their site, with good references to presenter information and papers. Their site also has a number of good books in the bioinformatics catagory. Genes, VII This is the standard text on genetics, and covers everything from basepairs to ribosome chemestry. Even a cursory reading of the first few chapters can be a big help in understanding gene-speak. ------------------------------------------------------------------------