How Perl Saved the Human Genome Project (opens in new tab)

(drdobbs.com)

40 pointsp3ll0n15y ago26 comments

26 comments

16 comments · 7 top-level

hackermom15y ago· 3 in thread

Alternate title: How it happened to be Perl instead of any other just as capable language that saved the Human Genome Project (in the land of Dangling Participles and Allusion Errors).

jbert15y ago

Python, perl and ruby are roughly the same language. The differences between them are primarily cultural, rather than technical.

I suspect the reason perl flourished here was a combination of luck and the cultural fit. Culture here includes the newbie-friendly online help (e.g. perlmonks), the ease of "publish and re-use components" (CPAN).

adorton15y ago

Also, remember that when the project started, Python and Ruby didn't exist yet. Perl still wasn't the only dynamic scripting language on the block, but it probably the most mature and best-suited to this problem domain.

I wonder if perl would still be used if the project was started today.

2 more replies

draegtun15y ago

I think a better alternate title would be: How it happened a dynamic language was used to save the Human Genome Project

Because at the time Perl was probably the only capable dynamic/scripting language.

draegtun15y ago· 2 in thread

Slightly related video presentations Curing Cancer with Perl by David Dooling of the Washington University Genome Center:

* part 1 - http://blip.tv/file/1997719/

* part 2 - http://blip.tv/file/1998152

* part 3 - http://blip.tv/file/2000983/

ben104015y ago

Wow, I filmed that video!

I cannot speak officially for the Genome Center, but I'll throw out there that the ORM that powers much of the GC's analysis platform is out on Github and CPAN.

It's actually more than an ORM in that it also supports features like automated creation/smart rewriting of class files based on database tables, quick and easy command modules that get turned into hierarchical command-line tools for free, and an automated test harness that can even parallelize onto an LSF cluster if you've got one.

Github http://github.com/sakoht/UR

CPAN w/ documentation http://search.cpan.org/dist/UR/lib/UR.pm

draegtun15y ago

Thanks for recording the talk. I enjoyed watching it.

pasbesoin15y ago· 2 in thread

In the same vein, though sometimes with less detail:

http://oreilly.com/pub/a/oreilly/perl/news/success_stories.h...

O'Reilly also published some of these in at least two folded/stapled pamphlets that were handed out for free e.g. at conferences. I recall a finance-centered application where the Perl prototype far outperformed the subsequent implementation and ended up taking over the production role.

It looks like maintenance at that URL stopped in about 2004, but in googling "perl success stories" I saw a few more recent articles that might qualify.

rmoriz15y ago

iirc: O'Reilly stopped maintaining perl.com content around that time and finally handed the domain over to the perl foundation a few weeks ago…

http://www.perl.com/pub/2010/07/relaunching-perlcom.html

chromatic15y ago

Minor nit: the abandoning occurred in 2008.

p3ll0nOP15y ago· 2 in thread

In addition to Lincoln's thoughts I think one of the main reasons bioinformaticians are attracted to Perl is because it is forgiving. Biological data is often incomplete, fields can be missing, or a field that is expected to be present once occurs several times (because, for example, an experiment was run in duplicate), or the data was entered by hand and doesn't quite fit the expected format. Perl doesn't particularly mind if a value is empty or contains odd characters. Regular expressions can be written to pick up and correct a variety of common errors in data entry. Of course this flexibility can be also be a curse.

chronomex15y ago

A paragraph very similar to this one occurs in the article.

pjscott15y ago

From the article:

"Perl is forgiving. Biological data is often incomplete, fields can be missing, a field that is expected to be present once occurs several times (because, for example, an experiment was run in triplicate) or the data gets entered by hand and doesn't quite fit the expected format. Perl doesn't particularly mind if a value is empty or contains odd characters. Regular expressions can be written to detect and correct a variety of common errors in data entry. Of course, this flexibility can also be a curse, as I'll discuss in more detail later."

A few words are different. The article says triplicate, and p3ll0n says duplicate, for example. But they are similar enough to use as testing input to a diff algorithm.

EDIT: Also from this guy's comment history:

http://news.ycombinator.com/item?id=1456105

Some of the phrasing looks to have been copied and pasted from this article by Jonathan Ellis:

http://www.rackspacecloud.com/blog/2009/11/09/nosql-ecosyste...

I bet if you could make a bot to do this -- go out and find relevant information, and summarize it -- you could actually provide a serious public service. As long as you cited your sources, so it's not a plagiarism-bot.

1 more reply

westbywest15y ago

Many moons ago, I worked on an FPGA-based platform that was among several research projects targeted at the Genome Project. The general idea was to offload BLAST-style sequence alignment to purpose-compiled FPGAs, such that sequencing across the entire dataset could be performed in order of magnitude less time. It really wasn't all that complex (I just implemented Smith-Waterman directly, as a demonstration), only intended to perform fuzzy matches at Gbps speeds to winnow the working dataset down to a size more palatable to a desktop workstation.

My understanding is that all these projects (mine included) were cast adrift when the funding for them evaporated in the post-9/11 climate. In the intervening years, I was aware that Perl was being picked rapidly at the Genomics labs in the nearby university hospital (i.e. since we never delivered them the FPGA platform), and I'm happy to read Perl has risen to fill this niche.

dstorrs15y ago

The part that made me smile was when he said "In all, between one and TERAbytes of data would generated!!!!" [exaggerated emphasis mine]

I've got 3-4 terabytes of storage within a dozen feet of me as I type this; it really drives home the pace of change in computing.

blahedo15y ago

(1997)

j / k navigate · click thread line to collapse

26 comments

16 comments · 7 top-level

hackermom15y ago· 3 in thread

Alternate title: How it happened to be Perl instead of any other just as capable language that saved the Human Genome Project (in the land of Dangling Participles and Allusion Errors).

jbert15y ago

Python, perl and ruby are roughly the same language. The differences between them are primarily cultural, rather than technical.

adorton15y ago

I wonder if perl would still be used if the project was started today.

2 more replies

draegtun15y ago

I think a better alternate title would be: How it happened a dynamic language was used to save the Human Genome Project

Because at the time Perl was probably the only capable dynamic/scripting language.

draegtun15y ago· 2 in thread

Slightly related video presentations Curing Cancer with Perl by David Dooling of the Washington University Genome Center:

* part 1 - http://blip.tv/file/1997719/

* part 2 - http://blip.tv/file/1998152

* part 3 - http://blip.tv/file/2000983/

ben104015y ago

Wow, I filmed that video!

I cannot speak officially for the Genome Center, but I'll throw out there that the ORM that powers much of the GC's analysis platform is out on Github and CPAN.

Github http://github.com/sakoht/UR

CPAN w/ documentation http://search.cpan.org/dist/UR/lib/UR.pm

draegtun15y ago

Thanks for recording the talk. I enjoyed watching it.

pasbesoin15y ago· 2 in thread

In the same vein, though sometimes with less detail:

http://oreilly.com/pub/a/oreilly/perl/news/success_stories.h...

It looks like maintenance at that URL stopped in about 2004, but in googling "perl success stories" I saw a few more recent articles that might qualify.

rmoriz15y ago

iirc: O'Reilly stopped maintaining perl.com content around that time and finally handed the domain over to the perl foundation a few weeks ago…

http://www.perl.com/pub/2010/07/relaunching-perlcom.html

chromatic15y ago

Minor nit: the abandoning occurred in 2008.

p3ll0nOP15y ago· 2 in thread

chronomex15y ago

A paragraph very similar to this one occurs in the article.

pjscott15y ago

From the article:

A few words are different. The article says triplicate, and p3ll0n says duplicate, for example. But they are similar enough to use as testing input to a diff algorithm.

EDIT: Also from this guy's comment history:

http://news.ycombinator.com/item?id=1456105

Some of the phrasing looks to have been copied and pasted from this article by Jonathan Ellis:

http://www.rackspacecloud.com/blog/2009/11/09/nosql-ecosyste...

1 more reply

westbywest15y ago

dstorrs15y ago

The part that made me smile was when he said "In all, between one and TERAbytes of data would generated!!!!" [exaggerated emphasis mine]

I've got 3-4 terabytes of storage within a dozen feet of me as I type this; it really drives home the pace of change in computing.

blahedo15y ago

(1997)

j / k navigate · click thread line to collapse