Importing Genbank Data, Quickly

NCBI is currently listing a total of 1668 bacterial genome projects, counting both complete and incomplete. With the advent of high-throughput sequencing technologies this number looks set to mushroom even further.

This is great news, but for bioinformaticians it provides serious challenges.

When it came to updating the xBASE database, we found ourselves in a spot of bother. Not only are there more genomes than ever to import and process, we've also decided to include plasmids sequenced outside of genome projects as well as viral and mitochondrial sequences.

All told this amounts to over 12GB of raw sequence data to process, which was putting a major strain on our scripts used to import data.

Our estimate was of several weeks of crunching to get the sequences import, but thanks to the Biopython project, and some nifty database tricks, we have got the time taken down to under a day, which means we can scale much better in the future.

For the technically minded, the details are on the biopython-dev list.