Biopython – Creating Simple Application
”;
Let us create a simple Biopython application to parse a bioinformatics file and print the content. This will help us understand the general concept of the Biopython and how it helps in the field of bioinformatics.
Step 1 − First, create a sample sequence file, “example.fasta” and put the below content into it.
>sp|P25730|FMS1_ECOLI CS1 fimbrial subunit A precursor (CS1 pilin) MKLKKTIGAMALATLFATMGASAVEKTISVTASVDPTVDLLQSDGSALPNSVALTYSPAV NNFEAHTINTVVHTNDSDKGVVVKLSADPVLSNVLNPTLQIPVSVNFAGKPLSTTGITID SNDLNFASSGVNKVSSTQKLSIHADATRVTGGALTAGQYQGLVSIILTKSTTTTTTTKGT >sp|P15488|FMS3_ECOLI CS3 fimbrial subunit A precursor (CS3 pilin) MLKIKYLLIGLSLSAMSSYSLAAAGPTLTKELALNVLSPAALDATWAPQDNLTLSNTGVS NTLVGVLTLSNTSIDTVSIASTNVSDTSKNGTVTFAHETNNSASFATTISTDNANITLDK NAGNTIVKTTNGSQLPTNLPLKFITTEGNEHLVSGNYRANITITSTIKGGGTKKGTTDKK
The extension, fasta refers to the file format of the sequence file. FASTA originates from the bioinformatics software, FASTA and hence it gets its name. FASTA format has multiple sequence arranged one by one and each sequence will have its own id, name, description and the actual sequence data.
Step 2 − Create a new python script, *simple_example.py” and enter the below code and save it.
from Bio.SeqIO import parse from Bio.SeqRecord import SeqRecord from Bio.Seq import Seq file = open("example.fasta") records = parse(file, "fasta") for record in records: print("Id: %s" % record.id) print("Name: %s" % record.name) print("Description: %s" % record.description) print("Annotations: %s" % record.annotations) print("Sequence Data: %s" % record.seq) print("Sequence Alphabet: %s" % record.seq.alphabet)
Let us take a little deeper look into the code −
Line 1 imports the parse class available in the Bio.SeqIO module. Bio.SeqIO module is used to read and write the sequence file in different format and `parse’ class is used to parse the content of the sequence file.
Line 2 imports the SeqRecord class available in the Bio.SeqRecord module. This module is used to manipulate sequence records and SeqRecord class is used to represent a particular sequence available in the sequence file.
*Line 3″ imports Seq class available in the Bio.Seq module. This module is used to manipulate sequence data and Seq class is used to represent the sequence data of a particular sequence record available in the sequence file.
Line 5 opens the “example.fasta” file using regular python function, open.
Line 7 parse the content of the sequence file and returns the content as the list of SeqRecord object.
Line 9-15 loops over the records using python for loop and prints the attributes of the sequence record (SqlRecord) such as id, name, description, sequence data, etc.
Line 15 prints the sequence’s type using Alphabet class.
Step 3 − Open a command prompt and go to the folder containing sequence file, “example.fasta” and run the below command −
> python simple_example.py
Step 4 − Python runs the script and prints all the sequence data available in the sample file, “example.fasta”. The output will be similar to the following content.
Id: sp|P25730|FMS1_ECOLI Name: sp|P25730|FMS1_ECOLI Decription: sp|P25730|FMS1_ECOLI CS1 fimbrial subunit A precursor (CS1 pilin) Annotations: {} Sequence Data: MKLKKTIGAMALATLFATMGASAVEKTISVTASVDPTVDLLQSDGSALPNSVALTYSPAVNNFEAHTINTVVHTNDSD KGVVVKLSADPVLSNVLNPTLQIPVSVNFAGKPLSTTGITIDSNDLNFASSGVNKVSSTQKLSIHADATRVTGGALTA GQYQGLVSIILTKSTTTTTTTKGT Sequence Alphabet: SingleLetterAlphabet() Id: sp|P15488|FMS3_ECOLI Name: sp|P15488|FMS3_ECOLI Decription: sp|P15488|FMS3_ECOLI CS3 fimbrial subunit A precursor (CS3 pilin) Annotations: {} Sequence Data: MLKIKYLLIGLSLSAMSSYSLAAAGPTLTKELALNVLSPAALDATWAPQDNLTLSNTGVSNTLVGVLTLSNTSIDTVS IASTNVSDTSKNGTVTFAHETNNSASFATTISTDNANITLDKNAGNTIVKTTNGSQLPTNLPLKFITTEGNEHLVSGN YRANITITSTIKGGGTKKGTTDKK Sequence Alphabet: SingleLetterAlphabet()
We have seen three classes, parse, SeqRecord and Seq in this example. These three classes provide most of the functionality and we will learn those classes in the coming section.
”;