Biopython Tutorial PDF Version Quick Guide Resources Job Search Discussion Biopython is an open-source python tool mainly used in bioinformatics field. This tutorial walks through the basics of Biopython package, overview of bioinformatics, sequence manipulation and plotting, population genetics, cluster analysis, genome analysis, connecting with BioSQL databases and finally concludes with some examples. Audience This tutorial is prepared for professionals who are aspiring to make a career in the field of bioinformatics programming using python as programming tool. This tutorial is intended to make you comfortable in getting started with the Biopython concepts and its various functions. Prerequisites Before proceeding with the various types of concepts given in this tutorial, it is being assumed that the readers are already aware about bioinformatics. In addition to this, it will be very helpful if the readers have a sound knowledge on Python. Print Page Previous Next Advertisements ”;
Category: biopython
Advanced Sequence Operations
Biopython – Advanced Sequence Operations ”; Previous Next In this chapter, we shall discuss some of the advanced sequence features provided by Biopython. Complement and Reverse Complement Nucleotide sequence can be reverse complemented to get new sequence. Also, the complemented sequence can be reverse complemented to get the original sequence. Biopython provides two methods to do this functionality − complement and reverse_complement. The code for this is given below − >>> from Bio.Alphabet import IUPAC >>> nucleotide = Seq(”TCGAAGTCAGTC”, IUPAC.ambiguous_dna) >>> nucleotide.complement() Seq(”AGCTTCAGTCAG”, IUPACAmbiguousDNA()) >>> Here, the complement() method allows to complement a DNA or RNA sequence. The reverse_complement() method complements and reverses the resultant sequence from left to right. It is shown below − >>> nucleotide.reverse_complement() Seq(”GACTGACTTCGA”, IUPACAmbiguousDNA()) Biopython uses the ambiguous_dna_complement variable provided by Bio.Data.IUPACData to do the complement operation. >>> from Bio.Data import IUPACData >>> import pprint >>> pprint.pprint(IUPACData.ambiguous_dna_complement) { ”A”: ”T”, ”B”: ”V”, ”C”: ”G”, ”D”: ”H”, ”G”: ”C”, ”H”: ”D”, ”K”: ”M”, ”M”: ”K”, ”N”: ”N”, ”R”: ”Y”, ”S”: ”S”, ”T”: ”A”, ”V”: ”B”, ”W”: ”W”, ”X”: ”X”, ”Y”: ”R”} >>> GC Content Genomic DNA base composition (GC content) is predicted to significantly affect genome functioning and species ecology. The GC content is the number of GC nucleotides divided by the total nucleotides. To get the GC nucleotide content, import the following module and perform the following steps − >>> from Bio.SeqUtils import GC >>> nucleotide = Seq(“GACTGACTTCGA”,IUPAC.unambiguous_dna) >>> GC(nucleotide) 50.0 Transcription Transcription is the process of changing DNA sequence into RNA sequence. The actual biological transcription process is performing a reverse complement (TCAG → CUGA) to get the mRNA considering the DNA as template strand. However, in bioinformatics and so in Biopython, we typically work directly with the coding strand and we can get the mRNA sequence by changing the letter T to U. Simple example for the above is as follows − >>> from Bio.Seq import Seq >>> from Bio.Seq import transcribe >>> from Bio.Alphabet import IUPAC >>> dna_seq = Seq(“ATGCCGATCGTAT”,IUPAC.unambiguous_dna) >>> transcribe(dna_seq) Seq(”AUGCCGAUCGUAU”, IUPACUnambiguousRNA()) >>> To reverse the transcription, T is changed to U as shown in the code below − >>> rna_seq = transcribe(dna_seq) >>> rna_seq.back_transcribe() Seq(”ATGCCGATCGTAT”, IUPACUnambiguousDNA()) To get the DNA template strand, reverse_complement the back transcribed RNA as given below − >>> rna_seq.back_transcribe().reverse_complement() Seq(”ATACGATCGGCAT”, IUPACUnambiguousDNA()) Translation Translation is a process of translating RNA sequence to protein sequence. Consider a RNA sequence as shown below − >>> rna_seq = Seq(“AUGGCCAUUGUAAU”,IUPAC.unambiguous_rna) >>> rna_seq Seq(”AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG”, IUPACUnambiguousRNA()) Now, apply translate() function to the code above − >>> rna_seq.translate() Seq(”MAIV”, IUPACProtein()) The above RNA sequence is simple. Consider RNA sequence, AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGA and apply translate() − >>> rna = Seq(”AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGA”, IUPAC.unambiguous_rna) >>> rna.translate() Seq(”MAIVMGR*KGAR”, HasStopCodon(IUPACProtein(), ”*”)) Here, the stop codons are indicated with an asterisk ’*’. It is possible in translate() method to stop at the first stop codon. To perform this, you can assign to_stop=True in translate() as follows − >>> rna.translate(to_stop = True) Seq(”MAIVMGR”, IUPACProtein()) Here, the stop codon is not included in the resulting sequence because it does not contain one. Translation Table The Genetic Codes page of the NCBI provides full list of translation tables used by Biopython. Let us see an example for standard table to visualize the code − >>> from Bio.Data import CodonTable >>> table = CodonTable.unambiguous_dna_by_name[“Standard”] >>> print(table) Table 1 Standard, SGC0 | T | C | A | G | –+———+———+———+———+– T | TTT F | TCT S | TAT Y | TGT C | T T | TTC F | TCC S | TAC Y | TGC C | C T | TTA L | TCA S | TAA Stop| TGA Stop| A T | TTG L(s)| TCG S | TAG Stop| TGG W | G –+———+———+———+———+– C | CTT L | CCT P | CAT H | CGT R | T C | CTC L | CCC P | CAC H | CGC R | C C | CTA L | CCA P | CAA Q | CGA R | A C | CTG L(s)| CCG P | CAG Q | CGG R | G –+———+———+———+———+– A | ATT I | ACT T | AAT N | AGT S | T A | ATC I | ACC T | AAC N | AGC S | C A | ATA I | ACA T | AAA K | AGA R | A A | ATG M(s)| ACG T | AAG K | AGG R | G –+———+———+———+———+– G | GTT V | GCT A | GAT D | GGT G | T G | GTC V | GCC A | GAC D | GGC G | C G | GTA V | GCA A | GAA E | GGA G | A G | GTG V | GCG A | GAG E | GGG G | G –+———+———+———+———+– >>> Biopython uses this table to translate the DNA to protein as well as to find the Stop codon. Print Page Previous Next Advertisements ”;
Biopython – Sequence
Biopython – Sequence ”; Previous Next A sequence is series of letters used to represent an organism’s protein, DNA or RNA. It is represented by Seq class. Seq class is defined in Bio.Seq module. Let’s create a simple sequence in Biopython as shown below − >>> from Bio.Seq import Seq >>> seq = Seq(“AGCT”) >>> seq Seq(”AGCT”) >>> print(seq) AGCT Here, we have created a simple protein sequence AGCT and each letter represents Alanine, Glycine, Cysteine and Threonine. Each Seq object has two important attributes − data − the actual sequence string (AGCT) alphabet − used to represent the type of sequence. e.g. DNA sequence, RNA sequence, etc. By default, it does not represent any sequence and is generic in nature. Alphabet Module Seq objects contain Alphabet attribute to specify sequence type, letters and possible operations. It is defined in Bio.Alphabet module. Alphabet can be defined as below − >>> from Bio.Seq import Seq >>> myseq = Seq(“AGCT”) >>> myseq Seq(”AGCT”) >>> myseq.alphabet Alphabet() Alphabet module provides below classes to represent different types of sequences. Alphabet – base class for all types of alphabets. SingleLetterAlphabet – Generic alphabet with letters of size one. It derives from Alphabet and all other alphabets type derives from it. >>> from Bio.Seq import Seq >>> from Bio.Alphabet import single_letter_alphabet >>> test_seq = Seq(”AGTACACTGGT”, single_letter_alphabet) >>> test_seq Seq(”AGTACACTGGT”, SingleLetterAlphabet()) ProteinAlphabet − Generic single letter protein alphabet. >>> from Bio.Seq import Seq >>> from Bio.Alphabet import generic_protein >>> test_seq = Seq(”AGTACACTGGT”, generic_protein) >>> test_seq Seq(”AGTACACTGGT”, ProteinAlphabet()) NucleotideAlphabet − Generic single letter nucleotide alphabet. >>> from Bio.Seq import Seq >>> from Bio.Alphabet import generic_nucleotide >>> test_seq = Seq(”AGTACACTGGT”, generic_nucleotide) >>> test_seq Seq(”AGTACACTGGT”, NucleotideAlphabet()) DNAAlphabet − Generic single letter DNA alphabet. >>> from Bio.Seq import Seq >>> from Bio.Alphabet import generic_dna >>> test_seq = Seq(”AGTACACTGGT”, generic_dna) >>> test_seq Seq(”AGTACACTGGT”, DNAAlphabet()) RNAAlphabet − Generic single letter RNA alphabet. >>> from Bio.Seq import Seq >>> from Bio.Alphabet import generic_rna >>> test_seq = Seq(”AGTACACTGGT”, generic_rna) >>> test_seq Seq(”AGTACACTGGT”, RNAAlphabet()) Biopython module, Bio.Alphabet.IUPAC provides basic sequence types as defined by IUPAC community. It contains the following classes − IUPACProtein (protein) − IUPAC protein alphabet of 20 standard amino acids. ExtendedIUPACProtein (extended_protein) − Extended uppercase IUPAC protein single letter alphabet including X. IUPACAmbiguousDNA (ambiguous_dna) − Uppercase IUPAC ambiguous DNA. IUPACUnambiguousDNA (unambiguous_dna) − Uppercase IUPAC unambiguous DNA (GATC). ExtendedIUPACDNA (extended_dna) − Extended IUPAC DNA alphabet. IUPACAmbiguousRNA (ambiguous_rna) − Uppercase IUPAC ambiguous RNA. IUPACUnambiguousRNA (unambiguous_rna) − Uppercase IUPAC unambiguous RNA (GAUC). Consider a simple example for IUPACProtein class as shown below − >>> from Bio.Alphabet import IUPAC >>> protein_seq = Seq(“AGCT”, IUPAC.protein) >>> protein_seq Seq(”AGCT”, IUPACProtein()) >>> protein_seq.alphabet Also, Biopython exposes all the bioinformatics related configuration data through Bio.Data module. For example, IUPACData.protein_letters has the possible letters of IUPACProtein alphabet. >>> from Bio.Data import IUPACData >>> IUPACData.protein_letters ”ACDEFGHIKLMNPQRSTVWY” Basic Operations This section briefly explains about all the basic operations available in the Seq class. Sequences are similar to python strings. We can perform python string operations like slicing, counting, concatenation, find, split and strip in sequences. Use the below codes to get various outputs. To get the first value in sequence. >>> seq_string = Seq(“AGCTAGCT”) >>> seq_string[0] ”A” To print the first two values. >>> seq_string[0:2] Seq(”AG”) To print all the values. >>> seq_string[ : ] Seq(”AGCTAGCT”) To perform length and count operations. >>> len(seq_string) 8 >>> seq_string.count(”A”) 2 To add two sequences. >>> from Bio.Alphabet import generic_dna, generic_protein >>> seq1 = Seq(“AGCT”, generic_dna) >>> seq2 = Seq(“TCGA”, generic_dna) >>> seq1+seq2 Seq(”AGCTTCGA”, DNAAlphabet()) Here, the above two sequence objects, seq1, seq2 are generic DNA sequences and so you can add them and produce new sequence. You can’t add sequences with incompatible alphabets, such as a protein sequence and a DNA sequence as specified below − >>> dna_seq = Seq(”AGTACACTGGT”, generic_dna) >>> protein_seq = Seq(”AGUACACUGGU”, generic_protein) >>> dna_seq + protein_seq ….. ….. TypeError: Incompatible alphabets DNAAlphabet() and ProteinAlphabet() >>> To add two or more sequences, first store it in a python list, then retrieve it using ‘for loop’ and finally add it together as shown below − >>> from Bio.Alphabet import generic_dna >>> list = [Seq(“AGCT”,generic_dna),Seq(“TCGA”,generic_dna),Seq(“AAA”,generic_dna)] >>> for s in list: … print(s) … AGCT TCGA AAA >>> final_seq = Seq(” “,generic_dna) >>> for s in list: … final_seq = final_seq + s … >>> final_seq Seq(”AGCTTCGAAAA”, DNAAlphabet()) In the below section, various codes are given to get outputs based on the requirement. To change the case of sequence. >>> from Bio.Alphabet import generic_rna >>> rna = Seq(“agct”, generic_rna) >>> rna.upper() Seq(”AGCT”, RNAAlphabet()) To check python membership and identity operator. >>> rna = Seq(“agct”, generic_rna) >>> ”a” in rna True >>> ”A” in rna False >>> rna1 = Seq(“AGCT”, generic_dna) >>> rna is rna1 False To find single letter or sequence of letter inside the given sequence. >>> protein_seq = Seq(”AGUACACUGGU”, generic_protein) >>> protein_seq.find(”G”) 1 >>> protein_seq.find(”GG”) 8 To perform splitting operation. >>> protein_seq = Seq(”AGUACACUGGU”, generic_protein) >>> protein_seq.split(”A”) [Seq(””, ProteinAlphabet()), Seq(”GU”, ProteinAlphabet()), Seq(”C”, ProteinAlphabet()), Seq(”CUGGU”, ProteinAlphabet())] To perform strip operations in the sequence. >>> strip_seq = Seq(” AGCT “) >>> strip_seq Seq(” AGCT ”) >>> strip_seq.strip() Seq(”AGCT”) Print Page Previous Next Advertisements ”;