Biopython – Phenotype Microarray

Biopython – Phenotype Microarray ”; Previous Next Phenotype is defined as an observable character or trait exhibited by an organism against a particular chemical or environment. Phenotype microarray simultaneously measures the reaction of an organism against a larger number of chemicals & environment and analyses the data to understand the gene mutation, gene characters, etc. Biopython provides an excellent module, Bio.Phenotype to analyze phenotypic data. Let us learn how to parse, interpolate, extract and analyze the phenotype microarray data in this chapter. Parsing Phenotype microarray data can be in two formats: CSV and JSON. Biopython supports both the formats. Biopython parser parses the phenotype microarray data and returns as a collection of PlateRecord objects. Each PlateRecord object contains a collection of WellRecord objects. Each WellRecord object holds data in 8 rows and 12 columns format. The eight rows are represented by A to H and 12 columns are represented by 01 to 12. For example, 4th row and 6th column are represented by D06. Let us understand the format and the concept of parsing with the following example − Step 1 − Download the Plates.csv file provided by Biopython team − https://raw.githubusercontent.com/biopython/biopython/master/Doc/examples/Plates.csv Step 2 − Load the phenotpe module as below − >>> from Bio import phenotype Step 3 − Invoke phenotype.parse method passing the data file and format option (“pm-csv”). It returns the iterable PlateRecord as below, >>> plates = list(phenotype.parse(”Plates.csv”, “pm-csv”)) >>> plates [PlateRecord(”WellRecord[”A01”], WellRecord[”A02”], WellRecord[”A03”], …, WellRecord[”H12”]”), PlateRecord(”WellRecord[”A01”], WellRecord[”A02”], WellRecord[”A03”], …, WellRecord[”H12”]”), PlateRecord(”WellRecord[”A01”], WellRecord[”A02”], WellRecord[”A03”], …, WellRecord[”H12”]”), PlateRecord(”WellRecord[”A01”], WellRecord[”A02”],WellRecord[”A03”], …, WellRecord[”H12”]”)] >>> Step 4 − Access the first plate from the list as below − >>> plate = plates[0] >>> plate PlateRecord(”WellRecord[”A01”], WellRecord[”A02”], WellRecord[”A03”], …, WellRecord[”H12”]”) >>> Step 5 − As discussed earlier, a plate contains 8 rows each having 12 items. WellRecord can be access in two ways as specified below − >>> well = plate[“A04″] >>> well = plate[0, 4] >>> well WellRecord(”(0.0, 0.0), (0.25, 0.0), (0.5, 0.0), (0.75, 0.0), (1.0, 0.0), …, (71.75, 388.0)”) >>> Step 6 − Each well will have series of measurement at different time points and it can be accessed using for loop as specified below − >>> for v1, v2 in well: … print(v1, v2) … 0.0 0.0 0.25 0.0 0.5 0.0 0.75 0.0 1.0 0.0 … 71.25 388.0 71.5 388.0 71.75 388.0 >>> Interpolation Interpolation gives more insight into the data. Biopython provides methods to interpolate WellRecord data to get information for intermediate time points. The syntax is similar to list indexing and so, easy to learn. To get the data at 20.1 hours, just pass as index values as specified below − >>> well[20.10] 69.40000000000003 >>> We can pass start time point and end time point as well as specified below − >>> well[20:30] [67.0, 84.0, 102.0, 119.0, 135.0, 147.0, 158.0, 168.0, 179.0, 186.0] >>> The above command interpolate data from 20 hour to 30 hours with 1 hour interval. By default, the interval is 1 hour and we can change it to any value. For example, let us give 15 minutes (0.25 hour) interval as specified below − >>> well[20:21:0.25] [67.0, 73.0, 75.0, 81.0] >>> Analyze and Extract Biopython provides a method fit to analyze the WellRecord data using Gompertz, Logistic and Richards sigmoid functions. By default, the fit method uses Gompertz function. We need to call the fit method of the WellRecord object to get the task done. The coding is as follows − >>> well.fit() Traceback (most recent call last): … Bio.MissingPythonDependencyError: Install scipy to extract curve parameters. >>> well.model >>> getattr(well, ”min”) 0.0 >>> getattr(well, ”max”) 388.0 >>> getattr(well, ”average_height”) 205.42708333333334 >>> Biopython depends on scipy module to do advanced analysis. It will calculate min, max and average_height details without using scipy module. Print Page Previous Next Advertisements ”;

Biopython – Population Genetics

Biopython – Population Genetics ”; Previous Next Population genetics plays an important role in evolution theory. It analyses the genetic difference between species as well as two or more individuals within the same species. Biopython provides Bio.PopGen module for population genetics and mainly supports `GenePop, a popular genetics package developed by Michel Raymond and Francois Rousset. A simple parser Let us write a simple application to parse the GenePop format and understand the concept. Download the genePop file provided by Biopython team in the link given below −https://raw.githubusercontent.com/biopython/biopython/master/Tests/PopGen/c3line.gen Load the GenePop module using the below code snippet − from Bio.PopGen import GenePop Parse the file using GenePop.read method as below − record = GenePop.read(open(“c3line.gen”)) Show the loci and population information as given below − >>> record.loci_list [”136255903”, ”136257048”, ”136257636”] >>> record.pop_list [”4”, ”b3”, ”5”] >>> record.populations [[(”1”, [(3, 3), (4, 4), (2, 2)]), (”2”, [(3, 3), (3, 4), (2, 2)]), (”3”, [(3, 3), (4, 4), (2, 2)]), (”4”, [(3, 3), (4, 3), (None, None)])], [(”b1”, [(None, None), (4, 4), (2, 2)]), (”b2”, [(None, None), (4, 4), (2, 2)]), (”b3”, [(None, None), (4, 4), (2, 2)])], [(”1”, [(3, 3), (4, 4), (2, 2)]), (”2”, [(3, 3), (1, 4), (2, 2)]), (”3”, [(3, 2), (1, 1), (2, 2)]), (”4”, [(None, None), (4, 4), (2, 2)]), (”5”, [(3, 3), (4, 4), (2, 2)])]] >>> Here, there are three loci available in the file and three sets of population: First population has 4 records, second population has 3 records and third population has 5 records. record.populations shows all sets of population with alleles data for each locus. Manipulate the GenePop file Biopython provides options to remove locus and population data. Remove a population set by position, >>> record.remove_population(0) >>> record.populations [[(”b1”, [(None, None), (4, 4), (2, 2)]), (”b2”, [(None, None), (4, 4), (2, 2)]), (”b3”, [(None, None), (4, 4), (2, 2)])], [(”1”, [(3, 3), (4, 4), (2, 2)]), (”2”, [(3, 3), (1, 4), (2, 2)]), (”3”, [(3, 2), (1, 1), (2, 2)]), (”4”, [(None, None), (4, 4), (2, 2)]), (”5”, [(3, 3), (4, 4), (2, 2)])]] >>> Remove a locus by position, >>> record.remove_locus_by_position(0) >>> record.loci_list [”136257048”, ”136257636”] >>> record.populations [[(”b1”, [(4, 4), (2, 2)]), (”b2”, [(4, 4), (2, 2)]), (”b3”, [(4, 4), (2, 2)])], [(”1”, [(4, 4), (2, 2)]), (”2”, [(1, 4), (2, 2)]), (”3”, [(1, 1), (2, 2)]), (”4”, [(4, 4), (2, 2)]), (”5”, [(4, 4), (2, 2)])]] >>> Remove a locus by name, >>> record.remove_locus_by_name(”136257636”) >>> record.loci_list [”136257048”] >>> record.populations [[(”b1”, [(4, 4)]), (”b2”, [(4, 4)]), (”b3”, [(4, 4)])], [(”1”, [(4, 4)]), (”2”, [(1, 4)]), (”3”, [(1, 1)]), (”4”, [(4, 4)]), (”5”, [(4, 4)])]] >>> Interface with GenePop Software Biopython provides interfaces to interact with GenePop software and thereby exposes lot of functionality from it. Bio.PopGen.GenePop module is used for this purpose. One such easy to use interface is EasyController. Let us check how to parse GenePop file and do some analysis using EasyController. First, install the GenePop software and place the installation folder in the system path. To get basic information about GenePop file, create a EasyController object and then call get_basic_info method as specified below − >>> from Bio.PopGen.GenePop.EasyController import EasyController >>> ec = EasyController(”c3line.gen”) >>> print(ec.get_basic_info()) ([”4”, ”b3”, ”5”], [”136255903”, ”136257048”, ”136257636”]) >>> Here, the first item is population list and second item is loci list. To get all allele list of a particular locus, call get_alleles_all_pops method by passing locus name as specified below − >>> allele_list = ec.get_alleles_all_pops(“136255903”) >>> print(allele_list) [2, 3] To get allele list by specific population and locus, call get_alleles by passing locus name and population position as given below − >>> allele_list = ec.get_alleles(0, “136255903”) >>> print(allele_list) [] >>> allele_list = ec.get_alleles(1, “136255903”) >>> print(allele_list) [] >>> allele_list = ec.get_alleles(2, “136255903”) >>> print(allele_list) [2, 3] >>> Similarly, EasyController exposes many functionalities: allele frequency, genotype frequency, multilocus F statistics, Hardy-Weinberg equilibrium, Linkage Disequilibrium, etc. Print Page Previous Next Advertisements ”;

Biopython – Sequence Alignments

Biopython – Sequence Alignments ”; Previous Next Sequence alignment is the process of arranging two or more sequences (of DNA, RNA or protein sequences) in a specific order to identify the region of similarity between them. Identifying the similar region enables us to infer a lot of information like what traits are conserved between species, how close different species genetically are, how species evolve, etc. Biopython provides extensive support for sequence alignment. Let us learn some of the important features provided by Biopython in this chapter − Parsing Sequence Alignment Biopython provides a module, Bio.AlignIO to read and write sequence alignments. In bioinformatics, there are lot of formats available to specify the sequence alignment data similar to earlier learned sequence data. Bio.AlignIO provides API similar to Bio.SeqIO except that the Bio.SeqIO works on the sequence data and Bio.AlignIO works on the sequence alignment data. Before starting to learn, let us download a sample sequence alignment file from the Internet. To download the sample file, follow the below steps − Step 1 − Open your favorite browser and go to http://pfam.xfam.org/family/browse website. It will show all the Pfam families in alphabetical order. Step 2 − Choose any one family having less number of seed value. It contains minimal data and enables us to work easily with the alignment. Here, we have selected/clicked PF18225 and it opens go to http://pfam.xfam.org/family/PF18225 and shows complete details about it, including sequence alignments. Step 3 − Go to alignment section and download the sequence alignment file in Stockholm format (PF18225_seed.txt). Let us try to read the downloaded sequence alignment file using Bio.AlignIO as below − Import Bio.AlignIO module >>> from Bio import AlignIO Read alignment using read method. read method is used to read single alignment data available in the given file. If the given file contain many alignment, we can use parse method. parse method returns iterable alignment object similar to parse method in Bio.SeqIO module. >>> alignment = AlignIO.read(open(“PF18225_seed.txt”), “stockholm”) Print the alignment object. >>> print(alignment) SingleLetterAlphabet() alignment with 6 rows and 65 columns MQNTPAERLPAIIEKAKSKHDINVWLLDRQGRDLLEQRVPAKVA…EGP B7RZ31_9GAMM/59-123 AKQRGIAGLEEWLHRLDHSEAIPIFLIDEAGKDLLEREVPADIT…KKP A0A0C3NPG9_9PROT/58-119 ARRHGQEYFQQWLERQPKKVKEQVFAVDQFGRELLGRPLPEDMA…KKP A0A143HL37_9GAMM/57-121 TRRHGPESFRFWLERQPVEARDRIYAIDRSGAEILDRPIPRGMA…NKP A0A0X3UC67_9GAMM/57-121 AINRNTQQLTQDLRAMPNWSLRFVYIVDRNNQDLLKRPLPPGIM…NRK B3PFT7_CELJU/62-126 AVNATEREFTERIRTLPHWARRNVFVLDSQGFEIFDRELPSPVA…NRT K4KEM7_SIMAS/61-125 >>> We can also check the sequences (SeqRecord) available in the alignment as well as below − >>> for align in alignment: … print(align.seq) … MQNTPAERLPAIIEKAKSKHDINVWLLDRQGRDLLEQRVPAKVATVANQLRGRKRRAFARHREGP AKQRGIAGLEEWLHRLDHSEAIPIFLIDEAGKDLLEREVPADITA—RLDRRREHGEHGVRKKP ARRHGQEYFQQWLERQPKKVKEQVFAVDQFGRELLGRPLPEDMAPMLIALNYRNRESHAQVDKKP TRRHGPESFRFWLERQPVEARDRIYAIDRSGAEILDRPIPRGMAPLFKVLSFRNREDQGLVNNKP AINRNTQQLTQDLRAMPNWSLRFVYIVDRNNQDLLKRPLPPGIMVLAPRLTAKHPYDKVQDRNRK AVNATEREFTERIRTLPHWARRNVFVLDSQGFEIFDRELPSPVADLMRKLDLDRPFKKLERKNRT >>> Multiple Alignments In general, most of the sequence alignment files contain single alignment data and it is enough to use read method to parse it. In multiple sequence alignment concept, two or more sequences are compared for best subsequence matches between them and results in multiple sequence alignment in a single file. If the input sequence alignment format contains more than one sequence alignment, then we need to use parse method instead of read method as specified below − >>> from Bio import AlignIO >>> alignments = AlignIO.parse(open(“PF18225_seed.txt”), “stockholm”) >>> print(alignments) <generator object parse at 0x000001CD1C7E0360> >>> for alignment in alignments: … print(alignment) … SingleLetterAlphabet() alignment with 6 rows and 65 columns MQNTPAERLPAIIEKAKSKHDINVWLLDRQGRDLLEQRVPAKVA…EGP B7RZ31_9GAMM/59-123 AKQRGIAGLEEWLHRLDHSEAIPIFLIDEAGKDLLEREVPADIT…KKP A0A0C3NPG9_9PROT/58-119 ARRHGQEYFQQWLERQPKKVKEQVFAVDQFGRELLGRPLPEDMA…KKP A0A143HL37_9GAMM/57-121 TRRHGPESFRFWLERQPVEARDRIYAIDRSGAEILDRPIPRGMA…NKP A0A0X3UC67_9GAMM/57-121 AINRNTQQLTQDLRAMPNWSLRFVYIVDRNNQDLLKRPLPPGIM…NRK B3PFT7_CELJU/62-126 AVNATEREFTERIRTLPHWARRNVFVLDSQGFEIFDRELPSPVA…NRT K4KEM7_SIMAS/61-125 >>> Here, parse method returns iterable alignment object and it can be iterated to get actual alignments. Pairwise Sequence Alignment Pairwise sequence alignment compares only two sequences at a time and provides best possible sequence alignments. Pairwise is easy to understand and exceptional to infer from the resulting sequence alignment. Biopython provides a special module, Bio.pairwise2 to identify the alignment sequence using pairwise method. Biopython applies the best algorithm to find the alignment sequence and it is par with other software. Let us write an example to find the sequence alignment of two simple and hypothetical sequences using pairwise module. This will help us understand the concept of sequence alignment and how to program it using Biopython. Step 1 Import the module pairwise2 with the command given below − >>> from Bio import pairwise2 Step 2 Create two sequences, seq1 and seq2 − >>> from Bio.Seq import Seq >>> seq1 = Seq(“ACCGGT”) >>> seq2 = Seq(“ACGT”) Step 3 Call method pairwise2.align.globalxx along with seq1 and seq2 to find the alignments using the below line of code − >>> alignments = pairwise2.align.globalxx(seq1, seq2) Here, globalxx method performs the actual work and finds all the best possible alignments in the given sequences. Actually, Bio.pairwise2 provides quite a set of methods which follows the below convention to find alignments in different scenarios. <sequence alignment type>XY Here, the sequence alignment type refers to the alignment type which may be global or local. global type is finding sequence alignment by taking entire sequence into consideration. local type is finding sequence alignment by looking into the subset of the given sequences as well. This will be tedious but provides better idea about the similarity between the given sequences. X refers to matching score. The possible values are x (exact match), m (score based on identical chars), d (user provided dictionary with character and match score) and finally c (user defined function to provide custom scoring algorithm). Y refers to gap penalty. The possible values are x (no gap penalties), s (same penalties for both sequences), d (different penalties for each sequence) and finally c (user defined function to provide custom gap penalties) So, localds is also a valid method, which finds the sequence alignment using local alignment technique, user provided dictionary for matches and user provided gap penalty for both sequences. >>> test_alignments = pairwise2.align.localds(seq1, seq2, blosum62, -10, -1) Here, blosum62 refers to a dictionary available in the pairwise2 module to provide match score. -10 refers to gap open penalty and -1 refers to gap extension penalty. Step 4 Loop over the iterable alignments object and get each individual alignment object and print it. >>> for alignment in alignments: … print(alignment) … (”ACCGGT”, ”A-C-GT”, 4.0, 0, 6) (”ACCGGT”, ”AC–GT”, 4.0, 0, 6) (”ACCGGT”, ”A-CG-T”, 4.0, 0, 6) (”ACCGGT”, ”AC-G-T”, 4.0, 0, 6) Step 5 Bio.pairwise2 module provides a formatting method, format_alignment to better visualize the result − >>> from Bio.pairwise2 import format_alignment >>> alignments

Biopython – Overview of BLAST

Biopython – Overview of BLAST ”; Previous Next BLAST stands for Basic Local Alignment Search Tool. It finds regions of similarity between biological sequences. Biopython provides Bio.Blast module to deal with NCBI BLAST operation. You can run BLAST in either local connection or over Internet connection. Let us understand these two connections in brief in the following section − Running over Internet Biopython provides Bio.Blast.NCBIWWW module to call the online version of BLAST. To do this, we need to import the following module − >>> from Bio.Blast import NCBIWWW NCBIWW module provides qblast function to query the BLAST online version, https://blast.ncbi.nlm.nih.gov/Blast.cgi. qblast supports all the parameters supported by the online version. To obtain any help about this module, use the below command and understand the features − >>> help(NCBIWWW.qblast) Help on function qblast in module Bio.Blast.NCBIWWW: qblast( program, database, sequence, url_base = ”https://blast.ncbi.nlm.nih.gov/Blast.cgi”, auto_format = None, composition_based_statistics = None, db_genetic_code = None, endpoints = None, entrez_query = ”(none)”, expect = 10.0, filter = None, gapcosts = None, genetic_code = None, hitlist_size = 50, i_thresh = None, layout = None, lcase_mask = None, matrix_name = None, nucl_penalty = None, nucl_reward = None, other_advanced = None, perc_ident = None, phi_pattern = None, query_file = None, query_believe_defline = None, query_from = None, query_to = None, searchsp_eff = None, service = None, threshold = None, ungapped_alignment = None, word_size = None, alignments = 500, alignment_view = None, descriptions = 500, entrez_links_new_window = None, expect_low = None, expect_high = None, format_entrez_query = None, format_object = None, format_type = ”XML”, ncbi_gi = None, results_file = None, show_overview = None, megablast = None, template_type = None, template_length = None ) BLAST search using NCBI”s QBLAST server or a cloud service provider. Supports all parameters of the qblast API for Put and Get. Please note that BLAST on the cloud supports the NCBI-BLAST Common URL API (http://ncbi.github.io/blast-cloud/dev/api.html). To use this feature, please set url_base to ”http://host.my.cloud.service.provider.com/cgi-bin/blast.cgi” and format_object = ”Alignment”. For more details, please see 8. Biopython – Overview of BLAST https://blast.ncbi.nlm.nih.gov/Blast.cgi?PAGE_TYPE = BlastDocs&DOC_TYPE = CloudBlast Some useful parameters: – program blastn, blastp, blastx, tblastn, or tblastx (lower case) – database Which database to search against (e.g. “nr”). – sequence The sequence to search. – ncbi_gi TRUE/FALSE whether to give ”gi” identifier. – descriptions Number of descriptions to show. Def 500. – alignments Number of alignments to show. Def 500. – expect An expect value cutoff. Def 10.0. – matrix_name Specify an alt. matrix (PAM30, PAM70, BLOSUM80, BLOSUM45). – filter “none” turns off filtering. Default no filtering – format_type “HTML”, “Text”, “ASN.1”, or “XML”. Def. “XML”. – entrez_query Entrez query to limit Blast search – hitlist_size Number of hits to return. Default 50 – megablast TRUE/FALSE whether to use MEga BLAST algorithm (blastn only) – service plain, psi, phi, rpsblast, megablast (lower case) This function does no checking of the validity of the parameters and passes the values to the server as is. More help is available at: https://ncbi.github.io/blast-cloud/dev/api.html Usually, the arguments of the qblast function are basically analogous to different parameters that you can set on the BLAST web page. This makes the qblast function easy to understand as well as reduces the learning curve to use it. Connecting and Searching To understand the process of connecting and searching BLAST online version, let us do a simple sequence search (available in our local sequence file) against online BLAST server through Biopython. Step 1 − Create a file named blast_example.fasta in the Biopython directory and give the below sequence information as input Example of a single sequence in FASTA/Pearson format: >sequence A ggtaagtcctctagtacaaacacccccaatattgtgatataattaaaattatattcatat tctgttgccagaaaaaacacttttaggctatattagagccatcttctttgaagcgttgtc >sequence B ggtaagtcctctagtacaaacacccccaatattgtgatataattaaaattatattca tattctgttgccagaaaaaacacttttaggctatattagagccatcttctttgaagcgttgtc Step 2 − Import the NCBIWWW module. >>> from Bio.Blast import NCBIWWW Step 3 − Open the sequence file, blast_example.fasta using python IO module. >>> sequence_data = open(“blast_example.fasta”).read() >>> sequence_data ”Example of a single sequence in FASTA/Pearson format:nnn> sequence Anggtaagtcctctagtacaaacacccccaatattgtgatataattaaaatt atattcatatntctgttgccagaaaaaacacttttaggctatattagagccatcttctttg aagcgttgtcnn” Step 4 − Now, call the qblast function passing sequence data as main parameter. The other parameter represents the database (nt) and the internal program (blastn). >>> result_handle = NCBIWWW.qblast(“blastn”, “nt”, sequence_data) >>> result_handle <_io.StringIO object at 0x000001EC9FAA4558> blast_results holds the result of our search. It can be saved to a file for later use and also, parsed to get the details. We will learn how to do it in the coming section. Step 5 − The same functionality can be done using Seq object as well rather than using the whole fasta file as shown below − >>> from Bio import SeqIO >>> seq_record = next(SeqIO.parse(open(”blast_example.fasta”),”fasta”)) >>> seq_record.id ”sequence” >>> seq_record.seq Seq(”ggtaagtcctctagtacaaacacccccaatattgtgatataattaaaattatat…gtc”, SingleLetterAlphabet()) Now, call the qblast function passing Seq object, record.seq as main parameter. >>> result_handle = NCBIWWW.qblast(“blastn”, “nt”, seq_record.seq) >>> print(result_handle) <_io.StringIO object at 0x000001EC9FAA4558> BLAST will assign an identifier for your sequence automatically. Step 6 − result_handle object will have the entire result and can be saved into a file for later usage. >>> with open(”results.xml”, ”w”) as save_file: >>> blast_results = result_handle.read() >>> save_file.write(blast_results) We will see how to parse the result file in the later section. Running Standalone BLAST This section explains about how to run BLAST in local system. If you run BLAST in local system, it may be faster and also allows you to create your own database to search against sequences. Connecting BLAST In general, running BLAST locally is not recommended due to its large size, extra effort needed to run the software, and the cost involved. Online BLAST is sufficient for basic and advanced purposes. Of course, sometime you may be required to install it locally. Consider you are conducting frequent searches online which may require a lot of time and high network volume and if you have proprietary sequence data or IP related issues, then installing it locally is recommended. To do this, we need to follow the below steps − Step 1 − Download and install the latest blast binary using the given link − ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/ Step 2 − Download and unpack the latest and necessary database using the below link −

Biopython – PDB Module

Biopython – PDB Module ”; Previous Next Biopython provides Bio.PDB module to manipulate polypeptide structures. The PDB (Protein Data Bank) is the largest protein structure resource available online. It hosts a lot of distinct protein structures, including protein-protein, protein-DNA, protein-RNA complexes. In order to load the PDB, type the below command − from Bio.PDB import * Protein Structure File Formats The PDB distributes protein structures in three different formats − The XML-based file format which is not supported by Biopython The pdb file format, which is a specially formatted text file PDBx/mmCIF files format PDB files distributed by the Protein Data Bank may contain formatting errors that make them ambiguous or difficult to parse. The Bio.PDB module attempts to deal with these errors automatically. The Bio.PDB module implements two different parsers, one is mmCIF format and second one is pdb format. Let us learn how to parser each of the format in detail − mmCIF Parser Let us download an example database in mmCIF format from pdb server using the below command − >>> pdbl = PDBList() >>> pdbl.retrieve_pdb_file(”2FAT”, pdir = ”.”, file_format = ”mmCif”) This will download the specified file (2fat.cif) from the server and store it in the current working directory. Here, PDBList provides options to list and download files from online PDB FTP server. retrieve_pdb_file method needs the name of the file to be downloaded without extension. retrieve_pdb_file also have option to specify download directory, pdir and format of the file, file_format. The possible values of file format are as follows − “mmCif” (default, PDBx/mmCif file) “pdb” (format PDB) “xml” (PMDML/XML format) “mmtf” (highly compressed) “bundle” (PDB formatted archive for large structure) To load a cif file, use Bio.MMCIF.MMCIFParser as specified below − >>> parser = MMCIFParser(QUIET = True) >>> data = parser.get_structure(“2FAT”, “2FAT.cif”) Here, QUIET suppresses the warning during parsing the file. get_structure will parse the file and return the structure with id as 2FAT (first argument). After running the above command, it parses the file and prints possible warning, if available. Now, check the structure using the below command − >>> data <Structure id = 2FAT> To get the type, use type method as specified below, >>> print(type(data)) <class ”Bio.PDB.Structure.Structure”> We have successfully parsed the file and got the structure of the protein. We will learn the details of the protein structure and how to get it in the later chapter. PDB Parser Let us download an example database in PDB format from pdb server using the below command − >>> pdbl = PDBList() >>> pdbl.retrieve_pdb_file(”2FAT”, pdir = ”.”, file_format = ”pdb”) This will download the specified file (pdb2fat.ent) from the server and store it in the current working directory. To load a pdb file, use Bio.PDB.PDBParser as specified below − >>> parser = PDBParser(PERMISSIVE = True, QUIET = True) >>> data = parser.get_structure(“2fat”,”pdb2fat.ent”) Here, get_structure is similar to MMCIFParser. PERMISSIVE option try to parse the protein data as flexible as possible. Now, check the structure and its type with the code snippet given below − >>> data <Structure id = 2fat> >>> print(type(data)) <class ”Bio.PDB.Structure.Structure”> Well, the header structure stores the dictionary information. To perform this, type the below command − >>> print(data.header.keys()) dict_keys([ ”name”, ”head”, ”deposition_date”, ”release_date”, ”structure_method”, ”resolution”, ”structure_reference”, ”journal_reference”, ”author”, ”compound”, ”source”, ”keywords”, ”journal”]) >>> To get the name, use the following code − >>> print(data.header[“name”]) an anti-urokinase plasminogen activator receptor (upar) antibody: crystal structure and binding epitope >>> You can also check the date and resolution with the below code − >>> print(data.header[“release_date”]) 2006-11-14 >>> print(data.header[“resolution”]) 1.77 PDB Structure PDB structure is composed of a single model, containing two chains. chain L, containing number of residues chain H, containing number of residues Each residue is composed of multiple atoms, each having a 3D position represented by (x, y, z) coordinates. Let us learn how to get the structure of the atom in detail in the below section − Model The Structure.get_models() method returns an iterator over the models. It is defined below − >>> model = data.get_models() >>> model <generator object get_models at 0x103fa1c80> >>> models = list(model) >>> models [<Model id = 0>] >>> type(models[0]) <class ”Bio.PDB.Model.Model”> Here, a Model describes exactly one 3D conformation. It contains one or more chains. Chain The Model.get_chain() method returns an iterator over the chains. It is defined below − >>> chains = list(models[0].get_chains()) >>> chains [<Chain id = L>, <Chain id = H>] >>> type(chains[0]) <class ”Bio.PDB.Chain.Chain”> Here, Chain describes a proper polypeptide structure, i.e., a consecutive sequence of bound residues. Residue The Chain.get_residues() method returns an iterator over the residues. It is defined below − >>> residue = list(chains[0].get_residues()) >>> len(residue) 293 >>> residue1 = list(chains[1].get_residues()) >>> len(residue1) 311 Well, Residue holds the atoms that belong to an amino acid. Atoms The Residue.get_atom() returns an iterator over the atoms as defined below − >>> atoms = list(residue[0].get_atoms()) >>> atoms [<Atom N>, <Atom CA>, <Atom C>, <Atom Ov, <Atom CB>, <Atom CG>, <Atom OD1>, <Atom OD2>] An atom holds the 3D coordinate of an atom and it is called a Vector. It is defined below >>> atoms[0].get_vector() <Vector 18.49, 73.26, 44.16> It represents x, y and z co-ordinate values. Print Page Previous Next Advertisements ”;

Biopython – Entrez Database

Biopython – Entrez Database ”; Previous Next Entrez is an online search system provided by NCBI. It provides access to nearly all known molecular biology databases with an integrated global query supporting Boolean operators and field search. It returns results from all the databases with information like the number of hits from each databases, records with links to the originating database, etc. Some of the popular databases which can be accessed through Entrez are listed below − Pubmed Pubmed Central Nucleotide (GenBank Sequence Database) Protein (Sequence Database) Genome (Whole Genome Database) Structure (Three Dimensional Macromolecular Structure) Taxonomy (Organisms in GenBank) SNP (Single Nucleotide Polymorphism) UniGene (Gene Oriented Clusters of Transcript Sequences) CDD (Conserved Protein Domain Database) 3D Domains (Domains from Entrez Structure) In addition to the above databases, Entrez provides many more databases to perform the field search. Biopython provides an Entrez specific module, Bio.Entrez to access Entrez database. Let us learn how to access Entrez using Biopython in this chapter − Database Connection Steps To add the features of Entrez, import the following module − >>> from Bio import Entrez Next set your email to identify who is connected with the code given below − >>> Entrez.email = ”<youremail>” Then, set the Entrez tool parameter and by default, it is Biopython. >>> Entrez.tool = ”Demoscript” Now, call einfo function to find index term counts, last update, and available links for each database as defined below − >>> info = Entrez.einfo() The einfo method returns an object, which provides access to the information through its read method as shown below − >>> data = info.read() >>> print(data) <?xml version = “1.0” encoding = “UTF-8” ?> <!DOCTYPE eInfoResult PUBLIC “-//NLM//DTD einfo 20130322//EN” “https://eutils.ncbi.nlm.nih.gov/eutils/dtd/20130322/einfo.dtd”> <eInfoResult> <DbList> <DbName>pubmed</DbName> <DbName>protein</DbName> <DbName>nuccore</DbName> <DbName>ipg</DbName> <DbName>nucleotide</DbName> <DbName>nucgss</DbName> <DbName>nucest</DbName> <DbName>structure</DbName> <DbName>sparcle</DbName> <DbName>genome</DbName> <DbName>annotinfo</DbName> <DbName>assembly</DbName> <DbName>bioproject</DbName> <DbName>biosample</DbName> <DbName>blastdbinfo</DbName> <DbName>books</DbName> <DbName>cdd</DbName> <DbName>clinvar</DbName> <DbName>clone</DbName> <DbName>gap</DbName> <DbName>gapplus</DbName> <DbName>grasp</DbName> <DbName>dbvar</DbName> <DbName>gene</DbName> <DbName>gds</DbName> <DbName>geoprofiles</DbName> <DbName>homologene</DbName> <DbName>medgen</DbName> <DbName>mesh</DbName> <DbName>ncbisearch</DbName> <DbName>nlmcatalog</DbName> <DbName>omim</DbName> <DbName>orgtrack</DbName> <DbName>pmc</DbName> <DbName>popset</DbName> <DbName>probe</DbName> <DbName>proteinclusters</DbName> <DbName>pcassay</DbName> <DbName>biosystems</DbName> <DbName>pccompound</DbName> <DbName>pcsubstance</DbName> <DbName>pubmedhealth</DbName> <DbName>seqannot</DbName> <DbName>snp</DbName> <DbName>sra</DbName> <DbName>taxonomy</DbName> <DbName>biocollections</DbName> <DbName>unigene</DbName> <DbName>gencoll</DbName> <DbName>gtr</DbName> </DbList> </eInfoResult> The data is in XML format, and to get the data as python object, use Entrez.read method as soon as Entrez.einfo() method is invoked − >>> info = Entrez.einfo() >>> record = Entrez.read(info) Here, record is a dictionary which has one key, DbList as shown below − >>> record.keys() [u”DbList”] Accessing the DbList key returns the list of database names shown below − >>> record[u”DbList”] [”pubmed”, ”protein”, ”nuccore”, ”ipg”, ”nucleotide”, ”nucgss”, ”nucest”, ”structure”, ”sparcle”, ”genome”, ”annotinfo”, ”assembly”, ”bioproject”, ”biosample”, ”blastdbinfo”, ”books”, ”cdd”, ”clinvar”, ”clone”, ”gap”, ”gapplus”, ”grasp”, ”dbvar”, ”gene”, ”gds”, ”geoprofiles”, ”homologene”, ”medgen”, ”mesh”, ”ncbisearch”, ”nlmcatalog”, ”omim”, ”orgtrack”, ”pmc”, ”popset”, ”probe”, ”proteinclusters”, ”pcassay”, ”biosystems”, ”pccompound”, ”pcsubstance”, ”pubmedhealth”, ”seqannot”, ”snp”, ”sra”, ”taxonomy”, ”biocollections”, ”unigene”, ”gencoll”, ”gtr”] >>> Basically, Entrez module parses the XML returned by Entrez search system and provide it as python dictionary and lists. Search Database To search any of one the Entrez databases, we can use Bio.Entrez.esearch() module. It is defined below − >>> info = Entrez.einfo() >>> info = Entrez.esearch(db = “pubmed”,term = “genome”) >>> record = Entrez.read(info) >>>print(record) DictElement({u”Count”: ”1146113”, u”RetMax”: ”20”, u”IdList”: [”30347444”, ”30347404”, ”30347317”, ”30347292”, ”30347286”, ”30347249”, ”30347194”, ”30347187”, ”30347172”, ”30347088”, ”30347075”, ”30346992”, ”30346990”, ”30346982”, ”30346980”, ”30346969”, ”30346962”, ”30346954”, ”30346941”, ”30346939”], u”TranslationStack”: [DictElement({u”Count”: ”927819”, u”Field”: ”MeSH Terms”, u”Term”: ””genome”[MeSH Terms]”, u”Explode”: ”Y”}, attributes = {}) , DictElement({u”Count”: ”422712”, u”Field”: ”All Fields”, u”Term”: ””genome”[All Fields]”, u”Explode”: ”N”}, attributes = {}), ”OR”, ”GROUP”], u”TranslationSet”: [DictElement({u”To”: ””genome”[MeSH Terms] OR “genome”[All Fields]”, u”From”: ”genome”}, attributes = {})], u”RetStart”: ”0”, u”QueryTranslation”: ””genome”[MeSH Terms] OR “genome”[All Fields]”}, attributes = {}) >>> If you assign incorrect db then it returns >>> info = Entrez.esearch(db = “blastdbinfo”,term = “books”) >>> record = Entrez.read(info) >>> print(record) DictElement({u”Count”: ”0”, u”RetMax”: ”0”, u”IdList”: [], u”WarningList”: DictElement({u”OutputMessage”: [”No items found.”], u”PhraseIgnored”: [], u”QuotedPhraseNotFound”: []}, attributes = {}), u”ErrorList”: DictElement({u”FieldNotFound”: [], u”PhraseNotFound”: [”books”]}, attributes = {}), u”TranslationSet”: [], u”RetStart”: ”0”, u”QueryTranslation”: ”(books[All Fields])”}, attributes = {}) If you want to search across database, then you can use Entrez.egquery. This is similar to Entrez.esearch except it is enough to specify the keyword and skip the database parameter. >>>info = Entrez.egquery(term = “entrez”) >>> record = Entrez.read(info) >>> for row in record[“eGQueryResult”]: … print(row[“DbName”], row[“Count”]) … pubmed 458 pmc 12779 mesh 1 … … … biosample 7 biocollections 0 Fetch Records Enterz provides a special method, efetch to search and download the full details of a record from Entrez. Consider the following simple example − >>> handle = Entrez.efetch( db = “nucleotide”, id = “EU490707”, rettype = “fasta”) Now, we can simply read the records using SeqIO object >>> record = SeqIO.read( handle, “fasta” ) >>> record SeqRecord(seq = Seq(”ATTTTTTACGAACCTGTGGAAATTTTTGGTTATGACAATAAATCTAGTTTAGTA…GAA”, SingleLetterAlphabet()), id = ”EU490707.1”, name = ”EU490707.1”, description = ”EU490707.1 Selenipedium aequinoctiale maturase K (matK) gene, partial cds; chloroplast”, dbxrefs = []) Print Page Previous Next Advertisements ”;

Biopython – Sequence

Biopython – Sequence ”; Previous Next A sequence is series of letters used to represent an organism’s protein, DNA or RNA. It is represented by Seq class. Seq class is defined in Bio.Seq module. Let’s create a simple sequence in Biopython as shown below − >>> from Bio.Seq import Seq >>> seq = Seq(“AGCT”) >>> seq Seq(”AGCT”) >>> print(seq) AGCT Here, we have created a simple protein sequence AGCT and each letter represents Alanine, Glycine, Cysteine and Threonine. Each Seq object has two important attributes − data − the actual sequence string (AGCT) alphabet − used to represent the type of sequence. e.g. DNA sequence, RNA sequence, etc. By default, it does not represent any sequence and is generic in nature. Alphabet Module Seq objects contain Alphabet attribute to specify sequence type, letters and possible operations. It is defined in Bio.Alphabet module. Alphabet can be defined as below − >>> from Bio.Seq import Seq >>> myseq = Seq(“AGCT”) >>> myseq Seq(”AGCT”) >>> myseq.alphabet Alphabet() Alphabet module provides below classes to represent different types of sequences. Alphabet – base class for all types of alphabets. SingleLetterAlphabet – Generic alphabet with letters of size one. It derives from Alphabet and all other alphabets type derives from it. >>> from Bio.Seq import Seq >>> from Bio.Alphabet import single_letter_alphabet >>> test_seq = Seq(”AGTACACTGGT”, single_letter_alphabet) >>> test_seq Seq(”AGTACACTGGT”, SingleLetterAlphabet()) ProteinAlphabet − Generic single letter protein alphabet. >>> from Bio.Seq import Seq >>> from Bio.Alphabet import generic_protein >>> test_seq = Seq(”AGTACACTGGT”, generic_protein) >>> test_seq Seq(”AGTACACTGGT”, ProteinAlphabet()) NucleotideAlphabet − Generic single letter nucleotide alphabet. >>> from Bio.Seq import Seq >>> from Bio.Alphabet import generic_nucleotide >>> test_seq = Seq(”AGTACACTGGT”, generic_nucleotide) >>> test_seq Seq(”AGTACACTGGT”, NucleotideAlphabet()) DNAAlphabet − Generic single letter DNA alphabet. >>> from Bio.Seq import Seq >>> from Bio.Alphabet import generic_dna >>> test_seq = Seq(”AGTACACTGGT”, generic_dna) >>> test_seq Seq(”AGTACACTGGT”, DNAAlphabet()) RNAAlphabet − Generic single letter RNA alphabet. >>> from Bio.Seq import Seq >>> from Bio.Alphabet import generic_rna >>> test_seq = Seq(”AGTACACTGGT”, generic_rna) >>> test_seq Seq(”AGTACACTGGT”, RNAAlphabet()) Biopython module, Bio.Alphabet.IUPAC provides basic sequence types as defined by IUPAC community. It contains the following classes − IUPACProtein (protein) − IUPAC protein alphabet of 20 standard amino acids. ExtendedIUPACProtein (extended_protein) − Extended uppercase IUPAC protein single letter alphabet including X. IUPACAmbiguousDNA (ambiguous_dna) − Uppercase IUPAC ambiguous DNA. IUPACUnambiguousDNA (unambiguous_dna) − Uppercase IUPAC unambiguous DNA (GATC). ExtendedIUPACDNA (extended_dna) − Extended IUPAC DNA alphabet. IUPACAmbiguousRNA (ambiguous_rna) − Uppercase IUPAC ambiguous RNA. IUPACUnambiguousRNA (unambiguous_rna) − Uppercase IUPAC unambiguous RNA (GAUC). Consider a simple example for IUPACProtein class as shown below − >>> from Bio.Alphabet import IUPAC >>> protein_seq = Seq(“AGCT”, IUPAC.protein) >>> protein_seq Seq(”AGCT”, IUPACProtein()) >>> protein_seq.alphabet Also, Biopython exposes all the bioinformatics related configuration data through Bio.Data module. For example, IUPACData.protein_letters has the possible letters of IUPACProtein alphabet. >>> from Bio.Data import IUPACData >>> IUPACData.protein_letters ”ACDEFGHIKLMNPQRSTVWY” Basic Operations This section briefly explains about all the basic operations available in the Seq class. Sequences are similar to python strings. We can perform python string operations like slicing, counting, concatenation, find, split and strip in sequences. Use the below codes to get various outputs. To get the first value in sequence. >>> seq_string = Seq(“AGCTAGCT”) >>> seq_string[0] ”A” To print the first two values. >>> seq_string[0:2] Seq(”AG”) To print all the values. >>> seq_string[ : ] Seq(”AGCTAGCT”) To perform length and count operations. >>> len(seq_string) 8 >>> seq_string.count(”A”) 2 To add two sequences. >>> from Bio.Alphabet import generic_dna, generic_protein >>> seq1 = Seq(“AGCT”, generic_dna) >>> seq2 = Seq(“TCGA”, generic_dna) >>> seq1+seq2 Seq(”AGCTTCGA”, DNAAlphabet()) Here, the above two sequence objects, seq1, seq2 are generic DNA sequences and so you can add them and produce new sequence. You can’t add sequences with incompatible alphabets, such as a protein sequence and a DNA sequence as specified below − >>> dna_seq = Seq(”AGTACACTGGT”, generic_dna) >>> protein_seq = Seq(”AGUACACUGGU”, generic_protein) >>> dna_seq + protein_seq ….. ….. TypeError: Incompatible alphabets DNAAlphabet() and ProteinAlphabet() >>> To add two or more sequences, first store it in a python list, then retrieve it using ‘for loop’ and finally add it together as shown below − >>> from Bio.Alphabet import generic_dna >>> list = [Seq(“AGCT”,generic_dna),Seq(“TCGA”,generic_dna),Seq(“AAA”,generic_dna)] >>> for s in list: … print(s) … AGCT TCGA AAA >>> final_seq = Seq(” “,generic_dna) >>> for s in list: … final_seq = final_seq + s … >>> final_seq Seq(”AGCTTCGAAAA”, DNAAlphabet()) In the below section, various codes are given to get outputs based on the requirement. To change the case of sequence. >>> from Bio.Alphabet import generic_rna >>> rna = Seq(“agct”, generic_rna) >>> rna.upper() Seq(”AGCT”, RNAAlphabet()) To check python membership and identity operator. >>> rna = Seq(“agct”, generic_rna) >>> ”a” in rna True >>> ”A” in rna False >>> rna1 = Seq(“AGCT”, generic_dna) >>> rna is rna1 False To find single letter or sequence of letter inside the given sequence. >>> protein_seq = Seq(”AGUACACUGGU”, generic_protein) >>> protein_seq.find(”G”) 1 >>> protein_seq.find(”GG”) 8 To perform splitting operation. >>> protein_seq = Seq(”AGUACACUGGU”, generic_protein) >>> protein_seq.split(”A”) [Seq(””, ProteinAlphabet()), Seq(”GU”, ProteinAlphabet()), Seq(”C”, ProteinAlphabet()), Seq(”CUGGU”, ProteinAlphabet())] To perform strip operations in the sequence. >>> strip_seq = Seq(” AGCT “) >>> strip_seq Seq(” AGCT ”) >>> strip_seq.strip() Seq(”AGCT”) Print Page Previous Next Advertisements ”;

Biopython – Introduction

Biopython – Introduction ”; Previous Next Biopython is the largest and most popular bioinformatics package for Python. It contains a number of different sub-modules for common bioinformatics tasks. It is developed by Chapman and Chang, mainly written in Python. It also contains C code to optimize the complex computation part of the software. It runs on Windows, Linux, Mac OS X, etc. Basically, Biopython is a collection of python modules that provide functions to deal with DNA, RNA & protein sequence operations such as reverse complementing of a DNA string, finding motifs in protein sequences, etc. It provides lot of parsers to read all major genetic databases like GenBank, SwissPort, FASTA, etc., as well as wrappers/interfaces to run other popular bioinformatics software/tools like NCBI BLASTN, Entrez, etc., inside the python environment. It has sibling projects like BioPerl, BioJava and BioRuby. Features Biopython is portable, clear and has easy to learn syntax. Some of the salient features are listed below − Interpreted, interactive and object oriented. Supports FASTA, PDB, GenBank, Blast, SCOP, PubMed/Medline, ExPASy-related formats. Option to deal with sequence formats. Tools to manage protein structures. BioSQL − Standard set of SQL tables for storing sequences plus features and annotations. Access to online services and database, including NCBI services (Blast, Entrez, PubMed) and ExPASY services (SwissProt, Prosite). Access to local services, including Blast, Clustalw, EMBOSS. Goals The goal of Biopython is to provide simple, standard and extensive access to bioinformatics through python language. The specific goals of the Biopython are listed below − Providing standardized access to bioinformatics resources. High-quality, reusable modules and scripts. Fast array manipulation that can be used in Cluster code, PDB, NaiveBayes and Markov Model. Genomic data analysis. Advantages Biopython requires very less code and comes up with the following advantages − Provides microarray data type used in clustering. Reads and writes Tree-View type files. Supports structure data used for PDB parsing, representation and analysis. Supports journal data used in Medline applications. Supports BioSQL database, which is widely used standard database amongst all bioinformatics projects. Supports parser development by providing modules to parse a bioinformatics file into a format specific record object or a generic class of sequence plus features. Clear documentation based on cookbook-style. Sample Case Study Let us check some of the use cases (population genetics, RNA structure, etc.,) and try to understand how Biopython plays an important role in this field − Population Genetics Population genetics is the study of genetic variation within a population, and involves the examination and modeling of changes in the frequencies of genes and alleles in populations over space and time. Biopython provides Bio.PopGen module for population genetics. This module contains all the necessary functions to gather information about classic population genetics. RNA Structure Three major biological macromolecules that are essential for our life are DNA, RNA and Protein. Proteins are the workhorses of the cell and play an important role as enzymes. DNA (deoxyribonucleic acid) is considered as the “blueprint” of the cell. It carries all the genetic information required for the cell to grow, take in nutrients, and propagate. RNA (Ribonucleic acid) acts as “DNA photocopy” in the cell. Biopython provides Bio.Sequence objects that represents nucleotides, building blocks of DNA and RNA. Print Page Previous Next Advertisements ”;

Sequence I/O Operations

Biopython – Sequence I/O Operations ”; Previous Next Biopython provides a module, Bio.SeqIO to read and write sequences from and to a file (any stream) respectively. It supports nearly all file formats available in bioinformatics. Most of the software provides different approach for different file formats. But, Biopython consciously follows a single approach to present the parsed sequence data to the user through its SeqRecord object. Let us learn more about SeqRecord in the following section. SeqRecord Bio.SeqRecord module provides SeqRecord to hold meta information of the sequence as well as the sequence data itself as given below − seq − It is an actual sequence. id − It is the primary identifier of the given sequence. The default type is string. name − It is the Name of the sequence. The default type is string. description − It displays human readable information about the sequence. annotations − It is a dictionary of additional information about the sequence. The SeqRecord can be imported as specified below from Bio.SeqRecord import SeqRecord Let us understand the nuances of parsing the sequence file using real sequence file in the coming sections. Parsing Sequence File Formats This section explains about how to parse two of the most popular sequence file formats, FASTA and GenBank. FASTA FASTA is the most basic file format for storing sequence data. Originally, FASTA is a software package for sequence alignment of DNA and protein developed during the early evolution of Bioinformatics and used mostly to search the sequence similarity. Biopython provides an example FASTA file and it can be accessed at https://github.com/biopython/biopython/blob/master/Doc/examples/ls_orchid.fasta. Download and save this file into your Biopython sample directory as ‘orchid.fasta’. Bio.SeqIO module provides parse() method to process sequence files and can be imported as follows − from Bio.SeqIO import parse parse() method contains two arguments, first one is file handle and second is file format. >>> file = open(”path/to/biopython/sample/orchid.fasta”) >>> for record in parse(file, “fasta”): … print(record.id) … gi|2765658|emb|Z78533.1|CIZ78533 gi|2765657|emb|Z78532.1|CCZ78532 ………. ………. gi|2765565|emb|Z78440.1|PPZ78440 gi|2765564|emb|Z78439.1|PBZ78439 >>> Here, the parse() method returns an iterable object which returns SeqRecord on every iteration. Being iterable, it provides lot of sophisticated and easy methods and let us see some of the features. next() next() method returns the next item available in the iterable object, which we can be used to get the first sequence as given below − >>> first_seq_record = next(SeqIO.parse(open(”path/to/biopython/sample/orchid.fasta”),”fasta”)) >>> first_seq_record.id ”gi|2765658|emb|Z78533.1|CIZ78533” >>> first_seq_record.name ”gi|2765658|emb|Z78533.1|CIZ78533” >>> first_seq_record.seq Seq(”CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGG…CGC”, SingleLetterAlphabet()) >>> first_seq_record.description ”gi|2765658|emb|Z78533.1|CIZ78533 C.irapeanum 5.8S rRNA gene and ITS1 and ITS2 DNA” >>> first_seq_record.annotations {} >>> Here, seq_record.annotations is empty because the FASTA format does not support sequence annotations. list comprehension We can convert the iterable object into list using list comprehension as given below >>> seq_iter = SeqIO.parse(open(”path/to/biopython/sample/orchid.fasta”),”fasta”) >>> all_seq = [seq_record for seq_record in seq_iter] >>> len(all_seq) 94 >>> Here, we have used len method to get the total count. We can get sequence with maximum length as follows − >>> seq_iter = SeqIO.parse(open(”path/to/biopython/sample/orchid.fasta”),”fasta”) >>> max_seq = max(len(seq_record.seq) for seq_record in seq_iter) >>> max_seq 789 >>> We can filter the sequence as well using the below code − >>> seq_iter = SeqIO.parse(open(”path/to/biopython/sample/orchid.fasta”),”fasta”) >>> seq_under_600 = [seq_record for seq_record in seq_iter if len(seq_record.seq) < 600] >>> for seq in seq_under_600: … print(seq.id) … gi|2765606|emb|Z78481.1|PIZ78481 gi|2765605|emb|Z78480.1|PGZ78480 gi|2765601|emb|Z78476.1|PGZ78476 gi|2765595|emb|Z78470.1|PPZ78470 gi|2765594|emb|Z78469.1|PHZ78469 gi|2765564|emb|Z78439.1|PBZ78439 >>> Writing a collection of SqlRecord objects (parsed data) into file is as simple as calling the SeqIO.write method as below − file = open(“converted.fasta”, “w) SeqIO.write(seq_record, file, “fasta”) This method can be effectively used to convert the format as specified below − file = open(“converted.gbk”, “w) SeqIO.write(seq_record, file, “genbank”) GenBank It is a richer sequence format for genes and includes fields for various kinds of annotations. Biopython provides an example GenBank file and it can be accessed at https://github.com/biopython/biopython/blob/master/Doc/examples/ls_orchid.fasta. Download and save file into your Biopython sample directory as ‘orchid.gbk’ Since, Biopython provides a single function, parse to parse all bioinformatics format. Parsing GenBank format is as simple as changing the format option in the parse method. The code for the same has been given below − >>> from Bio import SeqIO >>> from Bio.SeqIO import parse >>> seq_record = next(parse(open(”path/to/biopython/sample/orchid.gbk”),”genbank”)) >>> seq_record.id ”Z78533.1” >>> seq_record.name ”Z78533” >>> seq_record.seq Seq(”CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGG…CGC”, IUPACAmbiguousDNA()) >>> seq_record.description ”C.irapeanum 5.8S rRNA gene and ITS1 and ITS2 DNA” >>> seq_record.annotations { ”molecule_type”: ”DNA”, ”topology”: ”linear”, ”data_file_division”: ”PLN”, ”date”: ”30-NOV-2006”, ”accessions”: [”Z78533”], ”sequence_version”: 1, ”gi”: ”2765658”, ”keywords”: [”5.8S ribosomal RNA”, ”5.8S rRNA gene”, ”internal transcribed spacer”, ”ITS1”, ”ITS2”], ”source”: ”Cypripedium irapeanum”, ”organism”: ”Cypripedium irapeanum”, ”taxonomy”: [ ”Eukaryota”, ”Viridiplantae”, ”Streptophyta”, ”Embryophyta”, ”Tracheophyta”, ”Spermatophyta”, ”Magnoliophyta”, ”Liliopsida”, ”Asparagales”, ”Orchidaceae”, ”Cypripedioideae”, ”Cypripedium”], ”references”: [ Reference(title = ”Phylogenetics of the slipper orchids (Cypripedioideae: Orchidaceae): nuclear rDNA ITS sequences”, …), Reference(title = ”Direct Submission”, …) ] } Print Page Previous Next Advertisements ”;

Creating Simple Application

Biopython – Creating Simple Application ”; Previous Next Let us create a simple Biopython application to parse a bioinformatics file and print the content. This will help us understand the general concept of the Biopython and how it helps in the field of bioinformatics. Step 1 − First, create a sample sequence file, “example.fasta” and put the below content into it. >sp|P25730|FMS1_ECOLI CS1 fimbrial subunit A precursor (CS1 pilin) MKLKKTIGAMALATLFATMGASAVEKTISVTASVDPTVDLLQSDGSALPNSVALTYSPAV NNFEAHTINTVVHTNDSDKGVVVKLSADPVLSNVLNPTLQIPVSVNFAGKPLSTTGITID SNDLNFASSGVNKVSSTQKLSIHADATRVTGGALTAGQYQGLVSIILTKSTTTTTTTKGT >sp|P15488|FMS3_ECOLI CS3 fimbrial subunit A precursor (CS3 pilin) MLKIKYLLIGLSLSAMSSYSLAAAGPTLTKELALNVLSPAALDATWAPQDNLTLSNTGVS NTLVGVLTLSNTSIDTVSIASTNVSDTSKNGTVTFAHETNNSASFATTISTDNANITLDK NAGNTIVKTTNGSQLPTNLPLKFITTEGNEHLVSGNYRANITITSTIKGGGTKKGTTDKK The extension, fasta refers to the file format of the sequence file. FASTA originates from the bioinformatics software, FASTA and hence it gets its name. FASTA format has multiple sequence arranged one by one and each sequence will have its own id, name, description and the actual sequence data. Step 2 − Create a new python script, *simple_example.py” and enter the below code and save it. from Bio.SeqIO import parse from Bio.SeqRecord import SeqRecord from Bio.Seq import Seq file = open(“example.fasta”) records = parse(file, “fasta”) for record in records: print(“Id: %s” % record.id) print(“Name: %s” % record.name) print(“Description: %s” % record.description) print(“Annotations: %s” % record.annotations) print(“Sequence Data: %s” % record.seq) print(“Sequence Alphabet: %s” % record.seq.alphabet) Let us take a little deeper look into the code − Line 1 imports the parse class available in the Bio.SeqIO module. Bio.SeqIO module is used to read and write the sequence file in different format and `parse’ class is used to parse the content of the sequence file. Line 2 imports the SeqRecord class available in the Bio.SeqRecord module. This module is used to manipulate sequence records and SeqRecord class is used to represent a particular sequence available in the sequence file. *Line 3″ imports Seq class available in the Bio.Seq module. This module is used to manipulate sequence data and Seq class is used to represent the sequence data of a particular sequence record available in the sequence file. Line 5 opens the “example.fasta” file using regular python function, open. Line 7 parse the content of the sequence file and returns the content as the list of SeqRecord object. Line 9-15 loops over the records using python for loop and prints the attributes of the sequence record (SqlRecord) such as id, name, description, sequence data, etc. Line 15 prints the sequence’s type using Alphabet class. Step 3 − Open a command prompt and go to the folder containing sequence file, “example.fasta” and run the below command − > python simple_example.py Step 4 − Python runs the script and prints all the sequence data available in the sample file, “example.fasta”. The output will be similar to the following content. Id: sp|P25730|FMS1_ECOLI Name: sp|P25730|FMS1_ECOLI Decription: sp|P25730|FMS1_ECOLI CS1 fimbrial subunit A precursor (CS1 pilin) Annotations: {} Sequence Data: MKLKKTIGAMALATLFATMGASAVEKTISVTASVDPTVDLLQSDGSALPNSVALTYSPAVNNFEAHTINTVVHTNDSD KGVVVKLSADPVLSNVLNPTLQIPVSVNFAGKPLSTTGITIDSNDLNFASSGVNKVSSTQKLSIHADATRVTGGALTA GQYQGLVSIILTKSTTTTTTTKGT Sequence Alphabet: SingleLetterAlphabet() Id: sp|P15488|FMS3_ECOLI Name: sp|P15488|FMS3_ECOLI Decription: sp|P15488|FMS3_ECOLI CS3 fimbrial subunit A precursor (CS3 pilin) Annotations: {} Sequence Data: MLKIKYLLIGLSLSAMSSYSLAAAGPTLTKELALNVLSPAALDATWAPQDNLTLSNTGVSNTLVGVLTLSNTSIDTVS IASTNVSDTSKNGTVTFAHETNNSASFATTISTDNANITLDKNAGNTIVKTTNGSQLPTNLPLKFITTEGNEHLVSGN YRANITITSTIKGGGTKKGTTDKK Sequence Alphabet: SingleLetterAlphabet() We have seen three classes, parse, SeqRecord and Seq in this example. These three classes provide most of the functionality and we will learn those classes in the coming section. Print Page Previous Next Advertisements ”;