Biopython – Useful Resources

Biopython – Useful Resources ”; Previous Next The following resources contain additional information on Biopython. Please use them to get more in-depth knowledge on this. BioInformatics with Python 39 Lectures 11 hours Jesse E.Agbe More Detail NCBI Tools in Linux 15 Lectures 1.5 hours Matthew Cserhati More Detail Print Page Previous Next Advertisements ”;

Biopython – Testing Techniques

Biopython – Testing Techniques ”; Previous Next Biopython have extensive test script to test the software under different conditions to make sure that the software is bug-free. To run the test script, download the source code of the Biopython and then run the below command − python run_tests.py This will run all the test scripts and gives the following output − Python version: 2.7.12 (v2.7.12:d33e0cf91556, Jun 26 2016, 12:10:39) [GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] Operating system: posix darwin test_Ace … ok test_Affy … ok test_AlignIO … ok test_AlignIO_ClustalIO … ok test_AlignIO_EmbossIO … ok test_AlignIO_FastaIO … ok test_AlignIO_MauveIO … ok test_AlignIO_PhylipIO … ok test_AlignIO_convert … ok ……………………………………. ……………………………………. We can also run individual test script as specified below − python test_AlignIO.py Conclusion As we have learned, Biopython is one of the important software in the field of bioinformatics. Being written in python (easy to learn and write), It provides extensive functionality to deal with any computation and operation in the field of bioinformatics. It also provides easy and flexible interface to almost all the popular bioinformatics software to exploit the its functionality as well. Print Page Previous Next Advertisements ”;

Biopython – Cluster Analysis

Biopython – Cluster Analysis ”; Previous Next In general, Cluster analysis is grouping a set of objects in the same group. This concept is mainly used in data mining, statistical data analysis, machine learning, pattern recognition, image analysis, bioinformatics, etc. It can be achieved by various algorithms to understand how the cluster is widely used in different analysis. According to Bioinformatics, cluster analysis is mainly used in gene expression data analysis to find groups of genes with similar gene expression. In this chapter, we will check out important algorithms in Biopython to understand the fundamentals of clustering on a real dataset. Biopython uses Bio.Cluster module for implementing all the algorithms. It supports the following algorithms − Hierarchical Clustering K – Clustering Self-Organizing Maps Principal Component Analysis Let us have a brief introduction on the above algorithms. Hierarchical Clustering Hierarchical clustering is used to link each node by a distance measure to its nearest neighbor and create a cluster. Bio.Cluster node has three attributes: left, right and distance. Let us create a simple cluster as shown below − >>> from Bio.Cluster import Node >>> n = Node(1,10) >>> n.left = 11 >>> n.right = 0 >>> n.distance = 1 >>> print(n) (11, 0): 1 If you want to construct Tree based clustering, use the below command − >>> n1 = [Node(1, 2, 0.2), Node(0, -1, 0.5)] >>> n1_tree = Tree(n1) >>> print(n1_tree) (1, 2): 0.2 (0, -1): 0.5 >>> print(n1_tree[0]) (1, 2): 0.2 Let us perform hierarchical clustering using Bio.Cluster module. Consider the distance is defined in an array. >>> import numpy as np >>> distance = array([[1,2,3],[4,5,6],[3,5,7]]) Now add the distance array in tree cluster. >>> from Bio.Cluster import treecluster >>> cluster = treecluster(distance) >>> print(cluster) (2, 1): 0.666667 (-1, 0): 9.66667 The above function returns a Tree cluster object. This object contains nodes where the number of items are clustered as rows or columns. K – Clustering It is a type of partitioning algorithm and classified into k – means, medians and medoids clustering. Let us understand each of the clustering in brief. K-means Clustering This approach is popular in data mining. The goal of this algorithm is to find groups in the data, with the number of groups represented by the variable K. The algorithm works iteratively to assign each data point to one of the K groups based on the features that are provided. Data points are clustered based on feature similarity. >>> from Bio.Cluster import kcluster >>> from numpy import array >>> data = array([[1, 2], [3, 4], [5, 6]]) >>> clusterid, error,found = kcluster(data) >>> print(clusterid) [0 0 1] >>> print(found) 1 K-medians Clustering It is another type of clustering algorithm which calculates the mean for each cluster to determine its centroid. K-medoids Clustering This approach is based on a given set of items, using the distance matrix and the number of clusters passed by the user. Consider the distance matrix as defined below − >>> distance = array([[1,2,3],[4,5,6],[3,5,7]]) We can calculate k-medoids clustering using the below command − >>> from Bio.Cluster import kmedoids >>> clusterid, error, found = kmedoids(distance) Let us consider an example. The kcluster function takes a data matrix as input and not Seq instances. You need to convert your sequences to a matrix and provide that to the kcluster function. One way of converting the data to a matrix containing numerical elements only is by using the numpy.fromstring function. It basically translates each letter in a sequence to its ASCII counterpart. This creates a 2D array of encoded sequences that the kcluster function recognized and uses to cluster your sequences. >>> from Bio.Cluster import kcluster >>> import numpy as np >>> sequence = [ ”AGCT”,”CGTA”,”AAGT”,”TCCG”] >>> matrix = np.asarray([np.fromstring(s, dtype=np.uint8) for s in sequence]) >>> clusterid,error,found = kcluster(matrix) >>> print(clusterid) [1 0 0 1] Self-Organizing Maps This approach is a type of artificial neural network. It is developed by Kohonen and often called as Kohonen map. It organizes items into clusters based on rectangular topology. Let us create a simple cluster using the same array distance as shown below − >>> from Bio.Cluster import somcluster >>> from numpy import array >>> data = array([[1, 2], [3, 4], [5, 6]]) >>> clusterid,map = somcluster(data) >>> print(map) [[[-1.36032469 0.38667395]] [[-0.41170578 1.35295911]]] >>> print(clusterid) [[1 0] [1 0] [1 0]] Here, clusterid is an array with two columns, where the number of rows is equal to the number of items that were clustered, and data is an array with dimensions either rows or columns. Principal Component Analysis Principal Component Analysis is useful to visualize high-dimensional data. It is a method that uses simple matrix operations from linear algebra and statistics to calculate a projection of the original data into the same number or fewer dimensions. Principal Component Analysis returns a tuple columnmean, coordinates, components, and eigenvalues. Let us look into the basics of this concept. >>> from numpy import array >>> from numpy import mean >>> from numpy import cov >>> from numpy.linalg import eig # define a matrix >>> A = array([[1, 2], [3, 4], [5, 6]]) >>> print(A) [[1 2] [3 4] [5 6]] # calculate the mean of each column >>> M = mean(A.T, axis = 1) >>> print(M) [ 3. 4.] # center columns by subtracting column means >>> C = A – M >>> print(C) [[-2. -2.] [ 0. 0.] [ 2. 2.]] # calculate covariance matrix of centered matrix >>> V = cov(C.T) >>> print(V) [[ 4. 4.] [ 4. 4.]] # eigendecomposition of covariance matrix >>> values, vectors = eig(V) >>> print(vectors) [[ 0.70710678 -0.70710678] [ 0.70710678 0.70710678]] >>> print(values) [ 8. 0.] Let us apply the same rectangular matrix data to Bio.Cluster module as defined below − >>> from Bio.Cluster import pca >>> from numpy import array >>> data = array([[1, 2], [3, 4], [5, 6]]) >>> columnmean, coordinates, components, eigenvalues = pca(data) >>> print(columnmean) [ 3. 4.] >>> print(coordinates) [[-2.82842712 0. ] [ 0. 0. ] [ 2.82842712 0. ]] >>>

Biopython – Discussion

Discuss Biopython ”; Previous Next Biopython is an open-source python tool mainly used in bioinformatics field. This tutorial walks through the basics of Biopython package, overview of bioinformatics, sequence manipulation and plotting, population genetics, cluster analysis, genome analysis, connecting with BioSQL databases and finally concludes with some examples. Print Page Previous Next Advertisements ”;

Biopython – Machine Learning

Biopython – Machine Learning ”; Previous Next Bioinformatics is an excellent area to apply machine learning algorithms. Here, we have genetic information of large number of organisms and it is not possible to manually analyze all this information. If proper machine learning algorithm is used, we can extract lot of useful information from these data. Biopython provides useful set of algorithm to do supervised machine learning. Supervised learning is based on input variable (X) and output variable (Y). It uses an algorithm to learn the mapping function from the input to the output. It is defined below − Y = f(X) The main objective of this approach is to approximate the mapping function and when you have new input data (x), you can predict the output variables (Y) for that data. Logistic Regression Model Logistic regression is a supervised machine Learning algorithm. It is used to find out the difference between K classes using weighted sum of predictor variables. It computes the probability of an event occurrence and can be used for cancer detection. Biopython provides Bio.LogisticRegression module to predict variables based on Logistic regression algorithm. Currently, Biopython implements logistic regression algorithm for two classes only (K = 2). k-Nearest Neighbors k-Nearest neighbors is also a supervised machine learning algorithm. It works by categorizing the data based on nearest neighbors. Biopython provides Bio.KNN module to predict variables based on k-nearest neighbors algorithm. Naive Bayes Naive Bayes classifiers are a collection of classification algorithms based on Bayes’ Theorem. It is not a single algorithm but a family of algorithms where all of them share a common principle, i.e. every pair of features being classified is independent of each other. Biopython provides Bio.NaiveBayes module to work with Naive Bayes algorithm. Markov Model A Markov model is a mathematical system defined as a collection of random variables, that experiences transition from one state to another according to certain probabilistic rules. Biopython provides Bio.MarkovModel and Bio.HMM.MarkovModel modules to work with Markov models. Print Page Previous Next Advertisements ”;

Biopython – Quick Guide

Biopython – Quick Guide ”; Previous Next Biopython – Introduction Biopython is the largest and most popular bioinformatics package for Python. It contains a number of different sub-modules for common bioinformatics tasks. It is developed by Chapman and Chang, mainly written in Python. It also contains C code to optimize the complex computation part of the software. It runs on Windows, Linux, Mac OS X, etc. Basically, Biopython is a collection of python modules that provide functions to deal with DNA, RNA & protein sequence operations such as reverse complementing of a DNA string, finding motifs in protein sequences, etc. It provides lot of parsers to read all major genetic databases like GenBank, SwissPort, FASTA, etc., as well as wrappers/interfaces to run other popular bioinformatics software/tools like NCBI BLASTN, Entrez, etc., inside the python environment. It has sibling projects like BioPerl, BioJava and BioRuby. Features Biopython is portable, clear and has easy to learn syntax. Some of the salient features are listed below − Interpreted, interactive and object oriented. Supports FASTA, PDB, GenBank, Blast, SCOP, PubMed/Medline, ExPASy-related formats. Option to deal with sequence formats. Tools to manage protein structures. BioSQL − Standard set of SQL tables for storing sequences plus features and annotations. Access to online services and database, including NCBI services (Blast, Entrez, PubMed) and ExPASY services (SwissProt, Prosite). Access to local services, including Blast, Clustalw, EMBOSS. Goals The goal of Biopython is to provide simple, standard and extensive access to bioinformatics through python language. The specific goals of the Biopython are listed below − Providing standardized access to bioinformatics resources. High-quality, reusable modules and scripts. Fast array manipulation that can be used in Cluster code, PDB, NaiveBayes and Markov Model. Genomic data analysis. Advantages Biopython requires very less code and comes up with the following advantages − Provides microarray data type used in clustering. Reads and writes Tree-View type files. Supports structure data used for PDB parsing, representation and analysis. Supports journal data used in Medline applications. Supports BioSQL database, which is widely used standard database amongst all bioinformatics projects. Supports parser development by providing modules to parse a bioinformatics file into a format specific record object or a generic class of sequence plus features. Clear documentation based on cookbook-style. Sample Case Study Let us check some of the use cases (population genetics, RNA structure, etc.,) and try to understand how Biopython plays an important role in this field − Population Genetics Population genetics is the study of genetic variation within a population, and involves the examination and modeling of changes in the frequencies of genes and alleles in populations over space and time. Biopython provides Bio.PopGen module for population genetics. This module contains all the necessary functions to gather information about classic population genetics. RNA Structure Three major biological macromolecules that are essential for our life are DNA, RNA and Protein. Proteins are the workhorses of the cell and play an important role as enzymes. DNA (deoxyribonucleic acid) is considered as the “blueprint” of the cell. It carries all the genetic information required for the cell to grow, take in nutrients, and propagate. RNA (Ribonucleic acid) acts as “DNA photocopy” in the cell. Biopython provides Bio.Sequence objects that represents nucleotides, building blocks of DNA and RNA. Biopython – Installation This section explains how to install Biopython on your machine. It is very easy to install and it will not take more than five minutes. Step 1 − Verifying Python Installation Biopython is designed to work with Python 2.5 or higher versions. So, it is mandatory that python be installed first. Run the below command in your command prompt − > python –version It is defined below − It shows the version of python, if installed properly. Otherwise, download the latest version of the python, install it and then run the command again. Step 2 − Installing Biopython using pip It is easy to install Biopython using pip from the command line on all platforms. Type the below command − > pip install biopython The following response will be seen on your screen − For updating an older version of Biopython − > pip install biopython –-upgrade The following response will be seen on your screen − After executing this command, the older versions of Biopython and NumPy (Biopython depends on it) will be removed before installing the recent versions. Step 3 − Verifying Biopython Installation Now, you have successfully installed Biopython on your machine. To verify that Biopython is installed properly, type the below command on your python console − It shows the version of Biopython. Alternate Way − Installing Biopython using Source To install Biopython using source code, follow the below instructions − Download the recent release of Biopython from the following link − https://biopython.org/wiki/Download As of now, the latest version is biopython-1.72. Download the file and unpack the compressed archive file, move into the source code folder and type the below command − > python setup.py build This will build Biopython from the source code as given below − Now, test the code using the below command − > python setup.py test Finally, install using the below command − > python setup.py install Biopython – Creating Simple Application Let us create a simple Biopython application to parse a bioinformatics file and print the content. This will help us understand the general concept of the Biopython and how it helps in the field of bioinformatics. Step 1 − First, create a sample sequence file, “example.fasta” and put the below content into it. >sp|P25730|FMS1_ECOLI CS1 fimbrial subunit A precursor (CS1 pilin) MKLKKTIGAMALATLFATMGASAVEKTISVTASVDPTVDLLQSDGSALPNSVALTYSPAV NNFEAHTINTVVHTNDSDKGVVVKLSADPVLSNVLNPTLQIPVSVNFAGKPLSTTGITID SNDLNFASSGVNKVSSTQKLSIHADATRVTGGALTAGQYQGLVSIILTKSTTTTTTTKGT >sp|P15488|FMS3_ECOLI CS3 fimbrial subunit A precursor (CS3 pilin) MLKIKYLLIGLSLSAMSSYSLAAAGPTLTKELALNVLSPAALDATWAPQDNLTLSNTGVS NTLVGVLTLSNTSIDTVSIASTNVSDTSKNGTVTFAHETNNSASFATTISTDNANITLDK NAGNTIVKTTNGSQLPTNLPLKFITTEGNEHLVSGNYRANITITSTIKGGGTKKGTTDKK The extension, fasta refers to the file format of the sequence file. FASTA originates from the bioinformatics software, FASTA and hence it gets its name. FASTA format has multiple sequence arranged one by one and each sequence will have its own id, name, description and the actual sequence data. Step 2 − Create a new python script, *simple_example.py”

Biopython – BioSQL Module

Biopython – BioSQL Module ”; Previous Next BioSQL is a generic database schema designed mainly to store sequences and its related data for all RDBMS engine. It is designed in such a way that it holds the data from all popular bioinformatics databases like GenBank, Swissport, etc. It can be used to store in-house data as well. BioSQL currently provides specific schema for the below databases − MySQL (biosqldb-mysql.sql) PostgreSQL (biosqldb-pg.sql) Oracle (biosqldb-ora/*.sql) SQLite (biosqldb-sqlite.sql) It also provides minimal support for Java based HSQLDB and Derby databases. BioPython provides very simple, easy and advanced ORM capabilities to work with BioSQL based database. BioPython provides a module, BioSQL to do the following functionality − Create/remove a BioSQL database Connect to a BioSQL database Parse a sequence database like GenBank, Swisport, BLAST result, Entrez result, etc., and directly load it into the BioSQL database Fetch the sequence data from the BioSQL database Fetch taxonomy data from NCBI BLAST and store it in the BioSQL database Run any SQL query against the BioSQL database Overview of BioSQL Database Schema Before going deep into the BioSQL, let us understand the basics of BioSQL schema. BioSQL schema provides 25+ tables to hold sequence data, sequence feature, sequence category/ontology and taxonomy information. Some of the important tables are as follows − biodatabase bioentry biosequence seqfeature taxon taxon_name antology term dxref Creating a BioSQL Database In this section, let us create a sample BioSQL database, biosql using the schema provided by the BioSQL team. We shall work with SQLite database as it is really easy to get started and does not have complex setup. Here, we shall create a SQLite based BioSQL database using the below steps. Step 1 − Download the SQLite databse engine and install it. Step 2 − Download the BioSQL project from the GitHub URL. https://github.com/biosql/biosql Step 3 − Open a console and create a directory using mkdir and enter into it. cd /path/to/your/biopython/sample mkdir sqlite-biosql cd sqlite-biosql Step 4 − Run the below command to create a new SQLite database. > sqlite3.exe mybiosql.db SQLite version 3.25.2 2018-09-25 19:08:10 Enter “.help” for usage hints. sqlite> Step 5 − Copy the biosqldb-sqlite.sql file from the BioSQL project (/sql/biosqldb-sqlite.sql`) and store it in the current directory. Step 6 − Run the below command to create all the tables. sqlite> .read biosqldb-sqlite.sql Now, all tables are created in our new database. Step 7 − Run the below command to see all the new tables in our database. sqlite> .headers on sqlite> .mode column sqlite> .separator ROW “n” sqlite> SELECT name FROM sqlite_master WHERE type = ”table”; biodatabase taxon taxon_name ontology term term_synonym term_dbxref term_relationship term_relationship_term term_path bioentry bioentry_relationship bioentry_path biosequence dbxref dbxref_qualifier_value bioentry_dbxref reference bioentry_reference comment bioentry_qualifier_value seqfeature seqfeature_relationship seqfeature_path seqfeature_qualifier_value seqfeature_dbxref location location_qualifier_value sqlite> The first three commands are configuration commands to configure SQLite to show the result in a formatted manner. Step 8 − Copy the sample GenBank file, ls_orchid.gbk provided by BioPython team https://raw.githubusercontent.com/biopython/biopython/master/Doc/examples/ls_orchid.gbk into the current directory and save it as orchid.gbk. Step 9 − Create a python script, load_orchid.py using the below code and execute it. from Bio import SeqIO from BioSQL import BioSeqDatabase import os server = BioSeqDatabase.open_database(driver = ”sqlite3”, db = “orchid.db”) db = server.new_database(“orchid”) count = db.load(SeqIO.parse(“orchid.gbk”, “gb”), True) server.commit() server.close() The above code parses the record in the file and converts it into python objects and inserts it into BioSQL database. We will analyze the code in later section. Finally, we created a new BioSQL database and load some sample data into it. We shall discuss the important tables in the next chapter. Simple ER Diagram biodatabase table is in the top of the hierarchy and its main purpose is to organize a set of sequence data into a single group/virtual database. Every entry in the biodatabase refers to a separate database and it does not mingle with another database. All the related tables in the BioSQL database have references to biodatabase entry. bioentry table holds all the details about a sequence except the sequence data. sequence data of a particular bioentry will be stored in biosequence table. taxon and taxon_name are taxonomy details and every entry refers this table to specify its taxon information. After understanding the schema, let us look into some queries in the next section. BioSQL Queries Let us delve into some SQL queries to better understand how the data are organized and the tables are related to each other. Before proceeding, let us open the database using the below command and set some formatting commands − > sqlite3 orchid.db SQLite version 3.25.2 2018-09-25 19:08:10 Enter “.help” for usage hints. sqlite> .header on sqlite> .mode columns .header and .mode are formatting options to better visualize the data. You can also use any SQLite editor to run the query. List the virtual sequence database available in the system as given below − select * from biodatabase; *** Result *** sqlite> .width 15 15 15 15 sqlite> select * from biodatabase; biodatabase_id name authority description ————— ————— ————— ————— 1 orchid sqlite> Here, we have only one database, orchid. List the entries (top 3) available in the database orchid with the below given code select be.*, bd.name from bioentry be inner join biodatabase bd on bd.biodatabase_id = be.biodatabase_id where bd.name = ”orchid” Limit 1, 3; *** Result *** sqlite> .width 15 15 10 10 10 10 10 50 10 10 sqlite> select be.*, bd.name from bioentry be inner join biodatabase bd on bd.biodatabase_id = be.biodatabase_id where bd.name = ”orchid” Limit 1,3; bioentry_id biodatabase_id taxon_id name accession identifier division description version name ————— ————— ———- ———- ———- ———- ———- ———- ———- ———– ———- ——— ———- ———- 2 1 19 Z78532 Z78532 2765657 PLN C.californicum 5.8S rRNA gene and ITS1 and ITS2 DN 1 orchid 3 1 20 Z78531 Z78531 2765656 PLN C.fasciculatum 5.8S rRNA gene and ITS1 and ITS2 DN 1 orchid 4 1 21 Z78530 Z78530 2765655 PLN C.margaritaceum 5.8S rRNA gene and ITS1 and ITS2 D 1 orchid sqlite> List

Biopython – Genome Analysis

Biopython – Genome Analysis ”; Previous Next A genome is complete set of DNA, including all of its genes. Genome analysis refers to the study of individual genes and their roles in inheritance. Genome Diagram Genome diagram represents the genetic information as charts. Biopython uses Bio.Graphics.GenomeDiagram module to represent GenomeDiagram. The GenomeDiagram module requires ReportLab to be installed. Steps for creating a diagram The process of creating a diagram generally follows the below simple pattern − Create a FeatureSet for each separate set of features you want to display, and add Bio.SeqFeature objects to them. Create a GraphSet for each graph you want to display, and add graph data to them. Create a Track for each track you want on the diagram, and add GraphSets and FeatureSets to the tracks you require. Create a Diagram, and add the Tracks to it. Tell the Diagram to draw the image. Write the image to a file. Let us take an example of input GenBank file − https://raw.githubusercontent.com/biopython/biopython/master/Doc/examples/ls_orchid.gbk and read records from SeqRecord object then finally draw a genome diagram. It is explained below, We shall import all the modules first as shown below − >>> from reportlab.lib import colors >>> from reportlab.lib.units import cm >>> from Bio.Graphics import GenomeDiagram Now, import SeqIO module to read data − >>> from Bio import SeqIO record = SeqIO.read(“example.gb”, “genbank”) Here, the record reads the sequence from genbank file. Now, create an empty diagram to add track and feature set − >>> diagram = GenomeDiagram.Diagram( “Yersinia pestis biovar Microtus plasmid pPCP1″) >>> track = diagram.new_track(1, name=”Annotated Features”) >>> feature = track.new_set() Now, we can apply color theme changes using alternative colors from green to grey as defined below − >>> for feature in record.features: >>> if feature.type != “gene”: >>> continue >>> if len(feature) % 2 == 0: >>> color = colors.blue >>> else: >>> color = colors.red >>> >>> feature.add_feature(feature, color=color, label=True) Now you could see the below response on your screen − <Bio.Graphics.GenomeDiagram._Feature.Feature object at 0x105d3dc90> <Bio.Graphics.GenomeDiagram._Feature.Feature object at 0x105d3dfd0> <Bio.Graphics.GenomeDiagram._Feature.Feature object at 0x1007627d0> <Bio.Graphics.GenomeDiagram._Feature.Feature object at 0x105d57290> <Bio.Graphics.GenomeDiagram._Feature.Feature object at 0x105d57050> <Bio.Graphics.GenomeDiagram._Feature.Feature object at 0x105d57390> <Bio.Graphics.GenomeDiagram._Feature.Feature object at 0x105d57590> <Bio.Graphics.GenomeDiagram._Feature.Feature object at 0x105d57410> <Bio.Graphics.GenomeDiagram._Feature.Feature object at 0x105d57490> <Bio.Graphics.GenomeDiagram._Feature.Feature object at 0x105d574d0> Let us draw a diagram for the above input records − >>> diagram.draw( format = “linear”, orientation = “landscape”, pagesize = ”A4”, … fragments = 4, start = 0, end = len(record)) >>> diagram.write(“orchid.pdf”, “PDF”) >>> diagram.write(“orchid.eps”, “EPS”) >>> diagram.write(“orchid.svg”, “SVG”) >>> diagram.write(“orchid.png”, “PNG”) After executing the above command, you could see the following image saved in your Biopython directory. ** Result ** genome.png You can also draw the image in circular format by making the below changes − >>> diagram.draw( format = “circular”, circular = True, pagesize = (20*cm,20*cm), … start = 0, end = len(record), circle_core = 0.7) >>> diagram.write(“circular.pdf”, “PDF”) Chromosomes Overview DNA molecule is packaged into thread-like structures called chromosomes. Each chromosome is made up of DNA tightly coiled many times around proteins called histones that support its structure. Chromosomes are not visible in the cell’s nucleus — not even under a microscope —when the cell is not dividing. However, the DNA that makes up chromosomes becomes more tightly packed during cell division and is then visible under a microscope. In humans, each cell normally contains 23 pairs of chromosomes, for a total of 46. Twenty-two of these pairs, called autosomes, look the same in both males and females. The 23rd pair, the sex chromosomes, differ between males and females. Females have two copies of the X chromosome, while males have one X and one Y chromosome. Print Page Previous Next Advertisements ”;

Biopython – Motif Objects

Biopython – Motif Objects ”; Previous Next A sequence motif is a nucleotide or amino-acid sequence pattern. Sequence motifs are formed by three-dimensional arrangement of amino acids which may not be adjacent. Biopython provides a separate module, Bio.motifs to access the functionalities of sequence motif as specified below − from Bio import motifs Creating Simple DNA Motif Let us create a simple DNA motif sequence using the below command − >>> from Bio import motifs >>> from Bio.Seq import Seq >>> DNA_motif = [ Seq(“AGCT”), … Seq(“TCGA”), … Seq(“AACT”), … ] >>> seq = motifs.create(DNA_motif) >>> print(seq) AGCT TCGA AACT To count the sequence values, use the below command − >>> print(seq.counts) 0 1 2 3 A: 2.00 1.00 0.00 1.00 C: 0.00 1.00 2.00 0.00 G: 0.00 1.00 1.00 0.00 T: 1.00 0.00 0.00 2.00 Use the following code to count ‘A’ in the sequence − >>> seq.counts[“A”, :] (2, 1, 0, 1) If you want to access the columns of counts, use the below command − >>> seq.counts[:, 3] {”A”: 1, ”C”: 0, ”T”: 2, ”G”: 0} Creating a Sequence Logo We shall now discuss how to create a Sequence Logo. Consider the below sequence − AGCTTACG ATCGTACC TTCCGAAT GGTACGTA AAGCTTGG You can create your own logo using the following link − http://weblogo.berkeley.edu/ Add the above sequence and create a new logo and save the image named seq.png in your biopython folder. seq.png After creating the image, now run the following command − >>> seq.weblogo(“seq.png”) This DNA sequence motif is represented as a sequence logo for the LexA-binding motif. JASPAR Database JASPAR is one of the most popular databases. It provides facilities of any of the motif formats for reading, writing and scanning sequences. It stores meta-information for each motif. The module Bio.motifs contains a specialized class jaspar.Motif to represent meta-information attributes. It has the following notable attributes types − matrix_id − Unique JASPAR motif ID name − The name of the motif tf_family − The family of motif, e.g. ’Helix-Loop-Helix’ data_type − the type of data used in motif. Let us create a JASPAR sites format named in sample.sites in biopython folder. It is defined below − sample.sites >MA0001 ARNT 1 AACGTGatgtccta >MA0001 ARNT 2 CAGGTGggatgtac >MA0001 ARNT 3 TACGTAgctcatgc >MA0001 ARNT 4 AACGTGacagcgct >MA0001 ARNT 5 CACGTGcacgtcgt >MA0001 ARNT 6 cggcctCGCGTGc In the above file, we have created motif instances. Now, let us create a motif object from the above instances − >>> from Bio import motifs >>> with open(“sample.sites”) as handle: … data = motifs.read(handle,”sites”) … >>> print(data) TF name None Matrix ID None Matrix: 0 1 2 3 4 5 A: 2.00 5.00 0.00 0.00 0.00 1.00 C: 3.00 0.00 5.00 0.00 0.00 0.00 G: 0.00 1.00 1.00 6.00 0.00 5.00 T: 1.00 0.00 0.00 0.00 6.00 0.00 Here, data reads all the motif instances from sample.sites file. To print all the instances from data, use the below command − >>> for instance in data.instances: … print(instance) … AACGTG CAGGTG TACGTA AACGTG CACGTG CGCGTG Use the below command to count all the values − >>> print(data.counts) 0 1 2 3 4 5 A: 2.00 5.00 0.00 0.00 0.00 1.00 C: 3.00 0.00 5.00 0.00 0.00 0.00 G: 0.00 1.00 1.00 6.00 0.00 5.00 T: 1.00 0.00 0.00 0.00 6.00 0.00 >>> Print Page Previous Next Advertisements ”;

Biopython – Plotting

Biopython – Plotting ”; Previous Next This chapter explains about how to plot sequences. Before moving to this topic, let us understand the basics of plotting. Plotting Matplotlib is a Python plotting library which produces quality figures in a variety of formats. We can create different types of plots like line chart, histograms, bar chart, pie chart, scatter chart, etc. pyLab is a module that belongs to the matplotlib which combines the numerical module numpy with the graphical plotting module pyplot.Biopython uses pylab module for plotting sequences. To do this, we need to import the below code − import pylab Before importing, we need to install the matplotlib package using pip command with the command given below − pip install matplotlib Sample Input File Create a sample file named plot.fasta in your Biopython directory and add the following changes − >seq0 FQTWEEFSRAAEKLYLADPMKVRVVLKYRHVDGNLCIKVTDDLVCLVYRTDQAQDVKKIEKF >seq1 KYRTWEEFTRAAEKLYQADPMKVRVVLKYRHCDGNLCIKVTDDVVCLLYRTDQAQDVKKIEKFHSQLMRLME >seq2 EEYQTWEEFARAAEKLYLTDPMKVRVVLKYRHCDGNLCMKVTDDAVCLQYKTDQAQDVKKVEKLHGK >seq3 MYQVWEEFSRAVEKLYLTDPMKVRVVLKYRHCDGNLCIKVTDNSVCLQYKTDQAQDV >seq4 EEFSRAVEKLYLTDPMKVRVVLKYRHCDGNLCIKVTDNSVVSYEMRLFGVQKDNFALEHSLL >seq5 SWEEFAKAAEVLYLEDPMKCRMCTKYRHVDHKLVVKLTDNHTVLKYVTDMAQDVKKIEKLTTLLMR >seq6 FTNWEEFAKAAERLHSANPEKCRFVTKYNHTKGELVLKLTDDVVCLQYSTNQLQDVKKLEKLSSTLLRSI >seq7 SWEEFVERSVQLFRGDPNATRYVMKYRHCEGKLVLKVTDDRECLKFKTDQAQDAKKMEKLNNIFF >seq8 SWDEFVDRSVQLFRADPESTRYVMKYRHCDGKLVLKVTDNKECLKFKTDQAQEAKKMEKLNNIFFTLM >seq9 KNWEDFEIAAENMYMANPQNCRYTMKYVHSKGHILLKMSDNVKCVQYRAENMPDLKK >seq10 FDSWDEFVSKSVELFRNHPDTTRYVVKYRHCEGKLVLKVTDNHECLKFKTDQAQDAKKMEK Line Plot Now, let us create a simple line plot for the above fasta file. Step 1 − Import SeqIO module to read fasta file. >>> from Bio import SeqIO Step 2 − Parse the input file. >>> records = [len(rec) for rec in SeqIO.parse(“plot.fasta”, “fasta”)] >>> len(records) 11 >>> max(records) 72 >>> min(records) 57 Step 3 − Let us import pylab module. >>> import pylab Step 4 − Configure the line chart by assigning x and y axis labels. >>> pylab.xlabel(“sequence length”) Text(0.5, 0, ”sequence length”) >>> pylab.ylabel(“count”) Text(0, 0.5, ”count”) >>> Step 5 − Configure the line chart by setting grid display. >>> pylab.grid() Step 6 − Draw simple line chart by calling plot method and supplying records as input. >>> pylab.plot(records) [<matplotlib.lines.Line2D object at 0x10b6869d 0>] Step 7 − Finally save the chart using the below command. >>> pylab.savefig(“lines.png”) Result After executing the above command, you could see the following image saved in your Biopython directory. Histogram Chart A histogram is used for continuous data, where the bins represent ranges of data. Drawing histogram is same as line chart except pylab.plot. Instead, call hist method of pylab module with records and some custum value for bins (5). The complete coding is as follows − Step 1 − Import SeqIO module to read fasta file. >>> from Bio import SeqIO Step 2 − Parse the input file. >>> records = [len(rec) for rec in SeqIO.parse(“plot.fasta”, “fasta”)] >>> len(records) 11 >>> max(records) 72 >>> min(records) 57 Step 3 − Let us import pylab module. >>> import pylab Step 4 − Configure the line chart by assigning x and y axis labels. >>> pylab.xlabel(“sequence length”) Text(0.5, 0, ”sequence length”) >>> pylab.ylabel(“count”) Text(0, 0.5, ”count”) >>> Step 5 − Configure the line chart by setting grid display. >>> pylab.grid() Step 6 − Draw simple line chart by calling plot method and supplying records as input. >>> pylab.hist(records,bins=5) (array([2., 3., 1., 3., 2.]), array([57., 60., 63., 66., 69., 72.]), <a list of 5 Patch objects>) >>> Step 7 − Finally save the chart using the below command. >>> pylab.savefig(“hist.png”) Result After executing the above command, you could see the following image saved in your Biopython directory. GC Percentage in Sequence GC percentage is one of the commonly used analytic data to compare different sequences. We can do a simple line chart using GC Percentage of a set of sequences and immediately compare it. Here, we can just change the data from sequence length to GC percentage. The complete coding is given below − Step 1 − Import SeqIO module to read fasta file. >>> from Bio import SeqIO Step 2 − Parse the input file. >>> from Bio.SeqUtils import GC >>> gc = sorted(GC(rec.seq) for rec in SeqIO.parse(“plot.fasta”, “fasta”)) Step 3 − Let us import pylab module. >>> import pylab Step 4 − Configure the line chart by assigning x and y axis labels. >>> pylab.xlabel(“Genes”) Text(0.5, 0, ”Genes”) >>> pylab.ylabel(“GC Percentage”) Text(0, 0.5, ”GC Percentage”) >>> Step 5 − Configure the line chart by setting grid display. >>> pylab.grid() Step 6 − Draw simple line chart by calling plot method and supplying records as input. >>> pylab.plot(gc) [<matplotlib.lines.Line2D object at 0x10b6869d 0>] Step 7 − Finally save the chart using the below command. >>> pylab.savefig(“gc.png”) Result After executing the above command, you could see the following image saved in your Biopython directory. Print Page Previous Next Advertisements ”;