Welcome to Peking University MOOC “Bioinformatics: Introduction and Methods”. I’m Liping Wei from the Center for Bioinformatics at Peking University. In the previous weeks, we spent most of the time talking about bioinformatic methods and concepts. We hope that you had learnt not only the methods themselves but also the knowledge beyond. For those of you who are interested in DEVELOPING new bioinformatic methods, I hope that our previous lectures had given you some ideas of how to identify an important and unsolved biological question, how to formulate it into a computational problem, how to come up with the idea for an algorithm to solve the problem, how to implement the algorithm, and how to evaluate them. For those of you who are interested in USING bioinformatics methods in your biological research, I hope that our previous lectures had taught you how to appreciate the power and limitations of bioinformatics methods such as what biological questions does it address? What are the different parameters and what do they do? What are the underlying assumptions? What are the accuracies of the method? What are its limitations? Ever since the early days of bioinformatics, this field has had a tradition of making the software and databases freely available to the whole scientific community. Today the vast majority of bioinformatics methods and databases are freely available online. This week, we’ll try to walk you through some of the online bioinformatics resources so that you know where to find data and software. There are many more good resources than we can possibly cover in a week. We hope that the review in this week and the methods in previous weeks can provide you with a good map, a map that you can take with you for guidance when you embark on the exciting journey of bioinformatics research. of course studying a map and really going to a destination are two different things. Please try the resources yourself. Get your hands dirty, so to speak. Using the resources is where the real fun begins. In the first unit, I’ll try to give you an overview of bioinformatics resources. I don’t expect you to memorize every detail as the lists are long. Instead I hope that you can get the flavor of it. I provide you with these long lists so that you have something to refer back to in your future research when you need to find some data or software to address a particular biological problem. In Units 2 to 5, I’ll show you some example resources in a little more detail. The large number of bioinformatics resources may seem overwhelming and confusing at first glance. Don’t worry. The conceptual frameworks that I had showed you earlier in the course could help bring you some clarity. From the angle of “-informatics” in bioinformatics, the resources can be roughly divided into databases and software. Some databases contain original raw data such as Genbank and dbSNP. other databases contain secondary data, which are generated from original data by bioinformatics analyses and manual curations, such as the Gene Ontology. Software include standalone programs that are run on command line and web servers that have a web-based user interface. From the angle of “Bio-“ in bioinformatics, the databases and software address diverse biological problems from genotype to phenotype, from DNA sequences such as genes and genomes, RNA sequences, protein sequences and structures, pathways, networks, to diseases, and so on. From the organizational point of view, there are large centralized resources as well as individual databases and software tools. Let’s start with the centralized resources. They are like large shopping malls with lots of different stores where you can get a large variety of different stuff in one stop. I will briefly mention three of them here and describe then in a little more detail in units 2 to 4. One of the largest centralized bioinformatics resources is maintained by the National Center for Biotechnology Information (NCBI) at the National Institute of Health (NIH) in the US. As you can see, This is the frontpage of NCBI.NCBI has lots of database resources from DNA, RNA, and proteins, domains and structures, expression, variations, literature, and so on, as well as software tools, especially sequence analysis tools. Another large centralized bioinformatics resource is the European Bioinformatics Institute (EBI). It has a large collection of resources similar to NCBI. A good example of a centralized genome resource is the University of California at Santa Cruz Genome Bioinformatics, the main part of which is the Genome Browser. As of the end of 2013, it integrates over two hundred tracks of data onto the whole genome sequences including expression, variation, conservation, and so on. Each track consists of many experiments. The Center for Bioinformatics, or CBI, at Peking University where Dr. Gao and I work is the first bioinformatics center in China. The faculty and students have developed and maintained dozens of bioinformatic databases and software. We are much smaller compared to NCBI, EBI, or UCSC of course and can’t call ourselves a centralized resource yet. But there are many useful tools on our web site, which will be briefly reviewed by a student presentation in a supplementary learning video Other than the centralized resources, there are also thousands of individual bioinformatics databases and software tools. They are like boutique stores, each offering something unique. This slide shows some examples of individual resources. Tools included in the centralized resources are not repeated here. For example, if you want to predict the genes in a genome, you could use tools such as GENSCAN and Glimmer. You can find human genetic variation data in HGMD and cancer somatic mutations in COSMIC, and you can predict the functional effects of genetic variations using SIFT, PolyPhen, and SAPRED. If you are interested in studying expression regulation, you can find data on transcription factors and transcriptor binding sites in TRANSFAC and PlantTFDB, noncoding RNA families in Rfam, and microRNA annotations in miRBase. For epigenetic research, you can find data on DNA methylation from the MethylomeDB. To study molecular pathways, you can find pathway data from the KEGG, PANTHER, BioCyc, and REACTOME databases, and use software tools such as KOBAS and DAVID. In addition, PID and STRING are very useful for studying protein interaction networks. Finally, to study evolutionary conservation, you can use tools such as GERP++ and PHYML. Because of the limited time, we didn’t cover enough of mass spec proteomics analyses in this MOOC. I’d like to show you some of the databases and tools on this slide. Other than the large mass spec data repository PRIDE at EBI, the Global Proteome Machine (GPM) also contains lots of mass spec data. There are two main types of methods to identify peptides from mass spec data. The first commonly used are methods of peptide identification by searching existing databases of proteins, including Sequest, Mascot,ProteinProspector, pFind, PEAKS, Byonic, Proteome Discoverer, SpectrumMill, Masslynx, and X!Tandem. What if your protein is a brand new protein that has never seen before? Database search wouldn’t help you much in this case. If you suspect that your protein of interest might be new, you need to use de novo peptide identification methods such as pNovo, PEAKS, and PepNovo. Finally, sometimes you not only want to know about the presence of a protein but also its quantity. Quantitation software such as MaxQuant and Census may help. Another area that we didn’t cover enough of is protein structure analysis. On this slide I’ll show you some of the resources for protein 3D structures. The largest database for protein 3D structures is the Protein Databank, or PDB. It also consists of 3D structures of nucleic acids and complexes. The speed at which DNA and protein sequences are determined far exceeds the speed at which protein structures can be determined. Fortunately there are computational methods that can predict the protein 3D structures from protein sequences at varying accuracy. The most accurate predictions can be obtained when there is another protein with similar sequence whose structure is known. Examples of this type of Homology Modeling methods include Modeller, Swiss-Model, I-TASSER, and so on. Sometimes even though it is not possible to find another protein that has high sequence similarity to your protein of interest, you can still use all the known structures as templates and “thread” your protein on top of the structures and calculate the energy function to find the best fit. This type of Fold Recognition methods includes 3D-PSSM, Phyre2, PROSPECT, etc. Today many prediction methods integrate homology modeling with fold recognition and no longer make clear distinctions between the two. Finally, there are new proteins which may have a completely new and previously unseen fold. For these proteins you can only make model-free ab initio folding predictions by searching for possible folds in the fold space and calculating the free energy, using methods such as QUARK and Rosetta. As you can imagine, the difficulty and challenge increase from homology modeling to ab initio folding. Not surprisingly, the prediction accuracy decreases dramatically from homology modeling to ab initio folding. You should always try the ab initio predictions when necessary, but be very careful with the results. Because protein structure prediction can be very computationally intensive, some bioinformatics groups have built predicted models of newly determined protein sequences and make the results freely available on the web. A good example is the SWISS-MODEL Repository which is a database of homology models built by SWISS-MODEL software for human and a number of model organisms. In addition to protein structures, there are also databases and tools such as Mfold and PDB for nucleotide structures and protein-nucleotide complexes. Finally RNA molecules may interact with each other. RNAhybrid is a tool for finding the minimum free energy hybridisation of a long and a short RNA In the past few years the Next-Generation Sequencing technologies have been producing astronomical amount of new and noisier data that defeat traditional sequence analysis methods. Where there are challenges, there are opportunities. Lots of new methods have been developed. For instance, to map reads to the reference genome you could use BWA or Bowtie for DNA sequences and TopHat for RNA sequences, and there are other useful related utilities for quality checking, alignments etc. such as GATK, FastQC, RNA-SeQC, SAMtools, and Picard. For de novo assembly of genome without using the reference genome you could use tools such as Velvet and SOAP de novo. For de novo transcroptome assembly you could use Trinity or Velvet+Oases. For reference-based transcriptome assembly the most commonly used tools include TopHat+Cufflinks. To visualize the reads on the whole genome framework, there are genome browsers such as GBrowse, JBrowse, and IGV. For the important task of calling genetic variants from the aligned reads, different tools were developed to call variants at different scale. To call SNPs there are tools such as GATK, SOAPsnp, and SAMtools. To call small indels Pindel is a good choice. To call CNVs and structural variations there are tools such as CNVnator, BIC-seq, SVMerge, mrCaNaVaR, ExomeCNV, CoNIFER, HMMcopy, Control-FREEC, etc. To find differentially expression genes from RNA-seq data, you can use Cuffdiff or DESeq. To call peaks in ChIP-seq data, MACS is a good choice. There are a large variety of other useful individual resources as well. For instance, there are dedicated databases for almost each of the model organisms. A few examples, which by no means are exhaustive, include Flybase, Wormbase, ZFIN for zebrafish, TAIR for Arabidopsis, etc. Some large scale projects have their own databases, such as cancer databases TGCA and CGP, epigenetic data in the Roadmap Epigenetics project, and brain structure and connectivity databases Allen Brain Atlas and Human Connectome Project. There are a number of tools that assist wet-lab experiments. For instance to assist primer design there are tools such as Primer3Plus and Electronic PCR. Another type of very useful resources are the software programming utilities that contain modules that you can download and directly incorporate into your own software programs. EMBOSS has lots of sequence analysis tools that you can download and use. Bioconductor distributes R programs. BioPerl has lots of useful Perl modules and BioPython has lots of Python modules. These readily available modules can make your programming tasks much easier. So before you spend a lot of time writing a program, make sure to check out these resources first. Finally, there are tools that help you manage work flows such as Galaxy and Taverna. It is important to mention that the journal Nucleic Acids Research publishes an annual Database Issue every January. In the 2013 issue, about 90 new databases were reported and a similar number of databases had significant updates. This annual issue is a place where bioinformaticians from around the world show off their latest exciting databases. We read this free issue every year and follow the links to check out the databases. I suggest that you do so as well. Following the success of the Database Issue, Nucleic Acids Research then initiated an annual Web Server Issue that is published every July. I suggest that you read this issue every July to keep up with the latest exciting web servers. All of these thousands of useful resources had been developed by bioinformaticians around the world. They have made significant impact on life sciences. In the next few units, I will show you some examples of these resources in a little more detail, starting with NCBI, the National Center for Biotechnology Information. See you at the next unit!