=============================================================== INSTALLATION AND CONFIGURATION OF THE EST2uni ANALYSIS PIPELINE =============================================================== 1.- Download EST2uni: > wget http://bioinf.comav.upv.es/est2uni/est2uni_0.27.tar.gz 2.- Unpack: > tar -xvzf est2uni_0.27.tar.gz 3.- Move the perl directory to a convenient place: > mv perl /usr/local/est2uni 4.- Create the local variable EST2UNI_PERL and set it to the installation directory: > echo "export EST2UNI_PERL=/usr/local/est2uni" >> ~/.bashrc 5.- Modify your PERL5LIB environment variable adding that path: > echo "export PERL5LIB=$PERL5LIB:/usr/local/est2uni" >> ~/.bashrc 6.- Add that directory also to your path: > echo "export PATH=$PATH:/usr/local/est2uni" >> ~/.bashrc 7.- Check that the executable is found and that it works correctly: > est2uni ---------------------------------------------------------------------- MySQL CONFIGURATION ---------------------------------------------------------------------- 1.- The Perl script needs a MySQL account with create and drop database privileges in the host where the database is (e.g., user 'estpipe', password 'secure', host 'localhost'): > mysql -u root mysql> GRANT ALL PRIVILEGES ON *.* TO 'estpipe'@'localhost' IDENTIFIED BY 'secure'; 2.- Check that you can connect to the MySQL server with that user and password: > mysql -u estpipe -psecure (NOTE: With some versions of MySQL and DBI it could appear some authentication problem. This is solved using the OLD_PASSWORD function to set the MySQL passwords.) 3.- The Perl script will connect with the MySQL server using the perl DBI interface, so the user and password should be placed in the environment variables DBI_PASS and DBI_USER: > echo "export DBI_USER=estpipe" >> ~/.bashrc > echo "export DBI_PASS=secure" >> ~/.bashrc ---------------------------------------------------------------------- GENERAL ANALYSIS CONFIGURATION ---------------------------------------------------------------------- 1.- Create a directory for the analysis > mkdir test > cd test 2.- Create the library directory inside the main analysis directory > mkdir libraries 3.- Put the ESTs sequences and/or chromatograms in that directory. Every file or subdirectory in that directory will be considered as a different library, so put all the chromatograms from different libraries in different subdirectories inside the directory 'libraries' (each with the name of one library), and/or put all the FASTA sequences from the different libraries in multiFASTA files inside the directory 'libraries'(each with the name of one library). If you also have quality scores for the FASTA sequences, you can put them in files with the same name as the corresponding sequence file but with the extension .qual. There are some library examples on the 'test_data' directory of the EST2uni distribution. You could just copy that libraries to check the software. Be aware that the sgn chromatograms could need a modification in the phredpar.dat file (see installation instructions for phred in docs/installation_external_software.txt). 4.- Copy the estpipe.conf.example file from the installation directory into the analysis directory: > cp /usr/local/est2uni/est_pipe.conf.example est_pipe.conf 5.- Edit the est_pipe.conf to adapt the analysis parameters and options to your local needs. This is the main configuration file of the analysis. All the parameters are explained and configured here. It is highly recommended that you read it and customize the analysis to your needs. If you just want to do a minimal first analysis, you can just edit the 'analysis_home' and the 'estscan_matrix_dat' parameters and run the analysis; everything else can be added later. 6.- Run the est2uni script > est2uni ---------------------------------------------------------------------- SPECIFIC CONFIGURATION FOR THE DIFFERENT ANALYSIS METHODS ---------------------------------------------------------------------- LUCY CLEANING ------------- Lucy is capable of removing the known vector, but it needs a file with the vector FASTA sequence and a file defining the place where the insert is located. These files should be defined for each library. EST2uni will look for the paths of these files in the file defined on the 'assignments_dat' parameter. When this parameter is not set, the Lucy analysis will not be done. When set, EST2uni expects a file with the following format: assignment library: fake vector file: /path/to/the/vector/file/in/fasta/format/fake_vector.fasta vector splice site file: /path/to/the/vector/file/in/fasta/format/fake_splice.fasta For each of the libraries that we want to clean with Lucy, these four lines should be defined in the assignments_dat file. Please, look on the lucy documentation how the splice site file should be defined. There is an example of these files in the test_data/data directory. RFLP MARKERS ------------ RFLP markers can be related with the ESTs through the clone name. They are described in the file defined in the 'rflp_dat' parameter. They will be added to the database if the parameters 'do_prior_data_population' and 'populate_rflp_markers' are set to 1. An example of this file format is: line 1 -> rflp_name,clone,enzyme,researcher,bibliographic_ref,institute line 2 -> MC022,MC022,none,moliver,none,IRTA line n -> ... PCR MARKERS ----------- PCR markers are defined by a pair of primers and a name. That information must be located in the file defined in the parameter pcr_marker_dat. This file will be added to the database if parameters 'do_prior_data_population' and 'populate_pcr_markers' are set to 1. An example of this file format is: line 1 -> name,researcher,method,fwd_primer_name,fwd_primer_sequence,rev_primer_name,rev_primer_sequence line 2 -> pcr_marker_23,jblanca,snp,23f,AATTTCTGATGAACTACT,23r,GTCGTCTTCATCATCAGAT line 3 -> pcr_marker_27,jblanca,snp,27f,CAATCACTTTAAGATTATTTC,27r,ATGCATTTGGTTCACCTA Once the PCR markers are introduced in the database, an 'in silico' PCR experiment can be done if the 'do_ipcress_annot' parameter is set to 1. The relationships found will be stored in the database, and will be browseable in the web page. That experiment is done using the ipcress executable from the exonerate package. EST ASSEMBLY ------------ ESTs are assembled to get the unigene set and consensus sequences using either tgicl (http://www.tigr.org/software/other.shtml) or cap3 (http://seq.cs.iastate.edu), according to the parameter 'clusterer'. tgicl and cap3 command line options can be defined at the 'tgicl_options' and 'cap3_options', respectively, in the configuration file. If set, they must be enclosed by single quotation marks. cap3 options for running tgicl can be defined at any of them (if both are defined, 'tgicl_options' take preference when running tgicl). If these parameters are set to 'none', default options will be used (check the respective help for these programs). MICROARRAY CLONE SELECTION -------------------------- EST2uni can select a representative clone for each unigene to be printed in a microarray. That analysis is done when the 'do_est_micro_sel' parameter is set to 1. Some tweaking of the default parameters (minimum EST length, minimum percentage of EST sequence represented in the unigene consensus sequence, maximum GC percentage, etc.) can be done. You can also forbid and/or obligate some clones to be chosen. The list of the forbidden and obligate clones should be in the files defined by the parameters 'est_forbid_clones_dat' and 'sel_obligate_clones_dat'. If you want to use these lists you should also set the 'populate_est_sel_forbid_clones' and 'populate_est_sel_obligate_clones' to 1. There are examples of this files in the 'test_data' directory. SUPERUNIGENE ANNOTATION ----------------------- Although the clustering software could consider that two unigenes are not similar enough to be merged, it is useful to know which are these similar clusters of unigenes. Accordingly, EST2uni does a clustering of the unigenes, and these clusters are named 'superunigenes'. This analysis is done thorough a BLAST search, followed by parsing the results. Unigenes with a similarity high enough (above user-defined threshold parameters) are then clustered in a superunigene. To enable this analysis you just have to set the 'do_super_uni_annot' parameter to 1. Write permission on the directory used as the BLAST-formatted database repository (set by the 'blast_dir' parameter or the BLASTDB environment variable) is required. DATABASES DEFINITION FILE ------------------------- In EST2uni a database can have several aspects like the blast indexed files, the fasta sequence files or the related GO files. This information should be placed in the file referenced by the parameter databases_dat. This file should be uploaded to the database setting the parameters populate_database_table and do_prior_data_population to 1. In that file, the name of the database is defined in the field "name" and could be different than the BLAST name. For instance the database name could be "arabidopsis" and the local BLAST name could be "tair6". When a BLAST against database "arabidopsis" is asked, EST2uni will look for the local BLAST name in the database. In this example the blast should be asked for the database "arabidopsis", not against "tair6", despite the fact that the local blast database is named "tair6". BLAST ANNOTATION ---------------- Unigenes can be blasted against several databases selected by the user. The databases should be available for a blast local installation. The databases selected should be specified in the blast_dbs parameter as a comma separated list, for instance: blast_dbs=arabidopsis,unigenes The names of the databases should be equal to one of the databases defined in the file defined by the databases_dat parameter. This requirement is not need it by the unigenes database. In this example the arabidopsis database should be defined in the databases_dat file. On that file the field local_blast_name should be the name given to that database on the local blast installation. For instance it could be named tair6. That means that the name in the blast_dbs parameter and the name of the database in the local blast installation could be different. Other important field in the databases file is "kind", which should be set to dna or pep. If a BLAST database needs to be searched with an specific BLAST program (blastn, blastp, blastx, tblastn, or tblastx), it should be indicated in the field "blast_program". If empty, the appropriate program will be guessed according to the field "kind" (see above). If you want to ask for the reciprocal blast annotation, the parameter local_seq_file on the databases file should be set to the multifasta file containing the sequences of the databases. If link_pre and/or link_post are defined in the databases file, they will be used to create the links of the sequences belonging to the external databases. It is recommended to set the blast_dir parameter to the directory where the blast databases are installed. To load the databases file information to the database, the parameters do_prior_data_population and populate_database_table should be set to 1 at least once. Finally, do_est_annotation and do_blast_annot should be set to 1 whenever the blast annotation should be done. Once the blasts against the external databases are done the results will be stored in the database. If the blast is required again it won't be done, it just will be retrieved from the stored results in the database. To avoid this behaviour set the do_clean_blast_results parameter to 1 and all the blast results will be removed from the database. The blasts results will be used to annotate the sequences with a sensible description. The databases to be considered for this annotation should be specified in the blast_dbs_annot_order parameter as a comma and ordered separated list. The first one will be scanned for a non anonymous hit with a good e-value, if none is found the second one will be scanned and so on. The list of terms to consider a hit as anonymous is specified in the parameter unknown_tags. RECIPROCAL BLAST ANNOTATION --------------------------- To look for orthologs, this reciprocal blast annotation can be used. Besides the usual blast search of the unigenes against the external database a search of the external database against the unigenes will be also carried out. The results will be compared and the sequences with reciprocal top hits will be named as orthologs. This search can be also done to look for synonymous sequences against different databases on the same species. The multifasta file with all the sequences for the external databases should be defined on the local_seq_file field on the databases file defined by the databases_dat parameter. Also on this file the rec_sim_type should be set to ortholog or synonymous according to the kind of the relationship between the unigenes database and the external database. Several external databases can be used in this annotation. They should be defined in the comma separated reci_dbs parameter. The names of the database should be defined in the databases information file COMPLETE CDS ANNOTATION ----------------------- The aim of this annotation is to identify unigenes with ORF covering the N-terminal of their coded proteins. Unigenes get annotated comparing them to a reference set of presumably full-length proteins (e.g. arabidopsis proteins). Whenever a protein coded by a unigene is found to be similar by blast to the N-terminal of one of the proteins in the reference set, the unigene is marked as full length. To carry out this annotation the database information file should be set up as explained in the blast annotation section, the do_complete_cds_annot parameter should be set to 1, and the reference database should be set in the parameter blast_dbs_annot_complete_cds. CONTAMINATION ANNOTATION ------------------------ Contaminant sequences can be tagged using blast. To do this, the unigenes are blasted against one reference database with sequences coming from the contaminant organism and from the non-contaminant one. For instance if you are looking for EST coming from fungus in a plant EST library the reference database should be created merging a fungus and an arabidopsis databases. EST2uni will differentiate both sequences because all the contaminant sequence accessions names in that database should begin with 'c', and the non-contaminant host sequences should begin with 'h'. The name of this database should be set on the contamination_db parameter. Unigenes more similar to the contaminant sequences will be marked as contamination. Unigenes equally similar to both won't be considered contamination. GO ANNOTATION ------------- The GO annotation is done comparing the sequences with a reference organism previously annotated with GO. There are several steps involved: - Creation of the slimmed GO annotation for the reference organism. - Preparation of the annotation needed tables: - Loading of the reference GO annotation into a temporal database. - Loading of the slimmed and standard GO annotation into the temporal database. - Loading of the GO text definitions into the database. - Blasting of the sequences against the reference organism. - GO annotation of the sequences taking into account the blast result. The preparation and the loading of the slimmed GO annotation may be done just once for each database. You don't need to rerun this part of the analysis every time. To skip this part of the pipe just set the do_go_preparation parameter to 0. The sequences are annotated copying the GO annotation of the first blast hit in the blast against the reference organism. Several external files are need to do the annotation: - The GO ontology definition from the Gene Ontology site. The file is named gene_ontology.obo. The location of this file should be set on the go_def_dat parameter. - The GO perl package should be installed. Specially the map2slim utility. - The GO association file for the database selected as reference, and the GO slim obo file for the same database. The paths for these files should be specified in the local_go_assoc_file and local_go_slim_obo_file of the databases information file. The GO association file for the arabidopsis database can be downloaded from the www.arabidopsis.org site. At the time of this writting the file is located in: ftp://ftp.arabidopsis.org/home/tair/Ontologies/Gene_Ontology/ATH_GO_GOSLIM.20061021.txt The GO slim obo file for arabidopsis and goa is on the Gene Ontology ftp: ftp://ftp.geneontology.org/pub/go/GO_slims/go_slim_plant.obo - The GO terms and ids definition file (GO.terms_and_ids) can be downloaded from the Gene Ontology site. The path to this file should be set in the go_id_text_dat parameter. Arabidopsis databases have sometimes a dot-number extension in their gene names. If you are using Arabidopsis as a reference for the GO annotation, it is possible that extension exists in the GO association file, but not in the local Arabidopsis BLAST database used to do the GO annotation. In these cases, make sure that the arabi_hack field in the databases info file is set to 1. Leave it empty if both GO association file and local Arabidopsis BLAST database have (or not) the dot-number extension. HMMER ANNOTATION ---------------- You just need to set the parameter do_hmmer_annot to 1 and to set the name of the HMMER database on the parameter pfam_db. HMMER uses the ORFs as a starting point, so it is necessary to have the annotated ORF already in the database or to ask for the do_estscan_annot. EST MARKER RELATIONSHIP ----------------------- Genetic markers and ESTs are related through several mechanisms depending on the nature of the marker. RFLPs and ESTS are related using the clone information and PCR markers are associated with unigenes looking at the result of an in silico PCR perform by ipcress. But there is another way of associating ESTs and genetic markers. The user can establish an arbitrary relationship between them using a cvs file. That file should be defined in the parameter marker_est_dat. Each line of this line should relate a EST-marker pair. This information will be loaded into the database when the populate_marker_est_table and the do_prior_data_population are set to 1. EST RESEARCHER RELATIONSHIP --------------------------- Every researcher can be associated with several ESTs using the csv file defined in the working_on_dat parameter. Those relationships will get loaded when the parameters populate_workin_on_table and do_prior_data_population are set to 1.