SAT : SSR Analysis Tool - Help
Pipeline
Introduction
This pipeline has been conceived to analyse nucleotide sequences obtained from SSR (Simple Sequence Repeat) enriched libraries.
It allows processing a set of sequences, from original chromatograms to consensus sequences associated with SSR primers pairs.
This software combines various bioinformatics tools such as Phred, Lucy, Phrap, CAP3, e-Primer3, and scripts of its own, and is thus able to successively eliminate vector portions, low quality sequences and chimera sequences, assemble the sequencing forward and reverse products in one consensus sequence, contig similar sequences, search for SSR positions and finally design SSR primer pairs.
Scientists who use this pipeline only need to choose analysis tools and parameters, and then the software will present the detailed information.
Steps
The process is a pipeline that goes through analysis modules systematically.
It consists of a dozen tools or modules for designing SSR primers from initial chromatograms, and requires different external software programs to be called.
The automated process chain shows 8 fundamental steps of analysis, but users are not forced to run the whole pipeline, and can choose the steps to perform :
Base-calling : This step is realised by the Phred program. (about Phred).
It translates sequencing chromatograms to nucleotide sequences.
Data are read from chromatogram files, in SCF or ABI formats, and sequences are written in a multiFASTA file.
At the same time, it assigns to each peak a confidence index, in a quality file, reflecting the probability that the discovered and written base is the right one.
Vector removal : This step is performed by the Lucy software. (about Lucy).
It consists in identifying the position of the cloning vector and adapters in the sequence (the sequence of the vector and adapters can be provided by the user), and removing them to only keep the sequence of interest.
In addition to detecting the position of the vector, this software will also eliminate low quality areas according to the quality data provided by Phred.
Lucy takes in entry a Multifasta sequence file as well as the corresponding quality file, and provides sequence and quality files as output.
Users can change the quality threshold of filtered sequences.
Restriction digestion : This step consists in performing a virtual restriction digestion, with the enzyme(s) used for library construction, in order to solve the problem of chimers or partial digestion.
After digestion, sequences are renamed (identifier extension) to differentiate the different restriction fragments.
Microsatellite research: The SSR identification allows selecting and keeping the fragments that contain repeated sequences.
The pipeline can use either the script SSRIT developed in Perl by Temnykh (http://www.gramene.org/db/searches/ssrtool) that offers the possibility to choose the minimal number of repeats for each pattern (di, tri or tetranucleotide), or the SPUTNIK software (http://espressosoftware.com/pages/sputnik.jsp) which allows detecting interrupted SSRs.
The program provides information for each sequence : ssr number, pattern length, pattern type, number of repeats, start and end positions, sequence length.
Microsatellite masking: This step consists in replacing SSR nucleotides by N, and the sequencing quality value by 10 in the quality file.
Clustering : Clustering is done by the d2-cluster program (included in the Stackpack pipeline).
D2_cluster determines quickly, with an agglomerative algorithm, the relative similarity of a broad panel of sequences and groups the closest sequences in clusters (Stackpack denomination: cl). It identifies sequences having more than 96% identity on a 150 bases window.
Microsatellites are masked for the clustering step to prevent them to influence the sequence grouping, and are then unmasked to run the assembly. If users decide to remove the clustering step and to make only alignment, microsatellites will not be masked.
Alignments : Alignments can be carried out either by the Phrap program (included in the Stackpack pipeline), or by CAP3.
These are assembly programs of DNA sequences that align sequences grouped by the clustering in order to generate contigs (Stackpack denomination: ct) representative of the original cloning sequences. The quality of alignment is improved to separate individual sequences : singletons.
The simplest configuration is that of 2 sequences, forward and reverse, grouped within a cluster.
In the case of an alignment by Phrap, the produced contigs are kept only if the later analysis by Craw accepts them : the resulting sequences are called consensus sequences (Stackpack denomination: cn).
In the case of an alignment by CAP3, the analysis step is not performed (consensus Cap3 denomination: cl1Contig2).
Phrap presents the advantage of being already integrated in a pipeline devoted to assembly, ensuring an additional alignment analysis stage by Craw. But this pipeline, developed for the EST assembly, is not perfectly adapted for the study of SSRs. Indeed, although it gives correct assembly results, the fixed parameter setting seems much too stringent, drawing aside from the very close sequences within a same cluster.
Cap3 is less stringent and provides a correct assembly but it requires to be launched for each cluster, and its execution time is slower.
Biologist can choose the best assembly tool adapted to their needs.
Primer design: This step is executed by the e-Primer3 program from EMBOSS. (about ePrimer3).
Primers are designed with a defined set of constraints, such as oligonucleotide melting temperature, length, GC content, PCR product size and positional constraints around the SSR, to identify the optimal forward and reverse primers for the SSR flanking region.
Moreover, if no primer pair is found, an option makes it possible to know the reason of the failure.
Note that a new step of SSR identification within the consensus sequence must be run before performing the PCR primer design. If the consensus sequence shows several SSRs, each of them will be given as target to the program to define several local SSR primers, and the entire SSR zone will also be used to design a global primer pair.
Virtual PCR : A blast of the whole set of primers is carried out against the whole final sequences (consensus and singletons).
If both forward and reverse primers match against a sequence different from the one on which they were defined (we fixed a threshold of 80% hybridization), the amplification will not be locus specific anymore. A further analysis and alignment of the concerned sequences will then be necessary.
Parameters
Project : Enter a brief one-word project name. (use "_" instead of " ").
Archive : You have to create an archive (with extension .tar, .zip or .tar.gz) containing all your chromatograms.
This archive will be transferred on the server, then de-archived to be analysed by the program.
Sequences : You have to upload your Multifasta sequence and quality files. Quality file is optional if the vector removal has already been done, otherwise you have to provide it.
Moreover, note that if you want to perform the vector removal step, each sequence of the sequence and quality files must have identifiers with extension .SCF, .scf, .ab1 or .AB1, and must not contain special character like "|","/"...
Vector file : This file must contain the cloning vector sequence, including adapters, in Fasta format.
An example of pgemT vector is available here.
Vector splice file : This file must contain two splice site sequences before and after the insertion point on the vector.
An example of pgemT splice site sequences is available here.
Quality threshold for Lucy : Lucy has the capacity to remove bad quality sequences.But you have to give it thresholds from which it considers that the sequence is of bad quality.
The first parameter to give is the maximum acceptable average probability of error (calculated from the quality values assigned by PHRED) over the final clean range.
The second parameter is the maximum probability of error that is allowed for the 2 bases at each end of the final clean range.
Primer Tm : Minimum, Optimum, and Maximum melting temperatures (°C) for a primer. e-Primer3 will not pick oligonucleotides with temperatures smaller than Min or larger than Max, and, with default conditions, will try to pick primers with melting temperatures close to Opt.
Max tm difference : Maximum acceptable (unsigned) difference between the melting temperatures of the forward and reverse primers.
Primer GC% : Minimum, Optimum, and Maximum percentage of Gs and Cs in any primer.
Primer length : Minimum, Optimum, and Maximum lengths (in bases) of a primer oligo. e-Primer3 will not pick primers shorter than Min or longer than Max, and with default arguments will attempt to pick primers with size close to Opt. Min cannot be smaller than 1. Max cannot be larger than 36. (This limit is governed by maximum oligo size for which melting-temperature calculations are valid.) Min cannot be greater than Max.
Max polyX : The maximum allowable length of a mononucleotide repeat, for example 6 for AAAAAA.
Entry and exit in the pipeline
The pipeline makes it possible to select the steps which one wishes to perform, therefore to enter or exit more or less early in the pipeline
Users can stop wherever they wish, however, some entrance points are mandatory:
- Start at the beginning with base-calling. They will have to provide as input an archive or a zip file containing all the chromatograms to analyse.
- Start with the sequence cleaning if they already have the rough sequences. They will then have to provide the sequence and corresponding quality files.
- Start after cleaning to perform restriction digestion. They will have, in the same way, to provide the cleaned sequence file and optionally the corresponding quality file.
- Start after restriction digestion. They will have, in the same way, to provide the sequence file and optionally the corresponding quality file.
Users'management : To launch the pipeline and integrate your results in the database, or to access to the data of the database, you have to be registered as a user in the database. To create an account, please contact the administrator : manuel.ruiz@cirad.fr
Once registered, you will have access only to your projects.
Further analysis : Following the virtual PCR, the sequences are classified into 3 categories:
- Sequences for which the primers are specific, i.e. they do not match on other sequences with more than 80% identity.
For these sequences, there is an automatic validation of their primers. Indeed, on the 5 proposed primers pairs, only the first couple is validated (since it is supposed to be the best). If there are several SSRs, the primer pair which amplifies all the SSRs is validated, except if no primer was found for the whole sequence; in such case, one primer pair by SSR is validated.
- Sequences for which the primers are not specific, one of them (forward or reverse) among the various primers pairs proposed, or both, match at least on another sequence with more than 80% identity.
For these sequences, a new regrouping of sequences is undertaken according to the primer blast (virtual PCR), using the Graph9 program (about Graph9).
If a primer defined over a sequence match against another sequence with more than 80% identity, these two sequences will be gradually clustered in the same subgroup. Within each subgroup, we propose users to restart the sequence alignment with Cap3 and decide to keep or not the newly formed consensus sequence, by an automatic database update.
- Sequences for which no primer was found because of the constraints applied on e-Primer3.
For these sequences, there are no possible primers and the system proposes to run again e-Primer3 by modifying its set-up, giving less stringent parameters for example. Once started again, the interface proposes the new detected primers. One can accept them and integrate them in the database. These primers will be validated automatically. For the remaining sequences still not having primers, a table shows statistics related to the principal reasons of the failure of ePrimer3 for each sequence (low Tm, primer found in excluded region) and another table presents the global incidences of each one of these reasons. Users will still have the possibility of entering manually a primer pair for each remaining sequence, and recording it in the database.
For each sequence (consensus and singletons), user can :
1) Display the nucleotide sequence
2) Display the alignment (if consensus)
3) Display information about SSR
4) Display the different proposed primer pairs, and the results of the virtual PCR.
User can manually check the results and validate a primer pair to be kept.
5) Realise a BlastN against the nt bank
Import of existing primer pairs: Users have to give all the primers in a Fasta file, which are blasted against all the final sequences of the project (consensus and singletons). Primers are recorded in the database and associated to sequences with which both forward and reverse primers have matched with 100% identity. These primers are automatically validated: if other primers were already validated for a sequence, they are unvalidated because priority is given to these new imported primers.
Various output files are available:
- Primers indexed in a csv format (Excel compatible)
- Blast output file
- File showing the association between validated primers and amplified sequences (if both primers match with 100% identity).
If a forward primer and its corresponding reverse primer match on the same sequence over all their length, this primer pair is added in this file.
- Amplified sequences in fasta format
Export data :
- For each cluster, users can get the sequences gathered in this cluster in Fasta format. Same thing for each contig.
- All final sequences in fasta format (consensus + singletons)
- All primers in fasta format
- All validated primers pairs in an Excel file
- Summary file for each sequence