#
## Script modified from https://www.biostars.org/p/230441/
# This script extract CDS nucleotide sequences from mitochondrial refseq genomes
# usage: python extract_CDS.py input_file.gb output_file.fas
#
from Bio import SeqIO
import sys
with open(sys.argv[2], 'w') as nfh:
for rec in SeqIO.parse(sys.argv[1], "genbank"):
if rec.features:
for feature in rec.features:
if feature.type == "CDS":
nfh.write(">%s from %s\n%s\n" % (
feature.qualifiers['gene'][0],
rec.name,
feature.location.extract(rec).seq))
The code below extracts aminoacids sequences of CDS from *.gb files and outputs the CDS found in fasta format.
#
## Script modified from https://www.biostars.org/p/230441/
# This script extract CDS sequences of aminoacids from mitochondrial refseq genomes
# usage: python extract_CDSaa.py input_file.gb output_file.fas
#
from Bio import SeqIO
import sys
with open(sys.argv[2], 'w') as ofh:
for seq_record in SeqIO.parse(sys.argv[1], 'genbank'):
for seq_feature in seq_record.features:
if seq_feature.type=="CDS":
assert len(seq_feature.qualifiers['translation'])==1
ofh.write(">%s from %s\n%s\n" % (
seq_feature.qualifiers['gene'][0],
seq_record.name,
seq_feature.qualifiers['translation'][0]))
The code below extracts tRNA sequences from *.gb files and outputs the tRNA sequences found in fasta format.
#
## Script modified from https://www.biostars.org/p/230441/
# This script extract tRNA sequences from mitochondrial refseq genomes
# usage: python extract_tRNA.py input_file.gb output_file.fas
#
from Bio import SeqIO
import sys
with open(sys.argv[2], 'w') as nfh:
for rec in SeqIO.parse(sys.argv[1], "genbank"):
if rec.features:
for feature in rec.features:
if feature.type == "tRNA":
nfh.write(">%s from %s\n%s\n" % (
feature.qualifiers['product'][0],
rec.name,
feature.location.extract(rec).seq))
The code below extracts rRNA sequences from *.gb files and outputs the rRNA sequences found in fasta format.
#
## Script modified from https://www.biostars.org/p/230441/
# This script extract rRNA sequences from mitochondrial refseq genomes
# usage: python extract_rDNA.py input_file.gb output_file.fas
#
from Bio import SeqIO
import sys
with open(sys.argv[2], 'w') as nfh:
for rec in SeqIO.parse(sys.argv[1], "genbank"):
if rec.features:
for feature in rec.features:
if feature.type == "rRNA":
nfh.write(">%s from %s\n%s\n" % (
feature.qualifiers['product'][0],
rec.name,
feature.location.extract(rec).seq))
As the script requires individual files, you can automate sequencial runs by using the following script:
for file in *.gb ; do python extract_*.py ${file} ${file%.*}.fas ; done
This code extracts individual genomes from concatenated genomes in GenBank (*.gb) format.
#!/usr/bin/perl
#
# usage:
# perl get_individual_genomes.pl input_file.gb
#
#
$file = $ARGV[0];
open (FILE, $file);
while () {
if ($_ =~ m/^LOCUS\s+(\w+).*/){
$output_file = "$1\.gb";
open($out, '>', $output_file);
print $out "$_";
}
if (($_ =~ m/^[\w|\s|\d].*/) ne ($_ =~ m/^LOCUS.*/)){
print $out "$_";
}
if ($_ =~ m/^\/\/\n/){
print $out "$_";
close ($out);
}
}
close (FILE);
===== Sequence submission to the NCBI/GenBank =====
Here I will provide some information on how to submit sequences to NCBI/GenBank using [[http://www.ncbi.nlm.nih.gov/Sequin/|Sequin]]. It is the goal of this tutorial to make sure we standardize the information we associate with the sequences. This is important because not all the sequences deposited at NCBI for our group provide the same information (//i.e//, host data, voucher; among others) and even if they do, they are not in a standard format - that is, under the same source modifier (see bellow).
The information associated with sequences in the NCBI/GenBank is done by [[http://www.ncbi.nlm.nih.gov/Sequin/modifiers.html|Source Modifier]]. The list of source modifiers provided by the NCBI/GenBank is long and among those we think that we should use at least these: Organism, specimen-voucher, isolate, dev_stage host, collected-by, collection-date, country, lat_lon, and note.
==== Table of modifiers ====
The first thing you have to build is a table with the following content:
^ SeqId ^ Organism ^ specimen-voucher ^ isolate ^ dev_stage ^ host ^ collected-by ^ collection-date ^ country ^ lat_lon ^ note ^
| Pararhin_EC_14.1 | Pararhinebothroides hobergi | MZUSP 7754 | EC-14.1 | adults | Urobatis tumbesensis | J. N. Caira & F. Marques | 24-May-2014 | Ecuador | S 2.19967, W 80.9780 | Host field number EC-14 |
| Pararhin_EC_56.1 | Pararhinebothroides hobergi | MZUSP 7755 | EC-56.1 | adults | Urobatis tumbesensis | J. N. Caira & F. Marques | 3-Jun-2014 | Ecuador | S 2.19967, W 80.9780 | Host field number EC-56 |
| Pararhin_EC_56.2 | Pararhinebothroides hobergi | MZUSP 7756 | EC-56.2 | adults | Urobatis tumbesensis | J. N. Caira & F. Marques | 3-Jun-2014 | Ecuador | S 2.19967, W 80.9780 | Host field number EC-56 |
| Pararhin_EC_56.3 | Pararhinebothroides hobergi | MZUSP 7757 | EC-56.3 | adults | Urobatis tumbesensis | J. N. Caira & F. Marques | 3-Jun-2014 | Ecuador | S 2.19967, W 80.9780 | Host field number EC-56 |
Here, the first row contains the source modifiers. The Sequence ID (**SeqId**) is the name of your sequence in the [[http://en.wikipedia.org/wiki/FASTA_format|FASTA]] file. Thus remember, it is mandatory that the first column name matches the name of your sequences in the FASTA file. The other columns are pretty intuitive, but I should comment some modifiers. The **isolate** should receive your molecular code. The date of collection (**date**) has to be in the format: DD-month-YYYY (//i.e.//, "24-May-2014"). The latitude and longitude data (**lat_lon**) should be inserted in a "semi-decimal" format (//i.e.//, "S 2.19967, W 80.9780"). I say "semi-decimal" format because in LAT/LONG in an decimal format uses +/- for N/S and E/W, respectively. If your georeference data is in Degrees, Minutes, Seconds format, you can covert using [[http://www.rcn.montana.edu/Resources/Converter.aspx|Montana State University]] webpage. Finally, the host field number should be inserted under the source modifiers **note**, as in "Host field number XXXX".
You can build this table in any spreadsheet application, but you will have to save that as a TAB delimited text file. **You can download the table above in a TAB delimited format {{:wiki:techniques:sequin:example.zip|here}}.**
==== Running Sequin ====
Once you have your table built, you can run [[http://www.ncbi.nlm.nih.gov/Sequin/|Sequin]]. I will go through all the steps of **Sequin** using the sequence FASTA file below, which I will call **example.fas**. It is important to degap your sequences, that is, you should always submit **unaligned nucleotide sequences**.
>Pararhin_EC_14.1
ACAGGATGAGCCAGGATTCGCTCGGACACGATGGGCACGAGATTAGGACACAG
>Pararhin_EC_56.1
ACAGGATGAGCCAGGATTCGCTCGGACACGATGGGCACGAGATTAGGACACAG
>Pararhin_EC_56.2
ACAGGATGAGCCAGGATTCGCTCGGACACGATGGGCACGAGATTAGGACACAG
>Pararhin_EC_56.3
ACAGGATGAGCCAGGATTCGCTCGGACACGATGGGCACGAGATTAGGACACAG
So, as expected, the first thing to do is to open **Sequin**. You should get the following window:
{{ :wiki:techniques:sequin:sequin_1.png?400 |}}
In this window, you should click on "Start New Submission". If you have already done one and want to use some of your data from the previous record (//i.e.//, address, affiliation, among other) you can click on "Read Existing Record". Regardless of your choice, you should get the following window:
{{ :wiki:techniques:sequin:sequin_2_init.png?400 |}}
Here you will insert all the information concerning the publication to which your sequences will be related, your contact, authorship, and affiliation information.
{{ :wiki:techniques:sequin:sequin_2_sub.png?400 |}}
**Note:** In the window above, I selected the option in which I define the date in which the data should be public. By default, NCBI will give you one year. But you can make it longer or shorter as well as you can change the status of you sequence at any time by contacting the NCBI personal.
This step is pretty intuitive, after filling up all the fields in "Contact", "Authors", and "Affiliation" just click on the button "next page". By doing so, you will obtain:
{{ :wiki:techniques:sequin:sequin_3.png?400 |}}
In this example, I am submitting a rDNA 28S sequence. Thus, I chose the option accordingly. Press "Next" and you will get the following window:
{{ :wiki:techniques:sequin:sequin_4.png?600 |}}
This is the window that allows you to import your [[http://en.wikipedia.org/wiki/FASTA_format|External FASTA]] file. If you click on "Import Nucleotide FASTA", you will get the option to choose the file you want to import (**Remember:** your file must have unaligned sequences!):
{{ :wiki:techniques:sequin:sequin_4b.png?500 |}}
If all went well, you should get a summary of your imported sequences:
{{ :wiki:techniques:sequin:sequin_4c.png?500 |}}
In this window, you will also have to provide the methodology you used to obtain your sequences under the tab "Sequencing Method":
{{ :wiki:techniques:sequin:sequin_4d.png?400 |}}
Here I selected sanger sequencing and indicated the they are "raw sequence reads".
The next window will ask you the type of submission you are about to do. In this particular case, I want to associate the sequences with a phylogenetic study. Therefore, here is the option I selected:
{{ :wiki:techniques:sequin:sequin_5.png?500 |}}
Go to the next window.
Here, Sequin will ask from which organism you got your sequences. As you can see, there is no option for worms! Thus, I selected "something else"!
{{ :wiki:techniques:sequin:sequin_6.png?300 |}}
Move on to the next:
{{ :wiki:techniques:sequin:sequin_7.png?600 |}}
Here is an important step, since you will now use the table with the source modifiers you did as the first part of this tutorial. You should click on "import source table" and select the file. Once accomplished this step, go to the next window.
{{ :wiki:techniques:sequin:sequin_8.png?300 |}}
This window will ask you from which genome you obtained your sequences. Since we are submitting sequences of rDNA 28S, I selected **Nuclear** genome. Once you selected an option, go to the next.
{{ :wiki:techniques:sequin:sequin_9.png?300 |}}
Now you will have to select the type of fragment or genomic region you sequenced. In that case, I selected "single rRNA or ITS". Move to the next.
{{ :wiki:techniques:sequin:sequin_10.png?300 |}}
Because I selected "single rRNA or ITS", this window will ask some more details about my sequences. Select accordingly. In this case, I selected the option for rDNA 28S. Once you are done, go to the next:
{{ :wiki:techniques:sequin:sequin_11.png?500 |}}
Now you are basically done! You should select "Open Record Viewer", because you will need to verify whether there are errors in your data and you will have to save the file to be sent to NCBI.
When you press "Open Record Viewer", if no errors are found and reported, you should get the following:
{{ :wiki:techniques:sequin:sequin_12.png?400 |}}
**Note:** If you got an error message, please see the section on **Error Reports** below.
By default, the record viewer shows only the first sequence of your file. You could inspect one by one, to make sure all the information you desired is associated with your sequence. To finally export the file to be submitted to NCBI, you should select "ALL SEQUENCES" in "Target sequences" and click on "Done".
{{ :wiki:techniques:sequin:sequin_13.png?400 |}}
**Sequin** will then ask if you want to save the record. Press "Yes", and you will be presented with the following window:
{{ :wiki:techniques:sequin:sequin_14.png?400 |}}
Here you should give a name to the file and press "Save".
Congratulations! You are done! Well, in part. **Sequin** finished its job, which was to generate a file compatible to the input files NCBI receive the sequences. After you saved the file, you will get the following warning message:
{{ :wiki:techniques:sequin:sequin_15.png?500 |}}
Pay attention to that! Here is the email to which you should send the file **Sequin** just generated. So, I guess, your job will be done once you email the file!
==== Error Reports in Sequin ====
According to [[http://www.ncbi.nlm.nih.gov/projects/Sequin/sequin.hlp.html#SubmittingtheFinishedRecordtotheDatabase|NCBI]], "[t]here are four types of error messages, Info, Warning, Error, and Reject. Info is the least severe, and Reject is the most severe. You may submit the record even if it does contain errors. However, we encourage you to fix as many problems as possible. Note that some messages may be merely suggestions, not discrepancies. A possible Warning message is that a splice site does not match the consensus. This may be a legitimate result, but you may wish to recheck the sequence. A possible Error message is that the conceptual translation of the sequence that you supplied does not encode an open reading frame. In this case, you should check that you translated the sequence in the correct reading frame. A possible Reject message is that you neglected to include the name of the organism from which the sequence was derived. The name of the organism is absolutely required for a database entry. If Sequin does not detect any problems with the format of your record, you will see a message that "Validation test succeeded".
If you got an error message and was able to identify the error, you can reopen **Sequin**, start a new submission, but instead of following all the steps we discuss above, you select "Click here to import a template". See figure below:
{{ :wiki:techniques:sequin:figure_16.png?400 |}}
After importing the data, follow the steps of the program, which were discussed above.
Good luck!
===== Tree editing and manipulation =====
Below I will give you some tips on how to deal with file manipulation during your phylogenetic analysis in order to avoid a lot of editing of file. As a rule of thumb, I like to keep the name of my terminals short and without characters that are bound to give you troubles as you migrate files from one application to the other (//i.e.//, "-", "&", "#", among others). Underscore (//i.e.//, "_") is your only option most of the time.\\
Consider the following example of data file:
>Potamotrygonodestus_fitzgeraldae_amazon_PA03-53.1
ATTTACTATCATTCACATATCAAGAATGATGACAGATGAGCATGAC...
>Potamotrygonodestus_fitzgeraldae_amazon_PA03-53.2
ATTTATTAACATACACATATCAAGAATGATGACAGATGAGCATGAC...
>Potamotrygonodestus_marajoara_amazon_PA03-13.1
ATTTATTAACATACACATATCAAGAATGATGACAGATGAGCATGAC...
>Potamotrygonodestus_marajoara_amazon_PA03-13.2
ATTTATTAACATACACATATCAAGAATGATGATTTTTGAGCATTTC...
...
This format, encompasses the name of the species, its distribution and a code for the specimen sequenced. Some programs, and TNT is an example, would truncate names like this to 32 characters. Therefore, these programs would consider each of these taxon names as:
>Potamotrygonodestus_fitzgeraldae
ATTTACTATCATTCACATATCAAGAATGATGACAGATGAGCATGAC...
>Potamotrygonodestus_fitzgeraldae
ATTTATTAACATACACATATCAAGAATGATGACAGATGAGCATGAC...
>Potamotrygonodestus_marajoara_am
ATTTATTAACATACACATATCAAGAATGATGACAGATGAGCATGAC...
>Potamotrygonodestus_marajoara_am
ATTTATTAACATACACATATCAAGAATGATGATTTTTGAGCATTTC
...
Note that the two initial taxa would have the same name, the same is true for the last two, and that would be enough impede you to analyze this data.
I am aware we give names that would allow us to recognize the taxa as fast as possible. That I will try to convince you is that you can keep code names in your input files and use a simple tool to make the translations required whenever you need to visually examine the result of your analysis.
Here is how I would deal with that:
Usually, I use a code for each specimen. If I am dealing with **hundreds** of terminal, I use few letters and at **least three digits**. For instance, for the example above I would have something like:
>Pot001
ATTTACTATCATTCACATATCAAGAATGATGACAGATGAGCATGAC...
>Pot002
ATTTATTAACATACACATATCAAGAATGATGACAGATGAGCATGAC...
>Pot003
ATTTATTAACATACACATATCAAGAATGATGACAGATGAGCATGAC...
>Pot004
ATTTATTAACATACACATATCAAGAATGATGATTTTTGAGCATTTC...
...
Those taxon names will never give you a hard time in any application I am aware of! But of course, you need to keep track of what those codes means and most of the time, we do that by having a spreadsheet that document all our specimens or terminals. Let us then consider a two column table:
^ Code ^ Specimen ^
| Pot001 | Potamotrygonodestus_fitzgeraldae_amazon_PA03-53.1 |
| Pot002 | Potamotrygonodestus_fitzgeraldae_amazon_PA03-53.2 |
| Pot003 | Potamotrygonodestus_marajoara_amazon_PA03-13.1 |
| Pot004 | Potamotrygonodestus_marajoara_amazon_PA03-13.2 |
From this table, you can generate a text file called **rename.sed** in the following format:
s/Pot001/Potamotrygonodestus_fitzgeraldae_amazon_PA03_53_1/
s/Pot002/Potamotrygonodestus_fitzgeraldae_amazon_PA03_53_2/
s/Pot003/Potamotrygonodestus_marajoara_amazon_PA03_13_1/
s/Pot004/Potamotrygonodestus_marajoara_amazon_PA03_13_2/
This is a file formatted for [[https://www.gnu.org/software/sed/manual/sed.html|Sed]]; a stream editor native to UNIX/LINUX systems, therefore also available in your Mac. **Sed** is a powerful toll when it comes to make substitutions in text files - remember, most of the files we manipulate during phylogenetic analyses are text files! **Note:** I did avoid to use "-" and ".", because depending on what you want to do, those characters will not work!
Here it how you can apply this tool. Consider that you for a tree output file, which we will call **example.tre**, with the following content:
(Pot003:0.00000,Pot004:0.04311,(Pot002:0.00000,Pot001:0.07118)0.98:0.09396);
Now, if you open a terminal window, move to the directory in which your files are, and execute the following command line:
sed -f rename.sed example.tre
You will get:
{{ :wiki:techniques:phylogenetics:sed.png?800 |}}
However, you may desire to redirect the output of the commmand line above to a file so that you can open the file in something like [[http://tree.bio.ed.ac.uk/software/figtree/|FigTree]]. In that case, you should execute the following command line:
sed -f rename.sed example.tre > example_renamed.tre
In this command line, the option "-f" of sed implied that all rules of substitution are in a file (//i.e.//, **rename.sed**) and that the file to be translated is **example.tre** and the output of this **sed** command line should be redirected (the sign ">" is what does it) to the file **example_renamed.tre**. The content of the file **example_renamed.tre** should be:
(Potamotrygonodestus_marajoara_amazon_PA03_13_1:0.00000,Potamotrygonodestus_marajoara_amazon_PA03_13_2:0.04311,(Potamotrygonodestus_fitzgeraldae_amazon_PA03_53_2:0.00000,Potamotrygonodestus_fitzgeraldae_amazon_PA03_53_1:0.07118)0.98:0.09396);
Now you should be able to examine this tree in a more informative way. If you open that in [[http://tree.bio.ed.ac.uk/software/figtree/|FigTree]], you would have:
{{ :wiki:techniques:phylogenetics:figtree_1.png?600 |}}
But you can do better, especially if you are looking at a final tree, which would be used to genetate a figure for publication. Considere this substitution file, which we will call **rename_2.sed**:
s/Pot001/"Potamotrygonodestus fitzgeraldae ex P. motoro PA03-53.1 [Amazon\/AJ51651]"/
s/Pot002/"Potamotrygonodestus fitzgeraldae ex P. motoro PA03-53.2 [Amazon\/AJ51652]"/
s/Pot003/"Potamotrygonodestus marajoara ex Pl. iwamae PA03-13.1 [Amazon\/AJ51653]"/
s/Pot004/"Potamotrygonodestus marajoara ex Pl. iwamae PA03-13.1 [Amazon\/AJ51654]"/
Note that I am using all kinds of characters that are bound to give you a lot of hassle when using in input files. One of the special characters (//i.e.//, "/") in "Amazon/AJ51651" requires a back slash, since **sed** consider "/" a special character; hence, "Amazon**\/**AJ51651". There I am enclosing them between quotes, because [[http://tree.bio.ed.ac.uk/software/figtree/|FigTree]] allows me to do that. If I execute the command line above using this substitution file, I would get:
("Potamotrygonodestus marajoara ex Pl. iwamae PA03-13.1 [Amazon/AJ51653]":0.00000,"Potamotrygonodestus marajoara ex Pl. iwamae PA03-13.1 [Amazon/AJ51654]":0.04311,("Potamotrygonodestus fitzgeraldae ex P. motoro PA03-53.2 [Amazon/AJ51652]":0.00000,"Potamotrygonodestus fitzgeraldae ex P. motoro PA03-53.1 [Amazon/AJ51651]":0.07118)0.98:0.09396);
In [[http://tree.bio.ed.ac.uk/software/figtree/|FigTree]], this tree would be:
{{ :wiki:techniques:phylogenetics:figtree_2.png?600 |}}
======= Tools for morphometric analyses =======
--------
===== Wormbox v. 1.0 =====
WormBox (Vellutini & Marques, 2011-14) is a plugin to FIJI/ImageJ that was written with the intent to help you to automate the uptake of measurements from sets of images. This macro is based on plotting landmarks upon images from which a table with linear distances will be generated. You can also use it to count structures (//i.e.//, meristic variables). We have not tested how effective timewise this plugin will be in comparison to the traditional methods most people obtain measurements from specimens. We expect it will vary a lot among cases. Ultimately, it will all depend on how easy (and fast) you can produce images from which you can extract the measurements you need. If that is not an issue, we believe you will find that this application will speed up data gathering. You should also consider that, even if you do not speed up your work by using this application, using it will provide you with full documentation of your measurements -- which usually is not the case by traditional methods. Be that as it may, give it a try. We will be happy if it turned out to be a good tool for your research. To download the most recent version of the plugin, [[https://github.com/nelas/WormBox|press here]].
\\
\\
=== System requirements ===
As this application makes full usage of Fiji/ImageJ, which can be downloaded from [[http://fiji.sc/wiki/index.php/Fiji|Fiji's website]], as long as you meet the system requirements of Fiji/ImageJ, this application will work. Thus, before you start you should install Fiji/ImageJ. After installing Fiji/ImageJ, you should verify whether there are updates for Fiji and/or ImageJ available. To accomplish that, open the casting menus of Help in the main menu of Fiji/ImageJ and select the options to update both applications (Figure 1).
{{ :wiki:techniques:wormbox_images:fig_1.jpg?direct&150 |}}
internal_handle:1,3
internal_prong:2,4
lateral_handle:5,7
lateral_prong:6,8
Each line should contain the name of the variable, followed by “:” and at least two landmarks separated by comma defining the two points from which WormBox will calculate the distance.
**NOTE:** You can use more than two land marks for a single variable. For instance, your configuration file could have a line like this: “variable_x:1,2,3,4”. In this case, the measurement for “variable_x” will be the sum of the distances between 1--2, 2--3, and 3--4.
Once you have written your configuration file (//e.g.//, hooks_measurements.conf) and saved it, run the WormBox Analyzer (see definition of icons above). First, WormBox will ask you the target folder in which your images have been processed (Figure 10A); and then the name of your configuration file (Figure 10B).
{{ :wiki:techniques:wormbox_images:fig_10.jpg?direct&450 |}}
#1 length:1,2,3
#2 width:4,5
#3 progl_ratio:${length}//${width}
#4 genital_pore_distance:2,3
#5 genital_pore_position:${genital_pore_distance}*100/${length}
#6 testes_dia:6,7
#7 testes_dia:8,9
#8 aporal_testes:count
#9 poral_testes:count
#10 total_testes:${aporal_testes}+${poral_testes}
Here is what this configuration file will do with the data you compiled for each image:
* Lines 1 and 2 (i.e., #1 and #2) will calculate the length and width of your proglottid, which except for the fact that the length is being calculated by the sum of two distances (i.e., 1--2 + 2--3), they are the kind of distances you calculated before in the first part of this tutorial.
* Line 3 provides something new. Here we defined a new morphometric variable,”progl_ratio“, which is equal to the floored quotient of length and width of the proglottid; that is, its ratio. To operate with variables there are few rules you must follow:\\
* You can only operate with variables that were defined before. That is, line 3 could not be written prior to lines #1 and #2.\\
* The syntax for operating variables should be “${name_of_variable}”.\\
* The operator used in WormBox are basically those available in Python:\\
^ Operation ^ Result ^
| x + y | sum of x and y |
| x - y | difference of x and y |
| x * y | product of x and y |
| x / y | quotient of x and y |
| x %%//%% y | floored quotient of x and y |
| pow(x, y) | x to the power y |
| x ** y | x to the power y |
* Other lines that make use of this property are lines 5 and 10 in which we calculate the genital pore position and the total number of testes, respectively.\\
* Next, we should comment lines 8 and 9. These two redundant lines specify the distances for testes diameter (//i.e.//, 6--7 and 8--9). For those, when WormBox parse the information of an image, it compiles the mean value between 2 or more pseudo-replicates (or repetitive measures).\\
Finally, meristic variable are defined by its name followed by “:count”. Examples of that are the testes counts defined in “aporal_testes” and “poral_testes”, lines 8 and 9, respectively.
The final results should be something like this:
{{ :wiki:techniques:wormbox_images:fig_16.jpg?direct&600 |}}
Cirrus sac widely pyriform, 140–200 long and 90–170 wide, its length representing only 14–21% of proglottis width (x = 17%, n = 26) ...
Ovary width represents 72–79% (x = 76%, n = 25, CV = 3%) of proglottis width ...";
"..., prebulbar organs present, small; bulbs elongate (Fig. 3), bent,
thick-walled, 600-760 (649 ± 53; n = 22) long by 100-130 (117 ± 10; n = 22) wide, ...";
specimen n_testes
specimen_1 2
specimen_2 3
specimen_3 3
specimen_4 5
specimen_5 5
specimen_6 6
specimen_7 6
specimen_8 7
specimen_9 9
specimen_10 10
specimen_11 12
specimen_12 12
specimen_13 13
specimen_14 13
specimen_15 14
If you are familiar with R, you could just execute:
>n_testes <- c(2,3,3,5,5,6,6,7,9,10,12,12,13,13,14)
> summary(n_testes)
and you would obtain:
Min. 1st Qu. Median Mean 3rd Qu. Max.
2 5 7 8 12 14
This summary shows that the minimum valuer for your variable is 2, the maximum is 14, that the distribution is centered in the median 8 and that 50% of your samples have from 5 to 12.
If you are interested in looking at how the box plot for this variable looks like, run:
>boxplot(n_testes)
and R would return you the following plot:
{{ :wiki:techniques:morphometrics:images:five_number_example.jpg?400 |}}
This is the graphical representation of the five-number summary above.
The script available {{:wiki:techniques:morphometrics:compile5numbers.zip|here}} does somewhat the same, but allows you to explore your data in a more effective way. To work with this script, there are few things you need to know. Here they are:
== Structure of the input file ==
The input file must be a text format file, TAB delimited, and have at least three columns: specimen, group and variables. It should look like this:
specimen group n_testes total_length
specimen_1 host 2 2.1
specimen_2 host 3 2.2
specimen_3 host 3 2.2
specimen_4 host 5 2.3
specimen_5 host 5 2.4
specimen_6 host 6 2.5
specimen_7 host 6 2.6
specimen_8 host 7 1.2
specimen_9 host 9 5.5
specimen_10 host 10 6.7
specimen_11 host 12 6.7
specimen_12 host 12 7.8
specimen_13 host 13 8.0
specimen_14 host 13 8.0
specimen_15 host 14 9.0
You can have as many variables you want as well as many groups. Groups are identification tokens for specimens, let us say, host species or localities. One important thing to remember, the script averages out pseudoreplication. Thus, if the first column has more than a single specimen under the same name within a given group, the script will compute the mean value for the specimen and consider a single data entry -- for the reasons discussed above.
== Modes of execution ==
In its simplest form, the script should be executed in a shell terminal with the following sintax:
Rscript compile5numbers.r input_file
For example, the data above was in a file called example.txt, the execution would be:
Rscript compile5numbers.r example
This would return the file called my_out.csv, which is a file in a CSV (comma separated value) format -- therefore you can open it in a spreadsheet application of your choice, with the following content:
^ ^ stats ^ n_testes ^ total_length ^
| 1 |LowerValue | 2 | 1.2 |
| 2 |1stQ | 5 | 2.25 |
| 3 |Median | 7 | 2.6 |
| 4 |3rdQ | 12 | 7.25 |
| 5 |UpperValue | 14 | 9 |
| 6 |N | 15 | 15 |
The problem with this syntax is that you run the risk of overwriting output files, Thus the safest execution syntax would be
Rscript compile5numbers.r example -prefix any_word
The argument argument "any_word" for the option "-prefix" attaches the term at the beginning of each output file. For example, is you execute
Rscript compile5numbers.r example -prefix fernando
The same output file as before would be called fernando_out.csv.
This script can include as output files boxplots. To do that you have to include the option -boxplot into the command line, like:
Rscript compile5numbers.r example.txt -boxplot -prefix fernando
For the data above, this command line would generate a CSV file (Scalable Vector Graphics) called fernando_boxplot001.svg with the following box plots:
{{ :wiki:techniques:morphometrics:images:fernando_boxplot001.jpg?300 |}}
The last feature of this script is its ability to compute statistics and draw plots comparing groups and variables. To illustrate that, let us modify the input file attributing the specimens to two different hosts. Something like:
specimen group n_testes total_length
specimen_1 host_1 2 2.1
specimen_2 host_1 3 2.2
specimen_3 host_1 3 2.2
specimen_4 host_1 5 2.3
specimen_5 host_1 5 2.4
specimen_6 host_1 6 2.5
specimen_7 host_1 6 2.6
specimen_8 host_2 7 1.2
specimen_9 host_2 9 5.5
specimen_10 host_2 10 6.7
specimen_11 host_2 12 6.7
specimen_12 host_2 12 7.8
specimen_13 host_2 13 8.0
specimen_14 host_2 13 8.0
specimen_15 host_2 14 9.0
The option that allows you to compare groups per variable is called -groups, obviously! If not evoked, the script will treat the data as a single class group and the results would be the same as before even if your dataset has class groups defined. Here is the an example of command line evoking the comparative analysis:
Rscript compile5numbers.r example_hosts_groups.txt -boxplot -groups -prefix fernando_2
This analysis generates 3 files: fernando_2_out.csv, fernando_2_boxplot001.svg, and fernando_2_boxplot002.svg. The first one contains the summary statistics for each groups, and the other two files are SVG figures containing box plots for each variable.
Here is the structure of the summary statistics:
| ^ group ^ stats ^ n_testes ^ total_length ^
| 1 | host_1 | LowerValue | 2 | 2.1 |
| 2 | host_1 | 1stQ | 3 | 2.2 |
| 3 | host_1 | Median | 5 | 2.3 |
| 4 | host_1 | 3rdQ | 5.5 | 2.45 |
| 5 | host_1 | UpperValue | 6 | 2.6 |
| 6 | host_1 | N | 7 | 7 |
| 7 | host_2 | LowerValue | 7 | 5.5 |
| 8 | host_2 | 1stQ | 9.5 | 6.1 |
| 9 | host_2 | Median | 12 | 7.25 |
| 10 | host_2 | 3rdQ | 13 | 8 |
| 11 | host_2 | UpperValue | 14 | 9 |
| 12 | host_2 | N | 8 | 8 |
Here are the plots generated((You can try yourself downloading the input files used in the examples above from {{:wiki:techniques:morphometrics:input_files.zip|here}}.)):
{{ :wiki:techniques:morphometrics:images:fernando_2_boxplot001_and_2.jpg |}}
The results of this analysis is two plots as shown above. On you left you have the comparative distribution of the number of testes of these two groups of worms. Note that the ranges do not overlap. What does that mean? Well, we would say that the ranges of testes numbers of those specimens do not overlap. That is it. On your right, you have the distribution of the total length of those two groups. There is something worth a comment on that graph. Observe that the box plot recognized an outlier for the specimens attributed to Host 2. This is a useful tool of this script -- of the box plot to be fair -- that would allow you to recognize in an objective was data points that deviate considerably from the variability of your data. Whether that value a wrong class attribution, measurement error, or even a possible expectation for the distribution (see below) will require you to go back to the specimen, evaluate you data distribution, and think about it.
Before we move to our next section, we want to think a little bit more about the plots above. When you compare the distributions of these two variables, you realize that whereas for number of testes the ranges between specimens from the two hosts are closer to each other than when you compare the distribution reported for total length. The question we want you to ask yourselves is whether you would be willing to use any of those variables to state that specimens from Host 1 differ from those found in Host 2. That is, from what you have at hand, would you be save making generalizations about population parameter estimates? Remember that inferential statistics require large sample sizes and random sampling. How many worms you have? Are they from a single host (per group)? Is your sample biogeographycally representative or are they from a single locality? Did you selected a random sample of the specimens you had, or you selected them based on some criterion that might influence the results of your statistics? Depending of your answer, your analysis ends here, as a description of your data.
=== Inferential statistics ===
This section deserves a detailed treatment, especially in light of the tools that are available to us to make a better, and perhaps, a proper use of the morphometric data we intent to use in systematics of cestodes. This will come in a later version of this document. For the time being, we would like to pinpoint some to the limitations of the methods described above. We think that the example below will demonstrate to you, what you can and cannot do with descriptive statistics. Consider the following two adjacent normal distribution:
{{ :wiki:techniques:morphometrics:images:two_normal_dists.jpg?600 |}}