In the beginning the mobile genetic elements were associated with pathogenic strains; however, this has changed with time Dobrindt et al.2004. For example, an immense diversity of phages has been described in the ocean, suggesting that these mobile elements might play a relevant role in microbial communities working as seed banks for adaptive genes which any community member can potentially acquire (Fancello et al.2011; Modi et al.2013; Subirats et al.2016).
I started my search by looking for these three types of mobile elements.
conda create -n isescan_env
conda activate isescan_env
conda config --add channels defaultsconda config --add channels biocondaconda config --add channels conda-forge
conda install -n isescan_env -c conda isescan
nohup isescan.py --seqfile 0118A1023.contigs.fa --output isescan_gut_microbiota --nthread 20 &
grep -c ">" 0118A1023.contigs.fa.faa
Number of sequences: 15,164
wc -l 0118A1023.contigs.fa.tsv
Number of elements: 71
seqID | family | cluster | isBegin | isEnd | isLen | ncopy4is | start1 | end1 | start2 | end2 | score | irId | irLen | nGaps | orfBegin | orfEnd | strand | orfLen | E-value | E-value4copy | type | ov | tir |
---|
seqID | family | cluster | isBegin | isEnd | isLen | ncopy4is | start1 | end1 | start2 | end2 | score | irId | irLen | nGaps | orfBegin | orfEnd | strand | orfLen | E-value | E-value4copy | type | ov | tir |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
qb307082014_I_scaffold_0 | ISNCY | ISNCY_229 | 362714 | 363978 | 1265 | 1 | 362714 | 362733 | 363959 | 363978 | 24 | 16 | 20 | 0 | 362856 | 363845 | + | 990 | 2e-136 | 2e-136 | c | 1 | CCTGCTTGAAGAGAATCAGC:CCAGCTTGAAGTGCATCTGC |
qb307082014_I_scaffold_1 | IS3 | IS3_277 | 29932 | 31150 | 1219 | 1 | 29932 | 29947 | 31135 | 31150 | 20 | 13 | 16 | 0 | 30066 | 31134 | + | 1069 | 2.7e-69 | 2.7e-69 | c | 1 | TAATCTAGACACTTCT:TAATATGGACACTGCT |
qb307082014_I_scaffold_1 | IS481 | IS481_240 | 562014 | 563317 | 1304 | 1 | 562014 | 562026 | 563305 | 563317 | 22 | 12 | 13 | 0 | 562384 | 563229 | - | 846 | 1.7e-13 | 1.7e-13 | p | 1 | CGAAAGGAGAAAT:CGAAAGGTGAAAT |
qb307082014_I_scaffold_1038 | IS200/IS605 | IS200/IS605_96 | 1 | 110 | 110 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 110 | + | 110 | 8.1e-13 | 8.1e-13 | p | 1 | -:- |
qb307082014_I_scaffold_1051 | IS21 | IS21_122 | 126 | 1394 | 1269 | 1 | 126 | 146 | 1374 | 1394 | 18 | 15 | 21 | 0 | 1 | 1478 | + | 1478 | 4e-46 | 4e-46 | p | 1 | GAAAGTTATGTCTGTGACATT:GAAAGTCATCTCGTTCATATT |
qb307082014_I_scaffold_1052 | IS21 | IS21_122 | 131 | 1452 | 1322 | 1 | 131 | 140 | 1443 | 1452 | 16 | 9 | 10 | 0 | 1 | 1478 | + | 1478 | 1.4e-47 | 1.4e-47 | p | 1 | TTATGTCTGT:TTATATCTGT |
qb307082014_I_scaffold_1076 | IS982 | IS982_263|IS982||protein:plasmid:140967 | 519 | 1284 | 766 | 1 | 519 | 529 | 1274 | 1284 | 18 | 10 | 11 | 0 | 937 | 1413 | + | 477 | 8.8e-39 | 8.8e-39 | p | 1 | ACTGTTGCCGG:ACTGATGCCGG |
qb307082014_I_scaffold_110 | IS256 | IS256_162 | 20411 | 21536 | 1126 | 1 | 20411 | 20423 | 21524 | 21536 | 22 | 12 | 13 | 0 | 20659 | 21634 | - | 976 | 7.7e-150 | 7.7e-150 | p | 1 | TGACAAAATCATT:TGACTAAATCATT |
more_abundant<-Insertion %>% count(family) %>% # Count the number of elements in the family column arrange(-n) # Arrange the number from the from largest to smallesthead(more_abundant) # Show the top six## # A tibble: 6 × 2## family n## <chr> <int>## 1 IS21 15## 2 IS3 9## 3 IS256 6## 4 IS66 6## 5 IS982 6## 6 IS1182 5
The most abundant element is:
Some of the IS elements can be incomplete, which may reflect ancestral transposition events (Siguier et al., 2014) suggesting that the rest of the key sequence features were lost over time.
Therefore, I focused at first on the complete elements, considering they are more likely to be recent and active in the community.
ISEScan has an extra column indicating if the element is complete or partial; therefore, I filtered the output file to find the most abundant complete elements in the metagenome.
Complete<-Insertion %>% filter(type == "c") %>% # Filter the rows that had a "c" in the type column, that stands for complete count(family) %>% arrange(-n)head(Complete)## # A tibble: 6 × 2## family n## <chr> <int>## 1 IS1182 4## 2 IS3 4## 3 IS30 3## 4 IS66 3## 5 IS256 2## 6 IS4 2
What if there are partial sequences that also contain transposases?
Maybe the rest of the sequence is highly divergent, and that is why we can not find any matches to the databases; therefore, the sequences are classified as partial?
If they have transposases, that might mean that they must be elements that could be actively moving something in the community, but since they are divergent and new, the program is not able to classify them (properly).
interproscan.sh -cpu 20 -goterms -pa -i 0118A1023.contigs.fa.orf_interpr.faa > Log_Interpro_Scan_$i.txt
list_partial<-Insertion %>% filter(type == "p") %>% # Here I am filtering the output table to extract the "partial" hits. select(seqID, orfBegin, orfEnd, strand) %>% # Select certain columns on the table unite("seq_name", c("seqID", "orfBegin", "orfEnd", "strand"), sep="_") %>% # The output of ISEScan uses as sequence ID the name of the scaffold so I create a new colum merging the information from the start and end of the sequence to be able to match the results with InterproScan distinct() %>% # Remove duplicated pull() # Create a vector
interpro<-read_delim("../data/0118A1023.contigs.fa.orf_interpro.faa.tsv", delim="\t", col_names = F) %>% filter(X1 %in% list_partial) %>% # Filter the table using the IDs extracted previously filter(X4 == "Pfam") %>% # Filter the output to get the PFAM domains select(X1, X5, X6) %>% filter(str_detect(X6, "Transposase")) # Search for elements annotated as transposases
X1 | X5 | X6 |
---|
X1 | X5 | X6 |
---|---|---|
qb307082014_I_scaffold_1602_331_1806_+ | PF03050 | Transposase IS66 family |
qb307082014_I_scaffold_737_6236_6776_+ | PF00872 | Transposase, Mutator family |
qb307082014_I_scaffold_1076_937_1413_+ | PF13612 | Transposase DDE domain |
qb307082014_I_scaffold_34_1124_1628_+ | PF00872 | Transposase, Mutator family |
qb307082014_I_scaffold_110_20659_21634_- | PF00872 | Transposase, Mutator family |
qb307082014_I_scaffold_497_1_926_- | PF00872 | Transposase, Mutator family |
The most abundant element in the dataset is partial IS21.
The elements IS1182 and IS3 are the most abundant complete elements.
The analysis showed that there are 17 potential new mobile elements in the dataset.
Integrons are assembly platforms — DNA elements that acquire open reading frames embedded in exogenous gene cassettes and convert them to functional genes by ensuring their correct expression (Mazel., 2006).
I used Integron Finder to identify the integrons in the metagenome.
conda create -n integronFinder
conda activate integronFinder
conda install integron_finder -n integronFinder
integron_finder --local-max --func-annot 0118A1023.contigs.fa
The output of the program contain 3 files:
0118A1023.contigs.integrons: A file with all integrons and their elements detected in all sequences in the input file.
0118A1023.contigs.summary: A summary file with the number and type of integrons per sequence.
integron_finder.out: A copy standard output.
Lets see the 0118A1023.contigs.integrons file:
integrons<-read_delim("../data/0118A1023.contigs.integrons", delim = "\t", skip = 1)
ID_integron | ID_replicon | element | pos_beg | pos_end | strand | evalue | type_elt | annotation | model | type | default | distance_2attC | considered_topology |
---|
ID_integron | ID_replicon | element | pos_beg | pos_end | strand | evalue | type_elt | annotation | model | type | default | distance_2attC | considered_topology |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
integron_01 | qb307082014_I_scaffold_18 | attc_001 | 120781 | 120823 | -1 | 0.43 | attC | attC | attc_4 | CALIN | No | lin | |
integron_01 | qb307082014_I_scaffold_18 | attc_002 | 122724 | 122771 | -1 | 0.43 | attC | attC | attc_4 | CALIN | No | 1901 | lin |
Integron finder can predict three types of elements described in Cury et al., 2016:
In this metagenome, I was only able to identify two CALIN elements.
What genes are present in qb307082014_I_scaffold_18?
Exploring the neighborhood genes to the CALIN element would tell us maybe more about what are these sequences doing and why are they potentially important for the environment?
I used OperonMapper to predict the genes that are near the CALIN element predicted.
The output files from Operon Mapper are:
The list_of_operons_417647 file contains the annotation of the elements and the number of operons predicted.
Exploring the list_of_operons_417647 file.
Operon_mapper<-read_delim("../data/417647/list_of_operons_417647", delim="\t") %>% fill(Operon) # Fill NA values with the previous value in a column
Operon | IdGene | Type | COGgene | PosLeft | postRight | Strand | Function |
---|
Operon | IdGene | Type | COGgene | PosLeft | postRight | Strand | Function |
---|---|---|---|---|---|---|---|
1 | |||||||
1 | LOOBNNKK_00001 | CDS | COG1284 | 514 | 1410 | + | [S] Uncharacterized conserved protein |
2 | |||||||
2 | LOOBNNKK_00002 | CDS | COG3607 | 1461 | 1859 | - | [R] Predicted lactoylglutathione lyase |
3 | |||||||
3 | LOOBNNKK_00003 | CDS | COG3542 | 1979 | 2464 | - | [S] Uncharacterized conserved protein |
By manually looking at the list_of_operons_417647 file I found a sequence that is associated to a transposase is COG2826, and was found in the qb307082014_I_scaffold_18 scaffold. Suggesting is the one inferred by Integron Finder.
COG2826<-read_delim("../data/417647/list_of_operons_417647", delim="\t") %>% fill(Operon) %>% drop_na(Type) %>% filter(COGgene == "COG2826")
I look at the operons that where up and down to see which genes would be there and I found something interesting.
operons_to_look<-c("7", "9")neighbour_genes<-read_delim("../data/417647/list_of_operons_417647", delim="\t") %>% fill(Operon) %>% drop_na(Type) %>% filter(Operon %in% operons_to_look)
Operon | IdGene | Type | COGgene | PosLeft | postRight | Strand | Function |
---|
Operon | IdGene | Type | COGgene | PosLeft | postRight | Strand | Function |
---|---|---|---|---|---|---|---|
7 | LOOBNNKK_00018 | CDS | COG3279 | 21499 | 22218 | - | [KT] Response regulator of the LytR/AlgR family |
7 | LOOBNNKK_00019 | CDS | COG3290 | 22220 | 23050 | - | [T] Signal transduction histidine kinase regulating citrate/malate |
9 | LOOBNNKK_00021 | CDS | ROG4413 | 24272 | 24613 | - |
The gene in operon 9 has a non identidied function, I extracted the sequence and make a web blast search.
The best hit at the NCBI database was accessory regulator AgrC. The same protein at UniProt was A0A0K2MC93, which is classified by Gene Ontology database as GO:0000155:
Catalysis of the phosphorylation of a histidine residue in response to detection of an extracellular signal such as a chemical ligand or change in environment, to initiate a change in cell state or activity. The two-component sensor is a histidine kinase that autophosphorylates a histidine residue in its active site. The phosphate is then transferred to an aspartate residue in a downstream response regulator, to trigger a response.
In this metagenome I was only able to identify two CALIN elements.
These elements were identified in the same scaffold and the Operon Mapper analysis revealed that there is a hypothetical protein cluster nearby involved in signal transduction from the environment.
In this metagenome I was only able to identify two CALIN elements.
These elements were identified in the same scaffold and the Operon Mapper analysis revealed that there is a hypothetical protein cluster nearby involved in signal transduction from the environment.
I think that the genes within this integron cassette are proteins related to some environmental response to stress, meaning that in this community microbes exchange genes that allows them cope with a changing environment.
I searched for lysogenic phages in the metagenome as these elements are one of the most efficient mechanisms of HGT (Jiang and Paul 1998; Weinbauer and Rassouldzadegan 2004; Moon et al.2016).
PHASTER. A software that performs a number of database comparisons as well as phage 'cornerstone' feature identification steps to locate, annotate and display prophage sequences and prophage features (Zhou et al., 2011).
You can run a bunch of jobs within a loop using the API. I have done it before for one of my previous publications.
Here I just run it like that
wget --post-file="0118A1023.contigs.fa" "http://phaster.ca/phaster_api" -O 0118A1023.contigs.fa.phaster
And then you have to wait... And I still waiting
This way you can access to the job list and see how long is last for your job to run:
wget "http://phaster.ca/phaster_api?acc=ZZ_cd44018c8e" -O Output_filename
After the files were processed, we downloaded the results:
wget -O ID_FROM.phaster "http://phaster.ca/jobs/ID_FROM.phaster/summary.txt"
Then I would extract the number or intact prophages in the metagenome.
for i in $(*.fasta.phaster); do grep "intact prophage" $i >>number_phages; done
Reconstruct genomes. Reconstructing the genomes would allow me to know in which organism this elements are present and build more accurate hypothesis regarding the ecology and evolution of the mobile elements.
Over the MAGs I could run other programs and build a more accurate inferences of the operons:
OperonMapper to predict the genes that are potentially being transfer within the community.
IslandViewer to identify elements that are actively moving.
In the beginning the mobile genetic elements were associated with pathogenic strains; however, this has changed with time Dobrindt et al.2004. For example, an immense diversity of phages has been described in the ocean, suggesting that these mobile elements might play a relevant role in microbial communities working as seed banks for adaptive genes which any community member can potentially acquire (Fancello et al.2011; Modi et al.2013; Subirats et al.2016).
I started my search by looking for these three types of mobile elements.
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
o | Tile View: Overview of Slides |
Esc | Back to slideshow |
In the beginning the mobile genetic elements were associated with pathogenic strains; however, this has changed with time Dobrindt et al.2004. For example, an immense diversity of phages has been described in the ocean, suggesting that these mobile elements might play a relevant role in microbial communities working as seed banks for adaptive genes which any community member can potentially acquire (Fancello et al.2011; Modi et al.2013; Subirats et al.2016).
I started my search by looking for these three types of mobile elements.
conda create -n isescan_env
conda activate isescan_env
conda config --add channels defaultsconda config --add channels biocondaconda config --add channels conda-forge
conda install -n isescan_env -c conda isescan
nohup isescan.py --seqfile 0118A1023.contigs.fa --output isescan_gut_microbiota --nthread 20 &
grep -c ">" 0118A1023.contigs.fa.faa
Number of sequences: 15,164
wc -l 0118A1023.contigs.fa.tsv
Number of elements: 71
more_abundant<-Insertion %>% count(family) %>% # Count the number of elements in the family column arrange(-n) # Arrange the number from the from largest to smallesthead(more_abundant) # Show the top six## # A tibble: 6 × 2## family n## <chr> <int>## 1 IS21 15## 2 IS3 9## 3 IS256 6## 4 IS66 6## 5 IS982 6## 6 IS1182 5
The most abundant element is:
Some of the IS elements can be incomplete, which may reflect ancestral transposition events (Siguier et al., 2014) suggesting that the rest of the key sequence features were lost over time.
Therefore, I focused at first on the complete elements, considering they are more likely to be recent and active in the community.
ISEScan has an extra column indicating if the element is complete or partial; therefore, I filtered the output file to find the most abundant complete elements in the metagenome.
Complete<-Insertion %>% filter(type == "c") %>% # Filter the rows that had a "c" in the type column, that stands for complete count(family) %>% arrange(-n)head(Complete)## # A tibble: 6 × 2## family n## <chr> <int>## 1 IS1182 4## 2 IS3 4## 3 IS30 3## 4 IS66 3## 5 IS256 2## 6 IS4 2
What if there are partial sequences that also contain transposases?
Maybe the rest of the sequence is highly divergent, and that is why we can not find any matches to the databases; therefore, the sequences are classified as partial?
If they have transposases, that might mean that they must be elements that could be actively moving something in the community, but since they are divergent and new, the program is not able to classify them (properly).
interproscan.sh -cpu 20 -goterms -pa -i 0118A1023.contigs.fa.orf_interpr.faa > Log_Interpro_Scan_$i.txt
list_partial<-Insertion %>% filter(type == "p") %>% # Here I am filtering the output table to extract the "partial" hits. select(seqID, orfBegin, orfEnd, strand) %>% # Select certain columns on the table unite("seq_name", c("seqID", "orfBegin", "orfEnd", "strand"), sep="_") %>% # The output of ISEScan uses as sequence ID the name of the scaffold so I create a new colum merging the information from the start and end of the sequence to be able to match the results with InterproScan distinct() %>% # Remove duplicated pull() # Create a vector
interpro<-read_delim("../data/0118A1023.contigs.fa.orf_interpro.faa.tsv", delim="\t", col_names = F) %>% filter(X1 %in% list_partial) %>% # Filter the table using the IDs extracted previously filter(X4 == "Pfam") %>% # Filter the output to get the PFAM domains select(X1, X5, X6) %>% filter(str_detect(X6, "Transposase")) # Search for elements annotated as transposases
The most abundant element in the dataset is partial IS21.
The elements IS1182 and IS3 are the most abundant complete elements.
The analysis showed that there are 17 potential new mobile elements in the dataset.
Integrons are assembly platforms — DNA elements that acquire open reading frames embedded in exogenous gene cassettes and convert them to functional genes by ensuring their correct expression (Mazel., 2006).
I used Integron Finder to identify the integrons in the metagenome.
conda create -n integronFinder
conda activate integronFinder
conda install integron_finder -n integronFinder
integron_finder --local-max --func-annot 0118A1023.contigs.fa
The output of the program contain 3 files:
0118A1023.contigs.integrons: A file with all integrons and their elements detected in all sequences in the input file.
0118A1023.contigs.summary: A summary file with the number and type of integrons per sequence.
integron_finder.out: A copy standard output.
Lets see the 0118A1023.contigs.integrons file:
integrons<-read_delim("../data/0118A1023.contigs.integrons", delim = "\t", skip = 1)
Integron finder can predict three types of elements described in Cury et al., 2016:
In this metagenome, I was only able to identify two CALIN elements.
What genes are present in qb307082014_I_scaffold_18?
Exploring the neighborhood genes to the CALIN element would tell us maybe more about what are these sequences doing and why are they potentially important for the environment?
I used OperonMapper to predict the genes that are near the CALIN element predicted.
The output files from Operon Mapper are:
The list_of_operons_417647 file contains the annotation of the elements and the number of operons predicted.
Exploring the list_of_operons_417647 file.
Operon_mapper<-read_delim("../data/417647/list_of_operons_417647", delim="\t") %>% fill(Operon) # Fill NA values with the previous value in a column
By manually looking at the list_of_operons_417647 file I found a sequence that is associated to a transposase is COG2826, and was found in the qb307082014_I_scaffold_18 scaffold. Suggesting is the one inferred by Integron Finder.
COG2826<-read_delim("../data/417647/list_of_operons_417647", delim="\t") %>% fill(Operon) %>% drop_na(Type) %>% filter(COGgene == "COG2826")
I look at the operons that where up and down to see which genes would be there and I found something interesting.
operons_to_look<-c("7", "9")neighbour_genes<-read_delim("../data/417647/list_of_operons_417647", delim="\t") %>% fill(Operon) %>% drop_na(Type) %>% filter(Operon %in% operons_to_look)
The gene in operon 9 has a non identidied function, I extracted the sequence and make a web blast search.
The best hit at the NCBI database was accessory regulator AgrC. The same protein at UniProt was A0A0K2MC93, which is classified by Gene Ontology database as GO:0000155:
Catalysis of the phosphorylation of a histidine residue in response to detection of an extracellular signal such as a chemical ligand or change in environment, to initiate a change in cell state or activity. The two-component sensor is a histidine kinase that autophosphorylates a histidine residue in its active site. The phosphate is then transferred to an aspartate residue in a downstream response regulator, to trigger a response.
In this metagenome I was only able to identify two CALIN elements.
These elements were identified in the same scaffold and the Operon Mapper analysis revealed that there is a hypothetical protein cluster nearby involved in signal transduction from the environment.
In this metagenome I was only able to identify two CALIN elements.
These elements were identified in the same scaffold and the Operon Mapper analysis revealed that there is a hypothetical protein cluster nearby involved in signal transduction from the environment.
I think that the genes within this integron cassette are proteins related to some environmental response to stress, meaning that in this community microbes exchange genes that allows them cope with a changing environment.
I searched for lysogenic phages in the metagenome as these elements are one of the most efficient mechanisms of HGT (Jiang and Paul 1998; Weinbauer and Rassouldzadegan 2004; Moon et al.2016).
PHASTER. A software that performs a number of database comparisons as well as phage 'cornerstone' feature identification steps to locate, annotate and display prophage sequences and prophage features (Zhou et al., 2011).
You can run a bunch of jobs within a loop using the API. I have done it before for one of my previous publications.
Here I just run it like that
wget --post-file="0118A1023.contigs.fa" "http://phaster.ca/phaster_api" -O 0118A1023.contigs.fa.phaster
And then you have to wait... And I still waiting
This way you can access to the job list and see how long is last for your job to run:
wget "http://phaster.ca/phaster_api?acc=ZZ_cd44018c8e" -O Output_filename
After the files were processed, we downloaded the results:
wget -O ID_FROM.phaster "http://phaster.ca/jobs/ID_FROM.phaster/summary.txt"
Then I would extract the number or intact prophages in the metagenome.
for i in $(*.fasta.phaster); do grep "intact prophage" $i >>number_phages; done
Reconstruct genomes. Reconstructing the genomes would allow me to know in which organism this elements are present and build more accurate hypothesis regarding the ecology and evolution of the mobile elements.
Over the MAGs I could run other programs and build a more accurate inferences of the operons:
OperonMapper to predict the genes that are potentially being transfer within the community.
IslandViewer to identify elements that are actively moving.