Mobile genetic elements (MGEs) in the gut microbiota.Mirna Vázquez Rosas Landa2022/04/041 / 32

Mobile genetic elements (MGEs) in gut metagenomes

In the beginning the mobile genetic elements were associated with pathogenic strains; however, this has changed with time Dobrindt et al.2004. For example, an immense diversity of phages has been described in the ocean, suggesting that these mobile elements might play a relevant role in microbial communities working as seed banks for adaptive genes which any community member can potentially acquire (Fancello et al.2011; Modi et al.2013; Subirats et al.2016).

I started my search by looking for these three types of mobile elements.

Insertion sequences
Integrons
Phages

2 / 32

Insertion sequence elements in the gut metagenome

To search for insertion elements I used ISEScan. This program is a python pipeline that identifies elements requiring a nucleotide sequence as the input and identifies both complete and partial Insertion Sequence (IS).

3 / 32

Install ISEScan

Create a conda environment

conda create -n isescan_env

Activate the conda environment

conda activate isescan_env

Add Bioconda channels

conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge

Conda install

conda install  -n isescan_env -c conda isescan

4 / 32

Run ISEScan

nohup isescan.py --seqfile 0118A1023.contigs.fa --output isescan_gut_microbiota --nthread 20 &

5 / 32

Explore the output of ISEScan

The program uses FragGeneScan to make the protein prediction, so one first exciting thing to look at is how many proteins were inferred?

grep -c ">" 0118A1023.contigs.fa.faa

Number of sequences: 15,164

How many elements were inferred? One easy way to do it is to count the number of rows in the output file.

 wc -l 0118A1023.contigs.fa.tsv

Number of elements: 71

6 / 32

Explore the output of ISEScan

We can take a quick look to the output file.

seqID	family	cluster	isBegin	isEnd	isLen	ncopy4is	start1	end1	start2	end2	score	irId	irLen	orfBegin	orfEnd	strand	orfLen	E-value	E-value4copy	type	ov	tir
qb307082014_I_scaffold_0	ISNCY	ISNCY_229	362714	363978	1265	1	362714	362733	363959	363978	24	16	20	362856	363845	+	990	2e-136	2e-136	c	1	CCTGCTTGAAGAGAATCAGC:CCAGCTTGAAGTGCATCTGC
qb307082014_I_scaffold_1	IS3	IS3_277	29932	31150	1219	1	29932	29947	31135	31150	20	13	16	30066	31134	+	1069	2.7e-69	2.7e-69	c	1	TAATCTAGACACTTCT:TAATATGGACACTGCT
qb307082014_I_scaffold_1	IS481	IS481_240	562014	563317	1304	1	562014	562026	563305	563317	22	12	13	562384	563229	-	846	1.7e-13	1.7e-13	p	1	CGAAAGGAGAAAT:CGAAAGGTGAAAT
qb307082014_I_scaffold_1038	IS200/IS605	IS200/IS605_96	1	110	110	1	0	0	0	0	0	0	0	1	110	+	110	8.1e-13	8.1e-13	p	1	-:-
qb307082014_I_scaffold_1051	IS21	IS21_122	126	1394	1269	1	126	146	1374	1394	18	15	21	1	1478	+	1478	4e-46	4e-46	p	1	GAAAGTTATGTCTGTGACATT:GAAAGTCATCTCGTTCATATT
qb307082014_I_scaffold_1052	IS21	IS21_122	131	1452	1322	1	131	140	1443	1452	16	9	10	1	1478	+	1478	1.4e-47	1.4e-47	p	1	TTATGTCTGT:TTATATCTGT
qb307082014_I_scaffold_1076	IS982	IS982_263\|IS982\|\|protein:plasmid:140967	519	1284	766	1	519	529	1274	1284	18	10	11	937	1413	+	477	8.8e-39	8.8e-39	p	1	ACTGTTGCCGG:ACTGATGCCGG
qb307082014_I_scaffold_110	IS256	IS256_162	20411	21536	1126	1	20411	20423	21524	21536	22	12	13	20659	21634	-	976	7.7e-150	7.7e-150	p	1	TGACAAAATCATT:TGACTAAATCATT

Previous1 2 3 4 5…9Next

7 / 32

Explore the output of ISEScan

Which families are more abundant?

more_abundant<-Insertion %>%
  count(family) %>% # Count the number of elements in the family column
  arrange(-n) # Arrange the number from the from largest to smallest
head(more_abundant) # Show the top six
## # A tibble: 6 × 2
##   family     n
##   <chr>  <int>
## 1 IS21      15
## 2 IS3        9
## 3 IS256      6
## 4 IS66       6
## 5 IS982      6
## 6 IS1182     5

The most abundant element is:

IS21

8 / 32

Complete and partial elements

Some of the IS elements can be incomplete, which may reflect ancestral transposition events (Siguier et al., 2014) suggesting that the rest of the key sequence features were lost over time.

Therefore, I focused at first on the complete elements, considering they are more likely to be recent and active in the community.

9 / 32

Explore the output ISEScan

ISEScan has an extra column indicating if the element is complete or partial; therefore, I filtered the output file to find the most abundant complete elements in the metagenome.

Which ISs are complete?

Complete<-Insertion %>%
  filter(type == "c") %>% # Filter the rows that had a "c" in the type column, that stands for complete
  count(family) %>%
  arrange(-n)
head(Complete)
## # A tibble: 6 × 2
##   family     n
##   <chr>  <int>
## 1 IS1182     4
## 2 IS3        4
## 3 IS30       3
## 4 IS66       3
## 5 IS256      2
## 6 IS4        2

The most abundant elements that are also complete and probably more recent are:

IS1182
IS3

10 / 32

New hypothesis!!!

What if there are partial sequences that also contain transposases?
Maybe the rest of the sequence is highly divergent, and that is why we can not find any matches to the databases; therefore, the sequences are classified as partial?
If they have transposases, that might mean that they must be elements that could be actively moving something in the community, but since they are divergent and new, the program is not able to classify them (properly).

11 / 32

Explore the output of ISEScan to look for potential new elements.

I ran InterproScan to identify transposases domains in the sequences that were predicted with FragGeneScan by ISEScan.

interproscan.sh -cpu 20 -goterms -pa -i  0118A1023.contigs.fa.orf_interpr.faa > Log_Interpro_Scan_$i.txt

Then I extracted a list of the IDs of the partial sequences.

list_partial<-Insertion %>%
  filter(type == "p") %>% # Here I am filtering the output table to extract the "partial" hits.
  select(seqID, orfBegin, orfEnd, strand) %>% # Select certain columns on the table
  unite("seq_name", c("seqID", "orfBegin", "orfEnd", "strand"), sep="_") %>% # The output of ISEScan uses as sequence ID the name of the scaffold so I create a new colum merging the information from the start and end of the sequence to be able to match the results with InterproScan
  distinct() %>% # Remove duplicated
  pull() # Create a vector

12 / 32

Explore the output of ISEScan to look for potential new elements.

I read the output and filtered the sequences based on the list.

interpro<-read_delim("../data/0118A1023.contigs.fa.orf_interpro.faa.tsv", delim="\t", col_names = F) %>%
  filter(X1 %in% list_partial) %>% # Filter the table using the IDs extracted previously
  filter(X4 == "Pfam") %>% # Filter the output to get the PFAM domains
  select(X1, X5, X6) %>%
  filter(str_detect(X6, "Transposase")) # Search for elements annotated as transposases

X1	X5	X6
qb307082014_I_scaffold_1602_331_1806_+	PF03050	Transposase IS66 family
qb307082014_I_scaffold_737_6236_6776_+	PF00872	Transposase, Mutator family
qb307082014_I_scaffold_1076_937_1413_+	PF13612	Transposase DDE domain
qb307082014_I_scaffold_34_1124_1628_+	PF00872	Transposase, Mutator family
qb307082014_I_scaffold_110_20659_21634_-	PF00872	Transposase, Mutator family
qb307082014_I_scaffold_497_1_926_-	PF00872	Transposase, Mutator family

Previous1 2 3Next

13 / 32

Conclusion of ISEScan

The most abundant element in the dataset is partial IS21.
The elements IS1182 and IS3 are the most abundant complete elements.
The analysis showed that there are 17 potential new mobile elements in the dataset.

14 / 32

Integron prediction in the gut metagenome.

Integrons are assembly platforms — DNA elements that acquire open reading frames embedded in exogenous gene cassettes and convert them to functional genes by ensuring their correct expression (Mazel., 2006).
I used Integron Finder to identify the integrons in the metagenome.

15 / 32

Install Integron Finder

Create a conda environment

conda create -n integronFinder

Activate the conda environment

conda activate integronFinder

Conda install

 conda install integron_finder -n integronFinder

16 / 32

Run Integron Finder

integron_finder --local-max --func-annot 0118A1023.contigs.fa

17 / 32

Explore Integron Finder output

The output of the program contain 3 files:

0118A1023.contigs.integrons: A file with all integrons and their elements detected in all sequences in the input file.
0118A1023.contigs.summary: A summary file with the number and type of integrons per sequence.
integron_finder.out: A copy standard output.

18 / 32

Explore Integron Finder output

Lets see the 0118A1023.contigs.integrons file:

integrons<-read_delim("../data/0118A1023.contigs.integrons", delim = "\t",
           skip = 1)

ID_integron	ID_replicon	element	pos_beg	pos_end	strand	evalue	type_elt	annotation	model	type	default	distance_2attC	considered_topology
integron_01	qb307082014_I_scaffold_18	attc_001	120781	120823	-1	0.43	attC	attC	attc_4	CALIN	No		lin
integron_01	qb307082014_I_scaffold_18	attc_002	122724	122771	-1	0.43	attC	attC	attc_4	CALIN	No	1901	lin

Previous1Next

Integron finder can predict three types of elements described in Cury et al., 2016:

Complete integrons: Include an integrase and at least one attC site
CALIN: The clusters of attC sites lacking integron-integrases (CALIN) are composed of at least two attC sites.
In0: The In0 elements are composed of an integron integrase and no attC sites.

In this metagenome, I was only able to identify two CALIN elements.

19 / 32

New hypothesis!!!

What genes are present in qb307082014_I_scaffold_18?
Exploring the neighborhood genes to the CALIN element would tell us maybe more about what are these sequences doing and why are they potentially important for the environment?
I used OperonMapper to predict the genes that are near the CALIN element predicted.

20 / 32

Operon Mapper output

The output files from Operon Mapper are:

predicted_protein_sequences_417647
predicted_orfs_417647
predicted_COGs_417647
operonic_gene_pairs_417647
list_of_operons_417647
functional_descriptions_417647
ORFs_coordinates_417647

The list_of_operons_417647 file contains the annotation of the elements and the number of operons predicted.

21 / 32

Operon Mapper output

Exploring the list_of_operons_417647 file.

Operon_mapper<-read_delim("../data/417647/list_of_operons_417647", delim="\t") %>%
  fill(Operon) # Fill NA values with the previous value in a column

Operon	IdGene	Type	COGgene	PosLeft	postRight	Strand	Function
1
1	LOOBNNKK_00001	CDS	COG1284	514	1410	+	[S] Uncharacterized conserved protein
2
2	LOOBNNKK_00002	CDS	COG3607	1461	1859	-	[R] Predicted lactoylglutathione lyase
3
3	LOOBNNKK_00003	CDS	COG3542	1979	2464	-	[S] Uncharacterized conserved protein

Previous1 2 3 4 5…37Next

22 / 32

Operon Mapper output

By manually looking at the list_of_operons_417647 file I found a sequence that is associated to a transposase is COG2826, and was found in the qb307082014_I_scaffold_18 scaffold. Suggesting is the one inferred by Integron Finder.

COG2826<-read_delim("../data/417647/list_of_operons_417647", delim="\t") %>%
  fill(Operon) %>%
  drop_na(Type) %>%
  filter(COGgene == "COG2826")

Operon	IdGene	Type	COGgene	PosLeft	postRight	Strand	Function
8	LOOBNNKK_00020	CDS	COG2826	23250	24209	-	[L] Transposase and inactivated derivatives, IS30 family

Previous1Next

23 / 32

Operon Mapper output

I look at the operons that where up and down to see which genes would be there and I found something interesting.

operons_to_look<-c("7", "9")
neighbour_genes<-read_delim("../data/417647/list_of_operons_417647", delim="\t") %>%
  fill(Operon) %>%
  drop_na(Type) %>%
  filter(Operon %in% operons_to_look)

Operon	IdGene	Type	COGgene	PosLeft	postRight	Strand	Function
7	LOOBNNKK_00018	CDS	COG3279	21499	22218	-	[KT] Response regulator of the LytR/AlgR family
7	LOOBNNKK_00019	CDS	COG3290	22220	23050	-	[T] Signal transduction histidine kinase regulating citrate/malate
9	LOOBNNKK_00021	CDS	ROG4413	24272	24613	-

Previous1Next

24 / 32

Blast search results

The gene in operon 9 has a non identidied function, I extracted the sequence and make a web blast search.
The best hit at the NCBI database was accessory regulator AgrC. The same protein at UniProt was A0A0K2MC93, which is classified by Gene Ontology database as GO:0000155:

Catalysis of the phosphorylation of a histidine residue in response to detection of an extracellular signal such as a chemical ligand or change in environment, to initiate a change in cell state or activity. The two-component sensor is a histidine kinase that autophosphorylates a histidine residue in its active site. The phosphate is then transferred to an aspartate residue in a downstream response regulator, to trigger a response.

25 / 32

Conclusion

In this metagenome I was only able to identify two CALIN elements.
These elements were identified in the same scaffold and the Operon Mapper analysis revealed that there is a hypothetical protein cluster nearby involved in signal transduction from the environment.

26 / 32

Conclusion

In this metagenome I was only able to identify two CALIN elements.
These elements were identified in the same scaffold and the Operon Mapper analysis revealed that there is a hypothetical protein cluster nearby involved in signal transduction from the environment.

I think that the genes within this integron cassette are proteins related to some environmental response to stress, meaning that in this community microbes exchange genes that allows them cope with a changing environment.

26 / 32

Phages in the gut metagenome.

I searched for lysogenic phages in the metagenome as these elements are one of the most efficient mechanisms of HGT (Jiang and Paul 1998; Weinbauer and Rassouldzadegan 2004; Moon et al.2016).
PHASTER. A software that performs a number of database comparisons as well as phage 'cornerstone' feature identification steps to locate, annotate and display prophage sequences and prophage features (Zhou et al., 2011).

27 / 32

Run PHASTER

You can run a bunch of jobs within a loop using the API. I have done it before for one of my previous publications.

Here I just run it like that

wget --post-file="0118A1023.contigs.fa" "http://phaster.ca/phaster_api" -O 0118A1023.contigs.fa.phaster

And then you have to wait... And I still waiting

This way you can access to the job list and see how long is last for your job to run:

wget "http://phaster.ca/phaster_api?acc=ZZ_cd44018c8e" -O Output_filename

28 / 32

How would I analyze PHASTER results?

After the files were processed, we downloaded the results:

wget -O ID_FROM.phaster  "http://phaster.ca/jobs/ID_FROM.phaster/summary.txt"

Then I would extract the number or intact prophages in the metagenome.

for i in $(*.fasta.phaster); do grep "intact prophage" $i >>number_phages; done

29 / 32

Future directionsDiversity. To determine how divergent are all those elements elements I would do phylogenetic reconstructions of the transposases and the genes that cluster closely to the mobile elements.
30 / 32

Future directions

Reconstruct genomes. Reconstructing the genomes would allow me to know in which organism this elements are present and build more accurate hypothesis regarding the ecology and evolution of the mobile elements.

Over the MAGs I could run other programs and build a more accurate inferences of the operons:

OperonMapper to predict the genes that are potentially being transfer within the community.
IslandViewer to identify elements that are actively moving.

31 / 32

32 / 32

Mobile genetic elements (MGEs) in gut metagenomes

I started my search by looking for these three types of mobile elements.

Insertion sequences

Integrons

Phages

2 / 32

Help

Keyboard shortcuts

↑, ←, Pg Up, k

Go to previous slide

↓, →, Pg Dn, Space, j

Go to next slide

Home

Go to first slide

End

Go to last slide

Number + Return

Go to specific slide

b / m / f

Toggle blackout / mirrored / fullscreen mode

Clone slideshow

Toggle presenter mode

Restart the presentation timer

?, h

Toggle this help

Tile View: Overview of Slides

Mobile genetic elements (MGEs) in the gut microbiota.Mirna Vázquez Rosas Landa2022/04/041 / 32

Mobile genetic elements (MGEs) in gut metagenomes

I started my search by looking for these three types of mobile elements.

Insertion sequences
Integrons
Phages

2 / 32

Insertion sequence elements in the gut metagenome

To search for insertion elements I used ISEScan. This program is a python pipeline that identifies elements requiring a nucleotide sequence as the input and identifies both complete and partial Insertion Sequence (IS).

3 / 32

Install ISEScan

Create a conda environment

conda create -n isescan_env

Activate the conda environment

conda activate isescan_env

Add Bioconda channels

conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge

Conda install

conda install  -n isescan_env -c conda isescan

4 / 32

Run ISEScan

nohup isescan.py --seqfile 0118A1023.contigs.fa --output isescan_gut_microbiota --nthread 20 &

5 / 32

Explore the output of ISEScan

The program uses FragGeneScan to make the protein prediction, so one first exciting thing to look at is how many proteins were inferred?

grep -c ">" 0118A1023.contigs.fa.faa

Number of sequences: 15,164

How many elements were inferred? One easy way to do it is to count the number of rows in the output file.

 wc -l 0118A1023.contigs.fa.tsv

Number of elements: 71

6 / 32

Explore the output of ISEScanWe can take a quick look to the output file.
 
7 / 32

Explore the output of ISEScan

Which families are more abundant?

more_abundant<-Insertion %>%
  count(family) %>% # Count the number of elements in the family column
  arrange(-n) # Arrange the number from the from largest to smallest
head(more_abundant) # Show the top six
## # A tibble: 6 × 2
##   family     n
##   <chr>  <int>
## 1 IS21      15
## 2 IS3        9
## 3 IS256      6
## 4 IS66       6
## 5 IS982      6
## 6 IS1182     5

The most abundant element is:

IS21

8 / 32

Complete and partial elements

Some of the IS elements can be incomplete, which may reflect ancestral transposition events (Siguier et al., 2014) suggesting that the rest of the key sequence features were lost over time.

Therefore, I focused at first on the complete elements, considering they are more likely to be recent and active in the community.

9 / 32

Explore the output ISEScan

ISEScan has an extra column indicating if the element is complete or partial; therefore, I filtered the output file to find the most abundant complete elements in the metagenome.

Which ISs are complete?

Complete<-Insertion %>%
  filter(type == "c") %>% # Filter the rows that had a "c" in the type column, that stands for complete
  count(family) %>%
  arrange(-n)
head(Complete)
## # A tibble: 6 × 2
##   family     n
##   <chr>  <int>
## 1 IS1182     4
## 2 IS3        4
## 3 IS30       3
## 4 IS66       3
## 5 IS256      2
## 6 IS4        2

The most abundant elements that are also complete and probably more recent are:

IS1182
IS3

10 / 32

New hypothesis!!!

What if there are partial sequences that also contain transposases?
Maybe the rest of the sequence is highly divergent, and that is why we can not find any matches to the databases; therefore, the sequences are classified as partial?
If they have transposases, that might mean that they must be elements that could be actively moving something in the community, but since they are divergent and new, the program is not able to classify them (properly).

11 / 32

Explore the output of ISEScan to look for potential new elements.

I ran InterproScan to identify transposases domains in the sequences that were predicted with FragGeneScan by ISEScan.

interproscan.sh -cpu 20 -goterms -pa -i  0118A1023.contigs.fa.orf_interpr.faa > Log_Interpro_Scan_$i.txt

Then I extracted a list of the IDs of the partial sequences.

list_partial<-Insertion %>%
  filter(type == "p") %>% # Here I am filtering the output table to extract the "partial" hits.
  select(seqID, orfBegin, orfEnd, strand) %>% # Select certain columns on the table
  unite("seq_name", c("seqID", "orfBegin", "orfEnd", "strand"), sep="_") %>% # The output of ISEScan uses as sequence ID the name of the scaffold so I create a new colum merging the information from the start and end of the sequence to be able to match the results with InterproScan
  distinct() %>% # Remove duplicated
  pull() # Create a vector

12 / 32

Explore the output of ISEScan to look for potential new elements.

I read the output and filtered the sequences based on the list.

interpro<-read_delim("../data/0118A1023.contigs.fa.orf_interpro.faa.tsv", delim="\t", col_names = F) %>%
  filter(X1 %in% list_partial) %>% # Filter the table using the IDs extracted previously
  filter(X4 == "Pfam") %>% # Filter the output to get the PFAM domains
  select(X1, X5, X6) %>%
  filter(str_detect(X6, "Transposase")) # Search for elements annotated as transposases

13 / 32

Conclusion of ISEScan

The most abundant element in the dataset is partial IS21.
The elements IS1182 and IS3 are the most abundant complete elements.
The analysis showed that there are 17 potential new mobile elements in the dataset.

14 / 32

Integron prediction in the gut metagenome.

Integrons are assembly platforms — DNA elements that acquire open reading frames embedded in exogenous gene cassettes and convert them to functional genes by ensuring their correct expression (Mazel., 2006).
I used Integron Finder to identify the integrons in the metagenome.

15 / 32

Install Integron Finder

Create a conda environment

conda create -n integronFinder

Activate the conda environment

conda activate integronFinder

Conda install

 conda install integron_finder -n integronFinder

16 / 32

Run Integron Finder

integron_finder --local-max --func-annot 0118A1023.contigs.fa

17 / 32

Explore Integron Finder output

The output of the program contain 3 files:

0118A1023.contigs.integrons: A file with all integrons and their elements detected in all sequences in the input file.
0118A1023.contigs.summary: A summary file with the number and type of integrons per sequence.
integron_finder.out: A copy standard output.

18 / 32

Explore Integron Finder output

Lets see the 0118A1023.contigs.integrons file:

integrons<-read_delim("../data/0118A1023.contigs.integrons", delim = "\t",
           skip = 1)

Integron finder can predict three types of elements described in Cury et al., 2016:

Complete integrons: Include an integrase and at least one attC site
CALIN: The clusters of attC sites lacking integron-integrases (CALIN) are composed of at least two attC sites.
In0: The In0 elements are composed of an integron integrase and no attC sites.

In this metagenome, I was only able to identify two CALIN elements.

19 / 32

New hypothesis!!!

What genes are present in qb307082014_I_scaffold_18?
Exploring the neighborhood genes to the CALIN element would tell us maybe more about what are these sequences doing and why are they potentially important for the environment?
I used OperonMapper to predict the genes that are near the CALIN element predicted.

20 / 32

Operon Mapper output

The output files from Operon Mapper are:

predicted_protein_sequences_417647
predicted_orfs_417647
predicted_COGs_417647
operonic_gene_pairs_417647
list_of_operons_417647
functional_descriptions_417647
ORFs_coordinates_417647

The list_of_operons_417647 file contains the annotation of the elements and the number of operons predicted.

21 / 32

Operon Mapper output

Exploring the list_of_operons_417647 file.

Operon_mapper<-read_delim("../data/417647/list_of_operons_417647", delim="\t") %>%
  fill(Operon) # Fill NA values with the previous value in a column

22 / 32

Operon Mapper output

COG2826<-read_delim("../data/417647/list_of_operons_417647", delim="\t") %>%
  fill(Operon) %>%
  drop_na(Type) %>%
  filter(COGgene == "COG2826")

23 / 32

Operon Mapper output

I look at the operons that where up and down to see which genes would be there and I found something interesting.

operons_to_look<-c("7", "9")
neighbour_genes<-read_delim("../data/417647/list_of_operons_417647", delim="\t") %>%
  fill(Operon) %>%
  drop_na(Type) %>%
  filter(Operon %in% operons_to_look)

24 / 32

Blast search results

The gene in operon 9 has a non identidied function, I extracted the sequence and make a web blast search.
The best hit at the NCBI database was accessory regulator AgrC. The same protein at UniProt was A0A0K2MC93, which is classified by Gene Ontology database as GO:0000155:

25 / 32

Conclusion

In this metagenome I was only able to identify two CALIN elements.
These elements were identified in the same scaffold and the Operon Mapper analysis revealed that there is a hypothetical protein cluster nearby involved in signal transduction from the environment.

26 / 32

Conclusion

In this metagenome I was only able to identify two CALIN elements.
These elements were identified in the same scaffold and the Operon Mapper analysis revealed that there is a hypothetical protein cluster nearby involved in signal transduction from the environment.

26 / 32

Phages in the gut metagenome.

I searched for lysogenic phages in the metagenome as these elements are one of the most efficient mechanisms of HGT (Jiang and Paul 1998; Weinbauer and Rassouldzadegan 2004; Moon et al.2016).
PHASTER. A software that performs a number of database comparisons as well as phage 'cornerstone' feature identification steps to locate, annotate and display prophage sequences and prophage features (Zhou et al., 2011).

27 / 32

Run PHASTER

You can run a bunch of jobs within a loop using the API. I have done it before for one of my previous publications.

Here I just run it like that

wget --post-file="0118A1023.contigs.fa" "http://phaster.ca/phaster_api" -O 0118A1023.contigs.fa.phaster

And then you have to wait... And I still waiting

This way you can access to the job list and see how long is last for your job to run:

wget "http://phaster.ca/phaster_api?acc=ZZ_cd44018c8e" -O Output_filename

28 / 32

How would I analyze PHASTER results?

After the files were processed, we downloaded the results:

wget -O ID_FROM.phaster  "http://phaster.ca/jobs/ID_FROM.phaster/summary.txt"

Then I would extract the number or intact prophages in the metagenome.

for i in $(*.fasta.phaster); do grep "intact prophage" $i >>number_phages; done

29 / 32

Future directionsDiversity. To determine how divergent are all those elements elements I would do phylogenetic reconstructions of the transposases and the genes that cluster closely to the mobile elements.
30 / 32

Future directions

Over the MAGs I could run other programs and build a more accurate inferences of the operons:

OperonMapper to predict the genes that are potentially being transfer within the community.
IslandViewer to identify elements that are actively moving.

31 / 32

32 / 32