Kiepas_et_al_2024_16S - Kiepas_et_al_2024_16S: 16S taxonomy and clustering is not a proxy for taxonomy in Streptomyces

This repository contains all supplementary information for analyses reported in Kiepas et al. (2024) describing inconsistencies between taxonomies inferred using 16S and whole-genome identities in Streptomyces.


This repository is provided to enable both reproduction and independent exploration of the analysis reported in this manuscript.

Table of contents

  1. Reporting Problems
  2. Contributors
  3. Contact Us
  4. Downloading Repository
  5. Set Up
  6. Repository Files
  7. Reproducing analyses

Reporting Problems

Please report any issues or problems with this repository at the Issues page.


This manuscript has the following contributors:

Contact Us

How to reach us:

Downloading Repository

If you wish to indepedently explore, reproduce and/or validate the analyses reported in the manuscipt, you can use git to clone this repository to your machine.

git clone

Alternatively, click here to download the current state of this repository as a .zip file, then expand it in the usual way for your operating system, then change directory to the repository root.

cd Kiepas_et_al_2024_16S

Set Up

We strongly recommend to create a conda enviroment specific for this activity. For example, if you have cloned or downloaded the repository and navigated to its root directory, the commands below should set up an appropriate environment:

conda create --name streptomyces python=3.8 -y
conda activate streptomyces
conda install --file requirements.txt -y

You will need also to install the following software within the environment, and follow the installation instructions are appropriate for each program:

Due to repository size limits at GitHub we are unable to provide the complete set of 16S sequences and genomes used in this manuscipt in this repository. To access these FASTA and GenBank files you can access them on Zenodo at DOI, and place them in the appropriate directories. NCBI refernce taxonomy is also available from Zenodo.

Repository Files

Here you can find a list of all supplementary files provided in this repository. current set of subfolders include:

Supplementary file 1: Generate figures using Python and R. Directory containing all data, Python and R scripts to generate figures for this manuscript. (93MB)

Supplementary file 2: Raw 16S rRNA public databases. Directory containing four separate .txt files with sequence IDs for public 16S rRNA databases used in this manuscript, and an additional .txt file with Greengenes sequence taxonomy information, and a python script used to map taxonomy information to sequences found in Greengenes v13.5. (82.2MB)

Supplementary file 3: Filtration of 16S rRNA public databases. Directory containing python script used for filtration of the raw databases, and generated outputs. (84.2MB)

Supplementary file 4: Cleaning of the filtered 16S rRNA local. Directory containing all bash and Python scripts used to clean the local full-length 16S rRNA local databases by removing redundant and poor quality 16S rRNA sequences. (109.8MB)

Supplementary file 5: Sequence Clustering. Directory containing a bash script used to cluster full-length cleaned local 16S rRNA Streptomyces local databases at various thresholds, and provides .txt files with accessions for representative sequences, and cluster members for each clustering threshold. (471.7MB)

Supplementary file 6: Analysis of taxonomic composition for each clustering threshold. Directory containing Python scripts, NCBI taxonomy input and all outputs generated used to determine the taxonomic composition for each clustering threshold. (52.7MB)

Supplementary file 7: Cluster sizes. Empirical cumulative plot showing cluster size generated for all clustering thresholds. (PDF 44KB)

Supplementary file 8: Cluster taxID abundance. Empirical cumulative plot for unique number of taxID present for all clustering thresholds. (PDF 9KB)

Supplementary file 9: MSA. Directory containing all python and bash scripts, and additional data needed to generate and clean MSA for phylogenetic analysis. (69.2MB)

Supplementary file 10: Phylogenetic reconstruction. Directory containing bash scripts used for phylogenetic reconstruction, and all generated outputs and log files. (76MB).

Supplementary file 11: Collapse branches. Directory containing jupyter notebook used for collapsing branches with the same species names, and the collapsed tree in newick format. (3.5MB)

Supplementary file 12: Phylogenetic tree. PDF file showing collapsed phylogenetic tree with marked branches with transfer bootstrap expectation support of >= 50%. (PDF 224KB)

Supplementary file 13: Phylogenetic tree. PDF file showing collapsed phylogenetic tree showing distribution of Streptomyces albus and Streptomyces griseus. (PDF 229KB)

Supplementary file 14: Phylogenetic tree. PDF file showing collapsed phylogenetic tree showing distribution of Streptomyces albulus, Streptomyces lydicus and Streptomyces venezuelae. (PDF 228KB)

Supplementary file 15: Phylogenetic tree. PDF file showing collapsed phylogenetic tree showing distribution of Streptomyces clavuligerus and Streptomyces coelicolor. (PDF 227KB)

Supplementary file 16: Phylogenetic tree. PDF file showing collapsed phylogenetic tree showing distribution of Streptomyces lavendulae, Streptomyces rimosus and Streptomyces scabiei. (PDF 228KB)

Supplementary file 17: Streptomyces genomes. Directory containing bash scripts used to download Streptomyces genomes, and Python scripts used to check assembly status. The directory also contains two separate .txt files with Streptomyces genomes used in this manuscript: one file with all initial candidates, and a second file with replaced genomes. (20.2MB)

Supplementary file 18: Extraction of full-length and ambiguity free 16S rRNA sequences from Streptomyces genomes. Directory containing all Python and bash scripts used to extract full-length sequences from the filtered Streptomyces genomes. A single FASTA file with all extracted 16S rRNA sequences, and a single FASTA file with filtered sequences. A .txt file with accession of genomes retained in the analysis. (28.2MB)

Supplementary file 19: ANI analysis among Streptomyces genomes with identical 16S rRNA sequences. Directory containing all bash and Python scripts used to determine taxonomic boundaries among Streptomyces genomes sharing identical full-length 16S rRNA sequences. All output and pyANI log files. (52.8MB)

Supplementary file 20: Network analysis of genomes based on shared 16S sequences. Directory containing jupyter notebook with NetworkX analysis and all associated output files including. bash script for pyANI analysis runs on all connected components and all associated matrices, heatmaps and log files. (106.9MB)

Supplementary file 21: Interactive network graph. HTML file containing interactive network graph of genomes sharing common full-length 16S sequences with each node colour corresponding to the number of connections/degrees. (HTML 4.7MB)

Supplementary file 22: Interactive network graph. HTML file containing interactive network graph of genomes sharing common full-length 16S sequences showing clique (blue) and non-clique (green) components. (HTML 4.7MB)

Supplementary file 23: Interactive network graph. HTML file containing interactive network graph of genomes sharing common full-length 16S sequences showing number of unique genera within each connected component. Each candidate genus is represented as a single node colour within a connected component. (HTML 4.7MB)

Supplementary file 24: Interactive network graph. HTML file containing interactive network graph of genomes sharing common full-length 16S rRNA sequences showing number of unique species within each connected component. Each candidate species is represented as a single node colour within a connected component. (HTML 4.7MB)

Supplementary file 25: Interactive network graph. HTML file containing interactive network graph of genomes sharing common full-length 16S rRNA sequences showing number of unique NCBI names within each connected component. Each NCBI assigned name is represented as a single node colour within a connected component. Gray nodes represent genomes currently lacking assigned species names. (HTML 4.7MB)

Supplementary file 26: Intragenomic 16S rRNA heterogeneity within 1,369 Streptomyces genomes which exclusively contain only full-length and ambiguity symbol-free 16S rRNA sequences. A total of 811 genomes containing single 16S rRNA sequences are not shown. (PDF 8KB)

Supplementary file 27: Distribution of 16S copies per genome with a distinction between unique and total copies for genomes at assembly level complete and chromosome. (PDF 7KB)

Supplementary file 28: Schematic workflow for construction of the full-length 16S rRNA Streptomyces phylogeny. Each arrow represents a process and is annotated with script used and corresponding supplementary file. Output/data files, and the number of remaining sequences after each step, are indicated by rectangles. The green shading represents a single processing step of collecting and collating 16S database sequences. (PDF 91KB)

Supplementary file 29: Schematic representation of the pipeline used to filter publicly available Streptomyces genomes. (PDF 59KB)

Supplementary file 30: Sankey plot showing counts of taxonomic names in source databases, assigned at ranks from phylum to genus, to sequences identified with a key word ‘Streptomyces’ in the taxonomy field. Note that Actinobacteria and Actinobacteriota are synonyms in LPSN for the correct Phylum name Actinomycetota, but that Actinomycetales and Streptomycetales are not taxonomic synonyms for each other. Streptomycetales is synonymous in LPSN with the correct name Kitasatosporales; Actinomycetales is a distinct taxonomic Order. The parent order of the Family Streptomycetaceae in LPSN is Kitasatosporales. (PDF 64KB)

Supplementary file 31: Rectangular phylogram of the comprehensive maximum-likelihood tree of the genus Streptomyces based on the 16S sequence diversity of all 5,064 full-length 16S rRNA sequences with 100 TBE values. (PDF 194KB)

Supplementary file 32: Genomes sharing identical 16S rRNA sequences are assigned different names in NCBI. A total of 1,030 singleton clusters are not shown. (PDF 8KB)

Supplementary File 33: Phylogenetic tree. PDF file showing collapsed phylogenetic tree showing distribution of members of the novel Acintacidiphila genus. (PDF 228KB) Supplementary File 34: Phylogenetic tree. PDF file showing collapsed phylogenetic tree showing distribution of members of the novel Phaeacidiphilus genus. (PDF 228KB) Supplementary File 35: Phylogenetic tree. PDF file showing collapsed phylogenetic tree showing distribution of members of the novel Mangrovactinospora genus. (PDF 228KB) Supplementary File 36: Phylogenetic tree. PDF file showing collapsed phylogenetic tree showing distribution of members of the novel Wenjunlia genus. (PDF 228KB) Supplementary File 37: Phylogenetic tree. PDF file showing collapsed phylogenetic tree showing distribution of members of the novel Streptantibioticus genus. (PDF 228KB)

Reproducing analyses (QUICK Guide)

Analysis of 16S sequences from SILVA, Greengenes, RDP and NCBI

To reproduce the analyses, and phylogenetic tree using 16S sequences downloaded from SILVA, Greengenes, RDP and NCBI, please run the following scipts in this order:

Analysis of 16S sequences from Streptomyces genomes

To reproduce the analyses, and phylogenetic tree using 16S sequences downloaded from SILVA, Greengenes, RDP and NCBI, please run the following scipts in this order: