Title: | Single-Cell Analysis of Host-Microbiome Interactions |
---|---|
Description: | A computational resource designed to accurately detect microbial nucleic acids while filtering out contaminants and false-positive taxonomic assignments from standard transcriptomic sequencing of mammalian tissues. For more details, see Ghaddar (2023) <doi:10.1038/s43588-023-00507-1>. This implementation leverages the 'polars' package for fast and systematic microbial signal recovery and denoising from host tissue genomic sequencing. |
Authors: | Yun Peng [aut, cre] |
Maintainer: | Yun Peng <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.0.2.9000 |
Built: | 2025-03-27 03:36:45 UTC |
Source: | https://github.com/Yunuuuu/rsahmi |
True taxa are detected on multiple barcodes and with a proprotional number of
total and unique k-mer sequences across barcodes, measured as a significant
Spearman correlation between the number of total and unique k-mers across
barcodes. (padj < 0.05
)
blsd( kmer, method = "spearman", ..., p.adjust = "BH", min_kmer_len = 3L, min_number = 3L )
blsd( kmer, method = "spearman", ..., p.adjust = "BH", min_kmer_len = 3L, min_number = 3L )
kmer |
kmer data returned by |
method |
A character string indicating which correlation coefficient is to be used for the test. One of "pearson", "kendall", or "spearman", can be abbreviated. |
... |
Other arguments passed to cor.test. |
p.adjust |
Pvalue correction method, a character string. Can be abbreviated. Details see p.adjust. |
min_kmer_len |
An integer, the minimal number of kmer to filter taxa.
SAHMI use |
min_number |
An integer, the minimal number of cell per taxid. SAHMI use
|
A polars DataFrame
https://github.com/sjdlabgroup/SAHMI
## Not run: # 1. `sahmi_datasets` should be the output of all samples from `prep_dataset()` # 2. `real_taxids_slsd` should be the output of `slsd()` umi_list <- lapply(sahmi_datasets, function(dataset) { # Barcode level signal denoising (barcode k-mer correlation test) blsd <- blsd(dataset$kmer) real_taxids <- blsd$filter(pl$col("padj")$lt(0.05))$get_column("taxid") # only keep taxids pass Sample level signal denoising real_taxids <- real_taxids$filter(real_taxids$is_in(real_taxids_slsd)) # remove contaminants real_taxids <- real_taxids$filter( real_taxids$is_in(attr(truly_microbe, "truly")) ) # filter UMI data dataset$umi$filter(pl$col("taxid")$is_in(real_taxids)) }) ## End(Not run)
## Not run: # 1. `sahmi_datasets` should be the output of all samples from `prep_dataset()` # 2. `real_taxids_slsd` should be the output of `slsd()` umi_list <- lapply(sahmi_datasets, function(dataset) { # Barcode level signal denoising (barcode k-mer correlation test) blsd <- blsd(dataset$kmer) real_taxids <- blsd$filter(pl$col("padj")$lt(0.05))$get_column("taxid") # only keep taxids pass Sample level signal denoising real_taxids <- real_taxids$filter(real_taxids$is_in(real_taxids_slsd)) # remove contaminants real_taxids <- real_taxids$filter( real_taxids$is_in(attr(truly_microbe, "truly")) ) # filter UMI data dataset$umi$filter(pl$col("taxid")$is_in(real_taxids)) }) ## End(Not run)
Extract reads and output from Kraken
extract_taxids( kraken_report, taxon = c("d__Bacteria", "d__Fungi", "d__Viruses") ) extract_kraken_output( kraken_out, taxids, odir, ofile = "kraken_microbiome_output.txt", ... ) extract_kraken_reads( kraken_out, reads, ofile = NULL, odir = getwd(), threads = NULL, ..., envpath = NULL, seqkit = NULL )
extract_taxids( kraken_report, taxon = c("d__Bacteria", "d__Fungi", "d__Viruses") ) extract_kraken_output( kraken_out, taxids, odir, ofile = "kraken_microbiome_output.txt", ... ) extract_kraken_reads( kraken_out, reads, ofile = NULL, odir = getwd(), threads = NULL, ..., envpath = NULL, seqkit = NULL )
kraken_report |
The path to kraken report file. |
taxon |
An atomic character specify the taxa name wanted. Should follow the kraken style, connected by rank codes, two underscores, and the scientific name of the taxon (e.g., "d__Viruses") |
kraken_out |
The path to kraken output file. |
taxids |
A character specify NCBI taxonony identifier to extract. |
odir |
A string of directory to save the |
ofile |
A string of file save the kraken output of specified |
... |
|
reads |
The original fastq files (input in |
threads |
Number of threads to use, see
|
envpath |
A string of path to be added to the environment variable
|
seqkit |
A string of path to |
extract_taxids
: An atomic character vector of taxon identifiers.
extract_kraken_output
: A polars DataFrame.
extract_kraken_reads
: Exit status invisiblely.
https://github.com/DerrickWood/kraken2/blob/master/docs/MANUAL.markdown
## Not run: # For 10x Genomic data, `fq1` only contain barcode and umi, but the official # didn't give any information for this. In this way, I prefer using # `umi-tools` to transform the `umi` into fq2 and then run `rsahmi` with # only fq2. blit::kraken2( fq1 = fq1, fq2 = fq2, classified_out = "classified.fq", # Number of threads to use blit::arg("--threads", 10L, format = "%d"), # the kraken database blit::arg("--db", kraken_db), "--use-names", "--report-minimizer-data", ) |> blit::cmd_run() # `kraken_report` should be the output of `blit::kraken2()` taxids <- extract_taxids(kraken_report = "kraken_report.txt") # 1. `kraken_out` should be the output of `blit::kraken2()` # 2. `taxids` should be the output of `extract_taxids()` # 3. `odir`: the output directory extract_kraken_output( kraken_out = "kraken_output.txt", taxids = taxids, odir = # specify the output directory ) # 1. `kraken_out` should be the output of `extract_kraken_output()` # 2. `fq1` and `fq2` should be the same with `blit::kraken2()` extract_kraken_reads( kraken_out = "kraken_microbiome_output.txt", reads = c(fq1, fq2), threads = 10L, # Number of threads to use # try to change `seqkit` argument into your seqkit path. If `NULL`, the # internal will detect it in your `PATH` environment variable seqkit = NULL ) ## End(Not run)
## Not run: # For 10x Genomic data, `fq1` only contain barcode and umi, but the official # didn't give any information for this. In this way, I prefer using # `umi-tools` to transform the `umi` into fq2 and then run `rsahmi` with # only fq2. blit::kraken2( fq1 = fq1, fq2 = fq2, classified_out = "classified.fq", # Number of threads to use blit::arg("--threads", 10L, format = "%d"), # the kraken database blit::arg("--db", kraken_db), "--use-names", "--report-minimizer-data", ) |> blit::cmd_run() # `kraken_report` should be the output of `blit::kraken2()` taxids <- extract_taxids(kraken_report = "kraken_report.txt") # 1. `kraken_out` should be the output of `blit::kraken2()` # 2. `taxids` should be the output of `extract_taxids()` # 3. `odir`: the output directory extract_kraken_output( kraken_out = "kraken_output.txt", taxids = taxids, odir = # specify the output directory ) # 1. `kraken_out` should be the output of `extract_kraken_output()` # 2. `fq1` and `fq2` should be the same with `blit::kraken2()` extract_kraken_reads( kraken_out = "kraken_microbiome_output.txt", reads = c(fq1, fq2), threads = 10L, # Number of threads to use # try to change `seqkit` argument into your seqkit path. If `NULL`, the # internal will detect it in your `PATH` environment variable seqkit = NULL ) ## End(Not run)
Parse kraken report file
parse_kraken_report(kraken_report, intermediate_ranks = TRUE, mpa = FALSE)
parse_kraken_report(kraken_report, intermediate_ranks = TRUE, mpa = FALSE)
kraken_report |
The path to kraken report file. |
intermediate_ranks |
A bool indicates whether to include non-traditional taxonomic ranks in output. |
mpa |
A bool indicates whether to use mpa-style. |
A polars DataFrame.
https://github.com/DerrickWood/kraken2/blob/master/docs/MANUAL.markdown
Three elements returned by this function:
kreport
: Used by slsd()
.
kmer
: Used by blsd()
. The function count the number of k-mers and
unique k-mers assigned to a taxon across barcodes. The cell barcode
and unique molecular identifier (UMI) are used to identify unique
barcodes and reads. Data is reported for taxa of pre-specified ranks
(default genus + species) taking into account all subsequently higher
resolution ranks. The output is a table of barcodes, taxonomic IDs,
number of k-mers, and number of unique k-mers.
umi
: Used by taxa_counts()
.
prep_dataset( fa1, kraken_report, kraken_out, fa2 = NULL, cb_and_umi = function(sequence_id, read1, read2) { list(substring(read1, 1L, 16L), substring(read1, 17L, 28L)) }, ranks = c("G", "S"), kmer_len = 35L, min_frac = 0.5, exclude = "9606", threads = 10L, overwrite = TRUE, odir = NULL ) read_dataset(dir)
prep_dataset( fa1, kraken_report, kraken_out, fa2 = NULL, cb_and_umi = function(sequence_id, read1, read2) { list(substring(read1, 1L, 16L), substring(read1, 17L, 28L)) }, ranks = c("G", "S"), kmer_len = 35L, min_frac = 0.5, exclude = "9606", threads = 10L, overwrite = TRUE, odir = NULL ) read_dataset(dir)
fa1 , fa2
|
The path to microbiome fasta 1 and 2 file (returned by
|
kraken_report |
The path to kraken report file. |
kraken_out |
The path of microbiome output file. Usually should be
filtered with |
cb_and_umi |
A function takes sequence id, read1, read2 and return a list of 2 corresponding to cell barcode and UMI respectively., each should have the same length of the input. |
ranks |
Taxa ranks to analyze. |
kmer_len |
Kraken kmer length. Default: |
min_frac |
Minimum fraction of kmers directly assigned to taxid to use
read. Reads with |
exclude |
A character of taxid to exclude, for |
threads |
Number of threads to use. |
overwrite |
A bool indicates whether to overwrite the files in |
odir |
A string of directory to save the results. |
dir |
A string of directory containing the files returned by
|
A list of three polars DataFrame:
kreport: Used by slsd()
.
kmer: Used by blsd()
.
umi: Used by taxa_counts()
.
https://github.com/sjdlabgroup/SAHMI
# for sequence from `umi-tools`, we can use following function cb_and_umi <- function(sequence_id, read1, read2) { out <- lapply( strsplit(sequence_id, "_", fixed = TRUE), `[`, 2:3 ) lapply(1:2, function(i) { vapply(out, function(o) as.character(.subset2(o, i)), character(1L)) }) } ## Not run: # 1. `fa1` and `fa2` should be the output of `extract_kraken_reads()` # 2. `kraken_report` should be the output of `blit::kraken2()` # 3. `kraken_out` should be the output of `extract_kraken_output()` # 4. `dir`: you may want to specify the output directory since this process # is time-consuming sahmi_dataset <- prep_dataset( fa1 = "kraken_microbiome_reads.fa", # if you have paired sequence, please also specify `fa2`, # !!! Also pay attention to the file name of `fa1` (add suffix `_1`) # if you use paired reads. # fa2 = "kraken_microbiome_reads_2.fa", kraken_report = "kraken_report.txt", kraken_out = "kraken_microbiome_output.txt", odir = NULL ) # you may want to prepare all datasets for subsequent workflows. # `paths` should be the output directory for each sample from # `blit::kraken2()`, `extract_kraken_output()` and `extract_kraken_reads()`. sahmi_datasets <- lapply(paths, function(dir) { prep_dataset( fa1 = file.path(dir, "kraken_microbiome_reads.fa"), # fa2 = file.path(dir, "kraken_microbiome_reads_2.fa"), kraken_report = file.path(dir, "kraken_report.txt"), kraken_out = file.path(dir, "kraken_microbiome_output.txt"), odir = dir ) }) ## End(Not run)
# for sequence from `umi-tools`, we can use following function cb_and_umi <- function(sequence_id, read1, read2) { out <- lapply( strsplit(sequence_id, "_", fixed = TRUE), `[`, 2:3 ) lapply(1:2, function(i) { vapply(out, function(o) as.character(.subset2(o, i)), character(1L)) }) } ## Not run: # 1. `fa1` and `fa2` should be the output of `extract_kraken_reads()` # 2. `kraken_report` should be the output of `blit::kraken2()` # 3. `kraken_out` should be the output of `extract_kraken_output()` # 4. `dir`: you may want to specify the output directory since this process # is time-consuming sahmi_dataset <- prep_dataset( fa1 = "kraken_microbiome_reads.fa", # if you have paired sequence, please also specify `fa2`, # !!! Also pay attention to the file name of `fa1` (add suffix `_1`) # if you use paired reads. # fa2 = "kraken_microbiome_reads_2.fa", kraken_report = "kraken_report.txt", kraken_out = "kraken_microbiome_output.txt", odir = NULL ) # you may want to prepare all datasets for subsequent workflows. # `paths` should be the output directory for each sample from # `blit::kraken2()`, `extract_kraken_output()` and `extract_kraken_reads()`. sahmi_datasets <- lapply(paths, function(dir) { prep_dataset( fa1 = file.path(dir, "kraken_microbiome_reads.fa"), # fa2 = file.path(dir, "kraken_microbiome_reads_2.fa"), kraken_report = file.path(dir, "kraken_report.txt"), kraken_out = file.path(dir, "kraken_microbiome_output.txt"), odir = dir ) }) ## End(Not run)
Identifying contaminants and false positives taxa (cell line quantile test)
remove_contaminants( kraken_reports, study = "current study", taxon = c("d__Bacteria", "d__Fungi", "d__Viruses"), quantile = 0.95, alpha = 0.05, alternative = "greater", exclusive = FALSE )
remove_contaminants( kraken_reports, study = "current study", taxon = c("d__Bacteria", "d__Fungi", "d__Viruses"), quantile = 0.95, alpha = 0.05, alternative = "greater", exclusive = FALSE )
kraken_reports |
A character of path to all kraken report files. |
study |
A string of the study name, used to differentiate with cell line data. |
taxon |
An atomic character specify the taxa name wanted. Should follow the kraken style, connected by rank codes, two underscores, and the scientific name of the taxon (e.g., "d__Viruses") |
quantile |
Probabilities with values in |
alpha |
Level of significance. |
alternative |
A string specifying the alternative hypothesis, must be one of "two.sided", "greater" (default) or "less". You can specify just the initial letter. |
exclusive |
A boolean value, indicates whether taxa not found in
celllines data should be regarded as truly. Default: |
A polars DataFrame with following attributes:
pvalues
: Quantile test pvalue.
exclusive
: taxids in current study but not found in cellline data.
significant
: significant taxids with pvalues < alpha
.
truly
: truly taxids based on alpha
and exclusive
. If exclusive
is
TRUE
, this should be the union of exclusive
and significant
,
otherwise, this should be the same with significant
.
## Not run: # `paths` should be the output directory for each sample from # `blit::kraken2()` truly_microbe <- remove_contaminants( kraken_reports = file.path(paths, "kraken_report.txt"), quantile = 0.99, exclusive = FALSE ) microbe_for_plot <- attr(truly_microbe, "truly")[ order(attr(truly_microbe, "pvalue")[attr(truly_microbe, "truly")]) ] microbe_for_plot <- microbe_for_plot[ !microbe_for_plot %in% attr(truly_microbe, "exclusive") ] ggplot( truly_microbe$filter(pl$col("taxid")$is_in(microbe_for_plot))$ to_data_frame(), aes(rpmm), ) + geom_density(aes(fill = study), alpha = 0.5) + scale_x_log10() + facet_wrap(facets = vars(taxa), scales = "free") + theme( strip.clip = "off", axis.text = element_blank(), axis.ticks = element_blank(), legend.position = "inside", legend.position.inside = c(1, 0), legend.justification.inside = c(1, 0) ) ## End(Not run)
## Not run: # `paths` should be the output directory for each sample from # `blit::kraken2()` truly_microbe <- remove_contaminants( kraken_reports = file.path(paths, "kraken_report.txt"), quantile = 0.99, exclusive = FALSE ) microbe_for_plot <- attr(truly_microbe, "truly")[ order(attr(truly_microbe, "pvalue")[attr(truly_microbe, "truly")]) ] microbe_for_plot <- microbe_for_plot[ !microbe_for_plot %in% attr(truly_microbe, "exclusive") ] ggplot( truly_microbe$filter(pl$col("taxid")$is_in(microbe_for_plot))$ to_data_frame(), aes(rpmm), ) + geom_density(aes(fill = study), alpha = 0.5) + scale_x_log10() + facet_wrap(facets = vars(taxa), scales = "free") + theme( strip.clip = "off", axis.text = element_blank(), axis.ticks = element_blank(), legend.position = "inside", legend.position.inside = c(1, 0), legend.justification.inside = c(1, 0) ) ## End(Not run)
In the low-microbiome biomass setting, real microbes also exhibit a
proportional number of total k-mers, number of unique k-mers, as well as
number of total assigned sequencing reads across samples; i.e. the following
three Spearman correlations are significant when tested using sample-level
data provided in Kraken reports: cor(minimizer_len, minimizer_n_unique)
,
cor(minimizer_len, total_reads)
and cor(total_reads, minimizer_n_unique)
.
(r1>0 & r2>0 & r3>0 & p1<0.05 & p2<0.05 & p3<0.05
).
slsd( kreports, method = "spearman", ..., min_reads = 3L, min_minimizer_n_unique = 3L, min_number = 3L )
slsd( kreports, method = "spearman", ..., min_reads = 3L, min_minimizer_n_unique = 3L, min_number = 3L )
kreports |
kreports data returned by |
method |
A character string indicating which correlation coefficient is to be used for the test. One of "pearson", "kendall", or "spearman", can be abbreviated. |
... |
Other arguments passed to cor.test. |
min_reads |
An integer, the minimal number of the total reads to filter
taxa. SAHMI use |
min_minimizer_n_unique |
An integer, the minimal number of the unique
number of minimizer to filter taxa. SAHMI use |
min_number |
An integer, the minimal number of samples per taxid. SAHMI
use |
A polars DataFrame of correlation
coefficient and pvalue for cor(minimizer_len, minimizer_n_unique)
(r1 and
p1), cor(minimizer_len, total_reads)
(r2 and p2) and cor(total_reads, minimizer_n_unique)
(r3 and p3).
## Not run: # `sahmi_datasets` should be the output of all samples from `prep_dataset()` slsd <- slsd(lapply(sahmi_datasets, `[[`, "kreport")) real_taxids_slsd <- slsd$filter( pl$col("r1")$gt(0), pl$col("r2")$gt(0), pl$col("r3")$gt(0), pl$col("p1")$lt(0.05), pl$col("p2")$lt(0.05), pl$col("p3")$lt(0.05) )$get_column("taxid") ## End(Not run)
## Not run: # `sahmi_datasets` should be the output of all samples from `prep_dataset()` slsd <- slsd(lapply(sahmi_datasets, `[[`, "kreport")) real_taxids_slsd <- slsd$filter( pl$col("r1")$gt(0), pl$col("r2")$gt(0), pl$col("r3")$gt(0), pl$col("p1")$lt(0.05), pl$col("p2")$lt(0.05), pl$col("p3")$lt(0.05) )$get_column("taxid") ## End(Not run)
After identifying true taxa, reads assigned to those taxa are extracted and then undergo a series of filters. The cell barcode and UMI are used to demultiplex the reads and create a barcode x taxa counts matrix. The full taxonomic classification of all resulting barcodes and the number of counts assigned to each clade are tabulated.
taxa_counts(umi_list, samples = NULL)
taxa_counts(umi_list, samples = NULL)
umi_list |
A list of polars DataFrame for UMI data returned by prep_dataset. |
samples |
A character of sample identifier for each element in
|
A polars DataFrame.
https://github.com/sjdlabgroup/SAHMI
## Not run: # `umi_list` should be the output of all samples from `prep_dataset()`, and # filtered by `slsd()` and `blsd()` taxa_counts(umi_list, basename(names(umi_list))) ## End(Not run)
## Not run: # `umi_list` should be the output of all samples from `prep_dataset()`, and # filtered by `slsd()` and `blsd()` taxa_counts(umi_list, basename(names(umi_list))) ## End(Not run)