Package 'rsahmi' reference manual

Title:	Single-Cell Analysis of Host-Microbiome Interactions
Description:	A computational resource designed to accurately detect microbial nucleic acids while filtering out contaminants and false-positive taxonomic assignments from standard transcriptomic sequencing of mammalian tissues. For more details, see Ghaddar (2023) <doi:10.1038/s43588-023-00507-1>. This implementation leverages the 'polars' package for fast and systematic microbial signal recovery and denoising from host tissue genomic sequencing.
Authors:	Yun Peng [aut, cre]
Maintainer:	Yun Peng <[email protected]>
License:	MIT + file LICENSE
Version:	0.0.2.9000
Built:	2025-03-27 03:36:45 UTC
Source:	https://github.com/Yunuuuu/rsahmi

Barcode level signal denoising

Description

True taxa are detected on multiple barcodes and with a proprotional number of total and unique k-mer sequences across barcodes, measured as a significant Spearman correlation between the number of total and unique k-mers across barcodes. (padj < 0.05)

Usage

blsd(
  kmer,
  method = "spearman",
  ...,
  p.adjust = "BH",
  min_kmer_len = 3L,
  min_number = 3L
)
blsd(
  kmer,
  method = "spearman",
  ...,
  p.adjust = "BH",
  min_kmer_len = 3L,
  min_number = 3L
)

Arguments

`kmer`	kmer data returned by `prep_dataset()`.
`method`	A character string indicating which correlation coefficient is to be used for the test. One of "pearson", "kendall", or "spearman", can be abbreviated.
`...`	Other arguments passed to cor.test.
`p.adjust`	Pvalue correction method, a character string. Can be abbreviated. Details see p.adjust.
`min_kmer_len`	An integer, the minimal number of kmer to filter taxa. SAHMI use `2`.
`min_number`	An integer, the minimal number of cell per taxid. SAHMI use `4`.

Value

A polars DataFrame

Examples

## Not run: 
# 1. `sahmi_datasets` should be the output of all samples from 
      `prep_dataset()` 
# 2. `real_taxids_slsd` should be the output of `slsd()`
umi_list <- lapply(sahmi_datasets, function(dataset) {
    # Barcode level signal denoising (barcode k-mer correlation test)
    blsd <- blsd(dataset$kmer)
    real_taxids <- blsd$filter(pl$col("padj")$lt(0.05))$get_column("taxid")
    # only keep taxids pass Sample level signal denoising
    real_taxids <- real_taxids$filter(real_taxids$is_in(real_taxids_slsd))
    # remove contaminants
    real_taxids <- real_taxids$filter(
        real_taxids$is_in(attr(truly_microbe, "truly"))
    )
    # filter UMI data
    dataset$umi$filter(pl$col("taxid")$is_in(real_taxids))
})

## End(Not run)
## Not run: 
# 1. `sahmi_datasets` should be the output of all samples from 
      `prep_dataset()` 
# 2. `real_taxids_slsd` should be the output of `slsd()`
umi_list <- lapply(sahmi_datasets, function(dataset) {
    # Barcode level signal denoising (barcode k-mer correlation test)
    blsd <- blsd(dataset$kmer)
    real_taxids <- blsd$filter(pl$col("padj")$lt(0.05))$get_column("taxid")
    # only keep taxids pass Sample level signal denoising
    real_taxids <- real_taxids$filter(real_taxids$is_in(real_taxids_slsd))
    # remove contaminants
    real_taxids <- real_taxids$filter(
        real_taxids$is_in(attr(truly_microbe, "truly"))
    )
    # filter UMI data
    dataset$umi$filter(pl$col("taxid")$is_in(real_taxids))
})

## End(Not run)

Extract reads and output from Kraken

Description

Extract reads and output from Kraken

Usage

extract_taxids(
  kraken_report,
  taxon = c("d__Bacteria", "d__Fungi", "d__Viruses")
)

extract_kraken_output(
  kraken_out,
  taxids,
  odir,
  ofile = "kraken_microbiome_output.txt",
  ...
)

extract_kraken_reads(
  kraken_out,
  reads,
  ofile = NULL,
  odir = getwd(),
  threads = NULL,
  ...,
  envpath = NULL,
  seqkit = NULL
)
extract_taxids(
  kraken_report,
  taxon = c("d__Bacteria", "d__Fungi", "d__Viruses")
)

extract_kraken_output(
  kraken_out,
  taxids,
  odir,
  ofile = "kraken_microbiome_output.txt",
  ...
)

extract_kraken_reads(
  kraken_out,
  reads,
  ofile = NULL,
  odir = getwd(),
  threads = NULL,
  ...,
  envpath = NULL,
  seqkit = NULL
)

Arguments

`kraken_report`	The path to kraken report file.
`taxon`	An atomic character specify the taxa name wanted. Should follow the kraken style, connected by rank codes, two underscores, and the scientific name of the taxon (e.g., "d__Viruses")
`kraken_out`	The path to kraken output file.
`taxids`	A character specify NCBI taxonony identifier to extract.
`odir`	A string of directory to save the `ofile`.
`ofile`	A string of file save the kraken output of specified `taxids`.
`...`	`extract_kraken_output`: Additional arguments passed to `sink_csv()`. `extract_kraken_reads`: Additional arguments passed to `cmd_run()` method.
`reads`	The original fastq files (input in `kraken2`). You can pass two paired-end files directly.
`threads`	Number of threads to use, see `blit::cmd_help(blit::seqkit("grep"))`.
`envpath`	A string of path to be added to the environment variable `PATH`.
`seqkit`	A string of path to `seqkit` command.

Value

extract_taxids: An atomic character vector of taxon identifiers.

extract_kraken_output: A polars DataFrame.

extract_kraken_reads: Exit status invisiblely.

Examples

## Not run: 
# For 10x Genomic data, `fq1` only contain barcode and umi, but the official
# didn't give any information for this. In this way, I prefer using
# `umi-tools` to transform the `umi` into fq2 and then run `rsahmi` with
# only fq2.
blit::kraken2(
    fq1 = fq1,
    fq2 = fq2,
    classified_out = "classified.fq",
    # Number of threads to use
    blit::arg("--threads", 10L, format = "%d"),
    # the kraken database
    blit::arg("--db", kraken_db),
    "--use-names", "--report-minimizer-data",
) |> blit::cmd_run()

# `kraken_report` should be the output of `blit::kraken2()`
taxids <- extract_taxids(kraken_report = "kraken_report.txt")

# 1. `kraken_out` should be the output of `blit::kraken2()`
# 2. `taxids` should be the output of `extract_taxids()`
# 3. `odir`: the output directory
extract_kraken_output(
    kraken_out = "kraken_output.txt",
    taxids = taxids,
    odir = # specify the output directory
)

# 1. `kraken_out` should be the output of `extract_kraken_output()`
# 2. `fq1` and `fq2` should be the same with `blit::kraken2()`
extract_kraken_reads(
    kraken_out = "kraken_microbiome_output.txt",
    reads = c(fq1, fq2),
    threads = 10L, # Number of threads to use
    # try to change `seqkit` argument into your seqkit path. If `NULL`, the
    # internal will detect it in your `PATH` environment variable
    seqkit = NULL
)

## End(Not run)
## Not run: 
# For 10x Genomic data, `fq1` only contain barcode and umi, but the official
# didn't give any information for this. In this way, I prefer using
# `umi-tools` to transform the `umi` into fq2 and then run `rsahmi` with
# only fq2.
blit::kraken2(
    fq1 = fq1,
    fq2 = fq2,
    classified_out = "classified.fq",
    # Number of threads to use
    blit::arg("--threads", 10L, format = "%d"),
    # the kraken database
    blit::arg("--db", kraken_db),
    "--use-names", "--report-minimizer-data",
) |> blit::cmd_run()

# `kraken_report` should be the output of `blit::kraken2()`
taxids <- extract_taxids(kraken_report = "kraken_report.txt")

# 1. `kraken_out` should be the output of `blit::kraken2()`
# 2. `taxids` should be the output of `extract_taxids()`
# 3. `odir`: the output directory
extract_kraken_output(
    kraken_out = "kraken_output.txt",
    taxids = taxids,
    odir = # specify the output directory
)

# 1. `kraken_out` should be the output of `extract_kraken_output()`
# 2. `fq1` and `fq2` should be the same with `blit::kraken2()`
extract_kraken_reads(
    kraken_out = "kraken_microbiome_output.txt",
    reads = c(fq1, fq2),
    threads = 10L, # Number of threads to use
    # try to change `seqkit` argument into your seqkit path. If `NULL`, the
    # internal will detect it in your `PATH` environment variable
    seqkit = NULL
)

## End(Not run)

Parse kraken report file

Description

Parse kraken report file

Usage

parse_kraken_report(kraken_report, intermediate_ranks = TRUE, mpa = FALSE)
parse_kraken_report(kraken_report, intermediate_ranks = TRUE, mpa = FALSE)

Arguments

`kraken_report`	The path to kraken report file.
`intermediate_ranks`	A bool indicates whether to include non-traditional taxonomic ranks in output.
`mpa`	A bool indicates whether to use mpa-style.

Value

A polars DataFrame.

Prepare kraken report, k-mer statistics, UMI data

Description

Three elements returned by this function:

kreport: Used by slsd().
kmer: Used by blsd(). The function count the number of k-mers and unique k-mers assigned to a taxon across barcodes. The cell barcode and unique molecular identifier (UMI) are used to identify unique barcodes and reads. Data is reported for taxa of pre-specified ranks (default genus + species) taking into account all subsequently higher resolution ranks. The output is a table of barcodes, taxonomic IDs, number of k-mers, and number of unique k-mers.
umi: Used by taxa_counts().

Usage

prep_dataset(
  fa1,
  kraken_report,
  kraken_out,
  fa2 = NULL,
  cb_and_umi = function(sequence_id, read1, read2) {
     list(substring(read1, 1L, 16L),
    substring(read1, 17L, 28L))
 },
  ranks = c("G", "S"),
  kmer_len = 35L,
  min_frac = 0.5,
  exclude = "9606",
  threads = 10L,
  overwrite = TRUE,
  odir = NULL
)

read_dataset(dir)
prep_dataset(
  fa1,
  kraken_report,
  kraken_out,
  fa2 = NULL,
  cb_and_umi = function(sequence_id, read1, read2) {
     list(substring(read1, 1L, 16L),
    substring(read1, 17L, 28L))
 },
  ranks = c("G", "S"),
  kmer_len = 35L,
  min_frac = 0.5,
  exclude = "9606",
  threads = 10L,
  overwrite = TRUE,
  odir = NULL
)

read_dataset(dir)

Arguments

`fa1`, `fa2`	The path to microbiome fasta 1 and 2 file (returned by `extract_kraken_reads()`).
`kraken_report`	The path to kraken report file.
`kraken_out`	The path of microbiome output file. Usually should be filtered with `extract_kraken_output()`.
`cb_and_umi`	A function takes sequence id, read1, read2 and return a list of 2 corresponding to cell barcode and UMI respectively., each should have the same length of the input.
`ranks`	Taxa ranks to analyze.
`kmer_len`	Kraken kmer length. Default: `35L`, which is the default kmer size of kraken2.
`min_frac`	Minimum fraction of kmers directly assigned to taxid to use read. Reads with `⁠<=min_frac⁠` of the k-mers map inside the taxon's lineage are also discarded.
`exclude`	A character of taxid to exclude, for `SAHMI`, the host taxid. Reads with any k-mers mapped to the `exclude` are discarded.
`threads`	Number of threads to use.
`overwrite`	A bool indicates whether to overwrite the files in `odir`.
`odir`	A string of directory to save the results.
`dir`	A string of directory containing the files returned by `prep_dataset`.

Value

A list of three polars DataFrame:

kreport: Used by slsd().
kmer: Used by blsd().
umi: Used by taxa_counts().

Examples

# for sequence from `umi-tools`, we can use following function
cb_and_umi <- function(sequence_id, read1, read2) {
    out <- lapply(
        strsplit(sequence_id, "_", fixed = TRUE),
        `[`, 2:3
    )
    lapply(1:2, function(i) {
        vapply(out, function(o) as.character(.subset2(o, i)), character(1L))
    })
}

## Not run: 
# 1. `fa1` and `fa2` should be the output of `extract_kraken_reads()`
# 2. `kraken_report` should be the output of `blit::kraken2()`
# 3. `kraken_out` should be the output of `extract_kraken_output()`
# 4. `dir`: you may want to specify the output directory since this process 
#           is time-consuming
sahmi_dataset <- prep_dataset(
    fa1 = "kraken_microbiome_reads.fa",
    # if you have paired sequence, please also specify `fa2`,
    # !!! Also pay attention to the file name of `fa1` (add suffix `_1`)
    # if you use paired reads.
    # fa2 = "kraken_microbiome_reads_2.fa",
    kraken_report = "kraken_report.txt",
    kraken_out = "kraken_microbiome_output.txt",
    odir = NULL
)
# you may want to prepare all datasets for subsequent workflows.
# `paths` should be the output directory for each sample from
# `blit::kraken2()`, `extract_kraken_output()` and `extract_kraken_reads()`.
sahmi_datasets <- lapply(paths, function(dir) {
    prep_dataset(
        fa1 = file.path(dir, "kraken_microbiome_reads.fa"),
        # fa2 = file.path(dir, "kraken_microbiome_reads_2.fa"),
        kraken_report = file.path(dir, "kraken_report.txt"),
        kraken_out = file.path(dir, "kraken_microbiome_output.txt"),
        odir = dir
    )
})

## End(Not run)
# for sequence from `umi-tools`, we can use following function
cb_and_umi <- function(sequence_id, read1, read2) {
    out <- lapply(
        strsplit(sequence_id, "_", fixed = TRUE),
        `[`, 2:3
    )
    lapply(1:2, function(i) {
        vapply(out, function(o) as.character(.subset2(o, i)), character(1L))
    })
}

## Not run: 
# 1. `fa1` and `fa2` should be the output of `extract_kraken_reads()`
# 2. `kraken_report` should be the output of `blit::kraken2()`
# 3. `kraken_out` should be the output of `extract_kraken_output()`
# 4. `dir`: you may want to specify the output directory since this process 
#           is time-consuming
sahmi_dataset <- prep_dataset(
    fa1 = "kraken_microbiome_reads.fa",
    # if you have paired sequence, please also specify `fa2`,
    # !!! Also pay attention to the file name of `fa1` (add suffix `_1`)
    # if you use paired reads.
    # fa2 = "kraken_microbiome_reads_2.fa",
    kraken_report = "kraken_report.txt",
    kraken_out = "kraken_microbiome_output.txt",
    odir = NULL
)
# you may want to prepare all datasets for subsequent workflows.
# `paths` should be the output directory for each sample from
# `blit::kraken2()`, `extract_kraken_output()` and `extract_kraken_reads()`.
sahmi_datasets <- lapply(paths, function(dir) {
    prep_dataset(
        fa1 = file.path(dir, "kraken_microbiome_reads.fa"),
        # fa2 = file.path(dir, "kraken_microbiome_reads_2.fa"),
        kraken_report = file.path(dir, "kraken_report.txt"),
        kraken_out = file.path(dir, "kraken_microbiome_output.txt"),
        odir = dir
    )
})

## End(Not run)

Identifying contaminants and false positives taxa (cell line quantile test)

Description

Identifying contaminants and false positives taxa (cell line quantile test)

Usage

remove_contaminants(
  kraken_reports,
  study = "current study",
  taxon = c("d__Bacteria", "d__Fungi", "d__Viruses"),
  quantile = 0.95,
  alpha = 0.05,
  alternative = "greater",
  exclusive = FALSE
)
remove_contaminants(
  kraken_reports,
  study = "current study",
  taxon = c("d__Bacteria", "d__Fungi", "d__Viruses"),
  quantile = 0.95,
  alpha = 0.05,
  alternative = "greater",
  exclusive = FALSE
)

Arguments

`kraken_reports`	A character of path to all kraken report files.
`study`	A string of the study name, used to differentiate with cell line data.
`taxon`	An atomic character specify the taxa name wanted. Should follow the kraken style, connected by rank codes, two underscores, and the scientific name of the taxon (e.g., "d__Viruses")
`quantile`	Probabilities with values in `⁠[0, 1]⁠` specifying the quantile to calculate.
`alpha`	Level of significance.
`alternative`	A string specifying the alternative hypothesis, must be one of "two.sided", "greater" (default) or "less". You can specify just the initial letter.
`exclusive`	A boolean value, indicates whether taxa not found in celllines data should be regarded as truly. Default: `FALSE`.

Value

A polars DataFrame with following attributes:

pvalues: Quantile test pvalue.
exclusive: taxids in current study but not found in cellline data.
significant: significant taxids with pvalues < alpha.
truly: truly taxids based on alpha and exclusive. If exclusive is TRUE, this should be the union of exclusive and significant, otherwise, this should be the same with significant.

Examples

## Not run: 
# `paths` should be the output directory for each sample from
# `blit::kraken2()`
truly_microbe <- remove_contaminants(
    kraken_reports = file.path(paths, "kraken_report.txt"),
    quantile = 0.99, exclusive = FALSE
)
microbe_for_plot <- attr(truly_microbe, "truly")[
    order(attr(truly_microbe, "pvalue")[attr(truly_microbe, "truly")])
]
microbe_for_plot <- microbe_for_plot[
    !microbe_for_plot %in% attr(truly_microbe, "exclusive")
]
ggplot(
    truly_microbe$filter(pl$col("taxid")$is_in(microbe_for_plot))$
        to_data_frame(),
    aes(rpmm),
) +
    geom_density(aes(fill = study), alpha = 0.5) +
    scale_x_log10() +
    facet_wrap(facets = vars(taxa), scales = "free") +
    theme(
        strip.clip = "off",
        axis.text = element_blank(),
        axis.ticks = element_blank(),
        legend.position = "inside",
        legend.position.inside = c(1, 0),
        legend.justification.inside = c(1, 0)
    )

## End(Not run)
## Not run: 
# `paths` should be the output directory for each sample from
# `blit::kraken2()`
truly_microbe <- remove_contaminants(
    kraken_reports = file.path(paths, "kraken_report.txt"),
    quantile = 0.99, exclusive = FALSE
)
microbe_for_plot <- attr(truly_microbe, "truly")[
    order(attr(truly_microbe, "pvalue")[attr(truly_microbe, "truly")])
]
microbe_for_plot <- microbe_for_plot[
    !microbe_for_plot %in% attr(truly_microbe, "exclusive")
]
ggplot(
    truly_microbe$filter(pl$col("taxid")$is_in(microbe_for_plot))$
        to_data_frame(),
    aes(rpmm),
) +
    geom_density(aes(fill = study), alpha = 0.5) +
    scale_x_log10() +
    facet_wrap(facets = vars(taxa), scales = "free") +
    theme(
        strip.clip = "off",
        axis.text = element_blank(),
        axis.ticks = element_blank(),
        legend.position = "inside",
        legend.position.inside = c(1, 0),
        legend.justification.inside = c(1, 0)
    )

## End(Not run)

Sample level signal denoising

Description

In the low-microbiome biomass setting, real microbes also exhibit a proportional number of total k-mers, number of unique k-mers, as well as number of total assigned sequencing reads across samples; i.e. the following three Spearman correlations are significant when tested using sample-level data provided in Kraken reports: cor(minimizer_len, minimizer_n_unique), cor(minimizer_len, total_reads) and cor(total_reads, minimizer_n_unique). (r1>0 & r2>0 & r3>0 & p1<0.05 & p2<0.05 & p3<0.05).

Usage

slsd(
  kreports,
  method = "spearman",
  ...,
  min_reads = 3L,
  min_minimizer_n_unique = 3L,
  min_number = 3L
)
slsd(
  kreports,
  method = "spearman",
  ...,
  min_reads = 3L,
  min_minimizer_n_unique = 3L,
  min_number = 3L
)

Arguments

`kreports`	kreports data returned by `prep_dataset()` for all samples.
`method`	A character string indicating which correlation coefficient is to be used for the test. One of "pearson", "kendall", or "spearman", can be abbreviated.
`...`	Other arguments passed to cor.test.
`min_reads`	An integer, the minimal number of the total reads to filter taxa. SAHMI use `2`.
`min_minimizer_n_unique`	An integer, the minimal number of the unique number of minimizer to filter taxa. SAHMI use `2`.
`min_number`	An integer, the minimal number of samples per taxid. SAHMI use `4`.

Value

A polars DataFrame of correlation coefficient and pvalue for cor(minimizer_len, minimizer_n_unique) (r1 and p1), cor(minimizer_len, total_reads) (r2 and p2) and cor(total_reads, minimizer_n_unique) (r3 and p3).

Examples

## Not run: 
# `sahmi_datasets` should be the output of all samples from `prep_dataset()`
slsd <- slsd(lapply(sahmi_datasets, `[[`, "kreport"))
real_taxids_slsd <- slsd$filter(
    pl$col("r1")$gt(0),
    pl$col("r2")$gt(0),
    pl$col("r3")$gt(0),
    pl$col("p1")$lt(0.05),
    pl$col("p2")$lt(0.05),
    pl$col("p3")$lt(0.05)
)$get_column("taxid")

## End(Not run)
## Not run: 
# `sahmi_datasets` should be the output of all samples from `prep_dataset()`
slsd <- slsd(lapply(sahmi_datasets, `[[`, "kreport"))
real_taxids_slsd <- slsd$filter(
    pl$col("r1")$gt(0),
    pl$col("r2")$gt(0),
    pl$col("r3")$gt(0),
    pl$col("p1")$lt(0.05),
    pl$col("p2")$lt(0.05),
    pl$col("p3")$lt(0.05)
)$get_column("taxid")

## End(Not run)

Quantitation of microbes

Description

After identifying true taxa, reads assigned to those taxa are extracted and then undergo a series of filters. The cell barcode and UMI are used to demultiplex the reads and create a barcode x taxa counts matrix. The full taxonomic classification of all resulting barcodes and the number of counts assigned to each clade are tabulated.

Usage

taxa_counts(umi_list, samples = NULL)
taxa_counts(umi_list, samples = NULL)

Arguments

`umi_list`	A list of polars DataFrame for UMI data returned by prep_dataset.
`samples`	A character of sample identifier for each element in `umi_list`.

Value

A polars DataFrame.

Examples

## Not run: 
# `umi_list` should be the output of all samples from `prep_dataset()`, and 
# filtered by `slsd()` and `blsd()`
taxa_counts(umi_list, basename(names(umi_list)))

## End(Not run)
## Not run: 
# `umi_list` should be the output of all samples from `prep_dataset()`, and 
# filtered by `slsd()` and `blsd()`
taxa_counts(umi_list, basename(names(umi_list)))

## End(Not run)

Package 'rsahmi'

Help Index

Barcode level signal denoising

Description

Usage

Arguments

Value

See Also

Examples

Extract reads and output from Kraken

Description

Usage

Arguments

Value

See Also

Examples

Parse kraken report file

Description

Usage

Arguments

Value

See Also

Prepare kraken report, k-mer statistics, UMI data

Description

Usage

Arguments

Value

See Also

Examples

Identifying contaminants and false positives taxa (cell line quantile test)

Description

Usage

Arguments

Value

Examples

Sample level signal denoising

Description

Usage

Arguments

Value

Examples

Quantitation of microbes

Description

Usage

Arguments

Value

See Also

Examples