Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Introduction

Nanalogue = Nucleic Acid Analogue

A common pain point in the genomics community is that BAM files are information-dense which makes it difficult to gain insight from them. Nanalogue hopes to make it easy to extract and process this information, and forms a companion to other tools such as samtools and modkit. Although nanalogue’s primary focus is on DNA/RNA modifications on a single-molecule level, some of its functions are quite general and can be applied to almost any BAM file. Nanalogue is open-source and its code can be found on Github. The code of a companion package pynanalogue can be found here.

If you are a developer who needs BAM files with defined single-molecule modification patterns to help develop/test your tool, nanalogue can also help you create BAM files from scratch using artificial data created using parameters defined by you.

This documentation site is under active development.

Usage

This book is divided into two parts, based on two out of the following three ways to use nanalogue:

  • as a command line interface i.e. a tool that can be run from the terminal. See here.
  • as a python library i.e. if you write python code, you can use pynanalogue, a wrapper around a subset of nanalogue’s functions. See here.
  • as a rust library i.e. if you write rust code, you can benefit from nanalogue’s functions. If you are a rust developer looking to use nanalogue as a rust library, please head over to docs.rs.

Installation

General installation methods such as cargo install and pip install should work. For the most up to date information, please consult the README files from the Github links above.

Command line usage of nanalogue

For detailed documentation of all CLI commands and options, see the CLI Commands Reference.

The CLI Commands Reference is automatically generated from the latest version of the nanalogue package and provides comprehensive help text for all commands and subcommands.

Common options

Most if not all CLI commands have options to filter by:

  • minimum sequence or alignment length
  • alignment quality
  • alignment type (primary/secondary/supplementary/unmapped)
  • read ids
  • mod quality
  • base quality

You can also perform subsampling, trim by read ends etc. Please see the CLI commands reference shown above.

Illustration of commands

nanalogue CLI Commands Reference

Note: This file is auto-generated.

Main Command

BAM/Mod BAM parsing and analysis tool with a single-molecule focus

Usage: nanalogue <COMMAND>

Commands:
  read-table-show-mods  Prints basecalled len, align len, mod count per molecule
  read-table-hide-mods  Prints basecalled len, align len per molecule
  read-stats            Calculates various summary statistics on all reads
  read-info             Prints information about reads
  find-modified-reads   Find names of modified reads through criteria specified by sub commands
  window-dens           Output windowed densities of all reads
  window-grad           Output windowed gradients of all reads
  peek                  Display BAM file contigs, contig lengths, and mod types from a "peek" at the
                        header and first 100 records
  help                  Print this message or the help of the given subcommand(s)

Options:
  -h, --help     Print help
  -V, --version  Print version

Subcommands

read-table-show-mods

Prints basecalled len, align len, mod count per molecule

Usage: nanalogue read-table-show-mods [OPTIONS] <BAM_PATH> [SEQ_SUMM_FILE]

Arguments:
  <BAM_PATH>       Input BAM file. Set to a local file path, or set to - to read from stdin, or set
                   to a URL to read from a remote file. If using stdin and piping in from `samtools
                   view`, always include the header with the `-h` option
  [SEQ_SUMM_FILE]  Input sequence summary file from Guppy/Dorado (optional) [default: ]

Options:
      --min-seq-len <MIN_SEQ_LEN>
          Exclude reads whose sequence length in the BAM file is below this value. Defaults to 0
          [default: 0]
      --min-align-len <MIN_ALIGN_LEN>
          Exclude reads whose alignment length in the BAM file is below this value. Defaults to
          unused
      --read-id <READ_ID>
          Only include this read id, defaults to unused i.e. all reads are used. NOTE: if there are
          multiple alignments corresponding to this read id, all of them are used
      --read-id-list <READ_ID_LIST>
          Path to file containing list of read IDs (one per line). Lines starting with '#' are
          treated as comments and ignored. Cannot be used together with --read-id
      --threads <THREADS>
          Number of threads used during some aspects of program execution [default: 2]
      --include-zero-len
          Include "zero-length" sequences e.g. sequences with "*" in the sequence field. By default,
          these sequences are excluded to avoid processing errors. If this flag is set, these reads
          are included irrespective of any minimum sequence or align length criteria the user may
          have set. WARNINGS: (1) Some functions of the codebase may break or produce incorrect
          results if you use this flag. (2) due to a technical reason, we need a DNA sequence in the
          sequence field and cannot infer sequence length from other sources e.g. CIGAR strings
      --read-filter <READ_FILTER>
          Only retain reads of this type. Allowed types are `primary_forward`, `primary_reverse`,
          `secondary_forward`, `secondary_reverse`, `supplementary_forward`, `supplementary_reverse`
          and unmapped. Specify more than one type if needed separated by commas, in which case
          reads of any type in list are retained. Defaults to retain reads of all types
  -s, --sample-fraction <SAMPLE_FRACTION>
          Subsample BAM to retain only this fraction of total number of reads, defaults to 1.0. The
          sampling algorithm considers every read according to the specified probability, so due to
          this, you may not always get the same number of reads e.g. if you set `-s 0.05` in a file
          with 1000 reads, you will get 50 +- sqrt(50) reads. NOTE: a new subsample is drawn every
          time as the seed is not fixed. If you want reproducibility, consider piping the output of
          `samtools view -s` to our program [default: 1]
      --mapq-filter <MAPQ_FILTER>
          Exclude reads whose MAPQ (Mapping quality of position) is below this value. Defaults to
          zero i.e. do not exclude any read [default: 0]
      --exclude-mapq-unavail
          Exclude sequences with MAPQ unavailable. In the BAM format, a value of 255 in this column
          means MAPQ is unavailable. These reads are allowed by default, set this flag to exclude
      --region <REGION>
          Only keep reads passing through this region. If a BAM index is available with a name same
          as the BAM file but with the .bai suffix, the operation of selecting such reads will be
          faster. If you are using standard input as your input e.g. you are piping in the output
          from samtools, then you cannot use an index as a BAM filename is not available
      --full-region
          Only keep reads if they pass through the specified region in full. Related to the input
          `--region`; has no effect if that is not set
      --tag <TAG>
          modified tag
      --mod-strand <MOD_STRAND>
          modified strand, set this to `bc` or `bc_comp`, meaning on basecalled strand or its
          complement. Some technologies like `PacBio` or `ONT` duplex can call mod data on both a
          strand and its complementary DNA and store it in the record corresponding to the strand,
          so you can use this filter to select only for mod data on a strand or its complement.
          Please note that this filter is different from selecting for forward or reverse aligned
          reads using the BAM flags
      --mod-prob-filter <MOD_PROB_FILTER>
          Filter to reject mods before analysis. Specify as low,high where both are fractions to
          reject modifications where the probabilities (p) are in this range e.g. "0.4,0.6" rejects
          0.4 <= p <= 0.6. You can use this to reject 'weak' modification calls before analysis i.e.
          those with probabilities close to 0.5. NOTE: (1) Whether this filtration is applied or
          not, mods < 0.5 are considered unmodified and >= 0.5 are considered modified by our
          program. (2) mod probabilities are stored as a number from 0-255 in the modBAM format, so
          we internally convert 0.0-1.0 to 0-255. Default: reject nothing [default: ]
      --trim-read-ends-mod <TRIM_READ_ENDS_MOD>
          Filter this many bp at the start and end of a read before any mod operations. Please note
          that the units here are bp and not units of base being queried [default: 0]
      --base-qual-filter-mod <BASE_QUAL_FILTER_MOD>
          Exclude bases whose base quality is below this threshold before any mod operation,
          defaults to 0 i.e. unused. NOTE: (1) This step is only applied before modification
          operations, and not before any other operations. (2) No offsets such as +33 are needed
          here. (3) Modifications on reads where base quality information is not available are all
          rejected if this is non-zero [default: 0]
      --mod-region <MOD_REGION>
          Only keep modification data from this region
      --seq-region <SEQ_REGION>
          Genomic region from which basecalled sequences are displayed (optional)
      --seq-full
          Displays entire basecalled sequence (optional)
      --show-base-qual
          Displays basecalling qualities (optional)
      --show-ins-lowercase
          Show insertions in lower case
      --show-mod-z
          Shows modified bases as Z (or z depending on other options)
  -h, --help
          Print help

read-table-hide-mods

Prints basecalled len, align len per molecule

Usage: nanalogue read-table-hide-mods [OPTIONS] <BAM_PATH> [SEQ_SUMM_FILE]

Arguments:
  <BAM_PATH>       Input BAM file. Set to a local file path, or set to - to read from stdin, or set
                   to a URL to read from a remote file. If using stdin and piping in from `samtools
                   view`, always include the header with the `-h` option
  [SEQ_SUMM_FILE]  Input sequence summary file from Guppy/Dorado (optional) [default: ]

Options:
      --min-seq-len <MIN_SEQ_LEN>
          Exclude reads whose sequence length in the BAM file is below this value. Defaults to 0
          [default: 0]
      --min-align-len <MIN_ALIGN_LEN>
          Exclude reads whose alignment length in the BAM file is below this value. Defaults to
          unused
      --read-id <READ_ID>
          Only include this read id, defaults to unused i.e. all reads are used. NOTE: if there are
          multiple alignments corresponding to this read id, all of them are used
      --read-id-list <READ_ID_LIST>
          Path to file containing list of read IDs (one per line). Lines starting with '#' are
          treated as comments and ignored. Cannot be used together with --read-id
      --threads <THREADS>
          Number of threads used during some aspects of program execution [default: 2]
      --include-zero-len
          Include "zero-length" sequences e.g. sequences with "*" in the sequence field. By default,
          these sequences are excluded to avoid processing errors. If this flag is set, these reads
          are included irrespective of any minimum sequence or align length criteria the user may
          have set. WARNINGS: (1) Some functions of the codebase may break or produce incorrect
          results if you use this flag. (2) due to a technical reason, we need a DNA sequence in the
          sequence field and cannot infer sequence length from other sources e.g. CIGAR strings
      --read-filter <READ_FILTER>
          Only retain reads of this type. Allowed types are `primary_forward`, `primary_reverse`,
          `secondary_forward`, `secondary_reverse`, `supplementary_forward`, `supplementary_reverse`
          and unmapped. Specify more than one type if needed separated by commas, in which case
          reads of any type in list are retained. Defaults to retain reads of all types
  -s, --sample-fraction <SAMPLE_FRACTION>
          Subsample BAM to retain only this fraction of total number of reads, defaults to 1.0. The
          sampling algorithm considers every read according to the specified probability, so due to
          this, you may not always get the same number of reads e.g. if you set `-s 0.05` in a file
          with 1000 reads, you will get 50 +- sqrt(50) reads. NOTE: a new subsample is drawn every
          time as the seed is not fixed. If you want reproducibility, consider piping the output of
          `samtools view -s` to our program [default: 1]
      --mapq-filter <MAPQ_FILTER>
          Exclude reads whose MAPQ (Mapping quality of position) is below this value. Defaults to
          zero i.e. do not exclude any read [default: 0]
      --exclude-mapq-unavail
          Exclude sequences with MAPQ unavailable. In the BAM format, a value of 255 in this column
          means MAPQ is unavailable. These reads are allowed by default, set this flag to exclude
      --region <REGION>
          Only keep reads passing through this region. If a BAM index is available with a name same
          as the BAM file but with the .bai suffix, the operation of selecting such reads will be
          faster. If you are using standard input as your input e.g. you are piping in the output
          from samtools, then you cannot use an index as a BAM filename is not available
      --full-region
          Only keep reads if they pass through the specified region in full. Related to the input
          `--region`; has no effect if that is not set
      --seq-region <SEQ_REGION>
          Genomic region from which basecalled sequences are displayed (optional)
      --seq-full
          Displays entire basecalled sequence (optional)
      --show-base-qual
          Displays basecalling qualities (optional)
      --show-ins-lowercase
          Show insertions in lower case
  -h, --help
          Print help

read-stats

Calculates various summary statistics on all reads

Usage: nanalogue read-stats [OPTIONS] <BAM_PATH>

Arguments:
  <BAM_PATH>  Input BAM file. Set to a local file path, or set to - to read from stdin, or set to a
              URL to read from a remote file. If using stdin and piping in from `samtools view`,
              always include the header with the `-h` option

Options:
      --min-seq-len <MIN_SEQ_LEN>
          Exclude reads whose sequence length in the BAM file is below this value. Defaults to 0
          [default: 0]
      --min-align-len <MIN_ALIGN_LEN>
          Exclude reads whose alignment length in the BAM file is below this value. Defaults to
          unused
      --read-id <READ_ID>
          Only include this read id, defaults to unused i.e. all reads are used. NOTE: if there are
          multiple alignments corresponding to this read id, all of them are used
      --read-id-list <READ_ID_LIST>
          Path to file containing list of read IDs (one per line). Lines starting with '#' are
          treated as comments and ignored. Cannot be used together with --read-id
      --threads <THREADS>
          Number of threads used during some aspects of program execution [default: 2]
      --include-zero-len
          Include "zero-length" sequences e.g. sequences with "*" in the sequence field. By default,
          these sequences are excluded to avoid processing errors. If this flag is set, these reads
          are included irrespective of any minimum sequence or align length criteria the user may
          have set. WARNINGS: (1) Some functions of the codebase may break or produce incorrect
          results if you use this flag. (2) due to a technical reason, we need a DNA sequence in the
          sequence field and cannot infer sequence length from other sources e.g. CIGAR strings
      --read-filter <READ_FILTER>
          Only retain reads of this type. Allowed types are `primary_forward`, `primary_reverse`,
          `secondary_forward`, `secondary_reverse`, `supplementary_forward`, `supplementary_reverse`
          and unmapped. Specify more than one type if needed separated by commas, in which case
          reads of any type in list are retained. Defaults to retain reads of all types
  -s, --sample-fraction <SAMPLE_FRACTION>
          Subsample BAM to retain only this fraction of total number of reads, defaults to 1.0. The
          sampling algorithm considers every read according to the specified probability, so due to
          this, you may not always get the same number of reads e.g. if you set `-s 0.05` in a file
          with 1000 reads, you will get 50 +- sqrt(50) reads. NOTE: a new subsample is drawn every
          time as the seed is not fixed. If you want reproducibility, consider piping the output of
          `samtools view -s` to our program [default: 1]
      --mapq-filter <MAPQ_FILTER>
          Exclude reads whose MAPQ (Mapping quality of position) is below this value. Defaults to
          zero i.e. do not exclude any read [default: 0]
      --exclude-mapq-unavail
          Exclude sequences with MAPQ unavailable. In the BAM format, a value of 255 in this column
          means MAPQ is unavailable. These reads are allowed by default, set this flag to exclude
      --region <REGION>
          Only keep reads passing through this region. If a BAM index is available with a name same
          as the BAM file but with the .bai suffix, the operation of selecting such reads will be
          faster. If you are using standard input as your input e.g. you are piping in the output
          from samtools, then you cannot use an index as a BAM filename is not available
      --full-region
          Only keep reads if they pass through the specified region in full. Related to the input
          `--region`; has no effect if that is not set
  -h, --help
          Print help

read-info

Prints information about reads

Usage: nanalogue read-info [OPTIONS] <BAM_PATH>

Arguments:
  <BAM_PATH>  Input BAM file. Set to a local file path, or set to - to read from stdin, or set to a
              URL to read from a remote file. If using stdin and piping in from `samtools view`,
              always include the header with the `-h` option

Options:
      --min-seq-len <MIN_SEQ_LEN>
          Exclude reads whose sequence length in the BAM file is below this value. Defaults to 0
          [default: 0]
      --min-align-len <MIN_ALIGN_LEN>
          Exclude reads whose alignment length in the BAM file is below this value. Defaults to
          unused
      --read-id <READ_ID>
          Only include this read id, defaults to unused i.e. all reads are used. NOTE: if there are
          multiple alignments corresponding to this read id, all of them are used
      --read-id-list <READ_ID_LIST>
          Path to file containing list of read IDs (one per line). Lines starting with '#' are
          treated as comments and ignored. Cannot be used together with --read-id
      --threads <THREADS>
          Number of threads used during some aspects of program execution [default: 2]
      --include-zero-len
          Include "zero-length" sequences e.g. sequences with "*" in the sequence field. By default,
          these sequences are excluded to avoid processing errors. If this flag is set, these reads
          are included irrespective of any minimum sequence or align length criteria the user may
          have set. WARNINGS: (1) Some functions of the codebase may break or produce incorrect
          results if you use this flag. (2) due to a technical reason, we need a DNA sequence in the
          sequence field and cannot infer sequence length from other sources e.g. CIGAR strings
      --read-filter <READ_FILTER>
          Only retain reads of this type. Allowed types are `primary_forward`, `primary_reverse`,
          `secondary_forward`, `secondary_reverse`, `supplementary_forward`, `supplementary_reverse`
          and unmapped. Specify more than one type if needed separated by commas, in which case
          reads of any type in list are retained. Defaults to retain reads of all types
  -s, --sample-fraction <SAMPLE_FRACTION>
          Subsample BAM to retain only this fraction of total number of reads, defaults to 1.0. The
          sampling algorithm considers every read according to the specified probability, so due to
          this, you may not always get the same number of reads e.g. if you set `-s 0.05` in a file
          with 1000 reads, you will get 50 +- sqrt(50) reads. NOTE: a new subsample is drawn every
          time as the seed is not fixed. If you want reproducibility, consider piping the output of
          `samtools view -s` to our program [default: 1]
      --mapq-filter <MAPQ_FILTER>
          Exclude reads whose MAPQ (Mapping quality of position) is below this value. Defaults to
          zero i.e. do not exclude any read [default: 0]
      --exclude-mapq-unavail
          Exclude sequences with MAPQ unavailable. In the BAM format, a value of 255 in this column
          means MAPQ is unavailable. These reads are allowed by default, set this flag to exclude
      --region <REGION>
          Only keep reads passing through this region. If a BAM index is available with a name same
          as the BAM file but with the .bai suffix, the operation of selecting such reads will be
          faster. If you are using standard input as your input e.g. you are piping in the output
          from samtools, then you cannot use an index as a BAM filename is not available
      --full-region
          Only keep reads if they pass through the specified region in full. Related to the input
          `--region`; has no effect if that is not set
      --tag <TAG>
          modified tag
      --mod-strand <MOD_STRAND>
          modified strand, set this to `bc` or `bc_comp`, meaning on basecalled strand or its
          complement. Some technologies like `PacBio` or `ONT` duplex can call mod data on both a
          strand and its complementary DNA and store it in the record corresponding to the strand,
          so you can use this filter to select only for mod data on a strand or its complement.
          Please note that this filter is different from selecting for forward or reverse aligned
          reads using the BAM flags
      --mod-prob-filter <MOD_PROB_FILTER>
          Filter to reject mods before analysis. Specify as low,high where both are fractions to
          reject modifications where the probabilities (p) are in this range e.g. "0.4,0.6" rejects
          0.4 <= p <= 0.6. You can use this to reject 'weak' modification calls before analysis i.e.
          those with probabilities close to 0.5. NOTE: (1) Whether this filtration is applied or
          not, mods < 0.5 are considered unmodified and >= 0.5 are considered modified by our
          program. (2) mod probabilities are stored as a number from 0-255 in the modBAM format, so
          we internally convert 0.0-1.0 to 0-255. Default: reject nothing [default: ]
      --trim-read-ends-mod <TRIM_READ_ENDS_MOD>
          Filter this many bp at the start and end of a read before any mod operations. Please note
          that the units here are bp and not units of base being queried [default: 0]
      --base-qual-filter-mod <BASE_QUAL_FILTER_MOD>
          Exclude bases whose base quality is below this threshold before any mod operation,
          defaults to 0 i.e. unused. NOTE: (1) This step is only applied before modification
          operations, and not before any other operations. (2) No offsets such as +33 are needed
          here. (3) Modifications on reads where base quality information is not available are all
          rejected if this is non-zero [default: 0]
      --mod-region <MOD_REGION>
          Only keep modification data from this region
      --detailed
          Print detailed modification data (JSON)
      --detailed-pretty
          Pretty-print detailed modification data (JSON)
  -h, --help
          Print help

find-modified-reads

Find names of modified reads through criteria specified by sub commands

Usage: nanalogue find-modified-reads <COMMAND>

Commands:
  all-dens-between                   Find reads with all windowed modification densities within
                                     specified limits
  any-dens-above                     Find reads with windowed modification density such that at
                                     least one window is at or above the high value
  any-dens-below                     Find reads with windowed modification density such that at
                                     least one window is at or below the low value
  any-dens-below-and-any-dens-above  Find reads with windowed modification density such that at
                                     least one window is at or below the low value and at least one
                                     window is at or above the high value. This operation may enrich
                                     for reads with spatial gradients in modification density
  dens-range-above                   Find reads with windowed modification density such that max of
                                     all densities minus min of all densities is at least the value
                                     specified. This operation may enrich for reads with spatial
                                     gradients in modification density
  any-abs-grad-above                 Find reads such that absolute value of gradient in modification
                                     density measured in windows is at least the value specified.
                                     This operation enriches for reads with spatial gradients in
                                     modification density
  help                               Print this message or the help of the given subcommand(s)

Options:
  -h, --help  Print help

window-dens

Output windowed densities of all reads

Usage: nanalogue window-dens [OPTIONS] --win <WIN> --step <STEP> <BAM_PATH>

Arguments:
  <BAM_PATH>  Input BAM file. Set to a local file path, or set to - to read from stdin, or set to a
              URL to read from a remote file. If using stdin and piping in from `samtools view`,
              always include the header with the `-h` option

Options:
      --min-seq-len <MIN_SEQ_LEN>
          Exclude reads whose sequence length in the BAM file is below this value. Defaults to 0
          [default: 0]
      --min-align-len <MIN_ALIGN_LEN>
          Exclude reads whose alignment length in the BAM file is below this value. Defaults to
          unused
      --read-id <READ_ID>
          Only include this read id, defaults to unused i.e. all reads are used. NOTE: if there are
          multiple alignments corresponding to this read id, all of them are used
      --read-id-list <READ_ID_LIST>
          Path to file containing list of read IDs (one per line). Lines starting with '#' are
          treated as comments and ignored. Cannot be used together with --read-id
      --threads <THREADS>
          Number of threads used during some aspects of program execution [default: 2]
      --include-zero-len
          Include "zero-length" sequences e.g. sequences with "*" in the sequence field. By default,
          these sequences are excluded to avoid processing errors. If this flag is set, these reads
          are included irrespective of any minimum sequence or align length criteria the user may
          have set. WARNINGS: (1) Some functions of the codebase may break or produce incorrect
          results if you use this flag. (2) due to a technical reason, we need a DNA sequence in the
          sequence field and cannot infer sequence length from other sources e.g. CIGAR strings
      --read-filter <READ_FILTER>
          Only retain reads of this type. Allowed types are `primary_forward`, `primary_reverse`,
          `secondary_forward`, `secondary_reverse`, `supplementary_forward`, `supplementary_reverse`
          and unmapped. Specify more than one type if needed separated by commas, in which case
          reads of any type in list are retained. Defaults to retain reads of all types
  -s, --sample-fraction <SAMPLE_FRACTION>
          Subsample BAM to retain only this fraction of total number of reads, defaults to 1.0. The
          sampling algorithm considers every read according to the specified probability, so due to
          this, you may not always get the same number of reads e.g. if you set `-s 0.05` in a file
          with 1000 reads, you will get 50 +- sqrt(50) reads. NOTE: a new subsample is drawn every
          time as the seed is not fixed. If you want reproducibility, consider piping the output of
          `samtools view -s` to our program [default: 1]
      --mapq-filter <MAPQ_FILTER>
          Exclude reads whose MAPQ (Mapping quality of position) is below this value. Defaults to
          zero i.e. do not exclude any read [default: 0]
      --exclude-mapq-unavail
          Exclude sequences with MAPQ unavailable. In the BAM format, a value of 255 in this column
          means MAPQ is unavailable. These reads are allowed by default, set this flag to exclude
      --region <REGION>
          Only keep reads passing through this region. If a BAM index is available with a name same
          as the BAM file but with the .bai suffix, the operation of selecting such reads will be
          faster. If you are using standard input as your input e.g. you are piping in the output
          from samtools, then you cannot use an index as a BAM filename is not available
      --full-region
          Only keep reads if they pass through the specified region in full. Related to the input
          `--region`; has no effect if that is not set
      --win <WIN>
          size of window in units of base being queried i.e. if you are looking for cytosine
          modifications, then a window of a value 300 means create windows each with 300 cytosines
          irrespective of their modification status
      --step <STEP>
          step window by this size in units of base being queried
      --tag <TAG>
          modified tag
      --mod-strand <MOD_STRAND>
          modified strand, set this to `bc` or `bc_comp`, meaning on basecalled strand or its
          complement. Some technologies like `PacBio` or `ONT` duplex can call mod data on both a
          strand and its complementary DNA and store it in the record corresponding to the strand,
          so you can use this filter to select only for mod data on a strand or its complement.
          Please note that this filter is different from selecting for forward or reverse aligned
          reads using the BAM flags
      --mod-prob-filter <MOD_PROB_FILTER>
          Filter to reject mods before analysis. Specify as low,high where both are fractions to
          reject modifications where the probabilities (p) are in this range e.g. "0.4,0.6" rejects
          0.4 <= p <= 0.6. You can use this to reject 'weak' modification calls before analysis i.e.
          those with probabilities close to 0.5. NOTE: (1) Whether this filtration is applied or
          not, mods < 0.5 are considered unmodified and >= 0.5 are considered modified by our
          program. (2) mod probabilities are stored as a number from 0-255 in the modBAM format, so
          we internally convert 0.0-1.0 to 0-255. Default: reject nothing [default: ]
      --trim-read-ends-mod <TRIM_READ_ENDS_MOD>
          Filter this many bp at the start and end of a read before any mod operations. Please note
          that the units here are bp and not units of base being queried [default: 0]
      --base-qual-filter-mod <BASE_QUAL_FILTER_MOD>
          Exclude bases whose base quality is below this threshold before any mod operation,
          defaults to 0 i.e. unused. NOTE: (1) This step is only applied before modification
          operations, and not before any other operations. (2) No offsets such as +33 are needed
          here. (3) Modifications on reads where base quality information is not available are all
          rejected if this is non-zero [default: 0]
      --mod-region <MOD_REGION>
          Only keep modification data from this region
  -h, --help
          Print help

window-grad

Output windowed gradients of all reads

Usage: nanalogue window-grad [OPTIONS] --win <WIN> --step <STEP> <BAM_PATH>

Arguments:
  <BAM_PATH>  Input BAM file. Set to a local file path, or set to - to read from stdin, or set to a
              URL to read from a remote file. If using stdin and piping in from `samtools view`,
              always include the header with the `-h` option

Options:
      --min-seq-len <MIN_SEQ_LEN>
          Exclude reads whose sequence length in the BAM file is below this value. Defaults to 0
          [default: 0]
      --min-align-len <MIN_ALIGN_LEN>
          Exclude reads whose alignment length in the BAM file is below this value. Defaults to
          unused
      --read-id <READ_ID>
          Only include this read id, defaults to unused i.e. all reads are used. NOTE: if there are
          multiple alignments corresponding to this read id, all of them are used
      --read-id-list <READ_ID_LIST>
          Path to file containing list of read IDs (one per line). Lines starting with '#' are
          treated as comments and ignored. Cannot be used together with --read-id
      --threads <THREADS>
          Number of threads used during some aspects of program execution [default: 2]
      --include-zero-len
          Include "zero-length" sequences e.g. sequences with "*" in the sequence field. By default,
          these sequences are excluded to avoid processing errors. If this flag is set, these reads
          are included irrespective of any minimum sequence or align length criteria the user may
          have set. WARNINGS: (1) Some functions of the codebase may break or produce incorrect
          results if you use this flag. (2) due to a technical reason, we need a DNA sequence in the
          sequence field and cannot infer sequence length from other sources e.g. CIGAR strings
      --read-filter <READ_FILTER>
          Only retain reads of this type. Allowed types are `primary_forward`, `primary_reverse`,
          `secondary_forward`, `secondary_reverse`, `supplementary_forward`, `supplementary_reverse`
          and unmapped. Specify more than one type if needed separated by commas, in which case
          reads of any type in list are retained. Defaults to retain reads of all types
  -s, --sample-fraction <SAMPLE_FRACTION>
          Subsample BAM to retain only this fraction of total number of reads, defaults to 1.0. The
          sampling algorithm considers every read according to the specified probability, so due to
          this, you may not always get the same number of reads e.g. if you set `-s 0.05` in a file
          with 1000 reads, you will get 50 +- sqrt(50) reads. NOTE: a new subsample is drawn every
          time as the seed is not fixed. If you want reproducibility, consider piping the output of
          `samtools view -s` to our program [default: 1]
      --mapq-filter <MAPQ_FILTER>
          Exclude reads whose MAPQ (Mapping quality of position) is below this value. Defaults to
          zero i.e. do not exclude any read [default: 0]
      --exclude-mapq-unavail
          Exclude sequences with MAPQ unavailable. In the BAM format, a value of 255 in this column
          means MAPQ is unavailable. These reads are allowed by default, set this flag to exclude
      --region <REGION>
          Only keep reads passing through this region. If a BAM index is available with a name same
          as the BAM file but with the .bai suffix, the operation of selecting such reads will be
          faster. If you are using standard input as your input e.g. you are piping in the output
          from samtools, then you cannot use an index as a BAM filename is not available
      --full-region
          Only keep reads if they pass through the specified region in full. Related to the input
          `--region`; has no effect if that is not set
      --win <WIN>
          size of window in units of base being queried i.e. if you are looking for cytosine
          modifications, then a window of a value 300 means create windows each with 300 cytosines
          irrespective of their modification status
      --step <STEP>
          step window by this size in units of base being queried
      --tag <TAG>
          modified tag
      --mod-strand <MOD_STRAND>
          modified strand, set this to `bc` or `bc_comp`, meaning on basecalled strand or its
          complement. Some technologies like `PacBio` or `ONT` duplex can call mod data on both a
          strand and its complementary DNA and store it in the record corresponding to the strand,
          so you can use this filter to select only for mod data on a strand or its complement.
          Please note that this filter is different from selecting for forward or reverse aligned
          reads using the BAM flags
      --mod-prob-filter <MOD_PROB_FILTER>
          Filter to reject mods before analysis. Specify as low,high where both are fractions to
          reject modifications where the probabilities (p) are in this range e.g. "0.4,0.6" rejects
          0.4 <= p <= 0.6. You can use this to reject 'weak' modification calls before analysis i.e.
          those with probabilities close to 0.5. NOTE: (1) Whether this filtration is applied or
          not, mods < 0.5 are considered unmodified and >= 0.5 are considered modified by our
          program. (2) mod probabilities are stored as a number from 0-255 in the modBAM format, so
          we internally convert 0.0-1.0 to 0-255. Default: reject nothing [default: ]
      --trim-read-ends-mod <TRIM_READ_ENDS_MOD>
          Filter this many bp at the start and end of a read before any mod operations. Please note
          that the units here are bp and not units of base being queried [default: 0]
      --base-qual-filter-mod <BASE_QUAL_FILTER_MOD>
          Exclude bases whose base quality is below this threshold before any mod operation,
          defaults to 0 i.e. unused. NOTE: (1) This step is only applied before modification
          operations, and not before any other operations. (2) No offsets such as +33 are needed
          here. (3) Modifications on reads where base quality information is not available are all
          rejected if this is non-zero [default: 0]
      --mod-region <MOD_REGION>
          Only keep modification data from this region
  -h, --help
          Print help

peek

Display BAM file contigs, contig lengths, and mod types from a "peek" at the header and first 100
records

Usage: nanalogue peek <BAM>

Arguments:
  <BAM>  Input BAM file (path, URL, or '-' for stdin)

Options:
  -h, --help  Print help

help

error: unrecognized subcommand '--help'

Usage: nanalogue <COMMAND>

For more information, try '--help'.

Extracting raw mod calls

If you are interested in extracting raw mod calls, please use the following command. You need to have the jq package installed.

nanalogue read-info --detailed $bam_resource | jq '.[].mod_table[].data[][2]'

An example output from this command can look like the following. These are mod calls from all positions in all molecules in the BAM resource. The mod calls range from 0-255, which is just a rescaling of 0-100% probability of being modified.

4
7
9
6
221
242
3
47
239
3
3
4
3
182
0
0
0
0
77
0
221
242
0
47
239

This is a variant of the command above that prints read id, contig, position along reference, position along read and modification quality.

nanalogue read-info --detailed $bam_resource |\
    jq -r '.[] | .read_id as $rid | (.alignment.contig // "N/A") as $contig | .mod_table[].data[] | [$rid, $contig, .[1], .[0], .[2]] | @tsv'  
5d10eb9a-aae1-4db8-8ec6-7ebb34d32575    dummyI  9       0       4
5d10eb9a-aae1-4db8-8ec6-7ebb34d32575    dummyI  12      3       7
5d10eb9a-aae1-4db8-8ec6-7ebb34d32575    dummyI  13      4       9
5d10eb9a-aae1-4db8-8ec6-7ebb34d32575    dummyI  16      7       6
a4f36092-b4d5-47a9-813e-c22c3b477a0c    dummyIII        26      3       221
a4f36092-b4d5-47a9-813e-c22c3b477a0c    dummyIII        31      8       242
a4f36092-b4d5-47a9-813e-c22c3b477a0c    dummyIII        50      27      3
a4f36092-b4d5-47a9-813e-c22c3b477a0c    dummyIII        62      39      47
a4f36092-b4d5-47a9-813e-c22c3b477a0c    dummyIII        70      47      239
fffffff1-10d2-49cb-8ca3-e8d48979001b    dummyII 15      12      3
fffffff1-10d2-49cb-8ca3-e8d48979001b    dummyII 16      13      3
fffffff1-10d2-49cb-8ca3-e8d48979001b    dummyII 19      16      4
fffffff1-10d2-49cb-8ca3-e8d48979001b    dummyII 22      19      3
fffffff1-10d2-49cb-8ca3-e8d48979001b    dummyII 23      20      182
b1a36092-b4d5-47a9-813e-c22c3b477a0c    N/A     -1      28      0
b1a36092-b4d5-47a9-813e-c22c3b477a0c    N/A     -1      29      0
b1a36092-b4d5-47a9-813e-c22c3b477a0c    N/A     -1      30      0
b1a36092-b4d5-47a9-813e-c22c3b477a0c    N/A     -1      32      0
b1a36092-b4d5-47a9-813e-c22c3b477a0c    N/A     -1      43      77
b1a36092-b4d5-47a9-813e-c22c3b477a0c    N/A     -1      44      0
b1a36092-b4d5-47a9-813e-c22c3b477a0c    N/A     -1      3       221
b1a36092-b4d5-47a9-813e-c22c3b477a0c    N/A     -1      8       242
b1a36092-b4d5-47a9-813e-c22c3b477a0c    N/A     -1      27      0
b1a36092-b4d5-47a9-813e-c22c3b477a0c    N/A     -1      39      47
b1a36092-b4d5-47a9-813e-c22c3b477a0c    N/A     -1      47      239

Python usage of nanalogue

For detailed API documentation of all Python functions and classes, see the Python API Reference.

The Python API Reference is automatically generated from the latest version of the pynanalogue package and provides comprehensive documentation for all public functions, classes, and methods.

pynanalogue Python API Reference

Note: This file is auto-generated.

Functions

peek

peek(bam_path, treat_as_url=False)

Peeks at a BAM file to extract contig information and detected modifications.

This function reads the BAM header and examines up to 100 records to determine the contigs present in the file and any DNA/RNA modifications detected.

Args

  • bam_path (str): Path to the BAM file.

  • treat_as_url (optional, bool): If True, treat bam_path as a URL, default False.

Returns

A dictionary with two keys:

  • contigs: A dictionary mapping contig names to their lengths
  • modifications: A list of modifications, where each modification is a list of three strings: [base, strand, mod_code]. For example: [[“T”, “+”, “T”], [“G”, “-”, “7200”]]

Example

import pynanalogue

result = pynanalogue.peek("example.bam")
print(result["contigs"])  # {"chr1": 248956422, "chr2": 242193529, ...}
print(result["modifications"])  # [["C", "+", "m"], ["A", "+", "28871"]]

Errors

If the BAM file cannot be read or parsed

polars_bam_mods

polars_bam_mods(
    bam_path,
    treat_as_url=False,
    min_seq_len=0,
    min_align_len=0,
    read_ids=Ellipsis,
    threads=2,
    include_zero_len=False,
    read_filter='',
    sample_fraction=1.0,
    mapq_filter=0,
    exclude_mapq_unavail=False,
    region='',
    full_region=False,
    tag='',
    mod_strand='',
    min_mod_qual=0,
    reject_mod_qual_non_inclusive=Ellipsis,
    trim_read_ends_mod=0,
    base_qual_filter_mod=0,
    mod_region=''
)

Converts modification data in mod BAM files into a Polars Dataframe. Columns are read_id, seq_len, alignment_type, align_start, align_end, contig, contig_id, base, is_strand_plus, mod_code, position, ref_position, and mod_quality.

Sets various options through builder functions before running the function and capturing the output. Parses records, collects them, and runs nanalogue_core::read_utils::curr_reads_to_dataframe.

Args

  • bam_path (str): Path to the BAM file. Must be associated with a BAM index.

  • treat_as_url (optional, bool): If True, treat bam_path as a URL, default False.

  • min_seq_len (optional, int): Only retain sequences above this length, default 0.

  • min_align_len (optional, int): Only retain sequences with an alignment length above this value. Defaults to unused.

  • read_ids (optional, set of str): Only retrieve these read ids, defaults to unused.

  • threads (optional, int): Number of threads used in some aspects of program execution, defaults to 2.

  • include_zero_len (optional, bool): Include sequences of zero length. WARNING: our program may crash if you do this. Defaults to False. Helps to check if sequences of zero length exist in our BAM file.

  • read_filter (optional, str): Comma-separated sequence of one to many of the following strings: primary_forward, primary_reverse, secondary_forward, secondary_reverse, supplementary_forward, supplementary_reverse, unmapped. If specified, only reads with a mapping belonging to this set are retained. Defaults to no filter.

  • sample_fraction (optional, float): Set to between 0 and 1 to subsample BAM file. WARNING: seeds are not set, so you may get a new set of reads every time. WARNING: we sample every read with the given probability, so the total number of reads fluctuates according to standard counting statistics.

  • mapq_filter (optional, int): Exclude reads with mapping quality below this number. defaults to unused.

  • exclude_mapq_unavail (optional, bool): Exclude reads where mapping quality is unavailable. defaults to false.

  • region (optional, str): Only include reads with at least one mapped base from this region. Use the format “contig”, “contig:start-”, or “contig:start-end”. These are 0-based, half-open intervals. Defaults to read entire BAM file. Can be used in combination with mod_region.

  • full_region (optional, bool): Only include reads if they pass through the region above in full. Defaults to false.

  • tag (optional, str): If set, only this type of modification is processed. Input is a string type, for example a single-letter code “m”, a number as a string “76792” etc. Defaults to processing all modifications.

  • mod_strand (optional, str): Set this to bc or bc_comp to retrieve information about mods only from the basecalled strand or only from its complement. Some sequencing technologies like PacBio or ONT duplex record mod information both on a strand and its complement. It may be useful in some scenarios to separate this information. Defaults to not filter.

  • min_mod_qual (optional, int): Set to a number 0-255. Reject modification calls whose probability is below this value (0, 255 correspond to a probability of 0 and 1 respectively). Defaults to 0.

  • reject_mod_qual_non_inclusive (optional, (int, int)): Reject modification calls whose probability is such that int_low < prob < int_high. Set both to a number between 0-255 and such that the first entry is <= the second (if they are equal, no filtering is performed). Defaults to no filtering. Also see comments under min_mod_qual.

  • trim_read_ends_mod (optional, int): Reject modification information within so many bp of either end of the read. Defaults to 0.

  • base_qual_filter_mod (optional, int): Reject modification information on any base whose basecalling quality is below this number. Defaults to 0.

  • mod_region (optional, str): Genomic region in the format “contig”, “contig:start-” or “contig:start-end”. Reject any modification information outside this region. These are half-open, 0-based intervals. Can be used in combination with region.

Returns

A Polars dataframe.

Example output

If the polars dataframe were converted to a plain-text table, it might look like the following:

read_id seq_len alignment_type  align_start     align_end       contig  contig_id       base    is_strand_plus  mod_code        position        ref_position    mod_quality
5d10eb9a-aae1-4db8-8ec6-7ebb34d32575    8       primary_forward 9       17      dummyI  0       T       true    T       0       9       4
5d10eb9a-aae1-4db8-8ec6-7ebb34d32575    8       primary_forward 9       17      dummyI  0       T       true    T       3       12      7
5d10eb9a-aae1-4db8-8ec6-7ebb34d32575    8       primary_forward 9       17      dummyI  0       T       true    T       4       13      9
5d10eb9a-aae1-4db8-8ec6-7ebb34d32575    8       primary_forward 9       17      dummyI  0       T       true    T       7       16      6
a4f36092-b4d5-47a9-813e-c22c3b477a0c    48      primary_forward 23      71      dummyIII        2       T       true    T       3       26      221
a4f36092-b4d5-47a9-813e-c22c3b477a0c    48      primary_forward 23      71      dummyIII        2       T       true    T       8       31      242
a4f36092-b4d5-47a9-813e-c22c3b477a0c    48      primary_forward 23      71      dummyIII        2       T       true    T       27      50      3
a4f36092-b4d5-47a9-813e-c22c3b477a0c    48      primary_forward 23      71      dummyIII        2       T       true    T       39      62      47
a4f36092-b4d5-47a9-813e-c22c3b477a0c    48      primary_forward 23      71      dummyIII        2       T       true    T       47      70      239
fffffff1-10d2-49cb-8ca3-e8d48979001b    33      primary_reverse 3       36      dummyII 1       T       true    T       12      15      3
fffffff1-10d2-49cb-8ca3-e8d48979001b    33      primary_reverse 3       36      dummyII 1       T       true    T       13      16      3
fffffff1-10d2-49cb-8ca3-e8d48979001b    33      primary_reverse 3       36      dummyII 1       T       true    T       16      19      4
fffffff1-10d2-49cb-8ca3-e8d48979001b    33      primary_reverse 3       36      dummyII 1       T       true    T       19      22      3
fffffff1-10d2-49cb-8ca3-e8d48979001b    33      primary_reverse 3       36      dummyII 1       T       true    T       20      23      182
a4f36092-b4d5-47a9-813e-c22c3b477a0c    48      unmapped                                        G       false   7200    28      -1      0
a4f36092-b4d5-47a9-813e-c22c3b477a0c    48      unmapped                                        G       false   7200    29      -1      0
a4f36092-b4d5-47a9-813e-c22c3b477a0c    48      unmapped                                        G       false   7200    30      -1      0
a4f36092-b4d5-47a9-813e-c22c3b477a0c    48      unmapped                                        G       false   7200    32      -1      0
a4f36092-b4d5-47a9-813e-c22c3b477a0c    48      unmapped                                        G       false   7200    43      -1      77
a4f36092-b4d5-47a9-813e-c22c3b477a0c    48      unmapped                                        G       false   7200    44      -1      0
a4f36092-b4d5-47a9-813e-c22c3b477a0c    48      unmapped                                        T       true    T       3       -1      221
a4f36092-b4d5-47a9-813e-c22c3b477a0c    48      unmapped                                        T       true    T       8       -1      242
a4f36092-b4d5-47a9-813e-c22c3b477a0c    48      unmapped                                        T       true    T       27      -1      0
a4f36092-b4d5-47a9-813e-c22c3b477a0c    48      unmapped                                        T       true    T       39      -1      47
a4f36092-b4d5-47a9-813e-c22c3b477a0c    48      unmapped                                        T       true    T       47      -1      239

Errors

If building of option-related structs fails, if BAM input cannot be obtained, if preparing records fails, or running the nanalogue_core commands fail

read_info

read_info(
    bam_path,
    treat_as_url=False,
    min_seq_len=0,
    min_align_len=0,
    read_ids=Ellipsis,
    threads=2,
    include_zero_len=False,
    read_filter='',
    sample_fraction=1.0,
    mapq_filter=0,
    exclude_mapq_unavail=False,
    region='',
    full_region=False,
    tag='',
    mod_strand='',
    min_mod_qual=0,
    reject_mod_qual_non_inclusive=Ellipsis,
    trim_read_ends_mod=0,
    base_qual_filter_mod=0,
    mod_region=''
)

Produces bytes which can be decoded to JSON format, with one JSON record per BAM record with information per record such as alignment length, sequence length, read id, mod counts etc.

Sets various options through builder functions before running the function and capturing the output as a stream of bytes which can be decoded to JSON (see the Example output section below). Runs nanalogue_core::read_info::run.

This function can be used to count how many reads are above a threshold length or modification count or are primary mappings. The function can also be used to analyze relationships such as alignment length vs basecalled length. The arguments can be used to filter the BAM data (e.g. passing through a specific region etc.).

Args

  • bam_path (str): Path to the BAM file. Must be associated with a BAM index.

  • treat_as_url (optional, bool): If True, treat bam_path as a URL, default False.

  • min_seq_len (optional, int): Only retain sequences above this length, default 0.

  • min_align_len (optional, int): Only retain sequences with an alignment length above this value. Defaults to unused.

  • read_ids (optional, set of str): Only retrieve these read ids, defaults to unused.

  • threads (optional, int): Number of threads used in some aspects of program execution, defaults to 2.

  • include_zero_len (optional, bool): Include sequences of zero length. WARNING: our program may crash if you do this. Defaults to False. Helps to check if sequences of zero length exist in our BAM file.

  • read_filter (optional, str): Comma-separated sequence of one to many of the following strings: primary_forward, primary_reverse, secondary_forward, secondary_reverse, supplementary_forward, supplementary_reverse, unmapped. If specified, only reads with a mapping belonging to this set are retained. Defaults to no filter.

  • sample_fraction (optional, float): Set to between 0 and 1 to subsample BAM file. WARNING: seeds are not set, so you may get a new set of reads every time. WARNING: we sample every read with the given probability, so the total number of reads fluctuates according to standard counting statistics.

  • mapq_filter (optional, int): Exclude reads with mapping quality below this number. defaults to unused.

  • exclude_mapq_unavail (optional, bool): Exclude reads where mapping quality is unavailable. defaults to false.

  • region (optional, str): Only include reads with at least one mapped base from this region. Use the format “contig”, “contig:start-”, or “contig:start-end”. These are 0-based, half-open intervals. Defaults to read entire BAM file. Can be used in combination with mod_region.

  • full_region (optional, bool): Only include reads if they pass through the region above in full. Defaults to false.

  • tag (optional, str): If set, only this type of modification is processed. Input is a string type, for example a single-letter code “m”, a number as a string “76792” etc. Defaults to processing all modifications.

  • mod_strand (optional, str): Set this to bc or bc_comp to retrieve information about mods only from the basecalled strand or only from its complement. Some sequencing technologies like PacBio or ONT duplex record mod information both on a strand and its complement. It may be useful in some scenarios to separate this information. Defaults to not filter.

  • min_mod_qual (optional, int): Set to a number 0-255. Reject modification calls whose probability is below this value (0, 255 correspond to a probability of 0 and 1 respectively). Defaults to 0.

  • reject_mod_qual_non_inclusive (optional, (int, int)): Reject modification calls whose probability is such that int_low < prob < int_high. Set both to a number between 0-255 and such that the first entry is <= the second (if they are equal, no filtering is performed). Defaults to no filtering. Also see comments under min_mod_qual.

  • trim_read_ends_mod (optional, int): Reject modification information within so many bp of either end of the read. Defaults to 0.

  • base_qual_filter_mod (optional, int): Reject modification information on any base whose basecalling quality is below this number. Defaults to 0.

  • mod_region (optional, str): Genomic region in the format “contig”, “contig:start-” or “contig:start-end”. Reject any modification information outside this region. These are half-open, 0-based intervals. Can be used in combination with region.

Returns

A stream of bytes that can be decoded to JSON (See the snippet from Example output).

Example output

You’ve to decode the output of the function using something like:

import json
# Assume the function output is in 'result_bytes'
decoded_output = json.loads(result_bytes)

A record from the decoded output might look like

[
{
   "read_id": "cd623d4a-510d-4c6c-9d88-10eb475ac59d",
   "sequence_length": 2104,
   "contig": "contig_0",
   "reference_start": 7369,
   "reference_end": 9473,
   "alignment_length": 2104,
   "alignment_type": "primary_reverse",
   "mod_count": "C-m:263;N+N:2104;(probabilities >= 0.5020, PHRED base qual >= 0)"
}
]

When mods are not available, you will see NA in the mod_count field.

Errors

If building of option-related structs fails, if BAM input cannot be obtained, if preparing records fails, or running the nanalogue_core::read_info::run function fails

seq_table

seq_table(
    bam_path,
    region,
    treat_as_url=False,
    min_seq_len=0,
    min_align_len=0,
    read_ids=Ellipsis,
    threads=2,
    include_zero_len=False,
    read_filter='',
    sample_fraction=1.0,
    mapq_filter=0,
    exclude_mapq_unavail=False,
    tag='',
    mod_strand='',
    min_mod_qual=0,
    reject_mod_qual_non_inclusive=Ellipsis,
    trim_read_ends_mod=0,
    base_qual_filter_mod=0
)

Returns a Polars DataFrame with read sequence and quality information for a genomic region. Represents insertions as lowercase, deletions as periods, and modifications as ‘Z’. Basecalling qualities represented as a string of numbers separated by periods, with value set to ‘255’ at a deleted base.

Calls nanalogue_core::reads_table::run_df to extract read sequences and qualities from a specified genomic region. The output DataFrame contains columns for read ID, sequence (with modifications shown as ‘Z’), and base qualities.

Args

  • bam_path (str): Path to the BAM file. Must be associated with a BAM index.

  • region (str): Genomic region from which to extract sequences. Required. Use the format “contig”, “contig:start-”, or “contig:start-end”. These are 0-based, half-open intervals.

  • treat_as_url (optional, bool): If True, treat bam_path as a URL, default False.

  • min_seq_len (optional, int): Only retain sequences above this length, default 0.

  • min_align_len (optional, int): Only retain sequences with an alignment length above this value. Defaults to unused.

  • read_ids (optional, set of str): Only retrieve these read ids, defaults to unused.

  • threads (optional, int): Number of threads used in some aspects of program execution, defaults to 2.

  • include_zero_len (optional, bool): Include sequences of zero length. WARNING: our program may crash if you do this. Defaults to False.

  • read_filter (optional, str): Comma-separated sequence of one to many of the following strings: primary_forward, primary_reverse, secondary_forward, secondary_reverse, supplementary_forward, supplementary_reverse, unmapped. If specified, only reads with a mapping belonging to this set are retained. Defaults to no filter.

  • sample_fraction (optional, float): Set to between 0 and 1 to subsample BAM file. WARNING: seeds are not set, so you may get a new set of reads every time.

  • mapq_filter (optional, int): Exclude reads with mapping quality below this number. defaults to unused.

  • exclude_mapq_unavail (optional, bool): Exclude reads where mapping quality is unavailable. defaults to false.

  • tag (optional, str): If set, only this type of modification is processed. Input is a string type, for example a single-letter code “m”, a number as a string “76792” etc. Defaults to processing all modifications.

  • mod_strand (optional, str): Set this to bc or bc_comp to retrieve information about mods only from the basecalled strand or only from its complement. Defaults to not filter.

  • min_mod_qual (optional, int): Set to a number 0-255. Reject modification calls whose probability is below this value. Defaults to 0.

  • reject_mod_qual_non_inclusive (optional, (int, int)): Reject modification calls whose probability is such that int_low < prob < int_high. Defaults to no filtering.

  • trim_read_ends_mod (optional, int): Reject modification information within so many bp of either end of the read. Defaults to 0.

  • base_qual_filter_mod (optional, int): Reject modification information on any base whose basecalling quality is below this number. Defaults to 0.

Returns

A Polars dataframe with columns: read_id, sequence, qualities.

Sequence column conventions:

  • Insertions are shown in lowercase
  • Deletions are shown as periods (.)
  • Modified bases on the reference are shown as Z
  • Modified bases in an insertion are shown as z

Qualities column: a period-separated string of integer quality scores, with one score per base in the sequence.

Example output

If the polars dataframe were converted to a plain-text table, it might look like:

read_id sequence        qualities
5d10eb9a-aae1-4db8-8ec6-7ebb34d32575    ACGTZac..TZgt   30.35.40.38.42.20.22.255.255.41.25.28.30
a4f36092-b4d5-47a9-813e-c22c3b477a0c    TGCAZz.ATGCA    28.33.39.41.40.18.255.36.42.38.35.30

Errors

If the region parameter is empty, if building of option-related structs fails, if BAM input cannot be obtained, if preparing records fails, or if running the nanalogue_core::reads_table::run_df function fails

simulate_mod_bam

simulate_mod_bam(json_config, bam_path, fasta_path)

Simulates a BAM file with or without modifications based on a JSON configuration. Creates both a BAM file and a corresponding FASTA reference file.

This function takes a JSON configuration string that specifies how to generate synthetic sequencing data, including contig specifications, read parameters, and optional modification patterns. The output includes a sorted, indexed BAM file and a FASTA reference file.

Args

  • json_config (str): JSON string containing the simulation configuration. Must conform to the SimulationConfig schema. The JSON should specify

  • (some inputs may be optional): - contigs: Configuration for generating reference sequences - number: Number of contigs to generate - len_range: Range of contig lengths [min, max] - repeated_seq: Sequence pattern to repeat for contig generation. If not specified, random DNA sequences are generated. - reads: Array of read group specifications, each containing: - number: Number of reads to generate - mapq_range: Range of mapping quality values [min, max] - base_qual_range: Range of base quality values [min, max] - len_range: Range of read lengths as fraction of contig [min, max] - barcode: Optional barcode sequence - mods: Optional array of modification specifications

  • bam_path (str): Output path for the BAM file. If the file exists, it will be overwritten. A corresponding .bai index file will be created automatically.

  • fasta_path (str): Output path for the FASTA reference file. If the file exists, it will be overwritten.

Returns

Returns None on success. The function creates two files on disk:

  • A sorted, indexed BAM file at bam_path (with accompanying .bai index)
  • A FASTA reference file at fasta_path

Example

import pynanalogue

json_config = '''
{
"contigs": {
    "number": 2,
    "len_range": [100, 200],
    "repeated_seq": "ACGTACGT"
},
"reads": [
    {
        "number": 10,
        "mapq_range": [10, 30],
        "base_qual_range": [20, 40],
        "len_range": [0.1, 0.9],
        "barcode": "ACGTAA",
        "mods": [{
            "base": "C",
            "is_strand_plus": true,
            "mod_code": "m",
            "win": [5, 3],
            "mod_range": [[0.3, 0.7], [0.1, 0.5]]
        }]
    }
]
}
'''

pynanalogue.simulate_mod_bam(
json_config=json_config,
bam_path="output.bam",
fasta_path="output.fasta"
)

Errors

Returns a Python exception if:

  • The JSON configuration is invalid or cannot be parsed
  • The JSON structure doesn’t match the expected SimulationConfig schema
  • File I/O operations fail (e.g., permission issues, disk full)
  • BAM or FASTA generation fails due to invalid configuration parameters

window_reads

window_reads(
    bam_path,
    win,
    step,
    win_op='density',
    treat_as_url=False,
    min_seq_len=0,
    min_align_len=0,
    read_ids=Ellipsis,
    threads=2,
    include_zero_len=False,
    read_filter='',
    sample_fraction=1.0,
    mapq_filter=0,
    exclude_mapq_unavail=False,
    region='',
    full_region=False,
    tag='',
    mod_strand='',
    min_mod_qual=0,
    reject_mod_qual_non_inclusive=Ellipsis,
    trim_read_ends_mod=0,
    base_qual_filter_mod=0,
    mod_region=''
)

Window modification density across single molecules and return a Polars DataFrame. With the gradient option, the gradient in mod density within each window is reported. The output is a BED format, with windowed densities per window per read. Details: runs nanalogue_core::window_reads::run_df.

Sets various options through builder functions before running the function and capturing the output (see Example Output).

Args

  • bam_path (str): Path to the BAM file. Must be associated with a BAM index.

  • win (int): Size of window in number of bases whose mod is being queried. i.e. let’s say a read contains cytosine mods and win is set to 10, then each window is chosen so that there are 10 cytosines in it. If a read has multiple mods, then multiple windows are set up such that each window has the specified number of bases of that type in it.

  • step (int): Length by which the window is slid in the same units as win above.

  • win_op (optional, str): Type of windowing operation to use, allows “density” and “grad_density” i.e. measure modification density or the gradient of it within each window. Default is “density”.

  • treat_as_url (optional, bool): If True, treat bam_path as a URL, default False.

  • min_seq_len (optional, int): Only retain sequences above this length, default 0.

  • min_align_len (optional, int): Only retain sequences with an alignment length above this value. Defaults to unused.

  • read_ids (optional, set of str): Only retrieve these read ids, defaults to unused.

  • threads (optional, int): Number of threads used in some aspects of program execution, defaults to 2.

  • include_zero_len (optional, bool): Include sequences of zero length. WARNING: our program may crash if you do this. Defaults to False. Helps to check if sequences of zero length exist in our BAM file.

  • read_filter (optional, str): Comma-separated sequence of one to many of the following strings: primary_forward, primary_reverse, secondary_forward, secondary_reverse, supplementary_forward, supplementary_reverse, unmapped. If specified, only reads with a mapping belonging to this set are retained. Defaults to no filter.

  • sample_fraction (optional, float): Set to between 0 and 1 to subsample BAM file. WARNING: seeds are not set, so you may get a new set of reads every time. WARNING: we sample every read with the given probability, so the total number of reads fluctuates according to standard counting statistics.

  • mapq_filter (optional, int): Exclude reads with mapping quality below this number. defaults to unused.

  • exclude_mapq_unavail (optional, bool): Exclude reads where mapping quality is unavailable. defaults to false.

  • region (optional, str): Only include reads with at least one mapped base from this region. Use the format “contig”, “contig:start-”, or “contig:start-end”. These are 0-based, half-open intervals. Defaults to read entire BAM file. Can be used in combination with mod_region.

  • full_region (optional, bool): Only include reads if they pass through the region above in full. Defaults to false.

  • tag (optional, str): If set, only this type of modification is processed. Input is a string type, for example a single-letter code “m”, a number as a string “76792” etc. Defaults to processing all modifications.

  • mod_strand (optional, str): Set this to bc or bc_comp to retrieve information about mods only from the basecalled strand or only from its complement. Some sequencing technologies like PacBio or ONT duplex record mod information both on a strand and its complement. It may be useful in some scenarios to separate this information. Defaults to not filter.

  • min_mod_qual (optional, int): Set to a number 0-255. Reject modification calls whose probability is below this value (0, 255 correspond to a probability of 0 and 1 respectively). Defaults to 0.

  • reject_mod_qual_non_inclusive (optional, (int, int)): Reject modification calls whose probability is such that int_low < prob < int_high. Set both to a number between 0-255 and such that the first entry is <= the second (if they are equal, no filtering is performed). Defaults to no filtering. Also see comments under min_mod_qual.

  • trim_read_ends_mod (optional, int): Reject modification information within so many bp of either end of the read. Defaults to 0.

  • base_qual_filter_mod (optional, int): Reject modification information on any base whose basecalling quality is below this number. Defaults to 0.

  • mod_region (optional, str): Genomic region in the format “contig”, “contig:start-” or “contig:start-end”. Reject any modification information outside this region. These are half-open, 0-based intervals. Can be used in combination with region.

Returns

A Polars dataframe (See the snippet from Example output).

Example output

If the dataframe were converted to a TSV format, it might look like the following. The basecalling qualities are all 255 i.e. unknown because this is a BAM file where basecalling qualities haven’t been recorded.

#contig ref_win_start   ref_win_end     read_id win_val strand  base    mod_strand      mod_type        win_start       win_end basecall_qual
dummyIII        26      32      a4f36092-b4d5-47a9-813e-c22c3b477a0c    1       +       T       +       T       3       9       255
dummyIII        31      51      a4f36092-b4d5-47a9-813e-c22c3b477a0c    0.5     +       T       +       T       8       28      255
dummyIII        62      71      a4f36092-b4d5-47a9-813e-c22c3b477a0c    0.5     +       T       +       T       39      48      255
dummyII 22      24      fffffff1-10d2-49cb-8ca3-e8d48979001b    0.5     -       T       +       T       19      21      255
.       -1      -1      a4f36092-b4d5-47a9-813e-c22c3b477a0c    1       .       T       +       T       3       9       255
.       -1      -1      a4f36092-b4d5-47a9-813e-c22c3b477a0c    0.5     .       T       +       T       8       28      255
.       -1      -1      a4f36092-b4d5-47a9-813e-c22c3b477a0c    0.5     .       T       +       T       39      48      255

Under the gradient option, win_val reports the gradient in modification density within each window.

#contig ref_win_start   ref_win_end     read_id win_val strand  base    mod_strand      mod_type       win_start        win_end basecall_qual
dummyIII    40  50  c4f36092-b4d5-47a9-813e-c22c3b477a0c    0.054545455 +   N   +   N   17  27  255
dummyIII    41  51  c4f36092-b4d5-47a9-813e-c22c3b477a0c    0.096969694 +   N   +   N   18  28  255
dummyIII    42  52  c4f36092-b4d5-47a9-813e-c22c3b477a0c    0.12727273  +   N   +   N   19  29  255
dummyIII    43  53  c4f36092-b4d5-47a9-813e-c22c3b477a0c    0.14545454  +   N   +   N   20  30  255
dummyIII    44  54  c4f36092-b4d5-47a9-813e-c22c3b477a0c    0.15151516  +   N   +   N   21  31  255
dummyIII    45  55  c4f36092-b4d5-47a9-813e-c22c3b477a0c    0.14545454  +   N   +   N   22  32  255
dummyIII    46  56  c4f36092-b4d5-47a9-813e-c22c3b477a0c    0.12727273  +   N   +   N   23  33  255
dummyIII    47  57  c4f36092-b4d5-47a9-813e-c22c3b477a0c    0.096969694 +   N   +   N   24  34  255
dummyIII    48  58  c4f36092-b4d5-47a9-813e-c22c3b477a0c    0.054545455 +   N   +   N   25  35  255

Errors

If building of option-related structs fails, if BAM input cannot be obtained, if preparing records fails, or running the nanalogue_core::window_reads::run_df function fails

Acknowledgments

This software was developed at the Earlham Institute in the UK. This work was supported by the Biotechnology and Biological Sciences Research Council (BBSRC), part of UK Research and Innovation, through the Core Capability Grant BB/CCG2220/1 at the Earlham Institute and the Earlham Institute Strategic Programme Grant Cellular Genomics BBX011070/1 and its constituent work packages BBS/E/ER/230001B (CellGen WP2 Consequences of somatic genome variation on traits). The work was also supported by the following response-mode project grants: BB/W006014/1 (Single molecule detection of DNA replication errors) and BB/Y00549X/1 (Single molecule analysis of Human DNA replication). This research was supported in part by NBI Research Computing through use of the High-Performance Computing system and Isilon storage.