Reference

Data Input/Output

Functions:

load_csv(filename[, cell_axis, delimiter, …])

Load a csv file.

load_tsv(filename[, cell_axis, delimiter, …])

Load a tsv file.

load_fcs(filename[, gene_names, cell_names, …])

Load a fcs file.

load_mtx(mtx_file[, cell_axis, gene_names, …])

Load a mtx file.

save_mtx(data, destination[, cell_names, …])

Save a mtx file.

load_10X(data_dir[, sparse, gene_labels, …])

Load data produced from the 10X Cellranger pipeline.

load_10X_HDF5(filename[, genome, sparse, …])

Load HDF5 10X data produced from the 10X Cellranger pipeline.

load_10X_zip(filename[, sparse, …])

Load zipped 10X data produced from the 10X Cellranger pipeline.

scprep.io.load_csv(filename, cell_axis='row', delimiter=', ', gene_names=True, cell_names=True, sparse=False, chunksize=10000, **kwargs)[source]

Load a csv file.

Parameters
  • filename (str) – The name of the csv file to be loaded

  • cell_axis ({'row', 'column'}, optional (default: 'row')) – If your data has genes on the rows and cells on the columns, use cell_axis=’column’

  • delimiter (str, optional (default: ',')) – Use ‘t’ for tab separated values (tsv)

  • gene_names (bool, str, array-like, or None (default: True)) – If True, we assume gene names are in the first row/column. Otherwise expects a filename or an array containing a list of gene symbols or ids

  • cell_names (bool, str, array-like, or None (default: True)) – If True, we assume cell names are in the first row/column. Otherwise expects a filename or an array containing a list of cell barcodes.

  • sparse (bool, optional (default: False)) – If True, loads the data as a pd.DataFrame[pd.SparseArray]. This uses less memory but more CPU.

  • chunksize (int, optional (default: 10000)) – If sparse=True, read this many lines of dense data at a time before converting to sparse.

  • **kwargs (optional arguments for pd.read_csv.) –

Returns

data – If either gene or cell names are given, data will be a pd.DataFrame or pd.DataFrame[pd.SparseArray]. If no names are given, data will be a np.ndarray or scipy.sparse.spmatrix

Return type

array-like, shape=[n_samples, n_features]

scprep.io.load_tsv(filename, cell_axis='row', delimiter='\t', gene_names=True, cell_names=True, sparse=False, **kwargs)[source]

Load a tsv file.

Parameters
  • filename (str) – The name of the csv file to be loaded

  • cell_axis ({'row', 'column'}, optional (default: 'row')) – If your data has genes on the rows and cells on the columns, use cell_axis=’column’

  • delimiter (str, optional (default: 't')) – Use ‘,’ for comma separated values (csv)

  • gene_names (bool, str, array-like, or None (default: True)) – If True, we assume gene names are in the first row/column. Otherwise expects a filename or an array containing a list of gene symbols or ids

  • cell_names (bool, str, array-like, or None (default: True)) – If True, we assume cell names are in the first row/column. Otherwise expects a filename or an array containing a list of cell barcodes.

  • sparse (bool, optional (default: False)) – If True, loads the data as a pd.DataFrame[pd.SparseArray]. This uses less memory but more CPU.

  • **kwargs (optional arguments for pd.read_csv.) –

Returns

data – If either gene or cell names are given, data will be a pd.DataFrame or pd.DataFrame[pd.SparseArray]. If no names are given, data will be a np.ndarray or scipy.sparse.spmatrix

Return type

array-like, shape=[n_samples, n_features]

scprep.io.load_fcs(filename, gene_names=True, cell_names=True, sparse=None, metadata_channels=['Time', 'Event_length', 'DNA1', 'DNA2', 'Cisplatin', 'beadDist', 'bead1'], channel_naming='$PnS', reformat_meta=True, override=False, **kwargs)[source]

Load a fcs file.

Parameters
  • filename (str) – The name of the fcs file to be loaded

  • gene_names (bool, str, array-like, or None (default: True)) – If True, we assume gene names are contained in the file. Otherwise expects a filename or an array containing a list of gene symbols or ids

  • cell_names (bool, str, array-like, or None (default: True)) – If True, we assume cell names are contained in the file. Otherwise expects a filename or an array containing a list of cell barcodes.

  • sparse (bool, optional (default: None)) – If True, loads the data as a pd.DataFrame[SparseArray]. This uses less memory but more CPU.

  • metadata_channels (list-like, optional, shape=[n_meta]) –

    (default: [‘Time’, ‘Event_length’, ‘DNA1’,

    ’DNA2’, ‘Cisplatin’, ‘beadDist’, ‘bead1’])

    Channels to be excluded from the data

  • channel_naming ('$PnS' | '$PnN') – Determines which meta data field is used for naming the channels. The default should be $PnS (even though it is not guaranteed to be unique) $PnN stands for the short name (guaranteed to be unique). Will look like ‘FL1-H’ $PnS stands for the actual name (not guaranteed to be unique). Will look like ‘FSC-H’ (Forward scatter) The chosen field will be used to population self.channels Note: These names are not flipped in the implementation. It looks like they were swapped for some reason in the official FCS specification.

  • reformat_meta (bool, optional (default: True)) – If true, the meta data is reformatted with the channel information organized into a DataFrame and moved into the ‘_channels_’ key

  • override (bool, optional (default: False)) – If true, uses an experimental override of fcsparser. Should only be used in cases where fcsparser fails to load the file, likely due to a malformed header. Credit to https://github.com/pontikos/fcstools

  • **kwargs (optional arguments for fcsparser.parse.) –

Returns

  • channel_metadata (dict) – FCS metadata

  • cell_metadata (array-like, shape=[n_samples, n_meta]) – Values from metadata channels

  • data (array-like, shape=[n_samples, n_features]) – If either gene or cell names are given, data will be a pd.DataFrame or pd.DataFrame[SparseArray]. If no names are given, data will be a np.ndarray or scipy.sparse.spmatrix

scprep.io.load_mtx(mtx_file, cell_axis='row', gene_names=None, cell_names=None, sparse=None)[source]

Load a mtx file.

Parameters
  • filename (str) – The name of the mtx file to be loaded

  • cell_axis ({'row', 'column'}, optional (default: 'row')) – If your data has genes on the rows and cells on the columns, use cell_axis=’column’

  • gene_names (str, array-like, or None (default: None)) – Expects a filename or an array containing a list of gene symbols or ids

  • cell_names (str, array-like, or None (default: None)) – Expects a filename or an array containing a list of cell barcodes.

  • sparse (bool, optional (default: None)) – If True, loads the data as a pd.DataFrame[pd.SparseArray]. This uses less memory but more CPU.

Returns

data – If either gene or cell names are given, data will be a pd.DataFrame or pd.DataFrame[pd.SparseArray]. If no names are given, data will be a np.ndarray or scipy.sparse.spmatrix

Return type

array-like, shape=[n_samples, n_features]

scprep.io.save_mtx(data, destination, cell_names=None, gene_names=None)[source]

Save a mtx file.

Parameters
  • data (array-like, shape=[n_samples, n_features]) – Input data, saved to destination/matrix.mtx

  • destination (str) – Directory in which to save the data

  • cell_names (list-like, shape=[n_samples], optional (default: None)) – Cell names associated with rows, saved to destination/cell_names.tsv. If data is a pandas DataFrame and cell_names is None, these are autopopulated from data.index.

  • gene_names (list-like, shape=[n_features], optional (default: None)) – Cell names associated with rows, saved to destination/gene_names.tsv. If data is a pandas DataFrame and gene_names is None, these are autopopulated from data.columns.

Examples

>>> import scprep
>>> scprep.io.save_mtx(data, destination="my_data")
>>> reload = scprep.io.load_mtx("my_data/matrix.mtx",
...                             cell_names="my_data/cell_names.tsv",
...                             gene_names="my_data/gene_names.tsv")
scprep.io.load_10X(data_dir, sparse=True, gene_labels='symbol', allow_duplicates=None)[source]

Load data produced from the 10X Cellranger pipeline.

A default run of the cellranger count command will generate gene-barcode matrices for secondary analysis. For both “raw” and “filtered” output, directories are created containing three files: ‘matrix.mtx’, ‘barcodes.tsv’, ‘genes.tsv’. Running scprep.io.load_10X(data_dir) will return a Pandas DataFrame with genes as columns and cells as rows.

Parameters
  • data_dir (string) – path to input data directory expects ‘matrix.mtx(.gz)’, ‘[genes/features].tsv(.gz)’, ‘barcodes.tsv(.gz)’ to be present and will raise an error otherwise

  • sparse (boolean) – If True, a sparse Pandas DataFrame is returned.

  • gene_labels (string, {'id', 'symbol', 'both'} optional, default: 'symbol') – Whether the columns of the dataframe should contain gene ids or gene symbols. If ‘both’, returns symbols followed by ids in parentheses.

  • allow_duplicates (bool, optional (default: None)) – Whether or not to allow duplicate gene names. If None, duplicates are allowed for dense input but not for sparse input.

Returns

data – If sparse, data will be a pd.DataFrame[pd.SparseArray]. Otherwise, data will be a pd.DataFrame.

Return type

array-like, shape=[n_samples, n_features]

scprep.io.load_10X_HDF5(filename, genome=None, sparse=True, gene_labels='symbol', allow_duplicates=None, backend=None)[source]

Load HDF5 10X data produced from the 10X Cellranger pipeline.

Equivalent to load_10X but for HDF5 format.

Parameters
  • filename (string) – path to HDF5 input data

  • genome (str or None, optional (default: None)) – Name of the genome to which CellRanger ran analysis. If None, selects the first available genome, and prints all available genomes if more than one is available. Invalid for Cellranger 3.0 HDF5 files.

  • sparse (boolean) – If True, a sparse Pandas DataFrame is returned.

  • gene_labels (string, {'id', 'symbol', 'both'} optional, default: 'symbol') – Whether the columns of the dataframe should contain gene ids or gene symbols. If ‘both’, returns symbols followed by ids in parentheses.

  • allow_duplicates (bool, optional (default: None)) – Whether or not to allow duplicate gene names. If None, duplicates are allowed for dense input but not for sparse input.

  • backend (string, {'tables', 'h5py' or None} optional, default: None) – Selects the HDF5 backend. By default, selects whichever is available, using tables if both are available.

Returns

data – If sparse, data will be a pd.DataFrame[pd.SparseArray]. Otherwise, data will be a pd.DataFrame.

Return type

array-like, shape=[n_samples, n_features]

scprep.io.load_10X_zip(filename, sparse=True, gene_labels='symbol', allow_duplicates=None)[source]

Load zipped 10X data produced from the 10X Cellranger pipeline.

Runs load_10X after unzipping the data contained in filename.

Parameters
  • filename (string) – path to zipped input data directory expects ‘matrix.mtx’, ‘genes.tsv’, ‘barcodes.tsv’ to be present and will raise an error otherwise

  • sparse (boolean) – If True, a sparse Pandas DataFrame is returned.

  • gene_labels (string, {'id', 'symbol', 'both'} optional, default: 'symbol') – Whether the columns of the dataframe should contain gene ids or gene symbols. If ‘both’, returns symbols followed by ids in parentheses.

  • allow_duplicates (bool, optional (default: None)) – Whether or not to allow duplicate gene names. If None, duplicates are allowed for dense input but not for sparse input.

Returns

data – If sparse, data will be a pd.DataFrame[pd.SparseArray]. Otherwise, data will be a pd.DataFrame.

Return type

array-like, shape=[n_samples, n_features]

HDF5

Functions:

get_node(f, node)

Get a subnode from a HDF5 file or group.

get_values(dataset)

Read values from a HDF5 dataset.

list_nodes(f)

List all first-level nodes in a HDF5 file.

open_file(filename[, mode, backend])

Open an HDF5 file with either tables or h5py.

scprep.io.hdf5.get_node(f, node)[source]

Get a subnode from a HDF5 file or group.

Parameters
  • f (tables.File, h5py.File, tables.Group or h5py.Group) – Open HDF5 file handle or node

  • node (str) – Name of subnode to retrieve

Returns

g – Requested HDF5 node.

Return type

tables.Group, h5py.Group, tables.CArray or hdf5.Dataset

scprep.io.hdf5.get_values(dataset)[source]

Read values from a HDF5 dataset.

Parameters

dataset (tables.CArray or h5py.Dataset) –

Returns

data – Data read from HDF5 dataset

Return type

np.ndarray

scprep.io.hdf5.list_nodes(f)[source]

List all first-level nodes in a HDF5 file.

Parameters

f (tables.File or h5py.File) – Open HDF5 file handle.

Returns

nodes – List of names of first-level nodes below f

Return type

list

scprep.io.hdf5.open_file(filename, mode='r', backend=None)[source]

Open an HDF5 file with either tables or h5py.

Gives a simple, unified interface for both tables and h5py

Parameters
  • filename (str) – Name of the HDF5 file

  • mode (str, optional (default: 'r')) – Read/write mode. Choose from [‘r’, ‘w’, ‘a’ ‘r+’]

  • backend (str, optional (default: None)) – HDF5 backend to use. Choose from [‘h5py’, ‘tables’]. If not given, scprep will detect which backend is available, using tables if both are installed.

Returns

f – Open HDF5 file handle.

Return type

tables.File or h5py.File

Download

Functions:

download_and_extract_zip(url, destination)

Download a .zip file from a URL and extract it.

download_google_drive(id, destination)

Download a file from Google Drive.

download_url(url, destination)

Download a file from a URL.

unzip(filename[, destination, delete])

Extract a .zip file and optionally remove the archived version.

scprep.io.download.download_and_extract_zip(url, destination)[source]

Download a .zip file from a URL and extract it.

Parameters
  • url (string) – URL of file to be downloaded

  • destination (string) – Directory in which to extract the downloaded zip

scprep.io.download.download_google_drive(id, destination)[source]

Download a file from Google Drive.

Requires the file to be available to view by anyone with the URL.

Parameters
  • id (string) – Google Drive ID string. You can access this by clicking ‘Get Shareable Link’, which will give a URL of the form <https://drive.google.com/file/d/your_file_id/view?usp=sharing>

  • destination (string or file) – File to which to save the downloaded data

scprep.io.download.download_url(url, destination)[source]

Download a file from a URL.

Parameters
  • url (string) – URL of file to be downloaded

  • destination (string or file) – File to which to save the downloaded data

scprep.io.download.unzip(filename, destination=None, delete=True)[source]

Extract a .zip file and optionally remove the archived version.

Parameters
  • filename (string) – Path to the zip file

  • destination (string, optional (default: None)) – Path to the folder in which to extract the zip. If None, extracts to the same directory the archive is in.

  • delete (boolean, optional (default: True)) – If True, deletes the zip file after extraction

Filtering

Functions:

filter_duplicates(data, *extra_data[, …])

Filter all duplicate cells.

filter_empty_cells(data, *extra_data[, …])

Remove all cells with zero library size.

filter_empty_genes(data, *extra_data)

Filter all genes with zero counts across all cells.

filter_gene_set_expression(data, *extra_data)

Remove cells with total expression of a gene set above or below a threshold.

filter_library_size(data, *extra_data[, …])

Remove all cells with library size above or below a certain threshold.

filter_rare_genes(data, *extra_data[, …])

Filter all genes with negligible counts in all but a few cells.

filter_values(data, *extra_data[, values, …])

Remove all cells with values above or below a certain threshold.

scprep.filter.filter_duplicates(data, *extra_data, sample_labels=None)[source]

Filter all duplicate cells.

Parameters
  • data (array-like, shape=[n_samples, n_features]) – Input data

  • extra_data (array-like, shape=[n_samples, any], optional) – Optional additional data objects from which to select the same rows

  • sample_labels (Deprecated) –

Returns

  • data (array-like, shape=[m_samples, n_features]) – Filtered output data, where m_samples <= n_samples

  • extra_data (array-like, shape=[m_samples, any]) – Filtered extra data, if passed.

scprep.filter.filter_empty_cells(data, *extra_data, sample_labels=None)[source]

Remove all cells with zero library size.

Parameters
  • data (array-like, shape=[n_samples, n_features]) – Input data

  • extra_data (array-like, shape=[n_samples, any], optional) – Optional additional data objects from which to select the same rows

  • sample_labels (Deprecated) –

Returns

  • data (array-like, shape=[m_samples, n_features]) – Filtered output data, where m_samples <= n_samples

  • extra_data (array-like, shape=[m_samples, any]) – Filtered extra data, if passed.

scprep.filter.filter_empty_genes(data, *extra_data)[source]

Filter all genes with zero counts across all cells.

This is equivalent to filter_rare_genes(data, cutoff=0, min_cells=1) but should be faster.

Parameters
  • data (array-like, shape=[n_samples, n_features]) – Input data

  • extra_data (array-like, shape=[any, n_features], optional) – Optional additional data objects from which to select the same genes

Returns

  • data (array-like, shape=[n_samples, m_features]) – Filtered output data, where m_features <= n_features

  • extra_data (array-like, shape=[any, m_features]) – Filtered extra data, if passed.

scprep.filter.filter_gene_set_expression(data, *extra_data, genes=None, starts_with=None, ends_with=None, exact_word=None, regex=None, cutoff=None, percentile=None, library_size_normalize=False, keep_cells=None, return_expression=False, sample_labels=None, filter_per_sample=None)[source]

Remove cells with total expression of a gene set above or below a threshold.

It is recommended to use plot_gene_set_expression() to choose a cutoff prior to filtering.

Parameters
  • data (array-like, shape=[n_samples, n_features]) – Input data

  • extra_data (array-like, shape=[n_samples, any], optional) – Optional additional data objects from which to select the same rows

  • genes (list-like, optional (default: None)) – Integer column indices or string gene names included in gene set

  • starts_with (str or None, optional (default: None)) – If not None, select genes that start with this prefix

  • ends_with (str or None, optional (default: None)) – If not None, select genes that end with this suffix

  • exact_word (str, list-like or None, optional (default: None)) – If not None, select genes that contain this exact word.

  • regex (str or None, optional (default: None)) – If not None, select genes that match this regular expression

  • cutoff (float or tuple of floats, optional (default: None)) – Expression value above or below which to remove cells. Only one of cutoff and percentile should be specified.

  • percentile (int or tuple of ints, optional (Default: None)) – Percentile above or below which to retain a cell. Must be an integer between 0 and 100. Only one of cutoff and percentile should be specified.

  • library_size_normalize (bool, optional (default: False)) – Divide gene set expression by library size

  • keep_cells ({'above', 'below', 'between'} or None, optional (default: None)) – Keep cells above or below the cutoff. If None, defaults to ‘below’ for one cutoff and ‘between’ for two.

  • return_expression (bool, optional (default: False)) – If True, also return the values corresponding to the retained cells

  • sample_labels (Deprecated) –

  • filter_per_sample (Deprecated) –

Returns

  • data (array-like, shape=[m_samples, n_features]) – Filtered output data, where m_samples <= n_samples

  • filtered_expression (list-like, shape=[m_samples]) – Gene set expression corresponding to retained samples, returned only if return_expression is True

  • extra_data (array-like, shape=[m_samples, any]) – Filtered extra data, if passed.

scprep.filter.filter_library_size(data, *extra_data, cutoff=None, percentile=None, keep_cells=None, return_library_size=False, sample_labels=None, filter_per_sample=None)[source]

Remove all cells with library size above or below a certain threshold.

It is recommended to use plot_library_size() to choose a cutoff prior to filtering.

Parameters
  • data (array-like, shape=[n_samples, n_features]) – Input data

  • extra_data (array-like, shape=[n_samples, any], optional) – Optional additional data objects from which to select the same rows

  • cutoff (float or tuple of floats, optional (default: None)) – Library size above or below which to retain a cell. Only one of cutoff and percentile should be specified.

  • percentile (int or tuple of ints, optional (Default: None)) – Percentile above or below which to retain a cell. Must be an integer between 0 and 100. Only one of cutoff and percentile should be specified.

  • keep_cells ({'above', 'below', 'between'} or None, optional (default: None)) – Keep cells above, below or between the cutoff. If None, defaults to ‘above’ when a single cutoff is given and ‘between’ when two cutoffs are given.

  • return_library_size (bool, optional (default: False)) – If True, also return the library sizes corresponding to the retained cells

  • sample_labels (Deprecated) –

  • filter_per_sample (Deprecated) –

Returns

  • data (array-like, shape=[m_samples, n_features]) – Filtered output data, where m_samples <= n_samples

  • filtered_library_size (list-like, shape=[m_samples]) – Library sizes corresponding to retained samples, returned only if return_library_size is True

  • extra_data (array-like, shape=[m_samples, any]) – Filtered extra data, if passed.

scprep.filter.filter_rare_genes(data, *extra_data, cutoff=0, min_cells=5)[source]

Filter all genes with negligible counts in all but a few cells.

Parameters
  • data (array-like, shape=[n_samples, n_features]) – Input data

  • extra_data (array-like, shape=[any, n_features], optional) – Optional additional data objects from which to select the same rows

  • cutoff (float, optional (default: 0)) – Number of counts above which expression is deemed non-negligible

  • min_cells (int, optional (default: 5)) – Minimum number of cells above cutoff in order to retain a gene

Returns

  • data (array-like, shape=[n_samples, m_features]) – Filtered output data, where m_features <= n_features

  • extra_data (array-like, shape=[any, m_features]) – Filtered extra data, if passed.

scprep.filter.filter_values(data, *extra_data, values=None, cutoff=None, percentile=None, keep_cells='above', return_values=False, sample_labels=None, filter_per_sample=None)[source]

Remove all cells with values above or below a certain threshold.

It is recommended to use histogram() to choose a cutoff prior to filtering.

Parameters
  • data (array-like, shape=[n_samples, n_features]) – Input data

  • extra_data (array-like, shape=[n_samples, any], optional) – Optional additional data objects from which to select the same rows

  • values (list-like, shape=[n_samples]) – Value upon which to filter

  • cutoff (float or tuple of floats, optional (default: None)) – Value above or below which to retain cells. Only one of cutoff and percentile should be specified.

  • percentile (int or tuple of ints, optional (Default: None)) – Percentile above or below which to retain cells. Must be an integer between 0 and 100. Only one of cutoff and percentile should be specified.

  • keep_cells ({'above', 'below', 'between'} or None, optional (default: None)) – Keep cells above, below or between the cutoff. If None, defaults to ‘above’ when a single cutoff is given and ‘between’ when two cutoffs are given.

  • return_values (bool, optional (default: False)) – If True, also return the values corresponding to the retained cells

  • sample_labels (Deprecated) –

  • filter_per_sample (Deprecated) –

Returns

  • data (array-like, shape=[m_samples, n_features]) – Filtered output data, where m_samples <= n_samples

  • filtered_values (list-like, shape=[m_samples]) – Values corresponding to retained samples, returned only if return_values is True

  • extra_data (array-like, shape=[m_samples, any]) – Filtered extra data, if passed.

Normalization

Functions:

batch_mean_center(data[, sample_idx])

Perform batch mean-centering on the data.

library_size_normalize(data[, rescale, …])

Perform L1 normalization on input data.

scprep.normalize.batch_mean_center(data, sample_idx=None)[source]

Perform batch mean-centering on the data.

The features of the data are all centered such that the column means are zero. Each batch is centered separately.

Parameters
  • data (array-like, shape=[n_samples, n_features]) – Input data

  • sample_idx (list-like, optional) – Batch indices. If None, data is assumed to be a single batch

Returns

data – Batch mean-centered output data.

Return type

array-like, shape=[n_samples, n_features]

scprep.normalize.library_size_normalize(data, rescale=10000, return_library_size=False)[source]

Perform L1 normalization on input data.

Performs L1 normalization on input data such that the sum of expression values for each cell sums to 1 then returns normalized matrix to the metric space using median UMI count per cell effectively scaling all cells as if they were sampled evenly.

Parameters
  • data (array-like, shape=[n_samples, n_features]) – Input data

  • rescale ({‘mean’, ‘median’}, float or None, optional (default: 10000)) – Rescaling strategy. If ‘mean’ or ‘median’, normalized cells are scaled back up to the mean or median expression value. If a float, normalized cells are scaled up to the given value. If None, no rescaling is done and all cells will have normalized library size of 1.

  • return_library_size (bool, optional (default: False)) – If True, also return the library size pre-normalization

Returns

  • data_norm (array-like, shape=[n_samples, n_features]) – Library size normalized output data

  • filtered_library_size (list-like, shape=[m_samples]) – Library size of cells pre-normalization, returned only if return_library_size is True

Transformation

Functions:

arcsinh(data[, cofactor])

Inverse hyperbolic sine transform.

log(data[, pseudocount, base])

Log transform.

sqrt(data)

Square root transform.

scprep.transform.arcsinh(data, cofactor=5)[source]

Inverse hyperbolic sine transform.

Parameters
  • data (array-like, shape=[n_samples, n_features]) – Input data

  • cofactor (float or None, optional (default: 5)) – Factor by which to divide data before arcsinh transform

Returns

data – Inverse hyperbolic sine transformed output data

Return type

array-like, shape=[n_samples, n_features]

:raises ValueError : if cofactor <= 0:

scprep.transform.log(data, pseudocount=1, base=10)[source]

Log transform.

Parameters
  • data (array-like, shape=[n_samples, n_features]) – Input data

  • pseudocount (int, optional (default: 1)) – Pseudocount to add to values before log transform. If data is sparse, pseudocount must be 1 such that log(0 + pseudocount) = 0

  • base ({2, 'e', 10}, optional (default: 10)) – Logarithm base.

Returns

data – Log transformed output data

Return type

array-like, shape=[n_samples, n_features]

:raises ValueError : if data has zero or negative values: :raises RuntimeWarning : if data is sparse and pseudocount != 1:

scprep.transform.sqrt(data)[source]

Square root transform.

Parameters

data (array-like, shape=[n_samples, n_features]) – Input data

Returns

data – Square root transformed output data

Return type

array-like, shape=[n_samples, n_features]

:raises ValueError : if data has negative values:

Measurements

Functions:

gene_capture_count(data[, cutoff])

Measure the number of cells in which each gene has non-negligible counts.

gene_set_expression(data[, genes, …])

Measure the expression of a set of genes in each cell.

gene_variability(data[, kernel_size, …])

Measure the variability of each gene in a dataset.

library_size(data)

Measure the library size of each cell.

scprep.measure.gene_capture_count(data, cutoff=0)[source]

Measure the number of cells in which each gene has non-negligible counts.

Parameters
  • data (array-like, shape=[n_samples, n_features]) – Input data

  • cutoff (float, optional (default: 0)) – Number of counts above which expression is deemed non-negligible

Returns

capture-count – Capture count for each gene

Return type

list-like, shape=[m_features]

scprep.measure.gene_set_expression(data, genes=None, library_size_normalize=False, starts_with=None, ends_with=None, exact_word=None, regex=None)[source]

Measure the expression of a set of genes in each cell.

Parameters
  • data (array-like, shape=[n_samples, n_features]) – Input data

  • genes (list-like, shape<=[n_features], optional (default: None)) – Integer column indices or string gene names included in gene set

  • library_size_normalize (bool, optional (default: False)) – Divide gene set expression by library size

  • starts_with (str or None, optional (default: None)) – If not None, select genes that start with this prefix

  • ends_with (str or None, optional (default: None)) – If not None, select genes that end with this suffix

  • exact_word (str, list-like or None, optional (default: None)) – If not None, select genes that contain this exact word.

  • regex (str or None, optional (default: None)) – If not None, select genes that match this regular expression

Returns

gene_set_expression – Sum over genes for each cell

Return type

list-like, shape=[n_samples]

scprep.measure.gene_variability(data, kernel_size=0.005, smooth=5, return_means=False)[source]

Measure the variability of each gene in a dataset.

Variability is computed as the deviation from the rolling median of the mean-variance curve

Parameters
  • data (array-like, shape=[n_samples, n_features]) – Input data

  • kernel_size (float or int, optional (default: 0.005)) – Width of rolling median window. If a float between 0 and 1, the width is given by kernel_size * data.shape[1]. Otherwise should be an odd integer

  • smooth (int, optional (default: 5)) – Amount of smoothing to apply to the median filter

  • return_means (boolean, optional (default: False)) – If True, return the gene means

Returns

variability – Variability for each gene

Return type

list-like, shape=[n_samples]

scprep.measure.library_size(data)[source]

Measure the library size of each cell.

Parameters

data (array-like, shape=[n_samples, n_features]) – Input data

Returns

library_size – Sum over all genes for each cell

Return type

list-like, shape=[n_samples]

Statistics

Functions:

EMD(x, y)

Compute Earth Mover’s Distance between samples.

differential_expression(X, Y[, measure, …])

Calculate the most significant genes between two datasets.

differential_expression_by_cluster(data, …)

Calculate the most significant genes for each cluster in a dataset.

knnDREMI(x, y[, k, n_bins, n_mesh, n_jobs, …])

Compute kNN conditional Density Resampled Estimate of Mutual Information.

mean_difference(X, Y)

Calculate the mean difference in genes between two datasets.

mutual_information(x, y[, bins])

Compute mutual information score with set number of bins.

pairwise_correlation(X, Y[, ignore_nan])

Compute pairwise Pearson correlation between columns of two matrices.

plot_knnDREMI(dremi, mutual_info, x, y, …)

Plot results of DREMI.

rank_sum_statistic(X, Y)

Calculate the Wilcoxon rank-sum (aka Mann-Whitney U) statistic.

t_statistic(X, Y)

Calculate Welch’s t statistic.

scprep.stats.EMD(x, y)[source]

Compute Earth Mover’s Distance between samples.

Calculates an approximation of Earth Mover’s Distance (also called Wasserstein distance) for 2 variables. This can be thought of as the distance between two probability distributions. This metric is useful for identifying differentially expressed genes between two groups of cells. For more information see https://en.wikipedia.org/wiki/Wasserstein_metric.

Note, this is a wrapper function for scipy.stats.wasserstein_disance and assumes the data is 1-dimensional

Parameters
  • x (array-like, shape=[n_samples]) – Input data (feature 1)

  • y (array-like, shape=[n_samples]) – Input data (feature 2)

Returns

emd – Earth Mover’s Distance between x and y.

Return type

float

Examples

>>> import scprep
>>> data = scprep.io.load_csv("my_data.csv")
>>> emd = scprep.stats.EMD(data['GENE1'], data['GENE2'])
scprep.stats.differential_expression(X, Y, measure='difference', direction='both', gene_names=None, n_jobs=-2)[source]

Calculate the most significant genes between two datasets.

If using measure="emd", the test statistic is multiplied by the sign of the mean differencein order to allow for distinguishing between positive and negative shifts. To ignore this, use direction="both" to sort by the absolute value.

Parameters
  • X (array-like, shape=[n_cells, n_genes]) –

  • Y (array-like, shape=[m_cells, n_genes]) –

  • measure ({'difference', 'emd', 'ttest', 'ranksum'},) –

    optional (default: ‘difference’) The measurement to be used to rank genes. ‘difference’ is the mean difference between genes. ‘emd’ refers to Earth Mover’s Distance. ‘ttest’ refers to Welch’s t-statistic. ‘ranksum’ refers to the Wilcoxon rank sum statistic (or the Mann-Whitney

    U statistic).

  • direction ({'up', 'down', 'both'}, optional (default: 'both')) – The direction in which to consider genes significant. If ‘up’, rank genes where X > Y. If ‘down’, rank genes where X < Y. If ‘both’, rank genes by absolute value.

  • gene_names (list-like or None, optional (default: None)) – List of gene names associated with the columns of X and Y

  • n_jobs (int, optional (default: -2)) – Number of threads to use if the measurement is parallelizable (currently used for EMD). If negative, -1 refers to all available cores.

Returns

result – Ordered DataFrame with a column “gene” and a column named measure.

Return type

pd.DataFrame

scprep.stats.differential_expression_by_cluster(data, clusters, measure='difference', direction='both', gene_names=None, n_jobs=-2)[source]

Calculate the most significant genes for each cluster in a dataset.

Measurements are run for each cluster against the rest of the dataset.

Parameters
  • data (array-like, shape=[n_cells, n_genes]) –

  • clusters (list-like, shape=[n_cells]) –

  • measure ({'difference', 'emd', 'ttest', 'ranksum'}, optional) – (default: ‘difference’) The measurement to be used to rank genes. ‘difference’ is the mean difference between genes. ‘emd’ refers to Earth Mover’s Distance. ‘ttest’ refers to Welch’s t-statistic. ‘ranksum’ refers to the Wilcoxon rank sum statistic (or the Mann-Whitney U statistic).

  • direction ({'up', 'down', 'both'}, optional (default: 'both')) – The direction in which to consider genes significant. If ‘up’, rank genes where X > Y. If ‘down’, rank genes where X < Y. If ‘both’, rank genes by absolute value.

  • gene_names (list-like or None, optional (default: None)) – List of gene names associated with the columns of X and Y

  • n_jobs (int, optional (default: -2)) – Number of threads to use if the measurement is parallelizable (currently used for EMD). If negative, -1 refers to all available cores.

Returns

result – Dictionary containing an ordered DataFrame with a column “gene” and a column named measure for each cluster.

Return type

dict(pd.DataFrame)

scprep.stats.knnDREMI(x, y, k=10, n_bins=20, n_mesh=3, n_jobs=1, plot=False, return_drevi=False, **kwargs)[source]

Compute kNN conditional Density Resampled Estimate of Mutual Information.

Calculates k-Nearest Neighbor conditional Density Resampled Estimate of Mutual Information as defined in Van Dijk et al, 2018. 1

kNN-DREMI is an adaptation of DREMI (Krishnaswamy et al. 2014, 2) for single cell RNA-sequencing data. DREMI captures the functional relationship between two genes across their entire dynamic range. The key change to kNN-DREMI is the replacement of the heat diffusion-based kernel-density estimator from Botev et al., 2010 3 by a k-nearest neighbor-based density estimator (Sricharan et al., 2012 4), which has been shown to be an effective method for sparse and high dimensional datasets.

Note that kNN-DREMI, like Mutual Information and DREMI, is not symmetric. Here we are estimating I(Y|X).

Parameters
  • x (array-like, shape=[n_samples]) – Input data (independent feature)

  • y (array-like, shape=[n_samples]) – Input data (dependent feature)

  • k (int, range=[0:n_samples), optional (default: 10)) – Number of neighbors

  • n_bins (int, range=[0:inf), optional (default: 20)) – Number of bins for density resampling

  • n_mesh (int, range=[0:inf), optional (default: 3)) – In each bin, density will be calculcated around (mesh ** 2) points

  • n_jobs (int, optional (default: 1)) – Number of threads used for kNN calculation

  • plot (bool, optional (default: False)) – If True, DREMI create plots of the data like those seen in Fig 5C/D of van Dijk et al. 2018. (doi:10.1016/j.cell.2018.05.061).

  • return_drevi (bool, optional (default: False)) – If True, return the DREVI normalized density matrix in addition to the DREMI score.

  • **kwargs (additional arguments for scprep.stats.plot_knnDREMI) –

Returns

  • dremi (float) – kNN condtional Density resampled estimate of mutual information

  • drevi (np.ndarray) – DREVI normalized density matrix. Only returned if return_drevi is True.

Examples

>>> import scprep
>>> data = scprep.io.load_csv("my_data.csv")
>>> dremi = scprep.stats.knnDREMI(data['GENE1'], data['GENE2'],
...                               plot=True,
...                               filename='dremi.png')

References

1(1,2)

van Dijk D et al. (2018), Recovering Gene Interactions from Single-Cell Data Using Data Diffusion, Cell.

2

Krishnaswamy S et al. (2014), Conditional density-based analysis of T cell signaling in single-cell data, Science.

3

Botev ZI et al. (2010), Kernel density estimation via diffusion, The Annals of Statistics.

4

Sricharan K et al. (2012), Estimation of nonlinear functionals of densities with confidence, IEEE Transactions on Information Theory.

scprep.stats.mean_difference(X, Y)[source]

Calculate the mean difference in genes between two datasets.

In the case where the data has been log normalized, this is equivalent to fold change.

Parameters
  • X (array-like, shape=[n_cells, n_genes]) –

  • Y (array-like, shape=[m_cells, n_genes]) –

Returns

difference

Return type

list-like, shape=[n_genes]

scprep.stats.mutual_information(x, y, bins=8)[source]

Compute mutual information score with set number of bins.

Helper function for sklearn.metrics.mutual_info_score that builds a contingency table over a set number of bins. Credit: Warran Weckesser.

Parameters
  • x (array-like, shape=[n_samples]) – Input data (feature 1)

  • y (array-like, shape=[n_samples]) – Input data (feature 2)

  • bins (int or array-like, (default: 8)) – Passed to np.histogram2d to calculate a contingency table.

Returns

mi – Mutual information between x and y.

Return type

float

Examples

>>> import scprep
>>> data = scprep.io.load_csv("my_data.csv")
>>> mi = scprep.stats.mutual_information(data['GENE1'], data['GENE2'])
scprep.stats.pairwise_correlation(X, Y, ignore_nan=False)[source]

Compute pairwise Pearson correlation between columns of two matrices.

From https://stackoverflow.com/a/33651442/3996580

Parameters
  • X (array-like, shape=[n_samples, m_features]) – Input data

  • Y (array-like, shape=[n_samples, p_features]) – Input data

  • ignore_nan (bool, optional (default: False)) – If True, ignore NaNs, computing correlation over remaining values

Returns

cor

Return type

np.ndarray, shape=[m_features, p_features]

scprep.stats.plot_knnDREMI(dremi, mutual_info, x, y, n_bins, n_mesh, density, bin_density, drevi, figsize=(12, 3.5), filename=None, xlabel='Feature 1', ylabel='Feature 2', title_fontsize=18, label_fontsize=16, dpi=150)[source]

Plot results of DREMI.

Create plots of the data like those seen in Fig 5C/D of van Dijk et al. 2018. 1 Note that this function is not designed to be called manually. Instead create plots by running scprep.stats.knnDREMI with plot=True.

Parameters
  • figsize (tuple, optional (default: (12, 3.5))) – Matplotlib figure size

  • filename (str or None, optional (default: None)) – If given, saves the results to a file

  • xlabel (str, optional (default: "Feature 1")) – The name of the gene shown on the x axis

  • ylabel (str, optional (default: "Feature 2")) – The name of the gene shown on the y axis

  • title_fontsize (int, optional (default: 18)) – Font size for figure titles

  • label_fontsize (int, optional (default: 16)) – Font size for axis labels

  • dpi (int, optional (default: 150)) – Dots per inch for saved figure

scprep.stats.rank_sum_statistic(X, Y)[source]

Calculate the Wilcoxon rank-sum (aka Mann-Whitney U) statistic.

Parameters
  • X (array-like, shape=[n_cells, n_genes]) –

  • Y (array-like, shape=[m_cells, n_genes]) –

Returns

rank_sum_statistic

Return type

list-like, shape=[n_genes]

scprep.stats.t_statistic(X, Y)[source]

Calculate Welch’s t statistic.

Assumes data of unequal number of samples and unequal variance

Parameters
  • X (array-like, shape=[n_cells, n_genes]) –

  • Y (array-like, shape=[m_cells, n_genes]) –

Returns

t_statistic

Return type

list-like, shape=[n_genes]

Plotting

Functions:

histogram(data[, bins, log, cutoff, …])

Plot a histogram.

plot_gene_set_expression(data[, genes, …])

Plot the histogram of the expression of a gene set.

plot_library_size(data[, bins, log, cutoff, …])

Plot the library size histogram.

jitter(labels, values[, sigma, c, cmap, …])

Create a jitter plot.

marker_plot(data, clusters, markers[, …])

Plot marker gene enrichment.

rotate_scatter3d(data[, filename, …])

Create a rotating 3D scatter plot.

scatter(x, y[, z, c, cmap, cmap_scale, s, …])

Create a scatter plot.

scatter2d(data[, c, cmap, cmap_scale, s, …])

Create a 2D scatter plot.

scatter3d(data[, c, cmap, cmap_scale, s, …])

Create a 3D scatter plot.

scree_plot(singular_values[, cumulative, …])

Plot the explained variance of each principal component.

plot_gene_variability(data[, kernel_size, …])

Plot the histogram of gene variability.

scprep.plot.histogram(data, bins=100, log=False, cutoff=None, percentile=None, ax=None, figsize=None, xlabel=None, ylabel='Number of cells', title=None, fontsize=None, histtype='stepfilled', label=None, legend=True, alpha=None, filename=None, dpi=None, **kwargs)[source]

Plot a histogram.

Parameters
  • data (array-like, shape=[n_samples]) – Input data. Multiple datasets may be given as a list of array-likes.

  • bins (int, optional (default: 100)) – Number of bins to draw in the histogram

  • log (bool, or {'x', 'y'}, optional (default: False)) – If True, plot both axes on a log scale. If ‘x’ or ‘y’, only plot the given axis on a log scale. If False, plot both axes on a linear scale.

  • cutoff (float or None, optional (default: None)) – Absolute cutoff at which to draw a vertical line. Only one of cutoff and percentile may be given.

  • percentile (float or None, optional (default: None)) – Percentile between 0 and 100 at which to draw a vertical line. Only one of cutoff and percentile may be given.

  • ax (matplotlib.Axes or None, optional (default: None)) – Axis to plot on. If None, a new axis will be created.

  • figsize (tuple or None, optional (default: None)) – If not None, sets the figure size (width, height)

  • [x,y]label (str, optional) – Labels to display on the x and y axis.

  • title (str or None, optional (default: None)) – Axis title.

  • fontsize (float or None (default: None)) – Base font size.

  • histtype ({'bar', 'barstacked', 'step', 'stepfilled'}, optional) – (default: ‘stepfilled’) The type of histogram to draw. ‘bar’ is a traditional bar-type histogram. If multiple data are given the bars are arranged side by side. ‘barstacked’ is a bar-type histogram where multiple data are stacked on top of each other. ‘step’ generates a lineplot that is by default unfilled. ‘stepfilled’ generates a lineplot that is by default filled.

  • label (str or None, optional (default: None)) – String, or sequence of strings to match multiple datasets.

  • legend (bool, optional (default: True)) – Show the legend if label is given.

  • alpha (float, optional (default: 1 for a single dataset, 0.5 for multiple)) – Histogram transparency

  • filename (str or None (default: None)) – file to which the output is saved

  • dpi (int or None, optional (default: None)) – The resolution in dots per inch. If None it will default to the value savefig.dpi in the matplotlibrc file. If ‘figure’ it will set the dpi to be the value of the figure. Only used if filename is not None.

  • **kwargs (additional arguments for matplotlib.pyplot.hist) –

Returns

ax – axis on which plot was drawn

Return type

matplotlib.Axes

scprep.plot.plot_gene_set_expression(data, genes=None, starts_with=None, ends_with=None, exact_word=None, regex=None, bins=100, log=False, cutoff=None, percentile=None, library_size_normalize=False, ax=None, figsize=None, xlabel='Gene expression', title=None, fontsize=None, filename=None, dpi=None, **kwargs)[source]

Plot the histogram of the expression of a gene set.

Parameters
  • data (array-like, shape=[n_samples, n_features]) – Input data. Multiple datasets may be given as a list of array-likes.

  • genes (list-like, optional (default: None)) – Integer column indices or string gene names included in gene set

  • starts_with (str or None, optional (default: None)) – If not None, select genes that start with this prefix

  • ends_with (str or None, optional (default: None)) – If not None, select genes that end with this suffix

  • exact_word (str, list-like or None, optional (default: None)) – If not None, select genes that contain this exact word.

  • regex (str or None, optional (default: None)) – If not None, select genes that match this regular expression

  • bins (int, optional (default: 100)) – Number of bins to draw in the histogram

  • log (bool, or {'x', 'y'}, optional (default: False)) – If True, plot both axes on a log scale. If ‘x’ or ‘y’, only plot the given axis on a log scale. If False, plot both axes on a linear scale.

  • cutoff (float or None, optional (default: None)) – Absolute cutoff at which to draw a vertical line. Only one of cutoff and percentile may be given.

  • percentile (float or None, optional (default: None)) – Percentile between 0 and 100 at which to draw a vertical line. Only one of cutoff and percentile may be given.

  • library_size_normalize (bool, optional (default: False)) – Divide gene set expression by library size

  • ax (matplotlib.Axes or None, optional (default: None)) – Axis to plot on. If None, a new axis will be created.

  • figsize (tuple or None, optional (default: None)) – If not None, sets the figure size (width, height)

  • [x,y]label (str, optional) – Labels to display on the x and y axis.

  • title (str or None, optional (default: None)) – Axis title.

  • fontsize (float or None (default: None)) – Base font size.

  • filename (str or None (default: None)) – file to which the output is saved

  • dpi (int or None, optional (default: None)) – The resolution in dots per inch. If None it will default to the value savefig.dpi in the matplotlibrc file. If ‘figure’ it will set the dpi to be the value of the figure. Only used if filename is not None.

  • **kwargs (additional arguments for matplotlib.pyplot.hist) –

Returns

ax – axis on which plot was drawn

Return type

matplotlib.Axes

scprep.plot.plot_library_size(data, bins=100, log=True, cutoff=None, percentile=None, ax=None, figsize=None, xlabel='Library size', title=None, fontsize=None, filename=None, dpi=None, **kwargs)[source]

Plot the library size histogram.

Parameters
  • data (array-like, shape=[n_samples, n_features]) – Input data. Multiple datasets may be given as a list of array-likes.

  • bins (int, optional (default: 100)) – Number of bins to draw in the histogram

  • log (bool, or {'x', 'y'}, optional (default: True)) – If True, plot both axes on a log scale. If ‘x’ or ‘y’, only plot the given axis on a log scale. If False, plot both axes on a linear scale.

  • cutoff (float or None, optional (default: None)) – Absolute cutoff at which to draw a vertical line. Only one of cutoff and percentile may be given.

  • percentile (float or None, optional (default: None)) – Percentile between 0 and 100 at which to draw a vertical line. Only one of cutoff and percentile may be given.

  • ax (matplotlib.Axes or None, optional (default: None)) – Axis to plot on. If None, a new axis will be created.

  • figsize (tuple or None, optional (default: None)) – If not None, sets the figure size (width, height)

  • [x,y]label (str, optional) – Labels to display on the x and y axis.

  • title (str or None, optional (default: None)) – Axis title.

  • fontsize (float or None (default: None)) – Base font size.

  • filename (str or None (default: None)) – file to which the output is saved

  • dpi (int or None, optional (default: None)) – The resolution in dots per inch. If None it will default to the value savefig.dpi in the matplotlibrc file. If ‘figure’ it will set the dpi to be the value of the figure. Only used if filename is not None.

  • **kwargs (additional arguments for matplotlib.pyplot.hist) –

Returns

ax – axis on which plot was drawn

Return type

matplotlib.Axes

scprep.plot.jitter(labels, values, sigma=0.1, c=None, cmap=None, cmap_scale='linear', s=None, mask=None, plot_means=True, means_s=100, means_c='lightgrey', discrete=None, ax=None, legend=None, colorbar=None, shuffle=True, figsize=None, ticks=True, xticks=None, yticks=None, ticklabels=True, xticklabels=None, yticklabels=None, xlabel=None, ylabel=None, title=None, fontsize=None, legend_title=None, legend_loc='best', legend_anchor=None, vmin=None, vmax=None, filename=None, dpi=None, **plot_kwargs)[source]

Create a jitter plot.

Creates a 2D scatterplot showing the distribution of values for points that have associated labels.

Parameters
  • labels (array-like, shape=[n_cells]) – Class labels associated with each point.

  • values (array-like, shape=[n_cells]) – Values associated with each cell

  • sigma (float, optinoal, default: 0.1) – Adjusts the amount of jitter.

  • c (list-like or None, optional (default: None)) – Color vector. Can be a single color value (RGB, RGBA, or named matplotlib colors), an array of these of length n_samples, or a list of discrete or continuous values of any data type. If c is not a single or list of matplotlib colors, the values in c will be used to populate the legend / colorbar with colors from cmap

  • cmap (matplotlib colormap, str, dict or None, optional (default: None)) – matplotlib colormap. If None, uses tab20 for discrete data and inferno for continuous data. If a dictionary, expects one key for every unique value in c, where values are valid matplotlib colors (hsv, rbg, rgba, or named colors)

  • cmap_scale ({‘linear’, ‘log’, ‘symlog’, ‘sqrt’} or matplotlib.colors.Normalize,) – optional (default: ‘linear’) Colormap normalization scale. For advanced use, see <https://matplotlib.org/users/colormapnorms.html>

  • s (float, optional (default: None)) – Point size. If None, set to 200 / sqrt(n_samples)

  • mask (list-like, optional (default: None)) – boolean mask to hide data points

  • plot_means (bool, optional (default: True)) – If True, plot the mean value for each label.

  • means_s (float, optional (default: 100)) – Point size for mean values.

  • means_c (string, list-like or matplotlib color, optional (default: 'lightgrey')) – Point color(s) for mean values.

  • discrete (bool or None, optional (default: None)) – If True, the legend is categorical. If False, the legend is a colorbar. If None, discreteness is detected automatically. Data containing non-numeric c is always discrete, and numeric data with 20 or less unique values is discrete.

  • ax (matplotlib.Axes or None, optional (default: None)) – axis on which to plot. If None, an axis is created

  • legend (bool, optional (default: None)) – States whether or not to create a legend. If data is continuous, the legend is a colorbar. If None, a legend is created where possible

  • colorbar (bool, optional (default: None)) – Synonym for legend

  • shuffle (bool, optional (default: True)) – If True. shuffles the order of points on the plot.

  • figsize (tuple, optional (default: None)) – Tuple of floats for creation of new matplotlib figure. Only used if ax is None.

  • ticks (True, False, or list-like (default: True)) – If True, keeps default axis ticks. If False, removes axis ticks. If a list, sets custom axis ticks

  • {x,y}ticks (True, False, or list-like (default: None)) – If set, overrides ticks

  • ticklabels (True, False, or list-like (default: True)) – If True, keeps default axis tick labels. If False, removes axis tick labels. If a list, sets custom axis tick labels

  • {x,y}ticklabels (True, False, or list-like (default: None)) – If set, overrides ticklabels

  • {x,y}label (str or None (default : None)) – Axis labels. If None, no label is set.

  • title (str or None (default: None)) – axis title. If None, no title is set.

  • fontsize (float or None (default: None)) – Base font size.

  • legend_title (str (default: None)) – title for the colorbar of legend

  • legend_loc (int or string or pair of floats, default: 'best') – Matplotlib legend location. Only used for discrete data. See <https://matplotlib.org/api/_as_gen/matplotlib.pyplot.legend.html> for details.

  • legend_anchor (BboxBase, 2-tuple, or 4-tuple) – Box that is used to position the legend in conjunction with loc. See <https://matplotlib.org/api/_as_gen/matplotlib.pyplot.legend.html> for details.

  • vmax (vmin,) – Range of values to use as the range for the colormap. Only used if data is continuous

  • filename (str or None (default: None)) – file to which the output is saved

  • dpi (int or None, optional (default: None)) – The resolution in dots per inch. If None it will default to the value savefig.dpi in the matplotlibrc file. If ‘figure’ it will set the dpi to be the value of the figure. Only used if filename is not None.

  • **plot_kwargs (keyword arguments) – Extra arguments passed to matplotlib.pyplot.scatter.

Returns

ax – axis on which plot was drawn

Return type

matplotlib.Axes

scprep.plot.marker_plot(data, clusters, markers, gene_names=None, normalize_expression=True, normalize_emd=True, reorder_tissues=True, reorder_markers=True, cmap='magma', title=None, figsize=None, ax=None, fontsize=None)[source]

Plot marker gene enrichment.

Generate a plot indicating the expression level and enrichment of a set of marker genes for each cluster.

Color of each point indicates the expression of each gene in each cluster. The size of each point indicates how differentially expressed each gene is in each cluster.

Parameters
  • data (array-like, shape=[n_cells, n_genes]) – Gene expression data for calculating expression statistics.

  • clusters (list-like, shape=[n_cells]) – Cluster assignments for each cell. Should be ints like the output of most sklearn.cluster methods.

  • markers (dict or list-like) – If a dictionary, keys represent tissues and values being a list of marker genes in each tissue. If a list, a list of marker genes.

  • gene_names (list-like, shape=[n_genes]) – List of gene names.

  • normalize_{expression,emd} (bool, optional (default: True)) – Normalize the expression and EMD of each row.

  • reorder_{tissues,markers} (bool, optional (default: True)) – Reorder tissues and markers according to hierarchical clustering=

  • cmap (str or matplotlib colormap, optional (default: 'inferno')) – Colormap with which to color points.

  • title (str or None, optional (default: None)) – Title for the plot

  • figsize (tuple or None, optional (default: None)) – If not None, sets the figure size (width, height)

  • ax (matplotlib.Axes or None, optional (default: None)) – Axis to plot on. If None, a new axis will be created.

  • fontsize (int or None, optional (default: None)) – Base fontsize.

Returns

ax – axis on which plot was drawn

Return type

matplotlib.Axes

Example

>>> markers = {'Adaxial - Immature': ['myl10', 'myod1'],
               'Adaxial - Mature': ['myog'],
               'Presomitic mesoderm': ['tbx6', 'msgn1', 'tbx16'],
               'Forming somites': ['mespba', 'ripply2'],
               'Somites': ['meox1', 'ripply1', 'aldh1a2']}
>>> cluster_marker_plot(data, clusters, gene_names, markers,
                        title="Tailbud - PSM")
scprep.plot.rotate_scatter3d(data, filename=None, rotation_speed=30, fps=10, ax=None, figsize=None, elev=None, ipython_html='jshtml', dpi=None, **kwargs)[source]

Create a rotating 3D scatter plot.

Builds upon matplotlib.pyplot.scatter with nice defaults and handles categorical colors / legends better.

Parameters
  • data (array-like, phate.PHATE or scanpy.AnnData) – Input data. Only the first three dimensions are used.

  • filename (str, optional (default: None)) – If not None, saves a .gif or .mp4 with the output

  • rotation_speed (float, optional (default: 30)) – Speed of axis rotation, in degrees per second

  • fps (int, optional (default: 10)) – Frames per second. Increase this for a smoother animation

  • ax (matplotlib.Axes or None, optional (default: None)) – axis on which to plot. If None, an axis is created

  • figsize (tuple, optional (default: None)) – Tuple of floats for creation of new matplotlib figure. Only used if ax is None.

  • elev (int, optional (default: None)) – Elevation angle of viewpoint from horizontal, in degrees

  • ipython_html ({'html5', 'jshtml'}) – which html writer to use if using a Jupyter Notebook

  • dpi (int or None, optional (default: None)) – The resolution in dots per inch. If None it will default to the value savefig.dpi in the matplotlibrc file. If ‘figure’ it will set the dpi to be the value of the figure. Only used if filename is not None.

  • **kwargs (keyword arguments) – See :~func:scprep.plot.scatter3d.

Returns

ani – animation object

Return type

matplotlib.animation.FuncAnimation

Examples

>>> import scprep
>>> import numpy as np
>>> import matplotlib.pyplot as plt
>>> data = np.random.normal(0, 1, [200, 3])
>>> # Continuous color vector
>>> colors = data[:, 0]
>>> scprep.plot.rotate_scatter3d(data, c=colors, filename="animation.gif")
>>> # Discrete color vector with custom colormap
>>> colors = np.random.choice(['a','b'], data.shape[0], replace=True)
>>> data[colors == 'a'] += 5
>>> scprep.plot.rotate_scatter3d(
        data,
        c=colors,
        cmap={'a' : [1,0,0,1], 'b' : 'xkcd:sky blue'},
        filename="animation.mp4"
    )
scprep.plot.scatter(x, y, z=None, c=None, cmap=None, cmap_scale='linear', s=None, mask=None, discrete=None, ax=None, legend=None, colorbar=None, shuffle=True, figsize=None, ticks=True, xticks=None, yticks=None, zticks=None, ticklabels=True, xticklabels=None, yticklabels=None, zticklabels=None, label_prefix=None, xlabel=None, ylabel=None, zlabel=None, title=None, fontsize=None, legend_title=None, legend_loc='best', legend_anchor=None, legend_ncol=None, vmin=None, vmax=None, elev=None, azim=None, filename=None, dpi=None, **plot_kwargs)[source]

Create a scatter plot.

Builds upon matplotlib.pyplot.scatter with nice defaults and handles categorical colors / legends better. For easy access, use scatter2d or scatter3d.

Parameters
  • x (list-like) – data for x axis

  • y (list-like) – data for y axis

  • z (list-like, optional (default: None)) – data for z axis

  • c (list-like or None, optional (default: None)) – Color vector. Can be a single color value (RGB, RGBA, or named matplotlib colors), an array of these of length n_samples, or a list of discrete or continuous values of any data type. If c is not a single or list of matplotlib colors, the values in c will be used to populate the legend / colorbar with colors from cmap

  • cmap (matplotlib colormap, str, dict or None, optional (default: None)) – matplotlib colormap. If None, uses tab20 for discrete data and inferno for continuous data. If a dictionary, expects one key for every unique value in c, where values are valid matplotlib colors (hsv, rbg, rgba, or named colors)

  • cmap_scale ({‘linear’, ‘log’, ‘symlog’, ‘sqrt’} or matplotlib.colors.Normalize,) – optional (default: ‘linear’) Colormap normalization scale. For advanced use, see <https://matplotlib.org/users/colormapnorms.html>

  • s (float, optional (default: None)) – Point size. If None, set to 200 / sqrt(n_samples)

  • mask (list-like, optional (default: None)) – boolean mask to hide data points

  • discrete (bool or None, optional (default: None)) – If True, the legend is categorical. If False, the legend is a colorbar. If None, discreteness is detected automatically. Data containing non-numeric c is always discrete, and numeric data with 20 or less unique values is discrete.

  • ax (matplotlib.Axes or None, optional (default: None)) – axis on which to plot. If None, an axis is created

  • legend (bool, optional (default: None)) – States whether or not to create a legend. If data is continuous, the legend is a colorbar. If None, a legend is created where possible

  • colorbar (bool, optional (default: None)) – Synonym for legend

  • shuffle (bool, optional (default: True)) – If True. shuffles the order of points on the plot.

  • figsize (tuple, optional (default: None)) – Tuple of floats for creation of new matplotlib figure. Only used if ax is None.

  • ticks (True, False, or list-like (default: True)) – If True, keeps default axis ticks. If False, removes axis ticks. If a list, sets custom axis ticks

  • {x,y,z}ticks (True, False, or list-like (default: None)) – If set, overrides ticks

  • ticklabels (True, False, or list-like (default: True)) – If True, keeps default axis tick labels. If False, removes axis tick labels. If a list, sets custom axis tick labels

  • {x,y,z}ticklabels (True, False, or list-like (default: None)) – If set, overrides ticklabels

  • label_prefix (str or None (default: None)) – Prefix for all axis labels. Axes will be labelled label_prefix`1, `label_prefix`2, etc. Can be overriden by setting `xlabel, ylabel, and zlabel.

  • {x,y,z}label (str, None or False (default : None)) – Axis labels. Overrides the automatic label given by label_prefix. If None and label_prefix is None, no label is set unless the data is a pandas Series, in which case the series name is used. Override this behavior with {x,y,z}label=False

  • title (str or None (default: None)) – axis title. If None, no title is set.

  • fontsize (float or None (default: None)) – Base font size.

  • legend_title (str (default: None)) – title for the colorbar of legend

  • legend_loc (int or string or pair of floats, default: 'best') – Matplotlib legend location. Only used for discrete data. See <https://matplotlib.org/api/_as_gen/matplotlib.pyplot.legend.html> for details.

  • legend_anchor (BboxBase, 2-tuple, or 4-tuple) – Box that is used to position the legend in conjunction with loc. See <https://matplotlib.org/api/_as_gen/matplotlib.pyplot.legend.html> for details.

  • legend_ncol (int or None, optimal (default: None)) – Number of columns to show in the legend. If None, defaults to a maximum of entries per column.

  • vmax (vmin,) – Range of values to use as the range for the colormap. Only used if data is continuous

  • elev (int, optional (default: None)) – Elevation angle of viewpoint from horizontal for 3D plots, in degrees

  • azim (int, optional (default: None)) – Azimuth angle in x-y plane of viewpoint for 3D plots, in degrees

  • filename (str or None (default: None)) – file to which the output is saved

  • dpi (int or None, optional (default: None)) – The resolution in dots per inch. If None it will default to the value savefig.dpi in the matplotlibrc file. If ‘figure’ it will set the dpi to be the value of the figure. Only used if filename is not None.

  • **plot_kwargs (keyword arguments) – Extra arguments passed to matplotlib.pyplot.scatter.

Returns

ax – axis on which plot was drawn

Return type

matplotlib.Axes

Examples

>>> import scprep
>>> import numpy as np
>>> import matplotlib.pyplot as plt
>>> data = np.random.normal(0, 1, [200, 3])
>>> # Continuous color vector
>>> colors = data[:, 0]
>>> scprep.plot.scatter(x=data[:, 0], y=data[:, 1], c=colors)
>>> # Discrete color vector with custom colormap
>>> colors = np.random.choice(['a','b'], data.shape[0], replace=True)
>>> data[colors == 'a'] += 5
>>> scprep.plot.scatter(x=data[:, 0], y=data[:, 1], z=data[:, 2],
...                     c=colors, cmap={'a' : [1,0,0,1], 'b' : 'xkcd:sky blue'})
scprep.plot.scatter2d(data, c=None, cmap=None, cmap_scale='linear', s=None, mask=None, discrete=None, ax=None, legend=None, colorbar=None, shuffle=True, figsize=None, ticks=True, xticks=None, yticks=None, ticklabels=True, xticklabels=None, yticklabels=None, label_prefix=None, xlabel=None, ylabel=None, title=None, fontsize=None, legend_title=None, legend_loc='best', legend_anchor=None, legend_ncol=None, filename=None, dpi=None, **plot_kwargs)[source]

Create a 2D scatter plot.

Builds upon matplotlib.pyplot.scatter with nice defaults and handles categorical colors / legends better.

Parameters
  • data (array-like, shape=[n_samples, n_features]) – Input data. Only the first two components will be used.

  • c (list-like or None, optional (default: None)) – Color vector. Can be a single color value (RGB, RGBA, or named matplotlib colors), an array of these of length n_samples, or a list of discrete or continuous values of any data type. If c is not a single or list of matplotlib colors, the values in c will be used to populate the legend / colorbar with colors from cmap

  • cmap (matplotlib colormap, str, dict, list or None, optional (default: None)) – matplotlib colormap. If None, uses tab20 for discrete data and inferno for continuous data. If a list, expects one color for every unique value in c, otherwise interpolates between given colors for continuous data. If a dictionary, expects one key for every unique value in c, where values are valid matplotlib colors (hsv, rbg, rgba, or named colors)

  • cmap_scale ({‘linear’, ‘log’, ‘symlog’, ‘sqrt’} or matplotlib.colors.Normalize,) – optional (default: ‘linear’) Colormap normalization scale. For advanced use, see <https://matplotlib.org/users/colormapnorms.html>

  • s (float, optional (default: None)) – Point size. If None, set to 200 / sqrt(n_samples)

  • mask (list-like, optional (default: None)) – boolean mask to hide data points

  • discrete (bool or None, optional (default: None)) – If True, the legend is categorical. If False, the legend is a colorbar. If None, discreteness is detected automatically. Data containing non-numeric c is always discrete, and numeric data with 20 or less unique values is discrete.

  • ax (matplotlib.Axes or None, optional (default: None)) – axis on which to plot. If None, an axis is created

  • legend (bool, optional (default: None)) – States whether or not to create a legend. If data is continuous, the legend is a colorbar. If None, a legend is created where possible.

  • colorbar (bool, optional (default: None)) – Synonym for legend

  • shuffle (bool, optional (default: True)) – If True. shuffles the order of points on the plot.

  • figsize (tuple, optional (default: None)) – Tuple of floats for creation of new matplotlib figure. Only used if ax is None.

  • ticks (True, False, or list-like (default: True)) – If True, keeps default axis ticks. If False, removes axis ticks. If a list, sets custom axis ticks

  • {x,y}ticks (True, False, or list-like (default: None)) – If set, overrides ticks

  • ticklabels (True, False, or list-like (default: True)) – If True, keeps default axis tick labels. If False, removes axis tick labels. If a list, sets custom axis tick labels

  • {x,y}ticklabels (True, False, or list-like (default: None)) – If set, overrides ticklabels

  • label_prefix (str or None (default: None)) – Prefix for all axis labels. Axes will be labelled label_prefix`1, `label_prefix`2, etc. Can be overriden by setting `xlabel, ylabel, and zlabel.

  • {x,y}label (str or None (default : None)) – Axis labels. Overrides the automatic label given by label_prefix. If None and label_prefix is None, no label is set unless the data is a pandas Series, in which case the series name is used. Override this behavior with {x,y,z}label=False

  • title (str or None (default: None)) – axis title. If None, no title is set.

  • fontsize (float or None (default: None)) – Base font size.

  • legend_title (str (default: None)) – title for the colorbar of legend

  • legend_loc (int or string or pair of floats, default: 'best') – Matplotlib legend location. Only used for discrete data. See <https://matplotlib.org/api/_as_gen/matplotlib.pyplot.legend.html> for details.

  • legend_anchor (BboxBase, 2-tuple, or 4-tuple) – Box that is used to position the legend in conjunction with loc. See <https://matplotlib.org/api/_as_gen/matplotlib.pyplot.legend.html> for details.

  • legend_ncol (int or None, optimal (default: None)) – Number of columns to show in the legend. If None, defaults to a maximum of entries per column.

  • vmax (vmin,) – Range of values to use as the range for the colormap. Only used if data is continuous

  • filename (str or None (default: None)) – file to which the output is saved

  • dpi (int or None, optional (default: None)) – The resolution in dots per inch. If None it will default to the value savefig.dpi in the matplotlibrc file. If ‘figure’ it will set the dpi to be the value of the figure. Only used if filename is not None.

  • **plot_kwargs (keyword arguments) – Extra arguments passed to matplotlib.pyplot.scatter.

Returns

ax – axis on which plot was drawn

Return type

matplotlib.Axes

Examples

>>> import scprep
>>> import numpy as np
>>> import matplotlib.pyplot as plt
>>> data = np.random.normal(0, 1, [200, 2])
>>> # Continuous color vector
>>> colors = data[:, 0]
>>> scprep.plot.scatter2d(data, c=colors)
>>> # Discrete color vector with custom colormap
>>> colors = np.random.choice(['a','b'], data.shape[0], replace=True)
>>> data[colors == 'a'] += 10
>>> scprep.plot.scatter2d(
        data, c=colors, cmap={'a' : [1,0,0,1], 'b' : 'xkcd:sky blue'}
    )
scprep.plot.scatter3d(data, c=None, cmap=None, cmap_scale='linear', s=None, mask=None, discrete=None, ax=None, legend=None, colorbar=None, shuffle=True, figsize=None, ticks=True, xticks=None, yticks=None, zticks=None, ticklabels=True, xticklabels=None, yticklabels=None, zticklabels=None, label_prefix=None, xlabel=None, ylabel=None, zlabel=None, title=None, fontsize=None, legend_title=None, legend_loc='best', legend_anchor=None, legend_ncol=None, elev=None, azim=None, filename=None, dpi=None, **plot_kwargs)[source]

Create a 3D scatter plot.

Builds upon matplotlib.pyplot.scatter with nice defaults and handles categorical colors / legends better.

Parameters
  • data (array-like, shape=[n_samples, n_features]) – Input data. Only the first two components will be used.

  • c (list-like or None, optional (default: None)) – Color vector. Can be a single color value (RGB, RGBA, or named matplotlib colors), an array of these of length n_samples, or a list of discrete or continuous values of any data type. If c is not a single or list of matplotlib colors, the values in c will be used to populate the legend / colorbar with colors from cmap

  • cmap (matplotlib colormap, str, dict, list or None, optional (default: None)) – matplotlib colormap. If None, uses tab20 for discrete data and inferno for continuous data. If a list, expects one color for every unique value in c, otherwise interpolates between given colors for continuous data. If a dictionary, expects one key for every unique value in c, where values are valid matplotlib colors (hsv, rbg, rgba, or named colors)

  • cmap_scale ({‘linear’, ‘log’, ‘symlog’, ‘sqrt’} or matplotlib.colors.Normalize,) – optional (default: ‘linear’) Colormap normalization scale. For advanced use, see <https://matplotlib.org/users/colormapnorms.html>

  • s (float, optional (default: None)) – Point size. If None, set to 200 / sqrt(n_samples)

  • mask (list-like, optional (default: None)) – boolean mask to hide data points

  • discrete (bool or None, optional (default: None)) – If True, the legend is categorical. If False, the legend is a colorbar. If None, discreteness is detected automatically. Data containing non-numeric c is always discrete, and numeric data with 20 or less unique values is discrete.

  • ax (matplotlib.Axes or None, optional (default: None)) – axis on which to plot. If None, an axis is created

  • legend (bool, optional (default: None)) – States whether or not to create a legend. If data is continuous, the legend is a colorbar. If None, a legend is created where possible.

  • colorbar (bool, optional (default: None)) – Synonym for legend

  • shuffle (bool, optional (default: True)) – If True. shuffles the order of points on the plot.

  • figsize (tuple, optional (default: None)) – Tuple of floats for creation of new matplotlib figure. Only used if ax is None.

  • ticks (True, False, or list-like (default: True)) – If True, keeps default axis ticks. If False, removes axis ticks. If a list, sets custom axis ticks

  • {x,y,z}ticks (True, False, or list-like (default: None)) – If set, overrides ticks

  • ticklabels (True, False, or list-like (default: True)) – If True, keeps default axis tick labels. If False, removes axis tick labels. If a list, sets custom axis tick labels

  • {x,y,z}ticklabels (True, False, or list-like (default: None)) – If set, overrides ticklabels

  • label_prefix (str or None (default: None)) – Prefix for all axis labels. Axes will be labelled label_prefix`1, `label_prefix`2, etc. Can be overriden by setting `xlabel, ylabel, and zlabel.

  • {x,y,z}label (str or None (default : None)) – Axis labels. Overrides the automatic label given by label_prefix. If None and label_prefix is None, no label is set unless the data is a pandas Series, in which case the series name is used. Override this behavior with {x,y,z}label=False

  • title (str or None (default: None)) – axis title. If None, no title is set.

  • fontsize (float or None (default: None)) – Base font size.

  • legend_title (str (default: None)) – title for the colorbar of legend

  • legend_loc (int or string or pair of floats, default: 'best') – Matplotlib legend location. Only used for discrete data. See <https://matplotlib.org/api/_as_gen/matplotlib.pyplot.legend.html> for details.

  • legend_anchor (BboxBase, 2-tuple, or 4-tuple) – Box that is used to position the legend in conjunction with loc. See <https://matplotlib.org/api/_as_gen/matplotlib.pyplot.legend.html> for details.

  • legend_ncol (int or None, optimal (default: None)) – Number of columns to show in the legend. If None, defaults to a maximum of entries per column.

  • vmax (vmin,) – Range of values to use as the range for the colormap. Only used if data is continuous

  • elev (int, optional (default: None)) – Elevation angle of viewpoint from horizontal, in degrees

  • azim (int, optional (default: None)) – Azimuth angle in x-y plane of viewpoint

  • filename (str or None (default: None)) – file to which the output is saved

  • dpi (int or None, optional (default: None)) – The resolution in dots per inch. If None it will default to the value savefig.dpi in the matplotlibrc file. If ‘figure’ it will set the dpi to be the value of the figure. Only used if filename is not None.

  • **plot_kwargs (keyword arguments) – Extra arguments passed to matplotlib.pyplot.scatter.

Returns

ax – axis on which plot was drawn

Return type

matplotlib.Axes

Examples

>>> import scprep
>>> import numpy as np
>>> import matplotlib.pyplot as plt
>>> data = np.random.normal(0, 1, [200, 3])
>>> # Continuous color vector
>>> colors = data[:, 0]
>>> scprep.plot.scatter3d(data, c=colors)
>>> # Discrete color vector with custom colormap
>>> colors = np.random.choice(['a','b'], data.shape[0], replace=True)
>>> data[colors == 'a'] += 5
>>> scprep.plot.scatter3d(
        data, c=colors, cmap={'a' : [1,0,0,1], 'b' : 'xkcd:sky blue'}
    )
scprep.plot.scree_plot(singular_values, cumulative=False, ax=None, figsize=None, xlabel='Principal Component', ylabel='Explained Variance (%)', fontsize=None, filename=None, dpi=None, **kwargs)[source]

Plot the explained variance of each principal component.

Parameters
  • singular_values (list-like, shape=[n_components]) – Singular values returned by scprep.reduce.pca(data, return_singular_values=True)

  • cumulative (bool, optional (default=False)) – If True, plot the cumulative explained variance

  • ax (matplotlib.Axes or None, optional (default: None)) – Axis to plot on. If None, a new axis will be created.

  • figsize (tuple or None, optional (default: None)) – If not None, sets the figure size (width, height)

  • {x,y}label (str, optional) – Labels to display on the x and y axis.

  • fontsize (float or None (default: None)) – Base font size.

  • filename (str or None (default: None)) – file to which the output is saved

  • dpi (int or None, optional (default: None)) – The resolution in dots per inch. If None it will default to the value savefig.dpi in the matplotlibrc file. If ‘figure’ it will set the dpi to be the value of the figure. Only used if filename is not None.

  • **kwargs (additional arguments for matplotlib.pyplot.plot) –

Returns

ax – axis on which plot was drawn

Return type

matplotlib.Axes

Examples

>>> import scprep
>>> import numpy as np
>>> data = np.random.normal(0, 1, [200, 1000])
>>> pca_data, singular_values = scprep.reduce.pca(
        data, n_components=100, return_singular_values=True
    )
>>> scprep.plot.scree_plot(singular_values)
>>> scprep.plot.scree_plot(singular_values, cumulative=True)
scprep.plot.plot_gene_variability(data, kernel_size=0.005, smooth=5, cutoff=None, percentile=90, ax=None, figsize=None, xlabel='Gene mean', ylabel='Standardized variance', title=None, fontsize=None, filename=None, dpi=None, **kwargs)[source]

Plot the histogram of gene variability.

Variability is computed as the deviation from a loess fit to the rolling median of the mean-variance curve

Parameters
  • data (array-like, shape=[n_samples, n_features]) – Input data. Multiple datasets may be given as a list of array-likes.

  • kernel_size (float or int, optional (default: 0.005)) – Width of rolling median window. If a float between 0 and 1, the width is given by kernel_size * data.shape[1]. Otherwise should be an odd integer

  • smooth (int, optional (default: 5)) – Amount of smoothing to apply to the median filter

  • cutoff (float or None, optional (default: None)) – Absolute cutoff at which to draw a vertical line. Only one of cutoff and percentile may be given.

  • percentile (float or None, optional (default: 90)) – Percentile between 0 and 100 at which to draw a vertical line. Only one of cutoff and percentile may be given.

  • ax (matplotlib.Axes or None, optional (default: None)) – Axis to plot on. If None, a new axis will be created.

  • figsize (tuple or None, optional (default: None)) – If not None, sets the figure size (width, height)

  • [x,y]label (str, optional) – Labels to display on the x and y axis.

  • title (str or None, optional (default: None)) – Axis title.

  • fontsize (float or None (default: None)) – Base font size.

  • filename (str or None (default: None)) – file to which the output is saved

  • dpi (int or None, optional (default: None)) – The resolution in dots per inch. If None it will default to the value savefig.dpi in the matplotlibrc file. If ‘figure’ it will set the dpi to be the value of the figure. Only used if filename is not None.

  • **kwargs (additional arguments for matplotlib.pyplot.hist) –

Returns

ax – axis on which plot was drawn

Return type

matplotlib.Axes

Dimensionality Reduction

Classes:

AutomaticDimensionSVD([n_components, eps, …])

Truncated SVD with automatic dimensionality selected by Johnson-Lindenstrauss.

InvertibleRandomProjection([n_components, …])

Gaussian random projection with an inverse transform using the pseudoinverse.

SparseInputPCA([n_components, eps, …])

Calculate PCA using random projections to handle sparse matrices.

Functions:

pca(data[, n_components, eps, method, seed, …])

Calculate PCA using random projections to handle sparse matrices.

class scprep.reduce.AutomaticDimensionSVD(n_components='auto', eps=0.3, algorithm='randomized', n_iter=5, random_state=None, tol=0.0)[source]

Bases: sklearn.decomposition._truncated_svd.TruncatedSVD

Truncated SVD with automatic dimensionality selected by Johnson-Lindenstrauss.

Methods:

fit(X)

Fit model on training data X.

fit_transform(X[, y])

Fit model to X and perform dimensionality reduction on X.

get_params([deep])

Get parameters for this estimator.

inverse_transform(X)

Transform X back to its original space.

set_params(**params)

Set the parameters of this estimator.

transform(X)

Perform dimensionality reduction on X.

fit(X)[source]

Fit model on training data X.

Parameters
  • X ({array-like, sparse matrix} of shape (n_samples, n_features)) – Training data.

  • y (Ignored) – Not used, present here for API consistency by convention.

Returns

self – Returns the transformer object.

Return type

object

fit_transform(X, y=None)

Fit model to X and perform dimensionality reduction on X.

Parameters
  • X ({array-like, sparse matrix} of shape (n_samples, n_features)) – Training data.

  • y (Ignored) – Not used, present here for API consistency by convention.

Returns

X_new – Reduced version of X. This will always be a dense array.

Return type

ndarray of shape (n_samples, n_components)

get_params(deep=True)

Get parameters for this estimator.

Parameters

deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns

params – Parameter names mapped to their values.

Return type

dict

inverse_transform(X)

Transform X back to its original space.

Returns an array X_original whose transform would be X.

Parameters

X (array-like of shape (n_samples, n_components)) – New data.

Returns

X_original – Note that this is always a dense array.

Return type

ndarray of shape (n_samples, n_features)

set_params(**params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters

**params (dict) – Estimator parameters.

Returns

self – Estimator instance.

Return type

estimator instance

transform(X)

Perform dimensionality reduction on X.

Parameters

X ({array-like, sparse matrix} of shape (n_samples, n_features)) – New data.

Returns

X_new – Reduced version of X. This will always be a dense array.

Return type

ndarray of shape (n_samples, n_components)

class scprep.reduce.InvertibleRandomProjection(n_components='auto', eps=0.3, orthogonalize=False, random_state=None)[source]

Bases: sklearn.random_projection.GaussianRandomProjection

Gaussian random projection with an inverse transform using the pseudoinverse.

Methods:

fit(X)

Generate a sparse random projection matrix.

fit_transform(X[, y])

Fit to data, then transform it.

get_params([deep])

Get parameters for this estimator.

set_params(**params)

Set the parameters of this estimator.

transform(X)

Project the data by using matrix product with the random matrix.

Attributes:

pseudoinverse

Pseudoinverse of the random projection.

fit(X)[source]

Generate a sparse random projection matrix.

Parameters
  • X ({ndarray, sparse matrix} of shape (n_samples, n_features)) – Training set: only the shape is used to find optimal random matrix dimensions based on the theory referenced in the afore mentioned papers.

  • y (Ignored) – Not used, present here for API consistency by convention.

Returns

self – BaseRandomProjection class instance.

Return type

object

fit_transform(X, y=None, **fit_params)

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters
  • X (array-like of shape (n_samples, n_features)) – Input samples.

  • y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).

  • **fit_params (dict) – Additional fit parameters.

Returns

X_new – Transformed array.

Return type

ndarray array of shape (n_samples, n_features_new)

get_params(deep=True)

Get parameters for this estimator.

Parameters

deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns

params – Parameter names mapped to their values.

Return type

dict

property pseudoinverse

Pseudoinverse of the random projection.

This inverts the projection operation for any vector in the span of the random projection. For small enough eps, this should be close to the correct inverse.

set_params(**params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters

**params (dict) – Estimator parameters.

Returns

self – Estimator instance.

Return type

estimator instance

transform(X)

Project the data by using matrix product with the random matrix.

Parameters

X ({ndarray, sparse matrix} of shape (n_samples, n_features)) – The input data to project into a smaller dimensional space.

Returns

X_new – Projected array.

Return type

{ndarray, sparse matrix} of shape (n_samples, n_components)

class scprep.reduce.SparseInputPCA(n_components=2, eps=0.3, random_state=None, method='svd', **kwargs)[source]

Bases: sklearn.base.BaseEstimator

Calculate PCA using random projections to handle sparse matrices.

Uses the Johnson-Lindenstrauss Lemma to determine the number of dimensions of random projections prior to subtracting the mean.

Parameters
  • n_components (int, optional (default: 2)) – Number of components to keep.

  • eps (strictly positive float, optional (default=0.15)) – Parameter to control the quality of the embedding according to the Johnson-Lindenstrauss lemma when n_components is set to ‘auto’. Smaller values lead to more accurate embeddings but higher computational and memory costs

  • method ({'svd', 'orth_rproj', 'rproj'}, optional (default: 'svd')) – Dimensionality reduction method applied prior to mean centering. The method choice affects accuracy (svd > orth_rproj > rproj) comes with increased computational cost (but not memory.)

  • random_state (int, RandomState instance or None, optional (default=None)) – If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.

  • kwargs – Additional keyword arguments for sklearn.decomposition.PCA

Attributes:

components_

Principal axes in feature space, representing directions of maximum variance.

explained_variance_

The amount of variance explained by each of the selected components.

explained_variance_ratio_

Percentage of variance explained by each of the selected components.

singular_values_

Singular values of the PCA decomposition.

Methods:

fit(X)

Fit the model with X.

fit_transform(X)

Fit the model with X and apply the dimensionality reduction on X.

get_params([deep])

Get parameters for this estimator.

inverse_transform(X)

Transform data back to its original space.

set_params(**params)

Set the parameters of this estimator.

transform(X)

Apply dimensionality reduction to X.

property components_

Principal axes in feature space, representing directions of maximum variance.

The components are sorted by explained variance.

property explained_variance_

The amount of variance explained by each of the selected components.

property explained_variance_ratio_

Percentage of variance explained by each of the selected components.

The sum of the ratios is equal to 1.0. If n_components is None then the number of components grows as`eps` gets smaller.

fit(X)[source]

Fit the model with X.

Parameters

X (array-like, shape=(n_samples, n_features)) –

fit_transform(X)[source]

Fit the model with X and apply the dimensionality reduction on X.

Parameters

X (array-like, shape=(n_samples, n_features)) –

Returns

X_new

Return type

array-like, shape=(n_samples, n_components)

get_params(deep=True)

Get parameters for this estimator.

Parameters

deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns

params – Parameter names mapped to their values.

Return type

dict

inverse_transform(X)[source]

Transform data back to its original space.

Parameters

X (array-like, shape=(n_samples, n_components)) –

Returns

X_new

Return type

array-like, shape=(n_samples, n_features)

set_params(**params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters

**params (dict) – Estimator parameters.

Returns

self – Estimator instance.

Return type

estimator instance

property singular_values_

Singular values of the PCA decomposition.

transform(X)[source]

Apply dimensionality reduction to X.

Parameters

X (array-like, shape=(n_samples, n_features)) –

Returns

X_new

Return type

array-like, shape=(n_samples, n_components)

scprep.reduce.pca(data, n_components=100, eps=0.3, method='svd', seed=None, return_singular_values=False, n_pca=None, svd_offset=None, svd_multiples=None)[source]

Calculate PCA using random projections to handle sparse matrices.

Uses the Johnson-Lindenstrauss Lemma to determine the number of dimensions of random projections prior to subtracting the mean. Dense matrices are provided to sklearn.decomposition.PCA directly.

Parameters
  • data (array-like, shape=[n_samples, n_features]) – Input data

  • n_components (int, optional (default: 100)) – Number of PCs to compute

  • eps (strictly positive float, optional (default=0.3)) – Parameter to control the quality of the embedding of sparse input. Smaller values lead to more accurate embeddings but higher computational and memory costs

  • method ({'svd', 'orth_rproj', 'rproj', 'dense'}, optional (default: 'svd')) – Dimensionality reduction method applied prior to mean centering of sparse input. The method choice affects accuracy (svd > orth_rproj > rproj) and comes with increased computational cost (but not memory.) On the other hand, method=’dense’ adds a memory cost but is faster.

  • seed (int, RandomState or None, optional (default: None)) – Random state.

  • return_singular_values (bool, optional (default: False)) – If True, also return the singular values

  • n_pca (Deprecated.) –

  • svd_offset (Deprecated.) –

  • svd_multiples (Deprecated.) –

Returns

  • data_pca (array-like, shape=[n_samples, n_components]) – PCA reduction of data

  • singular_values (list-like, shape=[n_components]) – Singular values corresponding to principal components returned only if return_values is True

Row/Column Selection

Functions:

get_cell_set(data[, starts_with, ends_with, …])

Get a list of cells from data.

get_gene_set(data[, starts_with, ends_with, …])

Get a list of genes from data.

highly_variable_genes(data, *extra_data[, …])

Select genes with high variability.

select_cols(data, *extra_data[, idx, …])

Select columns from a data matrix.

select_rows(data, *extra_data[, idx, …])

Select rows from a data matrix.

subsample(*data[, n, seed])

Subsample the number of points in a dataset.

scprep.select.get_cell_set(data, starts_with=None, ends_with=None, exact_word=None, regex=None)[source]

Get a list of cells from data.

Parameters
  • data (array-like, shape=[n_samples, n_features] or [n_samples]) – Input pd.DataFrame, or list of cell names

  • starts_with (str, list-like or None, optional (default: None)) – If not None, only return cell names that start with this prefix.

  • ends_with (str, list-like or None, optional (default: None)) – If not None, only return cell names that end with this suffix.

  • exact_word (str, list-like or None, optional (default: None)) – If not None, only return cell names that contain this exact word.

  • regex (str, list-like or None, optional (default: None)) – If not None, only return cell names that match this regular expression.

Returns

cells – List of matching cells

Return type

list-like, shape<=[n_features]

scprep.select.get_gene_set(data, starts_with=None, ends_with=None, exact_word=None, regex=None)[source]

Get a list of genes from data.

Parameters
  • data (array-like, shape=[n_samples, n_features] or [n_features]) – Input pd.DataFrame, or list of gene names

  • starts_with (str, list-like or None, optional (default: None)) – If not None, only return gene names that start with this prefix.

  • ends_with (str, list-like or None, optional (default: None)) – If not None, only return gene names that end with this suffix.

  • exact_word (str, list-like or None, optional (default: None)) – If not None, only return gene names that contain this exact word.

  • regex (str, list-like or None, optional (default: None)) – If not None, only return gene names that match this regular expression.

Returns

genes – List of matching genes

Return type

list-like, shape<=[n_features]

scprep.select.highly_variable_genes(data, *extra_data, kernel_size=0.05, smooth=5, cutoff=None, percentile=80)[source]

Select genes with high variability.

Variability is computed as the deviation from a loess fit to the rolling median of the mean-variance curve

Parameters
  • data (array-like, shape=[n_samples, n_features]) – Input data

  • extra_data (array-like, shape=[any, n_features], optional) – Optional additional data objects from which to select the same rows

  • kernel_size (float or int, optional (default: 0.005)) – Width of rolling median window. If a float between 0 and 1, the width is given by kernel_size * data.shape[1]. Otherwise should be an odd integer

  • smooth (int, optional (default: 5)) – Amount of smoothing to apply to the median filter

  • cutoff (float, optional (default: None)) – Variability above which expression is deemed significant

  • percentile (int, optional (Default: 80)) – Percentile above or below which to remove genes. Must be an integer between 0 and 100. Only one of cutoff and percentile should be specified.

Returns

  • data (array-like, shape=[n_samples, m_features]) – Filtered output data, where m_features <= n_features

  • extra_data (array-like, shape=[any, m_features]) – Filtered extra data, if passed.

scprep.select.select_cols(data, *extra_data, idx=None, starts_with=None, ends_with=None, exact_word=None, regex=None)[source]

Select columns from a data matrix.

Parameters
  • data (array-like, shape=[n_samples, n_features]) – Input data

  • extra_data (array-like, shape=[any, n_features], optional) – Optional additional data objects from which to select the same rows

  • idx (list-like, shape=[m_features]) – Integer indices or string column names to be selected

  • starts_with (str, list-like or None, optional (default: None)) – If not None, select columns that start with this prefix.

  • ends_with (str, list-like or None, optional (default: None)) – If not None, select columns that end with this suffix.

  • exact_word (str, list-like or None, optional (default: None)) – If not None, select columns that contain this exact word.

  • regex (str, list-like or None, optional (default: None)) – If not None, select columns that match this regular expression.

Returns

  • data (array-like, shape=[n_samples, m_features]) – Subsetted output data.

  • extra_data (array-like, shape=[any, m_features]) – Subsetted extra data, if passed.

Examples

data_subset = scprep.select.select_cols(

data, idx=np.random.choice([True, False], data.shape[1])

) data_subset, metadata_subset = scprep.select.select_cols(

data, metadata, starts_with=”MT”

)

:raises UserWarning : if no columns are selected:

scprep.select.select_rows(data, *extra_data, idx=None, starts_with=None, ends_with=None, exact_word=None, regex=None)[source]

Select rows from a data matrix.

Parameters
  • data (array-like, shape=[n_samples, n_features]) – Input data

  • extra_data (array-like, shape=[n_samples, any], optional) – Optional additional data objects from which to select the same rows

  • idx (list-like, shape=[m_samples], optional (default: None)) – Integer indices or string index names to be selected

  • starts_with (str, list-like or None, optional (default: None)) – If not None, select rows that start with this prefix.

  • ends_with (str, list-like or None, optional (default: None)) – If not None, select rows that end with this suffix.

  • exact_word (str, list-like or None, optional (default: None)) – If not None, select rows that contain this exact word.

  • regex (str, list-like or None, optional (default: None)) – If not None, select rows that match this regular expression.

Returns

  • data (array-like, shape=[m_samples, n_features]) – Subsetted output data

  • extra_data (array-like, shape=[m_samples, any]) – Subsetted extra data, if passed.

Examples

data_subset = scprep.select.select_rows(

data, idx=np.random.choice([True, False], data.shape[0])

) data_subset, labels_subset = scprep.select.select_rows(

data, labels, end_with=”batch1”

)

:raises UserWarning : if no rows are selected:

scprep.select.subsample(*data, n=10000, seed=None)[source]

Subsample the number of points in a dataset.

Selects a random subset of (optionally multiple) datasets. Helpful for plotting, or for methods with computational constraints.

Parameters
  • data (array-like, shape=[n_samples, any]) – Input data. Any number of datasets can be passed at once, so long as n_samples remains the same.

  • n (int, optional (default: 10000)) – Number of samples to retain. Must be less than n_samples.

  • seed (int, optional (default: None)) – Random seed

Examples

data_subsample, labels_subsample = scprep.utils.subsample(data, labels, n=1000)

Utilities

Functions:

check_consistent_columns(data[, …])

Ensure that a set of data matrices have consistent columns.

combine_batches(data, batch_labels[, …])

Combine data matrices from multiple batches and store a batch label.

matrix_any(condition)

Check if a condition is true anywhere in a data matrix.

matrix_min(data)

Get the minimum value from a data matrix.

matrix_non_negative(data[, allow_equal])

Check if all values in a matrix are non-negative.

matrix_std(data[, axis])

Get the column-wise, row-wise, or total standard deviation of a matrix.

matrix_sum(data[, axis, ignore_nan])

Get the column-wise, row-wise, or total sum of values in a matrix.

matrix_transform(data, fun, *args, **kwargs)

Perform a numerical transformation to data.

matrix_transpose(X)

Transpose a matrix in a memory-efficient manner.

matrix_vector_elementwise_multiply(data, …)

Elementwise multiply a matrix by a vector.

sort_clusters_by_values(clusters, values)

Sort clusters in increasing order of values.

sparse_series_min(data)

Get the minimum value from a pandas sparse series.

to_array_or_spmatrix(x)

Convert an array-like to a np.ndarray or scipy.sparse.spmatrix.

toarray(x)

Convert an array-like to a np.ndarray.

scprep.utils.check_consistent_columns(data, common_columns_only=True)[source]

Ensure that a set of data matrices have consistent columns.

Parameters
  • data (list of array-likes) – List of matrices to be checked

  • common_columns_only (bool, optional (default: True)) – With pandas inputs, drop any columns that are not common to all matrices

Returns

data – List of matrices with consistent columns, subsetted if necessary

Return type

list of array-likes

Raises

ValueError – Raised if data has inconsistent number of columns and does not have column names for subsetting

scprep.utils.combine_batches(data, batch_labels, append_to_cell_names=None, common_columns_only=True)[source]

Combine data matrices from multiple batches and store a batch label.

Parameters
  • data (list of array-like, shape=[n_batch]) – All matrices must be of the same format and have the same number of columns (or genes.)

  • batch_labels (list of str, shape=[n_batch]) – List of names assigned to each batch

  • append_to_cell_names (bool, optional (default: None)) – If input is a pandas dataframe, add the batch label corresponding to each cell to its existing index (or cell name / barcode.) Default behavior is True for dataframes and False otherwise.

  • common_columns_only (bool, optional (default: True)) – With pandas inputs, drop any columns that are not common to all data matrices

Returns

  • data (data matrix, shape=[n_samples, n_features]) – Number of samples is the sum of numbers of samples of all batches. Number of features is the same as each of the batches.

  • sample_labels (list-like, shape=[n_samples]) – Batch labels corresponding to each sample

scprep.utils.matrix_any(condition)[source]

Check if a condition is true anywhere in a data matrix.

np.any doesn’t handle matrices of type pd.DataFrame

Parameters

condition (array-like) – Boolean matrix

Returns

any – True if condition contains any True values, False otherwise

Return type

bool

scprep.utils.matrix_min(data)[source]

Get the minimum value from a data matrix.

Pandas SparseDataFrame does not handle np.min.

Parameters

data (array-like, shape=[n_samples, n_features]) – Input data

Returns

minimum – Minimum entry in data.

Return type

float

scprep.utils.matrix_non_negative(data, allow_equal=True)[source]

Check if all values in a matrix are non-negative.

Parameters
  • data (array-like, shape=[n_samples, n_features]) – Input data

  • allow_equal (bool, optional (default: True)) – If True, min(data) can be equal to 0

Returns

is_non_negative

Return type

bool

scprep.utils.matrix_std(data, axis=None)[source]

Get the column-wise, row-wise, or total standard deviation of a matrix.

Parameters
  • data (array-like, shape=[n_samples, n_features]) – Input data

  • axis (int or None, optional (default: None)) – Axis across which to calculate standard deviation. axis=0 gives column standard deviation, axis=1 gives row standard deviation. None gives the total standard deviation.

Returns

std – Standard deviation along desired axis.

Return type

array-like or float

scprep.utils.matrix_sum(data, axis=None, ignore_nan=False)[source]

Get the column-wise, row-wise, or total sum of values in a matrix.

Parameters
  • data (array-like, shape=[n_samples, n_features]) – Input data

  • axis (int or None, optional (default: None)) – Axis across which to sum. axis=0 gives column sums, axis=1 gives row sums. None gives the total sum.

  • ignore_nan (bool, optional (default: False)) – If True, uses np.nansum instead of np.sum

Returns

sums – Sums along desired axis.

Return type

array-like or float

scprep.utils.matrix_transform(data, fun, *args, **kwargs)[source]

Perform a numerical transformation to data.

Parameters
  • data (array-like, shape=[n_samples, n_features]) – Input data

  • fun (callable) – Numerical transformation function, np.ufunc or similar.

  • kwargs (args,) – arguments for fun. data is always passed as the first argument

Returns

data – Transformed output data

Return type

array-like, shape=[n_samples, n_features]

scprep.utils.matrix_transpose(X)[source]

Transpose a matrix in a memory-efficient manner.

Pandas sparse dataframes are coerced to dense

Parameters

X (array-like, shape=[n,m]) – Input data

Returns

X_T – Transposed input data

Return type

array-like, shape=[m,n]

scprep.utils.matrix_vector_elementwise_multiply(data, multiplier, axis=None)[source]

Elementwise multiply a matrix by a vector.

Parameters
  • data (array-like, shape=[n_samples, n_features]) – Input data

  • multiplier (array-like, shape=[n_samples, 1] or [1, n_features]) – Vector by which to multiply data

  • axis (int or None, optional (default: None)) – Axis across which to sum. axis=0 multiplies each column, axis=1 multiplies each row. None guesses based on dimensions

Returns

product – Multiplied matrix

Return type

array-like

scprep.utils.sort_clusters_by_values(clusters, values)[source]

Sort clusters in increasing order of values.

Parameters
  • clusters (array-like) – An array of cluster assignments, like the output of a fit_predict() call.

  • values (type) – An associated value for each index in clusters to use for sorting the clusters.

Returns

new_clusters – Reordered cluster assignments. np.mean(values[new_clusters == 0]) will be less than np.mean(values[new_clusters == 1]) which will be less than np.mean(values[new_clusters == 2]) and so on.

Return type

array-likes

scprep.utils.sparse_series_min(data)[source]

Get the minimum value from a pandas sparse series.

Pandas SparseDataFrame does not handle np.min.

Parameters

data (pd.Series[SparseArray]) – Input data

Returns

minimum – Minimum entry in data.

Return type

float

scprep.utils.to_array_or_spmatrix(x)[source]

Convert an array-like to a np.ndarray or scipy.sparse.spmatrix.

Parameters

x (array-like) – Array-like to be converted

Returns

x

Return type

np.ndarray or scipy.sparse.spmatrix

scprep.utils.toarray(x)[source]

Convert an array-like to a np.ndarray.

Parameters

x (array-like) – Array-like to be converted

Returns

x

Return type

np.ndarray

External Tools

Functions:

DyngenSimulate(backbone[, num_cells, …])

Simulate dataset with cellular backbone.

install_bioconductor([package, …])

Install a Bioconductor package.

install_github(repo[, lib, dependencies, …])

Install a Github repository.

Slingshot(data, cluster_labels[, …])

Perform lineage inference with Slingshot.

SplatSimulate([method, batch_cells, …])

Simulate count data from a fictional single-cell RNA-seq experiment Splat.

Classes:

RFunction([args, setup, body, cleanup, verbose])

Run an R function from Python.

scprep.run.DyngenSimulate(backbone, num_cells=500, num_tfs=100, num_targets=50, num_hks=25, simulation_census_interval=10, compute_cellwise_grn=False, compute_rna_velocity=False, n_jobs=7, random_state=None, verbose=True, force_num_cells=False)[source]

Simulate dataset with cellular backbone.

The backbone determines the overall dynamic process during a simulation. It consists of a set of gene modules, which regulate each other such that expression of certain genes change over time in a specific manner.

DyngenSimulate is a Python wrapper for the R package Dyngen. Default values obtained from Github vignettes. For more details, read about Dyngen on Github_.

Parameters
  • backbone (string) – Backbone name from dyngen list of backbones. Get list with get_backbones()).

  • num_cells (int, optional (default: 500)) – Number of cells.

  • num_tfs (int, optional (default: 100)) –

    Number of transcription factors. The TFs are the main drivers of the molecular changes in the simulation. A TF can only be regulated by other TFs or itself.

    NOTE: If num_tfs input is less than nrow(backbone$module_info), Dyngen will default to nrow(backbone$module_info). This quantity varies between backbones and with each run (without seed). It is generally less than 75. It is recommended to input num_tfs >= 100 to stabilize the output.

  • num_targets (int, optional (default: 50)) – Number of target genes. Target genes are regulated by a TF or another target gene, but are always downstream of at least one TF.

  • num_hks (int, optional (default: 25)) – Number of housekeeping genees. Housekeeping genes are completely separate from any TFs or target genes.

  • simulation_census_interval (int, optional (default: 10)) – Stores the abundance levels only after a specific interval has passed. The lower the interval, the higher detail of simulation trajectory retained, though many timepoints will contain similar information.

  • compute_cellwise_grn (boolean, optional (default: False)) – If True, computes the ground truth cellwise gene regulatory networks. Also outputs ground truth bulk (entire dataset) regulatory network. NOTE: Increases compute time significantly.

  • compute_rna_velocity (boolean, optional (default: False)) – If true, computes the ground truth propensity ratios after simulation. NOTE: Increases compute time significantly.

  • n_jobs (int, optional (default: 8)) – Number of cores to use.

  • random_state (int, optional (default: None)) – Fixes seed for simulation generator.

  • verbose (boolean, optional (default: True)) – Data generation verbosity.

  • force_num_cells (boolean, optional (default: False)) – Dyngen occassionally produces fewer cells than specified. Set this flag to True to rerun Dyngen until correct cell count is reached.

Returns

  • Dictionary data of pd.DataFrames

  • data[‘cell_info’] (pd.DataFrame, shape (n_cells, 4)) – Columns: cell_id, step_ix, simulation_i, sim_time sim_time is the simulated timepoint for a given cell.

  • data[‘expression’] (pd.DataFrame, shape (n_cells, n_genes)) – Log-transformed counts with dropout.

  • If compute_cellwise_grn is True,

  • data[‘bulk_grn’] (pd.DataFrame, shape (n_tf_target_interactions, 4)) – Columns: regulator, target, strength, effect. Strength is positive and unbounded. Effect is either +1 (for activation) or -1 (for inhibition).

  • data[‘cellwise_grn’] (pd.DataFrame, shape (n_tf_target_interactions_per_cell, 4)) – Columns: cell_id, regulator, target, strength. The output does not include all edges per cell. The regulatory effect lies between [−1, 1], where -1 is complete inhibition of target by TF, +1 is maximal activation of target by TF, and 0 is inactivity of the regulatory interaction between R and T.

  • If compute_rna_velocity is True,

  • data[‘rna_velocity’] (pd.DataFrame, shape (n_cells, n_genes)) – Propensity ratios for each cell.

Example

>>> import scprep
>>> scprep.run.dyngen.install()
>>> backbones = scprep.run.dyngen.get_backbones()
>>> data = scprep.run.DyngenSimulate(backbone=backbones[0])
scprep.run.install_bioconductor(package=None, site_repository=None, update=False, type='binary', version=None, verbose=True)[source]

Install a Bioconductor package.

Parameters
  • site_repository (string, optional (default: None)) – additional repository in which to look for packages to install. This repository will be prepended to the default repositories

  • update (boolean, optional (default: False)) – When False, don’t attempt to update old packages. When True, update old packages automatically.

  • type ({"binary", "source", "both"}, optional (default: "binary")) – Which package version to install if a newer version is available as source. “both” tries source first and uses binary as a fallback.

  • version (string, optional (default: None)) – Bioconductor version to install, e.g., version = “3.8”. The special symbol version = “devel” installs the current ‘development’ version. If None, installs from the current version.

  • verbose (boolean, optional (default: True)) – Install script verbosity.

scprep.run.install_github(repo, lib=None, dependencies=None, update=False, type='binary', build_vignettes=False, force=False, verbose=True)[source]

Install a Github repository.

Parameters
  • repo (string) – Github repository name to install.

  • lib (string) – Directory to install the package. If missing, defaults to the first element of .libPaths().

  • dependencies (boolean, optional (default: None/NA)) – When True, installs all packages specified under “Depends”, “Imports”, “LinkingTo” and “Suggests”. When False, installs no dependencies. When None/NA, installs all packages specified under “Depends”, “Imports” and “LinkingTo”.

  • update (string or boolean, optional (default: False)) – One of “default”, “ask”, “always”, or “never”. “default” Respects R_REMOTES_UPGRADE variable if set, falls back to “ask” if unset. “ask” prompts the user for which out of date packages to upgrade. For non-interactive sessions “ask” is equivalent to “always”. TRUE and FALSE also accepted, correspond to “always” and “never” respectively.

  • type ({"binary", "source", "both"}, optional (default: "binary")) – Which package version to install if a newer version is available as source. “both” tries source first and uses binary as a fallback.

  • build_vignettes (boolean, optional (default: False)) – Builds Github vignettes.

  • force (boolean, optional (default: False)) – Forces installation even if remote state has not changed since previous install.

  • verbose (boolean, optional (default: True)) – Install script verbosity.

class scprep.run.RFunction(args='', setup='', body='', cleanup=True, verbose=1)[source]

Bases: object

Run an R function from Python.

Parameters
  • args (str, optional (default: "")) – Comma-separated R argument names and optionally default parameters

  • setup (str, optional (default: "")) – R code to run prior to function definition (e.g. loading libraries)

  • body (str, optional (default: "")) – R code to run in the body of the function

  • cleanup (boolean, optional (default: True)) – If true, clear the R workspace after the function is complete. If false, this could result in memory leaks.

  • verbose (int, optional (default: 1)) – R script verbosity. For verbose==0, all messages are printed. For verbose==1, messages from the function body are printed. For verbose==2, messages from the function setup and body are printed.

scprep.run.Slingshot(data, cluster_labels, start_cluster=None, end_cluster=None, distance=None, omega=None, shrink=True, extend='y', reweight=True, reassign=True, thresh=0.001, max_iter=15, stretch=2, smoother='smooth.spline', shrink_method='cosine', allow_breaks=True, seed=None, verbose=1, **kwargs)[source]

Perform lineage inference with Slingshot.

Given a reduced-dimensional data matrix n by p and a vector of cluster labels (or matrix of soft cluster assignments, potentially including a -1 label for “unclustered”), this function performs lineage inference using a cluster-based minimum spanning tree and constructing simulatenous principal curves for branching paths through the tree.

For more details, read about Slingshot on GitHub_ and Bioconductor_.

Parameters
  • data (array-like, shape=[n_samples, n_dimensions]) – matrix of (reduced dimension) coordinates to be used for lineage inference.

  • cluster_labels (list-like, shape=[n_samples]) – a vector of cluster labels, optionally including -1’s for “unclustered.”

  • start_cluster (string, optional (default: None)) – indicates the cluster(s) of origin. Lineages will be represented by paths coming out of this cluster.

  • end_cluster (string, optional (default: None)) – indicates the cluster(s) which will be forced leaf nodes. This introduces a constraint on the MST algorithm.

  • distance (callable, optional (default: None)) – method for calculating distances between clusters. Must take two matrices as input, corresponding to subsets of reduced_dim. If the minimum cluster size is larger than the number dimensions, the default is to use the joint covariance matrix to find squared distance between cluster centers. If not, the default is to use the diagonal of the joint covariance matrix. Not currently implemented

  • omega (float, optional (default: None)) – this granularity parameter determines the distance between every real cluster and the artificial cluster. It is parameterized such that this distance is omega / 2, making omega the maximum distance between two connected clusters. By default, omega = Inf.

  • shrink (boolean or float, optional (default: True)) – boolean or numeric between 0 and 1, determines whether and how much to shrink branching lineages toward their average prior to the split.

  • extend ({'y', 'n', 'pc1'}, optional (default: "y")) – how to handle root and leaf clusters of lineages when constructing the initial, piece-wise linear curve.

  • reweight (boolean, optional (default: True)) – whether to allow cells shared between lineages to be reweighted during curve-fitting. If True, cells shared between lineages will be iteratively reweighted based on the quantiles of their projection distances to each curve.

  • reassign (boolean, optional (default: True)) – whether to reassign cells to lineages at each iteration. If True, cells will be added to a lineage when their projection distance to the curve is less than the median distance for all cells currently assigned to the lineage. Additionally, shared cells will be removed from a lineage if their projection distance to the curve is above the 90th percentile and their weight along the curve is less than 0.1.

  • thresh (float, optional (default: 0.001)) – determines the convergence criterion. Percent change in the total distance from cells to their projections along curves must be less than thresh.

  • max_iter (int, optional (default: 15)) – maximum number of iterations

  • stretch (int, optional (default: 2)) – factor between 0 and 2 by which curves can be extrapolated beyond endpoints

  • smoother ({"smooth.spline", "lowess", "periodic_lowess"},) – optional (default: “smooth.spline”) choice of smoother. “periodic_lowess” allows one to fit closed curves. Beware, you may want to use iter = 0 with “lowess”.

  • shrink_method (string, optional (default: "cosine")) – how to determine the appropriate amount of shrinkage for a branching lineage. Accepted values: “gaussian”, “rectangular”, “triangular”, “epanechnikov”, “biweight”, “triweight”, “cosine”, “optcosine”, “density”.

  • allow_breaks (boolean, optional (default: True)) – determines whether curves that branch very close to the origin should be allowed to have different starting points.

  • seed (int or None, optional (default: None)) – Seed to use for generating random numbers.

  • verbose (int, optional (default: 1)) – Logging verbosity between 0 and 2.

Returns

  • slingshot (dict) – Contains the following keys:

  • pseudotime (array-like, shape=[n_samples, n_curves]) – Pseudotime projection of each cell onto each principal curve. Value is np.nan if the cell does not lie on the curve

  • branch (list-like, shape=[n_samples]) – Branch assignment for each cell

  • curves (array_like, shape=[n_curves, n_samples, n_dimensions]) – Coordinates of each principle curve in the reduced dimension

Examples

>>> import scprep
>>> import phate
>>> data, clusters = phate.tree.gen_dla(n_branch=4, n_dim=200, branch_length=200)
>>> phate_op = phate.PHATE()
>>> data_phate = phate_op.fit_transform(data)
>>> slingshot = scprep.run.Slingshot(data_phate, clusters)
>>> ax = scprep.plot.scatter2d(
...     data_phate,
...     c=slingshot['pseudotime'][:,0],
...     cmap='magma',
...     legend_title='Branch 1'
... )
>>> scprep.plot.scatter2d(
...     data_phate,
...     c=slingshot['pseudotime'][:,1],
...     cmap='viridis',
...     ax=ax,
...     ticks=False,
...     label_prefix='PHATE',
...     legend_title='Branch 2'
...     )
>>> for curve in slingshot['curves']:
...     ax.plot(curve[:,0], curve[:,1], c='black')
>>> ax = scprep.plot.scatter2d(data_phate, c=slingshot['branch'],
...                        legend_title='Branch', ticks=False, label_prefix='PHATE')
>>> for curve in slingshot['curves']:
...     ax.plot(curve[:,0], curve[:,1], c='black')
scprep.run.SplatSimulate(method='paths', batch_cells=100, n_genes=10000, batch_fac_loc=0.1, batch_fac_scale=0.1, mean_rate=0.3, mean_shape=0.6, lib_loc=11, lib_scale=0.2, lib_norm=False, out_prob=0.05, out_fac_loc=4, out_fac_scale=0.5, de_prob=0.1, de_down_prob=0.1, de_fac_loc=0.1, de_fac_scale=0.4, bcv_common=0.1, bcv_df=60, dropout_type='none', dropout_prob=0.5, dropout_mid=0, dropout_shape=-1, group_prob=1, path_from=0, path_n_steps=100, path_skew=0.5, path_nonlinear_prob=0.1, path_sigma_fac=0.8, seed=None, verbose=1, path_length=None)[source]

Simulate count data from a fictional single-cell RNA-seq experiment Splat.

SplatSimulate is a Python wrapper for the R package Splatter. For more details, read about Splatter on GitHub_ and Bioconductor_.

Parameters
  • batch_cells (list-like or int, optional (default: 100)) – The number of cells in each batch.

  • n_genes (int, optional (default:10000)) – The number of genes to simulate.

  • batch_fac_loc (float, optional (default: 0.1)) – Location (meanlog) parameter for the batch effects factor log-normal distribution.

  • batch_fac_scale (float, optional (default: 0.1)) – Scale (sdlog) parameter for the batch effects factor log-normal distribution.

  • mean_shape (float, optional (default: 0.3)) – Shape parameter for the mean gamma distribution.

  • mean_rate (float, optional (default: 0.6)) – Rate parameter for the mean gamma distribution.

  • lib_loc (float, optional (default: 11)) – Location (meanlog) parameter for the library size log-normal distribution, or mean for the normal distribution.

  • lib_scale (float, optional (default: 0.2)) – Scale (sdlog) parameter for the library size log-normal distribution, or sd for the normal distribution.

  • lib_norm (bool, optional (default: False)) – Whether to use a normal distribution instead of the usual log-normal distribution.

  • out_prob (float, optional (default: 0.05)) – Probability that a gene is an expression outlier.

  • out_fac_loc (float, optional (default: 4)) – Location (meanlog) parameter for the expression outlier factor log-normal distribution.

  • out_fac_scale (float, optional (default: 0.5)) – Scale (sdlog) parameter for the expression outlier factor log-normal distribution.

  • de_prob (float, optional (default: 0.1)) – Probability that a gene is differentially expressed in each group or path.

  • de_down_prob (float, optional (default: 0.1)) – Probability that a differentially expressed gene is down-regulated.

  • de_fac_loc (float, optional (default: 0.1)) – Location (meanlog) parameter for the differential expression factor log-normal distribution.

  • de_fac_scale (float, optional (default: 0.4)) – Scale (sdlog) parameter for the differential expression factor log-normal distribution.

  • bcv_common (float, optional (default: 0.1)) – Underlying common dispersion across all genes.

  • float, optional (default (bcv_df) – Degrees of Freedom for the BCV inverse chi-squared distribution.

  • dropout_type ({'none', 'experiment', 'batch', 'group', 'cell', 'binomial'},) – optional (default: ‘none’) The type of dropout to simulate. “none” indicates no dropout, “experiment” is global dropout using the same parameters for every cell, “batch” uses the same parameters for every cell in each batch, “group” uses the same parameters for every cell in each groups, “cell” uses a different set of parameters for each cell, and “binomial” performs post-hoc binomial undersampling.

  • dropout_mid (list-like or float, optional (default: 0)) – Midpoint parameter for the dropout logistic function.

  • dropout_shape (list-like or float, optional (default: -1)) – Shape parameter for the dropout logistic function.

  • dropout_prob (float, optional (default: 0.5)) – Probability for binomial undersampling dropout.

  • group_prob (list-like or int, optional (default: 1, shape=[n_groups])) – The probabilities that cells come from particular groups.

  • path_from (list-like, optional (default: 0, shape=[n_groups])) – Vector giving the originating point of each path.

  • path_length (list-like, optional (default: 100, shape=[n_groups])) – Vector giving the number of steps to simulate along each path.

  • path_skew (list-like, optional (default: 0.5, shape=[n_groups])) – Vector giving the skew of each path.

  • path_nonlinear_prob (float, optional (default: 0.1)) – Probability that a gene changes expression in a non-linear way along the differentiation path.

  • path_sigma_fac (float, optional (default: 0.8)) – Sigma factor for non-linear gene paths.

  • seed (int or None, optional (default: None)) – Seed to use for generating random numbers.

  • verbose (int, optional (default: 1)) – Logging verbosity between 0 and 2.

Returns

sim – counts : Simulated expression counts. group : The group or path the cell belongs to. batch : The batch the cell was sampled from. exp_lib_size : The expected library size for that cell. step (paths only) : how far along the path each cell is. base_gene_mean : The base expression level for that gene. outlier_factor : Expression outlier factor for that gene. Values of 1 indicate

the gene is not an expression outlier.

gene_mean : Expression level after applying outlier factors. batch_fac_[batch] : The batch effects factor for each gene for a particular

batch.

de_fac_[group]The differential expression factor for each gene in a

particular group. Values of 1 indicate the gene is not differentially expressed.

sigma_fac_[path]Factor applied to genes that have non-linear changes in

expression along a path.

batch_cell_meansThe mean expression of genes in each cell after adding

batch effects.

base_cell_meansThe mean expression of genes in each cell after any

differential expression and adjusted for expected library size.

bcv : The Biological Coefficient of Variation for each gene in each cell. cell_means : The mean expression level of genes in each cell adjusted for BCV. true_counts : The simulated counts before dropout. dropout : Logical matrix showing which values have been dropped in which cells.

Return type

dict

Splatter

Functions:

SplatSimulate([method, batch_cells, …])

Simulate count data from a fictional single-cell RNA-seq experiment Splat.

install([site_repository, update, version, …])

Install the required R packages to run Splatter.

scprep.run.splatter.SplatSimulate(method='paths', batch_cells=100, n_genes=10000, batch_fac_loc=0.1, batch_fac_scale=0.1, mean_rate=0.3, mean_shape=0.6, lib_loc=11, lib_scale=0.2, lib_norm=False, out_prob=0.05, out_fac_loc=4, out_fac_scale=0.5, de_prob=0.1, de_down_prob=0.1, de_fac_loc=0.1, de_fac_scale=0.4, bcv_common=0.1, bcv_df=60, dropout_type='none', dropout_prob=0.5, dropout_mid=0, dropout_shape=-1, group_prob=1, path_from=0, path_n_steps=100, path_skew=0.5, path_nonlinear_prob=0.1, path_sigma_fac=0.8, seed=None, verbose=1, path_length=None)[source]

Simulate count data from a fictional single-cell RNA-seq experiment Splat.

SplatSimulate is a Python wrapper for the R package Splatter. For more details, read about Splatter on GitHub_ and Bioconductor_.

Parameters
  • batch_cells (list-like or int, optional (default: 100)) – The number of cells in each batch.

  • n_genes (int, optional (default:10000)) – The number of genes to simulate.

  • batch_fac_loc (float, optional (default: 0.1)) – Location (meanlog) parameter for the batch effects factor log-normal distribution.

  • batch_fac_scale (float, optional (default: 0.1)) – Scale (sdlog) parameter for the batch effects factor log-normal distribution.

  • mean_shape (float, optional (default: 0.3)) – Shape parameter for the mean gamma distribution.

  • mean_rate (float, optional (default: 0.6)) – Rate parameter for the mean gamma distribution.

  • lib_loc (float, optional (default: 11)) – Location (meanlog) parameter for the library size log-normal distribution, or mean for the normal distribution.

  • lib_scale (float, optional (default: 0.2)) – Scale (sdlog) parameter for the library size log-normal distribution, or sd for the normal distribution.

  • lib_norm (bool, optional (default: False)) – Whether to use a normal distribution instead of the usual log-normal distribution.

  • out_prob (float, optional (default: 0.05)) – Probability that a gene is an expression outlier.

  • out_fac_loc (float, optional (default: 4)) – Location (meanlog) parameter for the expression outlier factor log-normal distribution.

  • out_fac_scale (float, optional (default: 0.5)) – Scale (sdlog) parameter for the expression outlier factor log-normal distribution.

  • de_prob (float, optional (default: 0.1)) – Probability that a gene is differentially expressed in each group or path.

  • de_down_prob (float, optional (default: 0.1)) – Probability that a differentially expressed gene is down-regulated.

  • de_fac_loc (float, optional (default: 0.1)) – Location (meanlog) parameter for the differential expression factor log-normal distribution.

  • de_fac_scale (float, optional (default: 0.4)) – Scale (sdlog) parameter for the differential expression factor log-normal distribution.

  • bcv_common (float, optional (default: 0.1)) – Underlying common dispersion across all genes.

  • float, optional (default (bcv_df) – Degrees of Freedom for the BCV inverse chi-squared distribution.

  • dropout_type ({'none', 'experiment', 'batch', 'group', 'cell', 'binomial'},) – optional (default: ‘none’) The type of dropout to simulate. “none” indicates no dropout, “experiment” is global dropout using the same parameters for every cell, “batch” uses the same parameters for every cell in each batch, “group” uses the same parameters for every cell in each groups, “cell” uses a different set of parameters for each cell, and “binomial” performs post-hoc binomial undersampling.

  • dropout_mid (list-like or float, optional (default: 0)) – Midpoint parameter for the dropout logistic function.

  • dropout_shape (list-like or float, optional (default: -1)) – Shape parameter for the dropout logistic function.

  • dropout_prob (float, optional (default: 0.5)) – Probability for binomial undersampling dropout.

  • group_prob (list-like or int, optional (default: 1, shape=[n_groups])) – The probabilities that cells come from particular groups.

  • path_from (list-like, optional (default: 0, shape=[n_groups])) – Vector giving the originating point of each path.

  • path_length (list-like, optional (default: 100, shape=[n_groups])) – Vector giving the number of steps to simulate along each path.

  • path_skew (list-like, optional (default: 0.5, shape=[n_groups])) – Vector giving the skew of each path.

  • path_nonlinear_prob (float, optional (default: 0.1)) – Probability that a gene changes expression in a non-linear way along the differentiation path.

  • path_sigma_fac (float, optional (default: 0.8)) – Sigma factor for non-linear gene paths.

  • seed (int or None, optional (default: None)) – Seed to use for generating random numbers.

  • verbose (int, optional (default: 1)) – Logging verbosity between 0 and 2.

Returns

sim – counts : Simulated expression counts. group : The group or path the cell belongs to. batch : The batch the cell was sampled from. exp_lib_size : The expected library size for that cell. step (paths only) : how far along the path each cell is. base_gene_mean : The base expression level for that gene. outlier_factor : Expression outlier factor for that gene. Values of 1 indicate

the gene is not an expression outlier.

gene_mean : Expression level after applying outlier factors. batch_fac_[batch] : The batch effects factor for each gene for a particular

batch.

de_fac_[group]The differential expression factor for each gene in a

particular group. Values of 1 indicate the gene is not differentially expressed.

sigma_fac_[path]Factor applied to genes that have non-linear changes in

expression along a path.

batch_cell_meansThe mean expression of genes in each cell after adding

batch effects.

base_cell_meansThe mean expression of genes in each cell after any

differential expression and adjusted for expected library size.

bcv : The Biological Coefficient of Variation for each gene in each cell. cell_means : The mean expression level of genes in each cell adjusted for BCV. true_counts : The simulated counts before dropout. dropout : Logical matrix showing which values have been dropped in which cells.

Return type

dict

scprep.run.splatter.install(site_repository=None, update=False, version=None, verbose=True)[source]

Install the required R packages to run Splatter.

Parameters
  • site_repository (string, optional (default: None)) – additional repository in which to look for packages to install. This repository will be prepended to the default repositories

  • update (boolean, optional (default: False)) – When False, don’t attempt to update old packages. When True, update old packages automatically.

  • version (string, optional (default: None)) – Bioconductor version to install, e.g., version = “3.8”. The special symbol version = “devel” installs the current ‘development’ version. If None, installs from the current version.

  • verbose (boolean, optional (default: True)) – Install script verbosity.

Slingshot

Functions:

Slingshot(data, cluster_labels[, …])

Perform lineage inference with Slingshot.

install([site_repository, update, version, …])

Install the required R packages to run Slingshot.

scprep.run.slingshot.Slingshot(data, cluster_labels, start_cluster=None, end_cluster=None, distance=None, omega=None, shrink=True, extend='y', reweight=True, reassign=True, thresh=0.001, max_iter=15, stretch=2, smoother='smooth.spline', shrink_method='cosine', allow_breaks=True, seed=None, verbose=1, **kwargs)[source]

Perform lineage inference with Slingshot.

Given a reduced-dimensional data matrix n by p and a vector of cluster labels (or matrix of soft cluster assignments, potentially including a -1 label for “unclustered”), this function performs lineage inference using a cluster-based minimum spanning tree and constructing simulatenous principal curves for branching paths through the tree.

For more details, read about Slingshot on GitHub_ and Bioconductor_.

Parameters
  • data (array-like, shape=[n_samples, n_dimensions]) – matrix of (reduced dimension) coordinates to be used for lineage inference.

  • cluster_labels (list-like, shape=[n_samples]) – a vector of cluster labels, optionally including -1’s for “unclustered.”

  • start_cluster (string, optional (default: None)) – indicates the cluster(s) of origin. Lineages will be represented by paths coming out of this cluster.

  • end_cluster (string, optional (default: None)) – indicates the cluster(s) which will be forced leaf nodes. This introduces a constraint on the MST algorithm.

  • distance (callable, optional (default: None)) – method for calculating distances between clusters. Must take two matrices as input, corresponding to subsets of reduced_dim. If the minimum cluster size is larger than the number dimensions, the default is to use the joint covariance matrix to find squared distance between cluster centers. If not, the default is to use the diagonal of the joint covariance matrix. Not currently implemented

  • omega (float, optional (default: None)) – this granularity parameter determines the distance between every real cluster and the artificial cluster. It is parameterized such that this distance is omega / 2, making omega the maximum distance between two connected clusters. By default, omega = Inf.

  • shrink (boolean or float, optional (default: True)) – boolean or numeric between 0 and 1, determines whether and how much to shrink branching lineages toward their average prior to the split.

  • extend ({'y', 'n', 'pc1'}, optional (default: "y")) – how to handle root and leaf clusters of lineages when constructing the initial, piece-wise linear curve.

  • reweight (boolean, optional (default: True)) – whether to allow cells shared between lineages to be reweighted during curve-fitting. If True, cells shared between lineages will be iteratively reweighted based on the quantiles of their projection distances to each curve.

  • reassign (boolean, optional (default: True)) – whether to reassign cells to lineages at each iteration. If True, cells will be added to a lineage when their projection distance to the curve is less than the median distance for all cells currently assigned to the lineage. Additionally, shared cells will be removed from a lineage if their projection distance to the curve is above the 90th percentile and their weight along the curve is less than 0.1.

  • thresh (float, optional (default: 0.001)) – determines the convergence criterion. Percent change in the total distance from cells to their projections along curves must be less than thresh.

  • max_iter (int, optional (default: 15)) – maximum number of iterations

  • stretch (int, optional (default: 2)) – factor between 0 and 2 by which curves can be extrapolated beyond endpoints

  • smoother ({"smooth.spline", "lowess", "periodic_lowess"},) – optional (default: “smooth.spline”) choice of smoother. “periodic_lowess” allows one to fit closed curves. Beware, you may want to use iter = 0 with “lowess”.

  • shrink_method (string, optional (default: "cosine")) – how to determine the appropriate amount of shrinkage for a branching lineage. Accepted values: “gaussian”, “rectangular”, “triangular”, “epanechnikov”, “biweight”, “triweight”, “cosine”, “optcosine”, “density”.

  • allow_breaks (boolean, optional (default: True)) – determines whether curves that branch very close to the origin should be allowed to have different starting points.

  • seed (int or None, optional (default: None)) – Seed to use for generating random numbers.

  • verbose (int, optional (default: 1)) – Logging verbosity between 0 and 2.

Returns

  • slingshot (dict) – Contains the following keys:

  • pseudotime (array-like, shape=[n_samples, n_curves]) – Pseudotime projection of each cell onto each principal curve. Value is np.nan if the cell does not lie on the curve

  • branch (list-like, shape=[n_samples]) – Branch assignment for each cell

  • curves (array_like, shape=[n_curves, n_samples, n_dimensions]) – Coordinates of each principle curve in the reduced dimension

Examples

>>> import scprep
>>> import phate
>>> data, clusters = phate.tree.gen_dla(n_branch=4, n_dim=200, branch_length=200)
>>> phate_op = phate.PHATE()
>>> data_phate = phate_op.fit_transform(data)
>>> slingshot = scprep.run.Slingshot(data_phate, clusters)
>>> ax = scprep.plot.scatter2d(
...     data_phate,
...     c=slingshot['pseudotime'][:,0],
...     cmap='magma',
...     legend_title='Branch 1'
... )
>>> scprep.plot.scatter2d(
...     data_phate,
...     c=slingshot['pseudotime'][:,1],
...     cmap='viridis',
...     ax=ax,
...     ticks=False,
...     label_prefix='PHATE',
...     legend_title='Branch 2'
...     )
>>> for curve in slingshot['curves']:
...     ax.plot(curve[:,0], curve[:,1], c='black')
>>> ax = scprep.plot.scatter2d(data_phate, c=slingshot['branch'],
...                        legend_title='Branch', ticks=False, label_prefix='PHATE')
>>> for curve in slingshot['curves']:
...     ax.plot(curve[:,0], curve[:,1], c='black')
scprep.run.slingshot.install(site_repository=None, update=False, version=None, verbose=True)[source]

Install the required R packages to run Slingshot.

Parameters
  • site_repository (string, optional (default: None)) – additional repository in which to look for packages to install. This repository will be prepended to the default repositories

  • update (boolean, optional (default: False)) – When False, don’t attempt to update old packages. When True, update old packages automatically.

  • version (string, optional (default: None)) – Bioconductor version to install, e.g., version = “3.8”. The special symbol version = “devel” installs the current ‘development’ version. If None, installs from the current version.

  • verbose (boolean, optional (default: True)) – Install script verbosity.