Specifically, we benchmarked 16 popular data integration tools on 13 data integration tasks consisting of up to 23 batches and 1 million cells, for both scRNA- and single-cell ATAC-sequencing (scRNA-seq and scATAC-seq) data. 4c), fully merged batches within cell-type clusters. Extended Data Fig. Even unintegrated data in gene activity space lacked biological variation in cell identities compared to the same data on peaks or windows. 874656 awarded to F.J.T., by the Wellcome Trust grant no. We compared trajectories computed after integration for certain clusters that had been manually selected during the data preprocessing step. A test metric for assessing single-cell RNA-seq batch correction. Biol. Methods that can remove strong batch effects also tend to remove nuanced biological signals or require cell identity labels obtained via per-batch data processing. Poor method scalability particularly affected scATAC-seq integration, which typically has a larger feature space. Here, we present SCALEX, a deep-learning method that integrates single-cell data by projecting cells into a batch-invariant, common cell-embedding space in a truly online manner (i.e.,. 910 data integration methods14 . This trend could also be seen on the large ATAC peak and window tasks, which proved prohibitively large for most methods due to poor scaling with the number of features (Extended Data Fig. Oetjen, K. A. et al. 2a) to the integrated data plots (Fig. For a more complete description of the challenges of horizontal integration, as well as benchmarking .
Benchmarking atlas-level data integration in single-cell genomics This score is computed as the sum of the partial average package and paper usability scores, and plotted on top in a barplot. Methods. As expected, using more features led to both longer runtimes and higher memory usage. To guide integration method choice, we benchmarked 68 method and preprocessing combinations on 85 batches of gene expression, chromatin accessibility and simulation data from 23 publications, altogether representing >1.2 million cells distributed in 13 atlas-level integration tasks View on Springer nature.com Save to Library Create Alert Cite Methods that use cell identity information (scGen and scANVI) must be considered separately in this tradeoff. Methods that failed to run for a particular task were assigned the unintegrated ranking for that task. This effect was particularly noticeable for the immune cell human and mouse and mouse brain tasks. A benchmark of batch-effect correction methods for single-cell RNA sequencing data. Extended Data Fig. We converted between R and Python data formats using anndata2ri (www.github.com/theislab/anndata2ri) and conversion functions in LIGER and Seurat. Using this approach, we were able to compute comparable overall performance scores even when different numbers of metrics were computed per run. supervised the work. All of the top-performing methods exhibited high trajectory conservation scores, whereas DESC (on scaled/HVG data), scGen (on scaled/full feature data) and Seurat v3 CCA (on scaled/HVG data), produced poor conservation of this trajectory due to overclustering (DESC), merging of cell types (Seurat v3 CCA) or lack of relevant biological latent structure (scGen). Second, we rescaled the LISI scores as follows: \({\mathrm{cLISI}}:f(x) = \frac{{B - x}}{{B - 1}}\), where a 0 value corresponds to low cell-type separation and \({\mathrm{iLISI}}:g(x) = \frac{{x - 1}}{{B - 1}}\), where a 0 value corresponds to low batch integration. 13, Extended Data Fig. For example, metrics that run on kNN graphs can be run on all output types after preprocessing. The complexity of single-cell omics datasets is increasing.
33-Benchmark - - & Theis, F. J. scGen predicts single-cell perturbation responses. A single-cell transcriptomic atlas characterizes ageing tissues in the mouse. In the absence of labels, given no further information on the integration task, we recommend the top-performing integration methods Scanorama and scVI, especially for sufficiently large datasets. Benchmarking atlas-level data integration in single-cell genomicsintegration task datasets. Our freely available Python module and benchmarking pipeline can identify optimal data integration methods for new data, benchmark new methods and improve method development. b, The overall scores for the best performing methods on each task. PLoS Comput. of single-cell genomics data by . To evaluate how integration methods scale with increasing numbers of features, we fitted further linear regression models with CPU time and memory respectively as the dependent variable and both the number of cells and the number of features on a log-scale as the independent variables, as follows: where f(x) denotes the log-scaled CPU time or memory consumption, N denotes the number of cells in the task and F denotes the number of features. LISI scores are computed from neighborhood lists per node from integrated kNN graphs. Thus, joint analysis of atlas datasets requires reliable data integration. 5 for the full plot). Cell 71, 858871.e8 (2018). Wickham, H ggplot2: Elegant Graphics for Data Analysis (Springer, 2010). Appl. Individual and aggregated scores are represented by circles and bars, respectively. By taking the mean of these per-task rankings we ordered the methods by overall performance across tasks. Given runtime and memory limitations imposed in our benchmark, trVAE could not integrate datasets with >34,000 cells, while Seurat v3, MNN and scGen failed to integrate datasets with >100,000 cells (Supplementary Data 3). M. Colom-Tatch or Fabian J. Theis. Nat. Using these subset kNN graphs, we computed the graph connectivity (GC) score using the equation: Here, C represents the set of cell identity labels, |LCC()| is the number of nodes in the largest connected component of the graph and |Nc| is the number of nodes with cell identity c. The resultant score has a range of (0;1], where 1 indicates that all cells with the same cell identity are connected in the integrated kNN graph and the lowest possible score indicates a graph where no cell is connected. In contrast, integrated graphs that are output by methods such as Conos or BBKNN typically contain far fewer than k=90 neighbors. The kBET algorithm (v.0.99.6, release 4c9dafa) determines whether the label composition of a k nearest neighborhood of a cell is similar to the expected (global) label composition11.
Benchmarking atlas-level data integration in single-cell genomics 17, 75 (2016). The first six categories, grouped under a Package score (open source, version control, unit testing, GitHub repository, tutorial and function documentation), assess the quality of the code, its availability, the presence of a tutorial to guide users through one or more examples, GitHub issue activity and responsiveness and (ideally) usage in a nonnative language (that is, from Python to R or vice versa). Methods were ranked as detailed in the Ranking and metric aggregation section above and bars were shaded by rank. Even methods that did integrate datasets across species failed to reconstruct a consistent global trajectory structure (scGen and FastMNN) or poorly reflected the trajectory (LIGER). Polaski, K. et al. We thank T. Walzthoeni for support with up-scaling the analysis provided at the Bioinformatics Core Facility, Institute of Computational Biology, Helmholtz Zentrum Mnchen. For example, Seurat v3 CCA removed variation within cells from a single batch that otherwise showed substructure in unintegrated data (lung task in Supplementary Note 3). Human bone marrow assessment by single-cell RNA sequencing, mass cytometry, and flow cytometry. We used the same set of cell-cycle genes for mouse and human data (using capitalization to convert between the gene symbols). This model was fit separately for each method using the ordinary least squares fit function ols from the statsmodels.formula.api module (statsmodels v.0.11.1) on unscaled data (using both full feature and HVG preprocessed data, as well as all ATAC results). a, Scatter plot of the mean overall batch correction score against mean overall bio-conservation score for the selected methods on RNA tasks. For example, metrics based on corrected embeddings (Silhouette scores, principal component regression and cell-cycle conservation) were not run where only a corrected graph was output. The HVG conservation score is a proxy for the preservation of the biological signal. kNN graphs were computed using the neighbors function where k=15 unless otherwise specified. We acknowledge M. Nawijn, H. Schiller and L. Simon for provision of data and expertise in mapping of lung cell annotations. For the real data tasks, we downloaded 23 published datasets (see Supplementary Data 2 for a per-batch overview of datasets). Overview of cellular identities in the pancreas and mouse brain integration tasks. Particular cell types such as endothelial cells perform different functions in these locations (for example, gas exchange in the parenchyma). Scaling the input data typically shifted results toward better batch removal but worse bio-conservation, while HVG selection improved overall performance. In contrast, data scaling had little influence on CPU time, but reduced data sparsity when scaling did increase peak memory use. Methods that could not be run for a particular task were assigned the same rank as unintegrated data on this task. 9). Overall, a batchASW of 1 represents ideal batch mixing and a value of 0 indicates strongly separated batches. Here, we benchmark 38 method and preprocessing combinations on 77 batches of gene expression, chromatin accessibility, and simulation data from 23 publications, altogether representing >1.2 million cells distributed in nine atlas-level integration tasks. Specifically, we chose the width as the difference of the log-scaled bounds and the height C as 108s (3years or 24days on 48 cores) and 107MB (10TB), respectively: Methods that scale well have a low AUC and, consequently, a low scaled AUC. Thus, up to 323GB of memory was available for each run. and K.C. b, The overall scores for the best performing method, preprocessing and output combinations on each task as well as their usability and scalability. Exploring the single-cell RNA-seq analysis landscape with the scRNA-tools database. 9 Usability assessment of data integration methods. Both metrics were computed on the embeddings provided by integration methods or the PCA of expression matrices in case of feature output. Points marked with a cross use scaled features. Furthermore, MNN scaled least favorably in CPU time, while scGEN and trVAE used most CPU time on the tasks we tested. Subsequently, kBET scores for each label were averaged and subtracted from 1 to give a final kBET score. Methods that favor bio-conservation and output corrected expression matrices tended to better conserve cell state variation. PubMed Luecken M. et al.
Single-cell integration benchmark scib scib 1.1.3 documentation The silhouette width measures the relationship between the within-cluster distances of a cell and the between-cluster distances of that cell to the closest cluster43. Sign up for the Nature Briefing newsletter what matters in science, free to your inbox daily. 8). To restrict the feature space, we used only the most variable peaks, windows or genes that overlap between datasets (Methods and Supplementary Note 3). Extended Data Fig. Haghverdi, L., Lun, A. T. L., Morgan, M. D. & Marioni, J. C. Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors.