Dirichlet Process Cluster Analysis of TCR Distances — DirichletClusterAnalysis • tcrClustR

High-level entry point for exploring the natural clustering structure of pairwise TCR distances. For each level of splitField, the function extracts within-group distances from assayName, downsamples to maxSamples via decile-stratified sampling, fits a Gaussian Dirichlet process mixture, and collects the per-cluster parameters (mean distance, spread, mixing weight) into a tidy data frame.

The resulting cluster means tell you where the natural modes of the distance distribution sit, which directly informs the dianaHeight parameter in RunTcrClustering.

DirichletClusterAnalysis(
  seuratObj,
  assayName,
  splitField,
  maxSamples = 100,
  nIterations = 1000,
  minClonesPerGroup = 2,
  nBins = 10,
  samplesPerBin = NULL,
  seed = 42,
  verbose = TRUE
)

Arguments

seuratObj: A Seurat object produced by CalculateTcrDistances (must have @misc$TCR_Distances).
assayName: Character. Distance assay to analyse (e.g., "TRA_fl", "TRB_cdr3", "TRA_TRB_fl").
splitField: Character. Metadata column whose levels define groups (e.g., "cDNA_ID", "Tissue").
maxSamples: Integer. Maximum pairwise distances to sample per group before fitting the DirichletProcess. Controls compute cost. Default 100.
nIterations: Integer. MCMC iterations for each DirichletProcess fit. Default 1000.
minClonesPerGroup: Integer. Groups with fewer clones are skipped. Default 2.
nBins: Integer. Number of equal-frequency quantile bins used during decile-stratified downsampling. Increase to preserve rare modes at the tails of the distance distribution. Default 10.
samplesPerBin: Integer. Samples drawn from each bin. Defaults to max(1, floor(maxSamples / nBins)), which evenly distributes the budget across bins. Override to draw more observations from each stratum.
seed: Integer. RNG seed for reproducible downsampling. Default 42.
verbose: Logical. Print progress messages. Default TRUE.

Value

A list of class "tcrDirichletResult" containing:

cluster_summary: A tidy data.frame with one row per group-cluster combination. Columns: Cluster, Mu, Sigma, MixingProportion, PointsPerCluster, Group.
models: Named list of raw dirichletprocess model objects.
assayName: The assay used.
splitField: The metadata split field used.

Examples

if (FALSE) { # \dontrun{
# after CalculateTcrDistances() + RunTcrClustering():
dp <- DirichletClusterAnalysis(seuratObj, "TRA_fl", "metadata_variable")

# diagnostic plots
PlotClusterMeans(dp)
PlotMixingProportions(dp)

# use cluster means to pick a dianaHeight:
dp$cluster_summary

# additionally, you can use quantile sampling to sample rare modes
dp <- DirichletClusterAnalysis(seuratObj, "TRA_fl", "metadata_variable", 
                               nBins = 10, samplesPerBin = 100)

} # }