Clustering Configuration
The Cluster parameters which control the clustering, joining, and outlier detection algorithms.
The default recommended settings when using the continuity join method are:
cluster_params = ClusterParameters(
min_cloud_size=50,
min_points=5,
min_size_scale_factor=0.0,
min_size_lower_cutoff=5,
cluster_selection_epsilon=13.0,
overlap_join=None,
continuity_join=ContinuityOverlapParamters(
join_radius_fraction=0.3,
join_z_fraction=0.2,
),
outlier_scale_factor=0.05,
)
The default recommended settings when using the circle overlap join method are:
cluster_params = ClusterParameters(
min_cloud_size=50,
min_points=3,
min_size_scale_factor=0.05,
min_size_lower_cutoff=10,
cluster_selection_epsilon=10.0,
overlap_join=OverlapJoinParameters(
min_cluster_size_join=15.0,
circle_overlap_ratio=0.25,
),
continuity_join=None,
outlier_scale_factor=0.05,
)
A break down of each parameter:
min_cloud_size
This is the minimum size a point cloud must be (in number of points) to be sent through the clustering algorithm. Any smaller point clouds will be tossed as noise.
minimum_points
The minimum number of samples (points) in a neighborhood for a point to be a core point. This is a re-exposure of the min_samples
parameter of
scikit-learn's HDBSCAN. See
their documentation for more details. Larger values will make the algorithm more likely to identify points as noise. See the original
HDBSCAN docs for details on why this parameter is important and how it can impact the data.
minimum_size_scale_factor
HDBSCAN requires a minimum size (the hyper parameter min_cluster_size
in
scikit-learn's HDBSCAN) in terms of samples for a group to
be considered a valid cluster. AT-TPC point clouds vary dramatically in size, from tens of points to thousands. To handle this wide scale, we use a scale factor
to determine the appropriate minimum size, where min_cluster_size = minimum_size_scale_factor * n_cloud_points
. The default value was found through some testing,
and may need serious adjustment to produce best results. Note that the scale factor should be small. Update: in the case of continuity joining, scaling was not
needed.
minimum_size_lower_cutoff
As discussed in the above minimum_size_scale_factor
, we need to scale the min_cluster_size
parameter to the size of the point cloud. However, there must be
a lower limit (i.e. you can't have a minimum cluster size of 0). This parameter sets the lower limit; that is any min_cluster_size
calculated using the scale factor
that is smaller than this cutoff is replaced with the cutoff value. As an example, if the cutoff is set to 10 and the calculated value is 50, the calculated value would
be used. However, if the calculated value is 5, the cutoff would be used instead.
cluster_selection_epsilon
A re-exposure of the cluster_selection_epsilon
paramter of
scikit-learn's HDBSCAN. This parameter will merge clusters that
are less than epsilon apart. Note that this epsilon must be on the scale of the scaled data (i.e. it is not in normal units). The impact of this parameter is large, and
small changes to this value can produce dramatically different results. Larger values will bias the clustering to assume the point cloud is onesingle cluster (or all noise),
while smaller values will cause the algorithm to revert to the default result of HDBSCAN. See the original
HDBSCAN docs for details on why this parameter is important and how it can impact the data.
outlier_scale_factor
We use scikit-learn's LocalOutlierFactor as a last round of noise elimination on
a cluster-by-cluster basis. This algorithim requires a number of neighbors to search over (the n_neighbors
parameter). As with the min_cluster_size
in HDBSCAN, we need to
scale this value off the size of the cluster. This factor multiplied by the size of the cluster gives the number of neighbors to search over
(n_neighbors = outlier_scale_factor * cluster_size
). This value tends to have a "sweet spot" where it is most effective. If it is too large, every point has basically the
same outlier factor as you're including the entire cluster for every point. If it is too small the variance between neighbors can be too large and the results will be
unpredictable. Note that if the value of outlier_scale_factor * cluster_size
is less than 2, n_neighbors
will be set to 2 as this is the minimum allowed value.
Overlap Join Parameters
min_cluster_size_join
The minimum size of a cluster for it to be considered in the joining step of the clustering. After HDBSCAN has made the initial clusters we attempt to combine any clusters which
have overlapping circles in the 2-D projection (see circle_overlap_ratio
). However, many times, small pockets of noise will be clustered and often sit within the larger trajectory.
To avoid these being joined we require a cluster to have a minimum size.
circle_overlap_ratio
The minimum amount of overlap between circles fit to two clusters for the clusters to be joined together into a single cluster. HDBSCAN often fractures trajectories into multiple clusters as the point density changes due to the pad size, gaps, etc. These fragments are grouped together based on how much circles fit on their 2-D projections (X-Y) overlap.
Continuity Join Parameters
join_radius_fraction
The fraction of the total radius range of the two clusters used to set the threshold for joining. Used as: radius_threshold = join_radius_fraction * (max_radius - min_radius)
where the min_radius
and max_radius
are the minimum and maximum radius values over the clusters being compared.
join_z_fraction
The fraction of the total z range of the two clusters used to set the threshold for joining. Used as: z_threshold = join_z_fraction * (max_z - min_z)
where the min_z
and max_z
are the minimum and maximum radius values over the clusters being compared.