Skip to content

subset_data does not converge #8

@MattScicluna

Description

@MattScicluna

Hi, I have a large dataset (>100k samples) that contains a lot of duplicates.
MSPHATE does not converge during the Calculating partitions... step.

I can't share the dataset in question, but I think I replicated the effect with some randomly generated data. See the following code and output:

import numpy as np
from multiscale_phate import compress, diffuse, condense

np.random.seed(42)

# spoof data
data = np.random.uniform(size=(10001, 200))
data = np.vstack([data, data, data, data, data, data, data, data, data, data])  # highly redundant

# spoof MSPHATE compress step
N, features = data.shape
n_pca = 200
partitions = None

# Computing compression features
n_pca, partitions = compress.get_compression_features(
    N, features, n_pca, partitions, landmarks=2000
)

# modified to display np.max(cluster_counts) and np.ceil(N / desired_num_clusters)
_ = compress.subset_data(data, desired_num_clusters=partitions, n_jobs=8, num_cluster=100, random_state=None)

output:

Calculating partitions...
np.max(cluster_counts):  3930
np.ceil(N / desired_num_clusters):  6.0
np.max(cluster_counts):  1120
np.ceil(N / desired_num_clusters):  6.0
np.max(cluster_counts):  70
np.ceil(N / desired_num_clusters):  6.0
np.max(cluster_counts):  10
np.ceil(N / desired_num_clusters):  6.0
np.max(cluster_counts):  10
np.ceil(N / desired_num_clusters):  6.0
np.max(cluster_counts):  10
np.ceil(N / desired_num_clusters):  6.0
np.max(cluster_counts):  10

The output is the same after many iterations.

Note: I am using python 3.8 and installed using pip install git+https://github.com/KrishnaswamyLab/Multiscale_PHATE

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions