Replication of Resilient K-Clustering (KDD, 2024)

Published in For partial fulfilments of COMP 5331 Requirements, 2024

1 Introduction

1.1 Overview

Clustering is a crucial problem in the realm of databases. It plays an important role in data analysis and pattern recognition. One of the widely recognized algorithms in this domain is k-clustering. However, k-clustering is sensitive to various factors, including outliers, centroids initialization and input perturbations. These characteristics increase the computational cost in attaining the globally optimal solution since multiple runs of the algorithm are often required to achieve more stable and reliable clustering outcomes. In light of these challenges, there has been a growing interest in exploring approaches that can enhance the stability of traditionalk-clustering algorithms. The resilient k-clustering algorithm, proposed in Ahmadian’s paper [1], represents a recent advancement that demonstrates improvement in the resilience of clustering algorithms towards input perturbations.

1.2 Objectives

The goal of this project is to implement resilientk-clustering according to Ahmadian’s paper [1] and apply it to solve another clustering problem. The clustering algorithm can resiliently return similar clusters on noisy and slightly changing inputs.

To achieve this purpose, we will mainly focus on the following objectives:

  1. Understand the approach used by resilientk-clustering.
  2. Implement the resilientk-clustering algorithm.
  3. Replicate the results from the original resilientk-clustering paper.
  4. Test out the implemented algorithm on a new clustering problem.

2 Related Work

Gonzalez Algorithm [2]. The Gonzalez algorithm starts by selecting an arbitrary point as the first center. It then iteratively adds k - 1 additional centers by selecting the point that is farthest from the nearest existing center.

Carving Algorithm [3]. This approach identifies the smallest valueR(which represents the maximum distance from a newly added center to the points that it covers) such that the algorithm can open at mostkcenters. While there are uncovered points, it randomly selects an uncovered point, adds it to the center set, and marks all points (including the new center) within a distance of R as covered.

3 Methodology

In this section, we explain the problem definition and algorithm of resilient k-clustering. For the implementation details, such as changes in the algorithm for experimental needs, please refer to Section 4.3.

3.1 Formal definition of resiliency

Before introducing the algorithms, we need to define two concepts, ε-close point sets and γ-resiliency. Referring to Ahmadian’s paper [1], the definition of ε-close point sets is given a bijective relation between two instances M(P,μ) and M(P’,μ’), where P and P’ represent point sets, μ and μ’ represent distance function, M and M’ are ε-close if and only if, for any two points p and q, their distance does not differ by more than a factor (1 +ε) in μ and μ’. Besides, for the definition of γ-resiliency, a k-clustering algorithm A is considered γ-resilient if and only if the outputs of A(P,k) and A(P’,k) differ by at most γε fraction for any ε-close point sets P and P’with the corresponding bijective mapping. In short, an intuition definition is that a k-clustering algorithm is resilient if the algorithm can return similar results given a pair of close inputs.

3.2 Resilient k-Center Algorithm

Concerning the algorithms, we will first replicate the resilientk-center algorithm by selecting a random point setP, a natural number parameterkand a positive real number ε. With the above inputs, the algorithm attempts to return a center set C of size k + 2klog(1/ε), and assign points in P to C so that the maximum distance between a point and the assigned center is minimized. Our attempts will largely mirror the authors’ two phases. For the first phase, we will attempt to choose a sufficiently large set of random points (i.e. 2klog(1/ε)), and then assign points that are very close to these initial centers. For the second phase, the remaining unassigned points are clustered using the well-established farthest-point traversal technique. This phase will openkmore centers.

With enough random points selected in the first phase, there are very few remaining unassigned points. Hence, applying another clustering algorithm in the second phase will not affect the resilience guarantee.

3.3 Resilient Minimum Spanning Tree

We will then continue to replicate the second algorithm, resilient minimum spanning tree. It will be called during the first phase of Resilient k-Center Algorithm to help forming the clustering result by generating a minimum spanning tree. The algorithm utilizes some discretization techniques. It will select some edges and adjust their weights so that two edges with similar weights will be considered as having the same weights. Hence, the algorithm will be able to construct a minimum spanning tree that is robust to perturbations in the edge weights across theε-close instances.

4 Implementation

4.1 Environment Setup

In the implementation, we will use Python as the programming language. This is because Python is commonly used in data mining tasks and contains some useful libraries to facilitate our implementation process. Windows is the OS used for development. To replicate our result without error, you can refer to the README.md file in our repository. The packages used in this project include Numpy, Pandas, SciPy, Scikit-Learn, Networkx, and Matplotlib.

4.2 Dataset

ε-close dataset is needed to test the resilience of the clustering algorithm. Each ε-close dataset should contain two sets of data that can be mapped bijectively and are of ε-close.

4.2.1 Original Dataset

In order to test the correctness of our algorithm by replicating results from the original paper[1], we need to test our implementation on the ε-close datasets used in it. The original datasets consisted of three real-world datasets and one synthetic dataset. The real-world datasets capture some users’ geolocation, which changes gradually over time. The synthetic dataset represents the scenario with consecutive measurements containing small errors. All datasets are open-source and the paper provided their processing procedure for all datasets.

Real-world datasets.The three real-world datasets were Uber ; Gowalla and Brightkite [4]. Uber contains 18.8 million Uber pickup locations in New York City from April to September 2014 and from January to June 2015. Both Gowalla and Brightkite contain the users’ check-in locations on the location-based social networking websites Gowalla and Brightkite, respectively. Gowalla has 6.4 million records from February 2009 to October 2010, and Brightkite has 4.5 million records from April 2008 to October 2010. We originally planned to replicate the results for Gowalla too. However, the dataset still consisted of 6,200 pairs of data after processing, which took a long time to cluster for each clustering method under different experimental settings. Considering the similar data nature of Gowalla and Brightkite and the limited computational power, we decided to replicate the result for Uber and Brightkite only.

Synthetic dataset. The one synthetic dataset was Birch-sets^4 [5]. Birch-sets are composed of three synthetic datasets, each with 100,000 data and 100 clusters generated in different settings. We denote them as Birch1, Birch2 and Birch3, respectively. In the original paper, only Birch1 with regular grid structure clusters was used. Similar to Gowalla issue, the 100,000 data are too intensive for our experiment. Since we wanted to still include this synthetic dataset in our experiment, we downsample it to 1,000 data while preserving its clusters shape and distribution.

In short, we will replicate results for Uber, Brightkite, and Birch1 from the original paper.

Recommended citation: Yoanna Lo, Jason Li, Newt Nguyen, Kelvin Tam, Dennis Tsang (2024). "Replication of Resilient K-Clustering (KDD, 2024)".