Kmeans

屬於分群非監督式演算法

觀念

在一個未標籤的群組搜尋一個預先定義數量的群組
群組中心是所有同一群組中的所有點的算術平均
群組中每個點都比其他群組點更接近中心

ISSUE

必須事先設定群組
Kmean適合使用在線性的問題
大量樣本計算慢, 每次迭代都必須存取資料集中的每個點, spark mllib可以進行運算
對於不是圓形的群集效果不佳, 可以使用高斯混和矩陣(GMM:Gaussian Mixture Models), 此model預測會回傳一組機率, 顯示數與哪個群的信心值

範例

from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=4)
kmeans.fit(X)
y_kmeans = kmeans.predict(X)

Let's visualize the results by plotting the data colored by these labels. We will also plot the cluster centers as determined by the k-means estimator:

In [4]:

plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=50, cmap='viridis')

centers = kmeans.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 1], c='black', s=200, alpha=0.5);

from sklearn.cluster import KMeans
from scipy.spatial.distance import cdist

def plot_kmeans(kmeans, X, n_clusters=4, rseed=0, ax=None):
    labels = kmeans.fit_predict(X)

    # plot the input data
    ax = ax or plt.gca()
    ax.axis('equal')
    ax.scatter(X[:, 0], X[:, 1], c=labels, s=40, cmap='viridis', zorder=2)

    # plot the representation of the KMeans model
    centers = kmeans.cluster_centers_
    radii = [cdist(X[labels == i], [center]).max()
             for i, center in enumerate(centers)]
    for c, r in zip(centers, radii):
        ax.add_patch(plt.Circle(c, r, fc='#CCCCCC', lw=3, alpha=0.5, zorder=1))

In [5]:

kmeans = KMeans(n_clusters=4, random_state=0)
plot_kmeans(kmeans, X)

非線性使用SpectralClustering

k-means is limited to linear cluster boundaries

The fundamental model assumptions of k-means (points will be closer to their own cluster center than to others) means that the algorithm will often be ineffective if the clusters have complicated geometries.

In particular, the boundaries between k-means clusters will always be linear, which means that it will fail for more complicated boundaries. Consider the following data, along with the cluster labels found by the typical k-means approach:

In [8]:

from sklearn.datasets import make_moons
X, y = make_moons(200, noise=.05, random_state=0)

In [9]:

labels = KMeans(2, random_state=0).fit_predict(X)
plt.scatter(X[:, 0], X[:, 1], c=labels,
            s=50, cmap='viridis');

This situation is reminiscent of the discussion in In-Depth: Support Vector Machines, where we used a kernel transformation to project the data into a higher dimension where a linear separation is possible. We might imagine using the same trick to allow k-means to discover non-linear boundaries.

One version of this kernelized k-means is implemented in Scikit-Learn within the SpectralClustering estimator. It uses the graph of nearest neighbors to compute a higher-dimensional representation of the data, and then assigns labels using a k-means algorithm:

In [10]:

from sklearn.cluster import SpectralClustering
model = SpectralClustering(n_clusters=2, affinity='nearest_neighbors',
                           assign_labels='kmeans')
labels = model.fit_predict(X)
plt.scatter(X[:, 0], X[:, 1], c=labels,
            s=50, cmap='viridis');

更多範例在此
數字影像使用辨識 - 對0~9數字分群辨識出圖片中的數字
色彩影像壓縮 - 降低影像中類似的像素, 用來壓縮影像

Ref:

台灣人工智慧學校
Jake VanderPlas著, 何敏煌譯 , Python資料科學學習手冊 , O'REILLY

Mr.好好吃的資料遊樂園

網頁

2019年11月10日星期日

[ML] 機器學習初學觀念 - Kmeans

非線性使用SpectralClustering

k-means is limited to linear cluster boundaries

沒有留言:

張貼留言

熱門文章

網頁

2019年11月10日 星期日

[ML] 機器學習初學觀念 - Kmeans

非線性使用SpectralClustering

k-means is limited to linear cluster boundaries

沒有留言:

張貼留言

2019年11月10日星期日