非監督式學習-Finding Natural Patterns

2020-11-15

多維度的資料的分類，也就是透過降維度找到適合探討的變數（多變量分析），有助於進行後續的機器學習/統計分析。本篇介紹PCA(主成分分析)與Multidimensional Scaling於Matlab的應用

Multidimensional Scaling

Step 1 - Calculate pairwise distances

You can use the function pdist to calculate the pairwise distance between the observations. Note that the input should be a numeric matrix.

D = pdist(data,"distance")

Step 2 - Perform multidimensional scaling

You can now use the dissimilarity vector as an input to the function cmdscale

[Y,e] = cmdscale(D);

％configMat = mdscale(distances,numDims);可以指定特定維度

Step 3 - 柏拉圖分析

You can use the pareto function to create a Pareto chart, which visualizes relative magnitudes of a vector in descending order.

pareto(e)

Step 4 - 降維度資料關係繪製

以上圖柏拉圖結果來看，超過90％在前面兩個變量，可用散點圖來描述這兩個變量的關係。

scatter(Y(:,1),Y(:,2))

Principal Component Analysis

Use the function pca to perform principal component analysis.

[pcs,scrs,~,~,pexp] = pca(data)

pcs： A n-by-n matrix of principal components.
scrs： An m-by-n matrix containing the data transformed using the linear coordinate transformation matrix pcs (first output).
pexp： A vector of length n containing the percentage of variance explained by each principal component.

隨後同樣使用柏拉圖看哪個變量貢獻高

More example

下例顯示一組標準化的多維度資料分類結果

1. 使用 classical multidimensional (CMD) scaling

D = pdist(statsNorm)

[Y,e] = cmdscale(D);

pareto(e)

可以使用三圍散點圖，進行前面三的變量的分析

scatter3(Y(:,1),Y(:,2),Y(:,3))

view(110,40)

2. 同理，使用PCA進行分類

[pcs,scrs,~,~,pexp] = pca(statsNorm)

pareto(pexp)

k-means Clustering (集群分析)

Having decided the number of groups to form, you can use the k-means clustering method to divide the observations into groups or clusters.

The kmeans function performs k-means clustering.

idx = kmeans(X,k)

X：Data, specified as a numeric matrix.
k：Number of clusters.
idx： Cluster indices, returned as a numeric column vector.

下圖範例，使用三組K means 進行分類，並繪製成散點圖

grp = kmeans(X, 3)

scatter(X(:,1),X(:,2),10,grp)

%使用grp做顏色分類

GMM Clustering （高斯混合）

Step 1 - Fit Gaussian Mixture Model

You can use the function fitgmdist to fit several multidimensional Gaussian (normal) distributions.

gm = fitgmdist(X,2);

%The command shown fits a mixture gm of two distributions.

Step 2 - Identify Clusters

Now, the data can be clustered probabilistically, by calculating each observation's posterior probability for each component.

g = cluster(gm,X);

You can also return the individual probabilities used to determine the clusters.

[g,~,p] = cluster(gm,X);

%The matrix p has two columns, one for each of the two clusters.

Hierarchical Clustering (階層分群)

Step 1 - Determine Hierarchical Structure

Finding the hierarchical structure involves calculating the distance between each pair of points and then using these distances to link together pairs of "neighboring" points.

Use the linkage function to create the hierarchical tree.
The optional second and third inputs specify the methods for calculating the distance between clusters (default: "single") and calculating the distance between points (default: "euclidean").

Z = linkage(X,"ward","cosine");

You can use the dendrogram function to visualize the hierarchy.

dendrogram(Z)

Step 2 - Divide Hierarchical Tree into Clusters

You can use the cluster function to assign observations into groups, according to the linkage distances Z.

Z = linkage(X,"centroid","cosine");

dendrogram(Z)

grp = cluster(Z,"maxclust",3)