監督式學習-分類方法

2020-11-21

KNN method

(1)Table data

如下有一表格資料 groupData (92X3)

step 1: 拆成訓練資料與測試資料

可繪製散點圖，來看訓練用資料的分群狀況

groupDataplotGroup(groupData,groupData.group,"x")

step 2: 訓練模型

Construct a k-nearest neighbor classifier named 'mdl' where k = 3. Train the model using 'dataTrain' with the response variable 'group'.

mdl = fitcknn(dataTrain,"group","NumNeighbors",3);

step 3: 測試模型

Predict the groups for the data in 'dataTest'. Assign the result to a variable named 'pg3'.

pg3 = predict(mdl,dataTest);

step 4: 可視化分類模型

可以與step1的分類一起比較，如下

Use 'plotGroup' to plot the predicted groups for 'dataTest' with the marker "o". Use hold on and hold off to add this plot to the existing figure.

hold on

plotGroup(dataTest,pg3,"o")

hold off

step 5: 驗證模型

一種確認模型是否能正確分類的方法是，採用 loss 函數，即誤報率。

Notice the point near (7,1) which has been misclassified. The true class (x marker) is different from the predicted class (o marker). What proportion of points have been misclassified?

Calculate the loss for the test data 'dataTest', and assign it to a variable named 'err3'.

err3 = loss(mdl,dataTest)

step 6: 修改模型

可以試著調整 nearest neighbors 來改善模型

mdl.NumNeighbors = 10;

step 7: 重新測試與可視化模型

pg10 = predict(mdl,dataTest);

hold on

plotGroup(dataTest,pg10,"p")

hold off

(2)Array data

與 table 資料類似，但現在改成陣列來處理

step 1: 拆成訓練資料與測試資料

step 2: 訓練模型

mdl = fitcknn(coordTrain,groupTrain,"NumNeighbors",3);

step3 : 測試模型

predGroups = predict(mdl,coordTest);

step4: 驗證模型

mdlLoss = loss(mdl,coordTest,groupTest)

Decision Trees

下表列出一組心臟病資料:

This code loads and formats the data.

heartData = readtable("heartDataNum.txt");

heartData.HeartDisease = categorical(heartData.HeartDisease);

This code partitions the data into training and test sets.

pt = cvpartition(heartData.HeartDisease,"HoldOut",0.3);

hdTrain = heartData(training(pt),:);

hdTest = heartData(test(pt),:);

Create a classification tree model named mdl using the training data 'hdTrain'

mdl = fitctree(hdTrain,"HeartDisease");

Change the level of branching of mdl to 3.

mdl = prune(mdl,"Level",3)

最後驗證模型

errTrain = resubLoss(mdl);

errTest = loss(mdl,hdTest);

disp("Training Error: " + errTrain)

disp("Test Error: " + errTest)

Naïve Bayes

決策樹與KNN分類法都沒有對資料進行假設，但‘單純貝氏分類’是一系列以假設特徵之間強（樸素）獨立下運用貝葉斯定理為基礎的簡單機率分類器

If we assume the data set comes from an underlying distribution, we can treat the data as a statistical sample. This can reduce the influence of the outliers on our model.

A naïve Bayes classifier assumes the independence of the predictors within each class. This classifier is a good choice for relatively simple problems.

Discriminant Analysis

判別分析，又稱為線性判別分析（Linear Discriminant Analysis），是利用已知類別的樣本建立判別模型，為未知類別的樣本判別的一種統計方法。

Similar to naive Bayes, discriminant analysis works by assuming that the observations in each prediction class can be modeled with a normal probability distribution. However, there is no assumption of independence in each predictor. Hence, a multivariate normal distribution is fitted to each class.

Linear Discriminant Analysis

By default, the covariance for each response class is assumed to be the same. This results in linear boundaries between classes.

daModel = fitcdiscr(dataTrain,"ResponseVarName")

Quadratic Discriminant Analysis

Removing the assumption of equal covariances results in a quadratic boundary between classes. Use the "DiscrimType" option to do this:

daModel = fitcdiscr(dataTrain,"ResponseVarName","DiscrimType","quadratic")

Support Vector Machines

A Support Vector Machine (SVM) algorithm classifies data by finding the "best" hyperplane that separates all data points.

Step1: process data

如下程式碼，已經將一組透過cvpartition 將資料分類

load groups

cvpt = cvpartition(groupData.group,"Holdout",0.35);

dataTrain = groupData(training(cvpt),:);

dataTest = groupData(test(cvpt),:);

Step2: 訓練模式與驗證

mdlSVM = fitcsvm(dataTrain,"group","KernelFunction","gaussian");

errSVM = loss(mdlSVM,dataTest)

Step3: 可視化資料

predGroups = predict(mdlSVM,dataTest);

plotGroup(groupData,groupData.group,"x")

hold on

plotGroup(dataTest,predGroups,"o")

hold off

其中，function plotGroup 如下:

function plotGroup(data,grp,mkr)

validateattributes(data,"table",{'nonempty','ncols',3})

% Plot data by group

colors = colororder;

p = gscatter(data.x,data.y,grp,colors([1 2 4],:),mkr,9);

% Format plot

[p.LineWidth] = deal(1.5);

legend("Location","eastoutside")

xlim([-0.5 10.5])

ylim([-0.5 10.5])

end

上圖可以發現，仍有一些資料沒有分類正確，下例中測試另一種數據，稱為同心圓數據。

Step1: process data

load points

rng(123)

cvpt = cvpartition(points.group,"Holdout",0.38);

trainPoints = points(training(cvpt),:);

testPoints = points(test(cvpt),:);

plotGroup(points,points.group,"x")

Step2: 訓練模式與驗證繪圖

mdl = fitcsvm(trainPoints,"group","KernelFunction","polynomial");

mdlLoss = loss(mdl,testPoints)

predGroups = predict(mdl,testPoints);

plotGroup(points,points.group,"x")

hold on

plotGroup(testPoints,predGroups,"o")

hold off

換言之，SVM適合分類非線性的資料

Multiclass Support Vector Machine Models

perform multiclass SVM classification by creating an error-correcting output codes (ECOC) classifier.

例如下圖應該有三個類別的數據

Suppose there are three classes in the data. By default, the ECOC model reduces the model to multiple, binary classifiers using the one-vs-one design.

The combination of the resulting classifiers is used for prediction.

Creating a multiclass SVM model is a two-step process.

如˙下一組資料，欲進行Multiclass Support Vector Machine Models

load groups

rng(0)

cvpt = cvpartition(groupData.group,"Holdout",0.35);

dataTrain = groupData(training(cvpt),:);

dataTest = groupData(test(cvpt),:);

plotGroup(groupData,groupData.group,"x")

Create a multiclass SVM classifier named 'mdlMSVM' using the training data 'dataTrain' and the response variable 'group'. Calculate the classification loss for the test data 'dataTest', and name it 'errMSVM'.

mdlMSVM = fitcecoc(dataTrain,"group");

errMSVM = loss(mdlMSVM,dataTest)

predGroups = predict(mdlMSVM,dataTest);

plotGroup(groupData,groupData.group,"x")

hold on

plotGroup(dataTest,predGroups,"o")

hold off