如何優化模型
在監督式學習中,使用了loss 函數或是confusion matrix 來觀察模型的正確性,但實際上要如何改善模型呢?
另外,當我們將數據拆成 trainning data & testing tata 時,可能會因為拆分法的獨立性,導致在特定子集 (specific subset) 表現良好,要如何避免呢?
Cross Validation
The idea behind cross-validation is to repeat the model evaluation process multiple times. Each time, the process uses different training and test subsets, and fits a model to the training set and calculates the loss using the corresponding test set.
To create a model with cross validation, provide one of the following options in the model-creation function.
If you already have a partition created using the cvpartition function, you can provide that to the fitting function instead.
To evaluate a cross-validated model, use the kfoldLoss function.
mdlLoss = kfoldLoss(mdl)
如下,存在一組資料 groupData
Create a five-fold cross-validation partition named 'cvpt' of the data in the table 'groupData'. The response is the table variable 'group'.
cvpt = cvpartition(groupData.group,"KFold",5)
You can pass a partition to model-creation functions, such as fitcknn or fitcdiscr, using the "CVPartition" option.
mdl = fitcdiscr(groupData,"group","CVPartition",cvpt);
To evaluate a cross-validated model, use the kfoldLoss function on the model. Note that function names are case-sensitive.
kfLoss = kfoldLoss(mdl)
Hyperparameter Optimization
You can modify the properties of a machine learning model to try to improve model performance.
For example, you can change a kNN model to use 5 nearest neighbors instead of 1, then calculate the loss of the new model.
The process usually consists of repeatedly training the model with a variety of property values, and choosing the combination that produces the best accuracy.
These properties are often called hyperparameters. Hyperparameters can have a large impact on the performance of a model, but it's typically time consuming or difficult to find the optimal hyperparameter values.
In order to efficiently set all the properties at once, you can perform hyperparameter optimization. Hyperparameter optimization allows you to select a subset of the model's properties and find the optimal settings for a specific data set.
During the optimization, iterative updates are displayed, as well as a plot with the best objective function value against the iteration number.