类似的作业代写任务还有COMP7710 Homework 4 – Dimensionality Reduction,以下是5的内容
Data pre-processing (1 mark)
1. Load the ’virusdata.csv’ dataset.
2. Preprocess the data by dropping any unnecessary columns.
3. Extract a subset of data by considering all columns, but only the first 1000 rows.
4. Now, extract the values of 4th and 5th columns from the above dataset (similar to what was done in the tutorial). Your new dataset should now be of size (1000,2). Use this sub-dataset for all subsequent problems.
Choosing the optimal number of clusters (3 marks)
For this question, use the silhouette score metric to choose an optimal number of clusters from a range of clusters.
1. Define a range of clusters to apply GMM. The minimum number of clusters should be 2, and the maximum can be a number between 12-15.
2. For each number of clusters
• Apply a Gaussian Mixture Model on your dataset.
• Using the trained GMM, predict the cluster of each datapoint.
• Calculate the average silhouette score for the number of clusters chosen at each iteration. Use the
Euclidean distance as the distance metric to calculate this score.
3. Plot the average silhouette score against the number of clusters
Optimal GMM (3 marks)
1. From the above plot, what would you choose as the optimal number of clusters?
2. Apply a GMM on your dataset with the optimal number of clusters that you chose.
3. Find the cluster centers for the optimal GMM.
4. Plot the clustered data points and the means of each cluster. (Assign different colours to each cluster for clear visualization)
Discussion (3 marks)
1. Consider K-means discussed in the tutorial, and GMM for clustering. Explain which method you think is most suitable for the problem of clustering the virus-MNIST dataset and why.
2. GMM is considered to be an unsupervised clustering technique. However, a small part of training data might contain labels. Explain how this information can be used to improve the performance of GMM. Tip: think about the initialisation.
Refereed Book Chapter:
Pattern Recognition and Machine Learning, by Christopher M. Bishop
Chapter 12.1
在作业四中
Questions
In this homework, you will use PCA and LDA to reduce the dimensions of the cyber attack dataset collected by the University of New South Wales.
You may use scikit-learn functions to answer these questions (or any implementations you built from scratch).