Clustering: PCA vs t-SNE on the Fashion MNIST dataset

Principal Component Analysis Recently I’ve been working on projects involving high-dimensional datasets with hundreds or thousands of variables, which naturally led me to dimension reduction techniques to better visualise and model the data (e.g. cluster analysis). The first port of call for most people will be Principal Component Analysis (“PCA”). In simple terms, PCA determines the directions (principal components) in which the data varies the most by decomposing the sample covariance matrix, \(S\), into its eigenvectors and eigenvalues....

April 26, 2022 · 13 min · Josh Cheema

Spatial Data: Adjusting the Boston Housing dataset

The original Boston Housing dataset contains U.S. Census data for the Greater Boston area in 1970, including metrics such as median value of owner-occupied housing, per capita crime rate and nitric oxide concentration for each census tract (a small collection of houses defined for the census). The corrected Boston Housing dataset includes the original variables with corrections for errors and additional spatial data for each tract such as longitude, latitude and the name of the town in which each tract is located....

April 25, 2022 · 5 min · Josh Cheema

LDA vs QDA

Introduction When looking at binary classification problems, a common modelling approach is logistic regression, which makes use of the logistic function to determine whether an observation belongs to one of \(K\) classes. However, while logistic regression is a valid approach, alternative methods may be required. In particular, for datasets where classes are completely (or almost completely) separate. In this article, we discuss two methods that do not suffer from this class separation issue: linear discriminant analysis (“LDA”) and quadratic discriminant analysis (“QDA”)....

April 20, 2022 · 5 min · Josh Cheema