Subscribe to DSC Newsletter

Suppose I have 20 independent vsariables and I am thinking to go for PCA, Do we need to do the scaling of all these 20 independent variable, or PCA will handle it... And I hope the output of PCA will be scaled features...

Views: 495

Reply to This

Replies to This Discussion

PCA is one approach to it, however, what is your main goal? If you want to reduce variables and group them into indices, PCA is an adequate approach. If you intend on using all 20 IVs, some form of linear/log regression may be best.

Is this survey data (categorical data) or strictly continuous? Has the survey been psychometrically validated previously or generated independently?

Thanks, but I want to know, do we need to do the scaling of data before PCA, so that all independent variable lies betwee 0 & 1??

It depends. You can compute the PCA in two ways:

1- By computing the eigenvalues and eigenvectors of the covariance matrix: In this case, you need to normalize your data (between 0-1)

2- By computing the eigenvalues and eigenvectors of the (pearson) correlation matrix: In this case you don't need to normalize your data.

The proof of this is very easy: The covariance matrix of the normalized data IS the correlation matrix :)

My advice: if you're not implementing PCA your self then check the package you're using on how it was implemented (using covariance or correlation matrix).

Scaling the input variables is your job as the analyst. Do you want to center and spherize the data? if so, center the data by subtracting the mean vector from all data vectors and sweep out the standard deviations. The result will be a mean vector of zeroes and a variance in all directions of one. Or, you could preserve the original covariance structure by centering by the vector of variable minima, and scale by the vector of variable ranges (this will give you the desired {0,1} interval for all your variables).

You also have to pay attention to the scales on which the variables are measured. Are they all ratio scale? If not then consider using Gower's coefficient for mixed data (implement with daisy() in the R cluster pkg) and then do a principal coordinates analysis. The ape pkg has a nice routine and I think it lets you create biplots as in PCA (it also gives you the broken-stick criterion for evaluating which dimensions are worth trying to interpret).

Biplots are handy because they show you something about the relationship between individual variables and their m'variate summary.

The result of the PCA won't be normalized.

PCA is an orthogonal linear transformation between your initial data space and a new space that is spanned by the eigenvectors of the covariance/correlation matrix.

The only thing you could say is that your data in this new space will be expressed as a linear combination of several orthogonal (not orthonormal) eigenvectors (also called principal components). This will make it easier to model and work with your data (scalar prod in orthogonal spaced is easier to work with so on).

I guess these eigenvectors (principal components) could be your features but you need to normalize them your self :)

RSS

Videos

  • Add Videos
  • View All

© 2019   Data Science Central ®   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service