## Factor Analysis- some suggestions

Rather than have another zombie post (which I’ll get back to in a day or two), I thought I would have a brief blog about a couple of tricks I picked up regarding Factor Analysis (FA), a statistical procedure in the correlation family. FA is sort of the black sheep of correlations though; while it is technically non-parametric, normality and linearity assumptions make for a much “healthier” interpretation. In addition there are some other funny requirements- there should be at least one hundred data sets from a series of stimuli/responses. FA’s most obvious implementation (using Principal Component Analysis) is using it to find Components that indicate a person’s responses on a questionnaire. Essentially, a Component is the structure of how certain questions on that questionnaire are answered similarly. Another way of putting it is that it represents a hidden dimension for several variables. So for instance I ask someone using a questionnaire, on a scale of 1-5 (1 being “yuck” and 5 being “yum”) on how raspberries taste. They rate it a 2. Now raspberries are quite sweet, and are a berry, so when I ask them how much they like blackberries, they give it a 2 as well. I’ll perform a FA after processing 100 participants and find that there is a hidden dimension- people who give an opinion on raspberries will have similar opinions on blackberries.

Over the course of many questions, some to do with berries, some to do with larger fruit etc., a Component emerges that finds that participants will answer certain questions similarly (or reverse-similarly, which I’ll get to later). So if someone rates raspberries a 2, there’s a strong chance they’ll rate blackberries similarly. If another participant rates raspberries 4, blackberries will probably be 4. This set of similar responses is a Component. By the way, reverse-similarity usually occurs when you phrase a question in the negative and again in the positive. One questionnaire I was exposed to had these two questions:

- Do you believe Jesus Christ is the Son of God?
- I don’t believe that Jesus Christ is the Savior and Son of God

It was a pretty bad questionnaire, because the questions are almost identical but in reverse. If someone answered the first question as a 5 (strongly believe), then they will almost certainly answer the second a 1(strongly disagree with the statement). You’ll find reverse-similarities in the Components matrix in your statistical program designated with a little minus sign.

Anyway, that’s how a part of it works. To determine whether something is a Component, there are two ways to go about it- Kaiser criterion and scree plots.

Kaiser criterion refers to selecting eigenvalues greater than 1. Eigenvalues are kind of like Component contestants, each explaining a certain amount of the overall variance. The majority are very poor; less than 1, and are uniformly ignored as they explain less than a single variable. Kaiser criterion, despite its aloof allusions, is actually pretty broad and gets progressively less useful the bigger the questionnaire is. As a side note, reverse-similarity questions actually muck around with the eigenvalues and subsequently the criterions, so either design a questionnaire that has all its questions in the positive (or negative) or painstakingly reverse the responses (a 2 becomes a 4 for instance) and say a short prayer to ethical statistical procedures because this method is not really a good thing to do.

Scree plots on the other hand are useful. Consider this one:

Scree plots have “peaks” on the left, and progressively weaker Components on the right. The trick is, apparently, to determine which Components are a part of the slope, and which ones are the “scree” which should be ignored. As this graph hopefully shows you, that is sometimes a difficult interpretation. The advice is to find the first Component, make a line diagonally downwards and see how many factors trend on that line. After that, make a “scree” line from the lowest Component and make a line backwards, seeing how many factors trend on that one. This is quite difficult though for a couple of reasons. Firstly, it would look like it would be fair to include Component 3 in the graph above, although whether it trends or not is up to interpretation. Secondly, if we’re counting Component 3, Components 4,5 and 6 are all pretty decent and could be up for inclusion too.

I think I’ve found a good away around this interpretative quandary. Although received (at least in SPSS) with the initial findings, the Components Matrix readout is supposed to be about improving the questionnaire for later application. The Matrix is used to remove questions from the questionnaire that don’t load very well to any of the retained Components, but it can also be used to pick up reverse-similarity questions you may have missed. While you can alter these questions, I would suggest that rather than alter or delete them, keep them within the Component (unless they really drag down the eigenvalue) because they may well be interesting points of analysis and not just negatively phrasing a previously positive question.

Another thing you can do with the Component Matrix is determine the usefulness of your Components. Now you aren’t technically supposed to do this, but I have yet to find a reason why. Let’s say that you have a Component Matrix that looks like this:

C1 | C2 | C3 | C4 | |

Q1 | .550 | .361 | ||

Q2 | -.347 | .334 | ||

Q3 | .716 | |||

Q4 | .403 | .465 | ||

Q5 | .595 | -.323 | ||

Q6 | .300 | .300 | .400 |

Note that blank entries mean the Component’s variance for that question is less than .3. Anyway, we can see C1 is really quite strong, as is C2. Both share variance for most of the questions but by themselves (and what we would hopefully see for all questions) is that they explain a lot of variance. If they tended to share the variance for a question a lot with other components, you would consider oblique rotations (and not orthogonal as we are using here) but that’s for another time.

I attempt to find justification for a Component via the Matrix. I do this by looking at its variance explained and whether or not it can explain one question more than any other Component- by doing so it encourages its inclusion. As we can see Component 4 doesn’t do that at all- every question that it puts some variance in already has more of it explained by one of the other Components. Component 4 is therefore up for swift removal. Even if it did have one or two questions, where it explained more variance than others, over the course of a 100 question questionnaire, it would probably be prudent to remove the questions rather than include the Component. A good rule of thumb in my experience is this simple equation:

Where *q* is the number of questions, *C* is the proposed number of Components and *a* refers to the number of questions that the Component must explain most of the variance of to be included.

I included in the Matrix figure Q6, wherein Component 3 explains more than 1 and 2 individually, but not when they combine. I haven’t had this occur to me personally, but I guess it would be open to interpretation. Maybe it nudges across the Component’s acceptability if you allow it being important for that one question, or maybe it repeats the effect, in which case you should start using oblique rotations.

Of course this is all rule of thumb stuff, but it has helped me along way using a really nifty statistical analysis that has a hundred and one applications. Take these suggestions with a grain of salt- I’m a statistics enthusiast but far removed from an expert.