11  Factor analss: EFA/CFA, PCA

Dimension reduction, and ‘carving reality at its joints’, uncovering structural parameters?

11.1 Where have we used this/where do we anticipate using it?

11.2 The idea

11.3 PCA

11.4 Factor analysis

11.4.1 Conceptual question 1: How can we have correlated traits but no overlap in factor loadings?

Factors represent ‘real but latent ways people differ’ (in personality psych)

“We don’t want cross-loadings” (in the case where we use sum scores). We don’t want the ‘factor loadings’ to be nonzero (or far from 0) on the same items for multiple factors.

Does this make sense?

… We don’t want our indicators (survey measures) to be measuring more than one true factor, I guess. We don’t want questions we think are about happiness to really be picking up sociability. But can such questions be designed? Suppose the questions do pick up multiple personality characteristics. If we redefine the ‘factors’ so that they only load on separate sets of question, I don’t see how these redefined factors will be truly reflecting the personality characteristics.

We know

  1. the latent personality traits are correlated (e.g., happy people are more sociable)

  2. Thus, people who are happy will tend to be more sociable, and thus tend to respond more positively to responses to questions about happiness and questions about sociability

  3. So if we restrict ‘only questions about sociability can load onto factor’, I claim that the value of this factor will tend to be higher for people who are happy (I know we are not restricting it, but that is the stated goal, so we should consider what it implies)

  4. Thus, ‘does this imply that the factor is measuring happiness and not just sociability?’

Perhaps the answer is something like

Yes, each sociabity question will tend to be higher for people who are happier (higher on the true ‘happiness factor’). However, the estimated ‘sociability factor level’ for an individual (summed factor loadings times individual values of items) need not be reflected in the loadings on the happiness items. If we are ‘already adequately picking up sociability’ through the loadings on the sociability questions, what remains (residual?) need not be related to the happiness items.

I know what principal component analysis does: it reduces dimensionality. It projects your outcomes into a series of vectors (components) so that each ‘individual’ can be described by a value for each component. This score ‘predicts’ her value for each of the outcomes, and the algorithm does this to minimize the sum of squared differences (SSE) between the prediction and actual. A natural result of ‘minimizing the SSE’ is that each component vector will be set to be orthogonal to each other vector. How many vectors (componenets)? That’s another issue, and there are various rules of thumb and metrics.

I know what factor analysis intends to do, and why this is conceptually different from PCA.

Explanation here about ‘latent factors’, with genes as an example, before they were able to be observed … We only observe and measure physical and behavioral outcomes, like height, weight, math performance, and verbal performance on specific tests. We want to predict ‘how many genes determine these’ and ‘how to best predict the value of these genes for each individual’. Note that the genes themselves may be correlated. If there is a ‘growth’ gene and an ‘intelligence’ gene, they may tend to be positively related to one another, because of other factors (assortative mating, health and nutrition in early childhood, etc.)

True ‘factors’ (genes): intelligence, growth

(However, we do not know how many true factors/genes there are… we just know there are some nmber).

Measured ‘items’: height, weight, hand_size, verbal_score, math_score, spatial_score

What I’ve been stuck on is ’mechanically, what is it that FA does to “better recover these genes (factors)” that PCA does not do. It is clearly not merely trying to maximize prediction… because that would be PCA. Maybe it maximizes predictiveness (minimizes some error term) subject to some constraints … where imposing these constraints makes it more likely that we are identifying genes, i.e., makes the factors a better estimate of these gene values.

One possible ‘constraint’ that might be meaningful: Maybe we know that specific genes cause (or ‘have a (causal) impact on’) certain behaviors and traits, although we do not know which ones and by how much. (A stronger assumption would be that ‘only a single gene causes each measured behavior or trait’ … but that’s probably too strong and unrealistic.) This (weaker or stronger assumption) will imply that certain traits, the traits ‘caused’ by any gene, must indeed be correlated in the population. If the ‘growth’ gene causes measured height, weight, and hand size, then people will tend to have high, medium, or low values of all of these, depending (at least in part) on the value of their growth gene. If the ‘intelligence’ gene causes measured verbal, math, and spatial performance … these will be similarly related.

In the other hand, the axes of variation (‘principal component vectors’) that most succinctly predict all of the measured items above might involve ‘substantial mixing’ of the physical measures and scores, even measures that do not themselves tend to be strongly correlated to one another. In other words, multiple items given positive weights in each vector might actually be negatively correlated in the population. Why?

Is this a helpful example?: The data may mix people from two cities, one of which pushes fatty foods but has great math education. This may then lead to the most predictive vectors (here I state them as sums, crossing them with the values) being something like (wait, I need to make these orthogonal … ugh) \(V1 = -1 height + 7 weight + 8 math - 1 verbal -1 spatial\) and \(V2 = 2 height -3 weight + 2 math + 1 verbal + 8 spatial\). But in fact, (do we know this?) height and weight are correlated in the population, as are all of the ‘intelligence’ scores. The psychometrician doesn’t know this (because this is ‘exploratory’), but they do know that whatever the latent factors are, they will imply items all caused by the factor, and thus given positive weights within a factor must (?) be correlated to one another. (At least if multiple, negatively correlated genes/factors are not each having important impacts on these items).

So the psychometrician says (perhaps driven by some theoretical model of genes, stated in explicit equations):

I am going to require that, whatever my factors are, the items given substantial positive weights within them must be positively correlated in the population

But what are ‘substantial positive weights’? Perhaps the metric itself is weighted; the more positive the weights in each factor, the more correlated the items must be.

This then may lead them to estimate a sort of ‘constrained PCA’ model … maximizing the predictiveness under the constraint like \(R_{12} := \rho(\gamma_1 x_1, \gamma_2 x_2) > 0.6 \forall x_1, x_2\), where \(\rho\) is a Pearson’s correlation coefficient, and the \(\gamma\)’s represent the factor loadings on any two items \(x_1\) and \(x_2\).

Or perhaps, rather than a constaint, the FA may maximize a value function that considers both these \(R_{ij}\) measures and the predictive power (the latter as in PCA).

If the ‘theory is correct’ and ‘factors tend to contain correlated items’, then imposing the constraint (or ‘reward in the value function’) that items should be correlated, will likely yeild better estimates of the true factors.

But what does this say about ‘using FA to choose which items to include?’

11.4.2 Suggested readings

  1. “Best practices in EFA” (Jason W. Osborne)
  2. Dynamic Fit Index Cutoffs for Confirmatory Factor Analysis Models

11.4.3 Integrate…

Willem’s work link: ‘How to Science’

https://willemsleegers.github.io/how-to-science/content/statistics/factor-analysis/

Code
knitr::include_url("https://willemsleegers.github.io/how-to-science/content/statistics/factor-analysis/")