Genes and ethnicity: can we?

The technical details

Aug 25, 2022

This is the last of three articles about this controversial topic. The previous two are here and here. This piece will focus more on technical issues, but aimed at the ordinary reader. I’m not a geneticist nor a specialist statistician; read this not as gospel, but as a reasonably well-informed take, and do disagree or correct me if you know better.

The takeaway is that finding out whether genetics accounts for differences between ethnic groups is scientifically complex and difficult, but not impossible. For example, it is an easier question than whether democratic countries grow faster, which political economists have been trying to figure out for decades. There are challenges, but they are no worse than for many other social science questions.

Start with two groups A and B, and a variable we care about, Y. The abstraction is for a reason. The groups could be anything — any ethnic groups, or even other groups like rich and poor — and the variable could be anything too, from hair colour to education and income. For a concrete example, I’ll use height. Height is not deeply politically controversial, and it is intuitive enough to bring out the issues. Many of the statistical issues are the same, whether you are thinking about controversial characteristics, or unimportant ones.

The two groups differ on the outcome variable: Yᴬ ≠ Yᴮ, where these are the group averages. And the outcome itself may depend on individual genetics. Individual i’s outcome is Yᵢ = Yᵢ(gᵢ) where gᵢ is some genetic measure such as a polygenic score. (Nerds use “i” to mean “any given person”; let’s call him Ian.) Ian’s score might also depend on the environment; ignore that for now. So Yᴬ is the average of Yᵢ(gᵢ) over every individual i in group A; and Yᴮ is the average over everyone in group B.

Because the variable depends on genetics, we can ask counterfactual questions like “what would happen if someone had different genes?” So Yᵢ(g) is the outcome that Ian would have if he had genetic score g rather than his actual score gᵢ, holding everything else equal including the environment. You could think of this as the result of an unlikely experiment where a child with different genes is born to Ian’s parents. Or if you are OK assuming that adoptive and natural children have the same outcomes, then you can imagine Ian's parents adopting someone with genes g.

We can now ask: what if these groups had the same genetic scores? For example, suppose you draw genetic scores at random from group B and allocate them to people in group A? Then the counterfactual average would be

Yᴬ(gᴮ) = E[Yᵢ(gⱼ)]

with the expectation taken over all individuals i in group A and j in group B. And Yᴮ(gᴬ) is the same thing with the groups reversed.

Given that, you can define

Pᴬ = (Yᴬ(gᴮ) – Yᴬ) / (Yᴮ – Yᴬ)

as the proportion of the difference between the groups that would no longer exist if A had B’s genetic scores; in other words, the part of the difference that is accounted for by genetics.

Here’s an example. Suppose on average, group A is 165 centimeters tall and group B is 175 centimeters tall. But if group A had group B’s genetic scores, then they would be 169 centimeters tall on average. The equation looks like

Pᴬ = (169 – 165) / (175 – 165) = 4/10 = 40%.

4 cm of the 10 cm difference between the groups is explained by the genetic score.

Now let’s discuss the complexities.

First of all, there’s no guarantee that Pᴬ is between 0 and 100%. Genes might work in the opposite way to what you expect. Suppose that Finns are shorter than the French (NB: they’re not) and that Finns are naturally very tall, but undernourished. If they had the same genes as Frenchmen they might be even shorter, and the difference would be even bigger. Pᴬ will be negative.

Or suppose that Finns are naturally very short, but very well-nourished. If they had French genetics they would be even taller than the French. Then Yᴬ(gᴮ) > Yᴮ and Pᴬ will be more than 100%.

Second, note that there’s not one proportion here but two. As well as Pᴬ, there’s

Pᴮ = (Yᴮ(gᴬ) – Yᴮ) / (Yᴬ – Yᴮ).

You can ask what if Finns had French genes, or what if French people had Finnish genes. These are not guaranteed to give you the same answer! Suppose that because Finnish people are very malnourished, they never grow beyond 165cm whatever their genetics. But French people’s height depends on genetics. Then giving Finnish people French genes won’t change the difference between the groups at all. But doing the reverse would. Another very likely example is that the genetic score may have been created using one group, and it will be less predictive for the other group. (That is another variant of the “white people genetics” problem, described before.)

Next, how do we get the counterfactual values Yᵢ(g)? The obvious answer is that within each group we run a regression:

Yᵢ = α + β gᵢ + error.

(In other words, we plot people’s genetic scores against their values of Y, and draw a straight line through the middle.) That will estimate β, the slope of the outcome on the genetic score. Savvy readers will already be thinking “what if something else correlates with genetics?” — like, oh say, the area you live in. Yes, that is indeed a problem: gene-environment correlation, sometimes known as “rGE” . Some gene-environment correlation is truly causal, like when a child’s personality causes its parents to react differently to it; geneticists call this “evocative rGE”. But some, like where you were born, is not.

Luckily, we can estimate the regression using siblings from within the same family. Siblings have similar environments, and even better, they have genes allocated at random by the lottery of meiosis, so differences in g between them are independent of other factors (except for evocative rGE, which we want to keep because it is truly an effect of genetics). Doing this, we can get a true causal estimate of β for each group.

UPDATE 23 Oct 2023: I now think this mistakenly conflates “what if group A had group B’s polygenic score?” with the broader question we really want to answer, “what if group A had group B’s genetics?” Polygenic scores contain noise which is correlated with, but does not cause, the outcome that we care about; the difference between group A and group B may contain more (or less) noise than the difference between two members of any one group. This post has the details:

Schizophrenia: a coda

Wyclif's Dust

September 23, 2023

The US Health and Retirement Study which I’ve been using has polygenic scores for schizophrenia. I pointed out here that there are large ethnic differences in rates of schizophrenia diagnosis. The HRS polygenic scores are normalized to mean 0, variance 1 separately in blacks and whites, but the

Read full story

Notice also: we don’t need much information to answer this question. We need to estimate beta for one group; and we need to know the distribution of genetic scores in each group. In fact, if the model above is the right one, you only need the average genetic score in each group, because:

Yᴬ = α + β gᴬ + error

where gᴬ is the average score in group A. Hence the top of the Pᴬ equation is

Yᴬ(gᴮ) – Yᴬ = β(gᴮ – gᴬ).

But in fact that’s a very idealized model, because it assumes that the effect of genetics is linear, and is the same for everyone in the group. Realistically, we might expect not just gene-environment correlation, but gene-environment interaction: the effect of your genetic score depends on your environment. If so, then β above will be estimating an average across all the different environments in the relevant group.

Also, we might worry that the effect of genetics is non-linear. If so, we’d probably want a model to reflect that, and we’d make estimates using the whole distribution of genetic scores from the “other group”, not just taking the average. For example, suppose g doesn’t affect your height unless it’s above some threshold value. And suppose the average in both groups is below the threshold, although one group has more people above the threshold. Then using group averages, genes would seem not to matter, but they would matter if we used the whole distribution.

And if genes and environments are also correlated, as well as interacting, then it gets worse! Then, if you randomly allocate every person in group A a genetic score from group B, that breaks the gene-environment correlation; the effect of this is interesting, but it’s not really about genetic differences between the groups. Instead, you might want to match pairs of people in different groups but similar environments, and swap their genetic scores so as to estimate the effect of shifting the distribution… oosh. At this point I get confused, because there is more than one possible counterfactual question. In effect, within each possible environment, there are subgroups of A and B, and each subgroup may have its own value of Pᴬ and Pᴮ!

The last tricky point is that I have been talking as if the genetic score g captured all of a person’s relevant genetics. Maybe one day we will have that, but today, we only have partial scores which don’t capture all the important genetic information to any given variable. So, it’s possible that differences on the measured genetic score go one way, while unmeasured genetic differences go the other way. This is hard to rule out completely but you can look at the genetics we do know about and see how reliably they point in the same direction. For example, suppose there’s 1000 different polymorphisms that affect height, and suppose that on average group A is predicted taller using a score created from these polymorphisms. Does group A have more of the “tall allele” on all 1000 of the polymorphisms? Or most of them? Or only on a few of the important ones? That might tell us what to expect from genetic variation that we have not yet measured.

These are challenging complexities (and perhaps there are others I’ve missed). But they are not really more daunting than those facing many other scientific questions. Overall, they suggest that questions of genetic differences between groups are testable, but complex. I think that strengthens my claim that we should be trying to test them. They are likely to have complex answers, which means they will not map neatly on to people’s prejudices. But answering them is not a fool’s game. We can learn something from doing so, and since genes and environments interplay in complex ways, that will also help us learn about the environment — about society.

If you like this newsletter, check out my book Wyclif’s Dust: Western Cultures from the Printing Press to the Present. Get your hands on a copy from Amazon, or read more about it.

You can subscribe to this newsletter for free:

Or spread the word by sharing it:

Share Wyclif's Dust

Wyclif's Dust

Genes and ethnicity: can we?

The technical details

Schizophrenia: a coda

Discussion about this post