Do genes and languages correlate?

I was recently asked by a linguist person why is it that we geneticists assume that genes and languages of populations would directly correlate. The short answer is that we definitely don’t.

Apparently, the impression partly stemmed from some presentations in which geneticists had described their [Admixture] results in terms of a number of genetic components, one of which was often called ‘Uralic’.

I ended up answering the question at more length and detail in an email. In case the answer would be useful and of interest to someone else as well, here it is:

We know – from the linguists – that the Uralic languages have spread to their current geographic distribution (and beyond) from an earlier, smaller area. When and from where and under what circumstances are open questions at least to some degree, considering that many people seem to have strong opinions about them which are nowhere close to a consensus.

It is also clear that genes and language in general can go hand in hand, or they can be [almost] totally discoupled, or something in-between. As far as I know, there are even historically attested examples of arrival of an elite language leading to a language shift with little gene flow (and possibly variable amounts of substrate traces, but let’s not go into that here). On the other hand, in the absence of an elite status, small numbers of immigrants might not affect the language of a population (beyond a possible loanword layer), whereas simple simulations show that small – even unintuitively small – amounts of continuous gene flow from a neighboring population can alter a population’s genetic make-up radically.

With all this in mind, it is interesting to ask: What were the circumstances in the spreading process of the Uralic languages? To which degree was it a movement of people, and how much contact was there with the populations already inhabiting the area of expansion (and with subsequent geographic neighbors)? In terms of studying the genetics of extant populations, this would translate to: Do most of the Uralic-speaking populations share a genetic component that is absent from most other neighboring/worldwide populations? If such a component would exist, how common is it in each population, and which populations are an exception to the pattern (Uralics without or non-Uralics with the component, indicating possible language shifts)? What other genetic components do the Uralic populations have? And especially with the rapidly increasing spatiotemporal density of the ancient-DNA data, it might even be possible to spot such a shared component in high proportion in an ancient population, giving hints of when and where the Uralic language may have originated and what archaeological phenomena (NB. in plural) may its spread have been related to, if any. (Of course, bits and pieces of the story are already known from earlier genetic studies, and for the most recent events also from historical evidence, such as some of the slavicisation (russification) of formerly Uralic-speaking populations as well as some Siberian populations shifting into a Samoyed language (I believe), but the interest lies in both systematizing the big picture and teasing out information on the earlier events.)

There’s no a priori reason to assume that the dynamics of the Uralic language spread would have been the same across the whole area and during the last several millennia; different answers at different corners of the spread and for different pre-proto-etc. stages of the language could very well be possible. Specifically, there’s no reason to assume that the language and genes would have correlated through the whole process, i.e., that the spread of language would always have resulted in a total genetic replacement of the previous inhabitants and that no further contact (= gene flow) would have taken place across language-family borders. In fact, if we believed that to be the case, it would hardly be worthwhile to conduct the genetic studies in the first place, as we would already know the answer: the Uralic-speaking populations would only differ from each other by the amount of genetic drift since their split (which might of course be of interest as a proxy for the split times), and would not resemble their geographic neighbors more than any other non-Uralic populations (with the exception of the populations they descend from – oh, wouldn’t that be a beautiful way of studying the relatedness of language families beyond the resolution provided by linguistic data!).

That such a pattern is not what is seen in the genetic data must be crystal-clear for example from basically any genome-wide population structure study involving Uralic populations and their non-Uralic neighbors, I believe. Therefore, it is difficult for me to understand how geneticists could appear to favor an idea of direct correlation of genes and languages, as it would immediately lead to scenarios that are both absurd and obviously not true. Yes, we might talk about “Uralic genes” or a “Uralic component” of a population’s gene pool, and I naturally cannot vow for all my fellow geneticists, but I myself would use such expressions as a mere shorthand for something like “a genetic component shared more often by Uralic than non-Uralic populations, i.e., present in most Uralic and absent from most non-Uralic populations, and therefore possibly related to their shared history and process of spread (and thus interesting)”. It most definitely would not mean that every person in every Uralic population would [be expected to] show 100% of such component, nor that the component was always and only ever distributed along with the Uralic language.

I’m sorry for not seeing why it would be problematic to search for such components, or call them “Uralic”, beyond the general problematicness of sloppy definitions of shorthand terminology, of course. Obviously, other labels could work for the same purpose equally well – or better, if more properly defined.

Elina Salmela

University of Helsinki

Do genes and languages correlate?

Leave a Reply Cancel reply