Monday 27 May 2013

Finding Linguistic Tells

Leah Libresco at Unequally Yoked runs annual Ideological Turing Tests in which contestants write articles espousing opinions they do not hold and readers try to distinguish between real Christians/atheists and pretend Christians/atheists. (I participate last year.) Based on the comments, it seems that a common way readers judge entries is to look for tells in word choice. Does the author use academic jargon that no Christian would likely be to know, or theological jargon that no atheist would be likely to know?

It might seem like a silly method, but there are some good theoretical reasons to think it could work. If Mikhail Bakhtin is correct, different communities of people will use language differently. Words will mean slightly (or entirely) different things depending on who the speaker is, and this suggests that they would convey the same or similar ideas using a different vocabulary. So it would seem on the surface as though word choice would be a good tell.

That being said, "atheist" and "Christian" are not homogenous communities. I presume that Maronite Catholics would have different ways of talking about religion than Mennonites would; I presume that humanities-educated atheists that are fans of Derrida or Marx would have different ways of talking about atheism than atheists without a post-secondary education who are fans of Less Wrong or The God Delusion. Furthermore, there will certainly be Christians with a post-secondary humanities education and atheists raised in a Mennonite household.

Indeed, most people on the Internet are members of multiple linguistic communities and their language will probably reflect that multiplicity. For this reason I would hypothesize that any given person will not be familiar with enough of the different language markers you would need to know to distinguish between atheist and Christian contestants on the basis of word choice. At best all you could do is tell whether a person is familiar with your own linguistic communities. However, just because any one person would be unable to use linguistic tells with any accuracy does not mean that it's a useless method. What we would need to do is to conduct a large-scale quantitative analysis of many samples (preferably, hundreds) of writing from a diversity of authors.

Ideally, we would attach a lot of meta-data to each sample: languages spoken, all countries and regions of residence (ever), education (including specializations), reading habits, political affiliations, workplaces. We would be looking to label the samples with every linguistic community the author is a member of. I don't know how to conduct such a study--in particular, I don't have access to the necessary programs--but this is the kind of thing people in the digital humanities do: track word use, syntax, etc. across numerous samples to get a large quantitative picture of a corpus.

After enough work has been done and particular words start to show up frequently (especially if certain words show up in authors with different ideological meta-data) we would also hand-code the samples according to how these words are used, something a computer program cannot do; we would create a taxonomy of use and label the samples according to that taxonomy, and run this new meta-data through the computer again to see if anything shows up. For instance, maybe two unrelated groups of people use the word "modern," but they use it in different ways.

We would be looking for two things in the resulting data. The first would be to see whether there are any reliable linguistic tells at all; is the correlation between word choice and word use high enough to be a reliable indicator, or is it too noisy? The second would be to see what those indicators (if there are any) are so that we could put together a checklist or procedure that we could use on samples. The procedure could be computerized, but since we'd be using hand-coded data we would probably be better off using a mix of computerized and hand-coded analysis.

The scope of the project is too great for me to do this, even if I did have access to the software necessary (because I might just be able to get my hands on it if I hand to). But I'd love to see the results if someone else conducted such a study.

(Of course, if you were going to use this for Leah's Turing Tests, you would be better off conducting the entire study with Turing Test samples, not just samples of people writing their own opinions.)

No comments:

Blog Widget by LinkWithin