In blogging, I rely a lot on the Global Web-Based English corpus, GloWBE, which has millions of words of internet data categori{z/s}ed by the country of the website. It's divided into excerpts from 'blogs' (which includes comments on blogs) and 'general', which includes all sorts of things, even some blogs.
It's an invaluable tool for judging whether a word or phrasing is used in a particular place. But national borders are very weak on the internet, and commenters comment on all kinds of things from all kinds of places. And there are even people like me who are blogging in the 'wrong' country for their dialect (and I have run into some of my own writing in the corpus!). So, how can we know how much of the data that's in the 'US' category is actually by Americans and so forth?
This is a problem that has struck me as I've tried to use GloWBE data to research politeness markers. (I'm giving a paper on please in GloWBE next week.) So, I came up with a little test for foreignness in UK and US--though it won't work for other countries, as you'll see.
The test is based on the -or/-our spelling distinction, and the question is: how much of the data for each of these two countries has the 'wrong' spelling for the country? This works because for the (AmE) thirty-some words that have this spelling variation (except glamo(u)r, which is a funny one that I'll blog about at some point):
It's an invaluable tool for judging whether a word or phrasing is used in a particular place. But national borders are very weak on the internet, and commenters comment on all kinds of things from all kinds of places. And there are even people like me who are blogging in the 'wrong' country for their dialect (and I have run into some of my own writing in the corpus!). So, how can we know how much of the data that's in the 'US' category is actually by Americans and so forth?
This is a problem that has struck me as I've tried to use GloWBE data to research politeness markers. (I'm giving a paper on please in GloWBE next week.) So, I came up with a little test for foreignness in UK and US--though it won't work for other countries, as you'll see.
The test is based on the -or/-our spelling distinction, and the question is: how much of the data for each of these two countries has the 'wrong' spelling for the country? This works because for the (AmE) thirty-some words that have this spelling variation (except glamo(u)r, which is a funny one that I'll blog about at some point):
- the 'correctness' of each spelling is long established in each country -- unlike -ise/-ize which is a more recent deviation
- there's little variation within the country -- unlike, say f(o)etus, where people of different professions might spell it differently, or theat{er/re}, where proper names of US establishments often use the UK spelling
- they are generally high-frequency words and therefore 'easy' to spell -- unlike paraly{s/z}e or mollus{c/k}, etc.
GB our
|
GB or
|
US or
|
US our
|
|
hono(u)r
|
10376
|
2985
|
17145
|
2437
|
humo(u)r
|
8903
|
1632
|
10461
|
1560
|
neighbo(u)r
|
5364
|
1029
|
8128
|
1037
|
odo(u)r
|
673
|
276
|
1224
|
119
|
tumo(u)r
|
2546
|
658
|
2269
|
244
|
vigo(u)r
|
910
|
201
|
745
|
197
|
totals
|
=28772
|
=6781
|
=39972
|
=5594
|
'foreign' rate
|
19%
|
12%
|
In other words, 19% of the -o(u)r words I searched for in the GB corpus had the AmE -or spelling and 12% of the US -o(u)r words had -our spellings. A few of these may just be by people who can't spell or who are putting on airs by using another spelling system, but they're probably a very few. The percentages for the individual words range from 8.9% (tumo(u)r) to 20.9% (vigo(u)r) for the US data and 15.5% (humo(u)r) to 29.1% (odo(u)r) for the GB data. It's important to use a number of words because the data will be skewed if it's just a word that American use more than British people (or vice versa), regardless of spelling. We can see that happening a bit with odo(u)r.
(I'd originally included colo(u)r in the list, which made the total difference more stark: 24% to 14%. I took it out because I suspected that the use of color in HTML coding might be skewing the result.)
So this can lead us to the hypotheses:
About 12% of US blog data is not written by Americans.We cannot say that about 12% of the US data were written by British people, or even that 81% of the British data were written by British people, since -our spellings are used in the rest of the anglophone world. Half of the vigours written on British sites might have been written by Australians or Canadians, for all we know. The -or is more a marker of Americanness than -our is a marker of Britishness. But we also can't say that the people spelling -or are Americans, we can only assume they are people whose education in English spelling used American English standards. So, half of the -or spellings might have been written by people in the Phillipines (where American spelling is used). It's unlikely that it's that many, but I've phrased the hypotheses to allow for these possibilities.
About 19% of UK blog data is written by people using American English spelling.
Why is the number of American spellings on British sites larger than the number of British spellings on American sites? Well, it might just be because there are nearly five times as many Americans on the internet than there are Brits. (The US has 86.9% internet penetration on a population of around 318 million, so that's over 276 million internet users. The UK has 89.8% penetration in a population of about 63 million, so that's nearly 57 million internet users. Source.) Of course, there are a lot more countries involved here (and I'm not going to go do all that adding-up at the moment), but that's a reasonable step toward(s) explaining the difference, and one that doesn't involve running around screaming 'the sky is falling on British English'!
So, what this means is that if we look for differences between American and British Englishes on GloWBE and see that a form that's used in Britain is used 10% as much in America, we can't conclude that the Britishism is gaining traction in the US, because there's a fair likelihood that the people who used the Britishism weren't American. If we see one with 40% use in the US, however, we can aver that it's well on its way to being established there.
Anyhow, I'm glad I decided to explain that all in a blog post, as it makes it clearer in my mind for explaining it in about 10 seconds as I fly by that slide in my presentation next week. If you see any flaws in my thinking or math(s), please let me know!
In other news (aka shameless self-promotion):
The Odditorium people have made a podcast of my Catalyst Club talk about little words (especially the). It's a bit odd without the visuals, but they do call themselves 'oddpodcast', after all.
I'm speaking at two conferences in the next two weeks, plus have a few public speaking engagements in the future. Follow the links for more info. If you're nearby, come say (orig. AmE in this form) hello!
23 July (with Rachele de Felice): The politics of please in British and American English. Corpus Linguistics conference, University of Lancaster.
31 July Separated by a Common Politeness Marker: please in British and American English. International Pragmatics Association Conference, Antwerp.
17 Sept How America Saved the English Language. The Bedford Culture Club (Horsham, W Sussex).
Further ahead, titles yet to be confirmed:
27 Sept Sunday Assembly Brighton.
27 Nov (Thanksgiving dinner--a day late) English-Speaking Union, Chichester.