Jul 21, 2015

a 'foreign spelling' test for GloWBE corpus

In blogging, I rely a lot on the Global Web-Based English corpus, GloWBE, which has millions of words of internet data categori{z/s}ed by the country of the website. It's divided into excerpts from 'blogs' (which includes comments on blogs) and 'general', which includes all sorts of things, even some blogs.

It's an invaluable tool for judging whether a word or phrasing is used in a particular place. But national borders are very weak on the internet, and commenters comment on all kinds of things from all kinds of places. And there are even people like me who are blogging in the 'wrong' country for their dialect (and I have run into some of my own writing in the corpus!). So, how can we know how much of the data that's in the 'US' category is actually by Americans and so forth?

This is a problem that has struck me as I've tried to use GloWBE data to research politeness markers. (I'm giving a paper on please in GloWBE next week.) So, I came up with a little test for foreignness in UK and US--though it won't work for other countries, as you'll see.

The test is based on the -or/-our spelling distinction, and the question is: how much of the data for each of these two countries has the 'wrong' spelling for the country? This works because for the (AmE) thirty-some words that have this spelling variation (except glamo(u)r, which is a funny one that I'll blog about at some point):
  • the 'correctness' of each spelling is long established in each country -- unlike -ise/-ize which is a more recent deviation
  • there's little variation within the country -- unlike, say f(o)etus, where people of different professions might spell it differently, or theat{er/re}, where proper names of US establishments often use the UK spelling
  • they are generally high-frequency words and therefore 'easy' to spell -- unlike paraly{s/z}e or mollus{c/k}, etc.
Here is the result, showing how many of each spelling were found on 'US' and 'GB' (as it is named on GloWBE) data in the corpus:


GB our
GB or
US or
US our
hono(u)r
10376
2985
17145
2437
humo(u)r
8903
1632
10461
1560
neighbo(u)r
5364
1029
8128
1037
odo(u)r
673
276
1224
119
tumo(u)r
2546
658
2269
244
vigo(u)r
910
201
745
197
totals
=28772
=6781
=39972
=5594
'foreign' rate
19%
12%


In other words,  19% of the -o(u)r words I searched for in the GB corpus had the AmE -or spelling and 12% of the US -o(u)r words had -our spellings. A few of these may just be by people who can't spell or who are putting on airs by using another spelling system, but they're probably a very few. The percentages for the individual words range from 8.9% (tumo(u)r) to 20.9% (vigo(u)r) for the US data and 15.5% (humo(u)r) to 29.1% (odo(u)r) for the GB data. It's important to use a number of words because the data will be skewed if it's just a word that American use more than British people (or vice versa), regardless of spelling. We can see that happening a bit with odo(u)r.

(I'd originally included colo(u)r in the list, which made the total difference more stark: 24% to 14%. I took it out because I suspected that the use of color in HTML coding might be skewing the result.)

So this can lead us to the hypotheses:
About 12% of US blog data is not written by Americans.
About 19% of UK blog data is written by people using American English spelling.
We cannot say that about 12% of the US data were written by British people, or even that 81% of the British data were written by British people, since -our spellings are used in the rest of the anglophone world. Half of the vigours written on British sites might have been written by Australians or Canadians, for all we know. The -or is more a marker of Americanness than -our is a marker of Britishness. But we also can't say that the people spelling -or are Americans, we can only assume they are people whose education in English spelling used American English standards. So, half of the -or spellings might have been written by people in the Phillipines (where American spelling is used). It's unlikely that it's that many, but I've phrased the hypotheses to allow for these possibilities.

Why is the number of American spellings on British sites larger than the number of British spellings on American sites? Well, it might just be because there are nearly five times as many Americans on the internet than there are Brits. (The US has 86.9% internet penetration on a population of around 318 million, so that's over 276 million internet users. The UK has 89.8% penetration in a population of about 63 million, so that's nearly 57 million internet users. Source.) Of course, there are a lot more countries involved here (and I'm not going to go do all that adding-up at the moment), but that's a reasonable step toward(s) explaining the difference, and one that doesn't involve running around screaming 'the sky is falling on British English'!

So, what this means is that if we look for differences between American and British Englishes on GloWBE and see that a form that's used in Britain is used 10% as much in America, we can't conclude that the Britishism is gaining traction in the US, because there's a fair likelihood that the people who used the Britishism weren't American. If we see one with 40% use in the US, however, we can aver that it's well on its way to being established there.

Anyhow, I'm glad I decided to explain that all in a blog post, as it makes it clearer in my mind for explaining it in about 10 seconds as I fly by that slide in my presentation next week. If you see any flaws in my thinking or math(s), please let me know!

In other news (aka shameless self-promotion):
The Odditorium people have made a podcast of my Catalyst Club talk about little words (especially the). It's a bit odd without the visuals, but they do call themselves 'oddpodcast', after all.



I'm speaking at two conferences in the next two weeks, plus have a few public speaking engagements in the future. Follow the links for more info. If you're nearby, come say (orig. AmE in this form) hello!

23 July (with Rachele de Felice): The politics of please in British and American English. Corpus Linguistics conference, University of Lancaster.

31 July Separated by a Common Politeness Marker: please in British and American English. International Pragmatics Association Conference, Antwerp.  
17 Sept How America Saved the English Language.  The Bedford Culture Club (Horsham, W Sussex). 

Further ahead, titles yet to be confirmed:
27 Sept  Sunday Assembly Brighton.  
27 Nov (Thanksgiving dinner--a day late)  English-Speaking Union, Chichester.

27 comments:

  1. Why is the number of American spellings on British sites larger than the number of British spellings on American sites?

    I don't know enough about the scope or the methodology of the corpus to do more than speculate, but I woudln't be surprised if a large part of it was the prevalence of US media companies (eg Gawker group) having .co.uk spinoff sites. Some of the content will be written by Brits in British English, but most of it will be American and the same as would run on the main .com site. These companies don't seem to enforce a country-specific style guide very strongly.

    ReplyDelete
  2. I thought I'd look at the odd word to see the percentages you'd omitted. I never find it easy calculating percentages so I ended up doing an Excel spreadsheet. There seems to be something wrong.

    1. Your percentages of the totals don't seem to be right.

    2. The percentage of BrE spelling odor is a remarkable 41%. Really?

    Could you have deleted something from your table without recalculating?

    Or is it me that has mistyped something in my spreadsheet?

    ReplyDelete
  3. David: The percentages are of the total or+our for the country. So, divide column GB -or by the sum of GB -our + GB -or. (I didn't want to clutter the table too much with intermediate steps.) I'm guessing that you were taking column B as a percentage of column A?

    ReplyDelete
  4. Well, it might just be because there are nearly five times as many Americans on the internet as there are Brits.

    Lynne: forgive me for being hyperfastidious -- I just felt this sentence of yours needed some first aid! ;)

    ReplyDelete
  5. A propos of nothing, I recently discovered I'd been using the British spelling of mustache (i.e., moustache) without noticing it.

    My suspicion is that a nearby NYC restaurant, Moustache -- where I've actually eaten -- is to blame ... if that's the right word.

    ReplyDelete
  6. Yes, thanks Lynne, that explains it. Even so isn't it odd that BrE odor is a s high as 29% frequency?

    Where I started from was a hunch that there'd be a difference between nouns with and without cognate adjective spelled with -or-. I'd hoped neighbour. would stand out. It doesn't.

    ReplyDelete
  7. Dick: thanks, that was one of those sentences that I rearranged at least once. Iatrogenic errors.

    David: Since 'odor' is used more in AmE than in BrE, I think what's happening is the the AmE commenters on UK blogs are contributing 'odor' (as it were!) to the comment threads while maybe the BrE writers are writing 'smell' or 'aroma' instead (or being too polite to mention them!), therefore not adding to the 'odour' count. I had a glance at the data to make sure it wasn't being skewed by an acronym for the Office of Direct Orbital Returns or any such thing, but it just looked like a lot of Americans writing about 'odors' on the Daily Mail site and other places where American web browsers end up.

    I also had a look at 'vigo(u)r' and found that Vigor is a surname. But since it was just a few examples and the %ages were similar for the two countries, I decided to leave it. Any big corpus thing is going to have a bit of noise, but you count on the large numbers to show the bigger trends.

    ReplyDelete
  8. And Ginger Yellow: I'm not sure about all the details of the corpus construction, but I've coded 500-item samples from each of the corpora and the website names from them look pretty country-specific. The UK one (which is open in the next window here) has UK newspapers, UK town councils, UK football club sites, etc.

    I'm not certain how they ascertain that the US ones are 'US', but there I've got Metafilter, Daily Kos, scienceblogs.com, Harper's, and a lot of little company and personal blogs and political and health sites (just looking at blog data at the moment, there are news sites in the larger corpus).

    Since anyone can write on some of these -- e.g. Metafilter -- I'd almost expect the data to go the other way so that there are more British/non-US comments on US sites, but I take that as part of the reason that having 5x more Americans on the internet doesn't mean 5x more American spellings on British websites than British ones on American websites.

    ReplyDelete
  9. The other thing with Odo(u)r is it's a relatively small compared to the others, so there's a lot of potential for noise in the sample.

    ReplyDelete
  10. It's not that small, though--that's the beauty of having a 350+-million-word corpus. Looking at the main spellings of each, there are twice as many 'odor' in American data as 'odour' in British. It really looks like a word the British like a lot less than Americans.

    ReplyDelete
  11. Can I ask whether there's any possible way of excluding citations from this, or does the possibility of quotations (with or without a fastidious [sic]) account for at least some of this cross-pollination?

    ReplyDelete
  12. They try to weed out repetitions--like a blog comment being quoted in another one. It's not perfect, but it's better than it could be.

    It's not possible to search it and say 'nothing in quotation marks' (for various reasons), but that's part of the purpose of doing this exercise. Quotations of Americans in British posts should count in the 'not British author' percentage (and vice versa).

    Having dealt with this data a lot, I'd hazard that cross-pollination through quotation accounts for a minuscule proportion of the other-country spellings. Within the sample that I'm coding (about something else altogether), I weed out everything in quotations marks (because for my purposes it's 'inauthentic' data), and it's almost all (fan-)fictional dialogue authored by the blogger or he-said/she-said accounts of personal conversations.

    In the non-blog data, news sources usually nativize spellings to their own system when they're presenting quotations. I avoided 'labo(u)r' because its presence in proper names (political parties, unions) might skew things. There may be some proper names in the 'hono(u)r' data, but not the kind that make the international news much.

    ReplyDelete
  13. I'm not sure I agree about "not that small" Lynne. I know you haven't actually posted directly on that word except in this post, but if someone else has with your kind of readership, I'd expect the word to crop up 1-200 times in the post and discussion, with a mix of UK and US spellings but show as part of the UK corpus if it's your blog. Similarly if it's a US blog, that's 50-100 uses of the UK spelling in the US straight away.

    One or two posts in the US could instantly generate all the UK spellings. It takes a bit more effort in the UK but 700 uses isn't that many - a few posts talking about it plus a few uses in the wild.

    I would agree your data suggest we avoid odour in preference for other smell/scent words though, compared to the Americans. The Americans seem to avoid tumo(u)r and vigo(u)r comparatively too.

    ReplyDelete
  14. Another issue to bear in mind is the presence of US spellings in film/tv titles which are not changed to UK spelling when crossing the Atlantic (e.g. Prizzi's Honor). According to IMDB there are 618 titles with 'honor' in them and just 4 with 'honour' (including 'honourable'). Of course, a lot of those titles will be unreleased, unknown or too poor to get a UK media mention, but a handful of successful films (and for that matter music albums/singles etc) will get hefty coverage with the original US spelling.

    ReplyDelete
  15. Eloise: what's not that small, though, is the corpus. It's more than 350 million words each for BrE & AmE. It's not full blog posts, but samples from them and their comments, so goodness knows how many writers. So, if it's reaching that wide and showing a difference in frequency between the two dialects in using that word, that's going to make me pretty confident that Americans use the word more than Brits do. Searching them on Google ngrams supports that reading--with more 'odors' (and 'odours') in American books than odo(u)rs in British books--whereas the numbers for hono(u)r look about the same in the two.

    ReplyDelete
  16. Oh yes, I'm not arguing your conclusion about the different rates of usage at all. I did say that. It just seems like it's a relatively rarely used word - it's used about 1/4 as often as tumour in the UK corpus, and I'd have thought tumour was an uncommon word because it's fairly technical and you can talk about tumours in other ways.

    So places it crops up - one can't tell I'm sure, however, I'm suspicious they're atypical simply because it's so relatively rarely used in the UK. If I search for the word, once I get past definitions and so on, I get a hit for Chrysanthemums, hits for BO and then a lot of hits for farms and the bible. Maybe the Americans use it more because there's a lot more Americans talking about its use in Corinthians plus there are more farmers? It's probably enough to drive the traffic up appreciably when the numbers are small.

    ReplyDelete
  17. Ok, thanks! I think words for smells can get negative connotations rather easily--it may be down to that.

    ReplyDelete
  18. Although I'm British, I do tend to spell it "vigor", because of "vigorous", although quite why, when there is "humorous" and I don't spell it "humor", I don't know.... But then, my spelling does go a bit awry at times - I can never remember whether to use "-ise" or "-ize" endings, and as for "-ent" vs "-ant", let's not go there....

    Good luck with your talks - I'm sure you'll be brilliant.

    ReplyDelete
  19. I think your "-or" vs "-our" is a good first discriminator, but you could conceivably refine your results depending on how the word data is organized. For example, if you can select only the American blog posts that use "-our" spellings, then run those specific posts against a selection of words with BrE spellings, you would likely find some that use more BrE spellings in other common words. So now you've found the posts with multiple instances of "foreign" spellings, plus probably some instances of mixed AmE and BrE spelling, and you could draw stronger conclusions about the nationality of at least some of the writers.

    ReplyDelete
  20. Also take into account that there are a large number of Americans who are religious and use the King James Version of the Bible...which uses the British spellings of words. Some of those people often us quotes on the internet from that version of the Bible.

    ReplyDelete
  21. Interesting point, thanks. I've just searched

    'Hono(u)r thy father and thy mother'

    US had 5 of each spelling.
    UK had 4 honour / 1 honor.

    The problem is knowing what proportion of the searchable web is biblical quotation...

    ReplyDelete
  22. Could some of the appearances of US spellings in British writing be caused by auto-correction? Most software seems to default to USA spellings.

    ReplyDelete
  23. As a US born, Am/E speaking blogger, I regularly use humour and behaviour and colour out of sheer affectation. Grew up seeing it "across the river" in Windsor, ON - visiting relatives, in stores and newspapers. Did not use it in school. But given the freedom to do as I liked on my own blogs... well, my husband calls me an "Ethnic Canadian."

    Muddying the waters for linguists all over. I'm sure I'm not the only one.


    ReplyDelete
  24. I think Jeremy has a valid point. The spellings may be due to the dictionary used by the spellchecker. It is easier to just accept the corrections suggested by the spellchecker than have a lot of red squiggles all over the page because the default language is different. I myself prefer to use US English for all my documents unless I am writing for a school/college text book when I set the language options to British.

    ReplyDelete
  25. Very nice blog. Greetings from Indonesia.

    ReplyDelete
  26. Wow! I'd been wondering what corpora to use to collect some reliable US UK speech data for a project and I knew you would have some great insights. Fascinating! Of course it totally muddies the waters for my question but great post!

    ReplyDelete
  27. Best of luck with your project, Vicki. Would be good to hear about it sometime!

    ReplyDelete