a 'foreign spelling' test for GloWBE corpus

In blogging, I rely a lot on the Global Web-Based English corpus, GloWBE, which has millions of words of internet data categori{z/s}ed by the country of the website. It's divided into excerpts from 'blogs' (which includes comments on blogs) and 'general', which includes all sorts of things, even some blogs.

It's an invaluable tool for judging whether a word or phrasing is used in a particular place. But national borders are very weak on the internet, and commenters comment on all kinds of things from all kinds of places. And there are even people like me who are blogging in the 'wrong' country for their dialect (and I have run into some of my own writing in the corpus!). So, how can we know how much of the data that's in the 'US' category is actually by Americans and so forth?

This is a problem that has struck me as I've tried to use GloWBE data to research politeness markers. (I'm giving a paper on please in GloWBE next week.) So, I came up with a little test for foreignness in UK and US--though it won't work for other countries, as you'll see.

The test is based on the -or/-our spelling distinction, and the question is: how much of the data for each of these two countries has the 'wrong' spelling for the country? This works because for the (AmE) thirty-some words that have this spelling variation (except glamo(u)r, which is a funny one that I'll blog about at some point):
  • the 'correctness' of each spelling is long established in each country -- unlike -ise/-ize which is a more recent deviation
  • there's little variation within the country -- unlike, say f(o)etus, where people of different professions might spell it differently, or theat{er/re}, where proper names of US establishments often use the UK spelling
  • they are generally high-frequency words and therefore 'easy' to spell -- unlike paraly{s/z}e or mollus{c/k}, etc.
Here is the result, showing how many of each spelling were found on 'US' and 'GB' (as it is named on GloWBE) data in the corpus:


GB our
GB or
US or
US our
hono(u)r
10376
2985
17145
2437
humo(u)r
8903
1632
10461
1560
neighbo(u)r
5364
1029
8128
1037
odo(u)r
673
276
1224
119
tumo(u)r
2546
658
2269
244
vigo(u)r
910
201
745
197
totals
=28772
=6781
=39972
=5594
'foreign' rate
19%
12%


In other words,  19% of the -o(u)r words I searched for in the GB corpus had the AmE -or spelling and 12% of the US -o(u)r words had -our spellings. A few of these may just be by people who can't spell or who are putting on airs by using another spelling system, but they're probably a very few. The percentages for the individual words range from 8.9% (tumo(u)r) to 20.9% (vigo(u)r) for the US data and 15.5% (humo(u)r) to 29.1% (odo(u)r) for the GB data. It's important to use a number of words because the data will be skewed if it's just a word that American use more than British people (or vice versa), regardless of spelling. We can see that happening a bit with odo(u)r.

(I'd originally included colo(u)r in the list, which made the total difference more stark: 24% to 14%. I took it out because I suspected that the use of color in HTML coding might be skewing the result.)

So this can lead us to the hypotheses:
About 12% of US blog data is not written by Americans.
About 19% of UK blog data is written by people using American English spelling.
We cannot say that about 12% of the US data were written by British people, or even that 81% of the British data were written by British people, since -our spellings are used in the rest of the anglophone world. Half of the vigours written on British sites might have been written by Australians or Canadians, for all we know. The -or is more a marker of Americanness than -our is a marker of Britishness. But we also can't say that the people spelling -or are Americans, we can only assume they are people whose education in English spelling used American English standards. So, half of the -or spellings might have been written by people in the Phillipines (where American spelling is used). It's unlikely that it's that many, but I've phrased the hypotheses to allow for these possibilities.

Why is the number of American spellings on British sites larger than the number of British spellings on American sites? Well, it might just be because there are nearly five times as many Americans on the internet than there are Brits. (The US has 86.9% internet penetration on a population of around 318 million, so that's over 276 million internet users. The UK has 89.8% penetration in a population of about 63 million, so that's nearly 57 million internet users. Source.) Of course, there are a lot more countries involved here (and I'm not going to go do all that adding-up at the moment), but that's a reasonable step toward(s) explaining the difference, and one that doesn't involve running around screaming 'the sky is falling on British English'!

So, what this means is that if we look for differences between American and British Englishes on GloWBE and see that a form that's used in Britain is used 10% as much in America, we can't conclude that the Britishism is gaining traction in the US, because there's a fair likelihood that the people who used the Britishism weren't American. If we see one with 40% use in the US, however, we can aver that it's well on its way to being established there.

Anyhow, I'm glad I decided to explain that all in a blog post, as it makes it clearer in my mind for explaining it in about 10 seconds as I fly by that slide in my presentation next week. If you see any flaws in my thinking or math(s), please let me know!

In other news (aka shameless self-promotion):
The Odditorium people have made a podcast of my Catalyst Club talk about little words (especially the). It's a bit odd without the visuals, but they do call themselves 'oddpodcast', after all.



I'm speaking at two conferences in the next two weeks, plus have a few public speaking engagements in the future. Follow the links for more info. If you're nearby, come say (orig. AmE in this form) hello!

23 July (with Rachele de Felice): The politics of please in British and American English. Corpus Linguistics conference, University of Lancaster.

31 July Separated by a Common Politeness Marker: please in British and American English. International Pragmatics Association Conference, Antwerp.  
17 Sept How America Saved the English Language.  The Bedford Culture Club (Horsham, W Sussex). 

Further ahead, titles yet to be confirmed:
27 Sept  Sunday Assembly Brighton.  
27 Nov (Thanksgiving dinner--a day late)  English-Speaking Union, Chichester.
Read more

known them (to) and help them (to)

Yesterday, The Syntactician was asking me questions about semantic terminology in relation to particular uses of the verb know, as one does. And so, as one does, I looked for know in the indices of various books about verbs that I have, hoping to find a term that would suit her particular purposes. In doing so, I came across something that was completely new to me in F. R. Palmer's A linguistic study of the English verb (1965):


In case you can't read the photo, it says that you can 'help someone do something' or you can 'help someone to do something'.  So far, so familiar to me.

But then it goes on to say that know has the same pattern with  
(1)  Have you ever known them come on time?
and
(2)  Have you ever known them to come on time?

Now, if I have ever seen sentences of type (1) in the wild, I must have assumed them to have typos, because if I want to know someone/thing + verb, I must have the the to-infinitive form of the verb. Yes to (2), no to (1). Absolutely, no question.

So, I turned to the (English) Syntactician, who said that yes, (1) is good in her BrE, "but old-fashioned". I then went onto Twitter to proclaim my ignorance/learning/disbelief, and many English people (many of whom are probably not terribly old-fashioned) replied to say "Yes, that's fine. I can say that."  No US people replied to say they could say it, and now that I look in Algeo's British or American English?, I see that he records this as a British form.

Palmer hasn't mentioned the big restriction on this, however. Algeo does, but I learned the restriction  the hard way: by tweeting "Can you really know someone do something?" The answer there is 'no'--British English speakers can only use the to-less version in the perfect aspect (the 'have/had verbed' forms). So:
  • General (BrE or AmE) perfect: I have known them to frequent dark alleys.
  • BrE-only perfect:   I have known them frequent dark alleys.
  • General English present:  I know them to frequent dark alleys.
  • Nobody's English present:  *I know them frequent dark alleys.
 
(Overly academic side point. Skip this unless can name at least two theories of grammar!
I'm wondering how you get a [say, Chomskyan] theory of grammar to account for a complementation structure that is particular to a certain aspect of a certain verb. Maybe all theories are now so lexical that it's  possible--though you'd have to treat known and know as different lexical items, I guess. Would be easier to account for in a Construction Grammar, but still seems like a very heavy--or at least fiddly--cognitive load for a language to bear. If you know about such things, let me know in the comments, please!)


I should also say a bit about that help (to). As I said above, both of these are fine in AmE and BrE:

(3)   I helped them escape.


(4)  I helped them to escape.
 ...but what's interesting for us is that AmE prefers (3) [in 75% of the cases in the Brown corpus] and BrE prefers (4) [73% of the cases in the LOB corpus] (both figures from Algeo, p. 228).



And that, my friends, is how you write a blog post of less than 1000 words. When was the last time you had known me do that?  :)
Read more

The book!

View by topic

Twitter

Abbr.

AmE = American English
BrE = British English
OED = Oxford English Dictionary (online)