Tag Archives: corpus linguistics

Practical Corpus Linguistics by Martin Weisser

Practical Corpus Linguistics: An Introduction to Corpus-Based Language Analysis

Practical Corpus Linguistics: An Introduction to Corpus-Based Language Analysis by Martin Weisser

Practical Corpus Linguistics is a great introduction to analyzing language data with hands-on exercises using free software and websites. For anyone interested in textual analysis, corpus linguistics, and digital humanities, this book will get you started on the basics. There are other introduction to corpus linguistics books available, but this appears to be the only one that is designed as more of a how-to guide rather than a theoretical overview.

Chapters include collecting and cleaning data, concordancing, querying mega corpora online, frequency analysis, keywords in context, and part-of-speech tagging. There are even chapters on regular expressions and XML. Each chapter features several exercises for you to try out, as well as solutions to and comments on the exercises. There is also a companion website for the book with more exercises as well as updates.

The BYU website was changed shortly after the book was published, so you will need to check the companion website for instructions on using the new interface for accessing the Corpus of Contemporary American English. However, the old website is still available. You will also need to download the free software, AntConc, and create a free account at the BNCweb site to follow along with the exercises.

If you prefer to analyze languages other than English, check out a simple analysis of telenovela Spanish phrases I carried out using DownThemAll to collect transcripts, Python to clean the text files, and AntConc to find the most common phrases.

More Corpus Linguistics

The 8-week Corpus Linguistics MOOC at Futurelearn is another great introduction to the methodology. Both Lancaster and Birmingham run Corpus Linguistics summer schools in the UK every year if you’re able to travel to them.

If you’re interested in learning more programming for humanities research (usually in Python), try the lessons at Programming Historian and check out my list of Digital Humanities tools. Martin Weisser also wrote Essential Programming for Linguists (using Perl) that you might be interested in.

Frequent French Words in Lexique Database

Frequent French Words in Lexique Database

French Language Database: Lexique

Lexique is a free database of frequent French words that you can download (text file or spreadsheet) or consult online. It contains 140,000 French words that can easily be filtered or sorted to look at patterns such as most frequent words or phrases, number of homophones, parts of speech, etc. The corpus that it is based on includes both literature and film subtitles so you can also compare differences among books and films. You can also search the corpus for the sentences containing certain words to see how they are used in context.

Frequent French Words: Verbs

One aspect of Lexique that I prefer over other databases or frequency lists is that verbs are not only included as the infinitive form. All conjugated forms are included so you can easily see which tense or person/number is more frequent. Auxiliary verbs (avoir and être used in compound tenses) are separated from regular verbs, so if you are interested in form only rather than meaning, you’ll need to add up the frequencies. Homonyms such as va (imperative) and va (present tense) are not separated, but different parts of speech are, i.e. danse as a verb vs. danse a noun are two separate entries in the database.

If you download the Excel spreadsheet, apply a filter to only show AUX and VER, then sort the list by frequency, you can get some interesting on data on verb forms. In the table below, you will see that the imperfect tense is quite common in books. There are also a few conditional forms, but no future or subjunctive, in the top 30 verb conjugations.

Verb Form Infinitive Aux/Verb Frequency Conjugation
est être VER 6331.76 ind:pre:3s;
était être VER 3688.99 ind:imp:3s;
avait avoir AUX 3116.42 ind:imp:3s;
a avoir AUX 2926.69 ind:pre:3s;
ai avoir AUX 2119.12 ind:pre:1s;
a avoir VER 1669.39 ind:pre:3s;
est être AUX 1600.27 ind:pre:3s;
était être AUX 1497.84 ind:imp:3s;
avait avoir VER 1496.15 ind:imp:3s;
été être VER 818.99 par:pas;
sont être VER 713.18 ind:pre:3p;
être être AUX 685.47 inf;
avoir avoir AUX 649.26 inf;
ai avoir VER 619.05 ind:pre:1s;
avais avoir AUX 566.76 ind:imp:2s;
suis être AUX 560.47 ind:pre:1s;
ont avoir AUX 553.31 ind:pre:3p;
étaient être VER 534.19 ind:imp:3p;ind:pre:3p;sub:pre:3p;
avaient avoir AUX 524.26 ind:imp:3p;
être être VER 505.61 inf;;inf;;inf;;
aurait avoir AUX 491.15 cnd:pre:3s;
eu avoir VER 436.76 par:pas;
étais être VER 403.11 ind:imp:1s;ind:imp:2s;
étaient être AUX 393.85 ind:imp:3p;
sont être AUX 386.35 ind:pre:3p;
avais avoir VER 351.96 ind:imp:1s;ind:imp:2s;
as avoir AUX 294.46 ind:pre:2s;
serait être VER 285.27 cnd:pre:3s;
fut être VER 284.46 ind:pas:3s;
es être VER 256.62 ind:pre:2s;

This is something to keep in mind when learning/teaching French. Perhaps we should introduce the conditional before the future? Most textbooks tend to do the opposite, especially since the future and conditional use the same stems. However, the imperfect and conditional use the same endings, so the same argument could be made for teaching them together – which is strengthened by the fact that conditional forms are more frequent than future forms, as the Lexique database indicates.

I’ve always disagreed with teaching tenses separately (going from present to passé composé, then adding imperfect, followed by future, conditional, subjunctive, etc.) It seems more useful to me to teach the most common verbs and their forms regardless of the tense. This is why I include imperfect and future forms when I first introduce avoir and être in my French Language Tutorial – though now I see that I should perhaps have included conditional instead.

Thanks to corpus linguistics techniques, it is easier to design language learning materials that represent actual language use. Part of my PhD dissertation explores this topic if you’re interested in learning more.

Let me know if there are other databases of frequent French words that include conjugated verb forms instead of just infinitives!

A Linguistic Analysis of Telenovela Spanish - What are the most frequent phrases in telenovelas?

A Very Informal Corpus Linguistic Analysis of Telenovela Spanish: Pasión y Poder

A Linguistic Analysis of Telenovela Spanish, or How this Nerdy Linguist Spent her Friday Night

Ever since I discovered that Univision started including transcripts of their telenovelas online, I had been wanting to experiment with the free corpus linguistics software AntConc to analyze the most common phrases used in telenovela Spanish. I chose Pasión y Poder because it had the most transcripts still available on the website, even though I rarely watched it. It was a fairly typical telenovela, unlike El Hotel de Los Secretos or Yago, with plenty of fighting and drama and a (mostly) happy ending. Unfortunately Telemundo does not provide transcripts of their telenovelas (which tend to be better) which is a shame since I’d love to analyze the language of La Esclava Blanca, a Colombian telenovela set in the mid 1800’s.

Here’s how I created the corpus and found the most frequent phrases, if you feel inclined to be as nerdy…

How to be a linguistics/telenovela nerd:

  1. Downloading the html files was easy and quick thanks to the DownThemAll add-on for Firefox and the fact that the URL of each episode only differs by the number so I was able to use batch descriptors. (I know webscraping is possible with Python, but my programming knowledge is still pretty basic and I knew that I could get the files with the add-on in about 20 seconds.)
  2. Then I needed to find a way to extract the text from all of the <p> tags – since the transcript was the only text enclosed in these tags in all of the html code – and create text files for each episode. I managed to find some Python/BeautifulSoup code online after an hour of searching that did what I needed, after a couple tweaks, a few tears, and many error messages.
  3. Finally, I loaded all the text files into AntConc and played around with the Clusters/N-Grams option and N-Gram Size to find the most frequent phrases between five and ten words.

NEW! Watch a video explaining the steps:

Most Frequent Phrases in Pasión y Poder

So here are the most frequent phrases used in Pasión y Poder, starting with ten word phrases and ending with five word phrases. Keep in mind that some of the phrases are typically Mexican, and some are overly dramatic because, well, they’re from a telenovela!

  • A ver, a ver, a ver, a ver, a ver. (A ver is usually translated as let’s see, but I have no idea what a good translation for this many a vers together would be in natural English.)
  • No te metas en lo que no te importa. (Don’t stick your nose where it doesn’t belong./Mind your own business.)
  • No sabes el gusto que me da que… (You don’t know how happy it makes me that…)
    ¿No te das cuenta? ¿No te das cuenta? (Don’t you realize? Don’t you realize?)
  • Esto no se va a quedar así. (This isn’t over. [said as a threat of revenge])
    No me lo tomes a mal, pero… (Don’t take this the wrong way, but…)
  • … lo que te voy a decir. (… what I’m going to tell you.)
    Lo único que quiero es que… (The only thing I want is that…)
    No, eso no va a pasar. (No, that is not going to happen.)
    No tiene nada que ver con… (It has nothing to do with…)
    Lo que pasa es que no…  (What is happening is that … not)
    No te lo voy a perdonar. (I’m not going to forgive you for it.)
    No te voy a permitir que… (I won’t allow you to…)
    Eres el amor de mi vida. (You are the love of my life.)
    No tiene la culpa de nada. (S/he is not guilty of anything.)
    A pesar de todo, lo que… (In spite of everything, what…)
    Creo que lo mejor es que… (I think the best thing is that/to…)
    Lo que me preocupa es que… (What worries me is that…)
    Lo único que espero es que… (The only thing I hope is that…)
  • Todo va a estar bien. (Everything will be fine.)
    Me da mucho gusto que… (I’m very happy that…)
    No voy a dejar que… (I’m not going to let…)
    No, por supuesto que no. (No, of course not.)
    ¿Que fue lo que pasó? (What happened?)
    Sí, lo sé, lo sé. (Yes, I know, I know.)
    Ya me tengo que ir. (I have to go now.)
    No me importa lo que… (I don’t care what…)
    … lo que vas a hacer. (…what you’re going to do.)
    Te pido por favor que… (I am asking you please to…)
    Ya me di cuenta que… (I already realized that…)
    De una vez por todas. (Once and for all.)
    ¿No te das cuenta que…? (Don’t you realize that…?)
    Yo no tengo nada que… (I have nothing that…)
    Y lo peor es que… (And the worst is that…)

Telenovela Battle of Screams and Insults

I was also interested in finding out which words I heard yelled all the time were more frequent:

In the battle suéltame (let go of me) vs. lárgate (get out), the winner is: ¡lárgate! (59 vs. 61)

And in the battle infeliz (fool) vs. desgraciado (bastard), the winner is: ¡infeliz! (74 vs. 69)

However, the winner of them all was ¡No puede ser! (It can’t be!) with a frequency count of 151.

So what have we learned?

To sum up, Telenovela Spanish is hilarious and corpus linguistics is amazing.

If you’d like to learn more about Corpus Linguistics, there is a great free Corpus Linguistics MOOC at Futurelearn and the hands-on exercises in the new textbook Practical Corpus Linguistics will get you started with AntConc, plus there are tutorials on Youtube on how to use this software. If you’re interested in learning Python, try Dr. Chuck’s Python for Everybody lessons.

Free Corpora of Spoken French for French Language Learners or Researchers

Free Corpora of Spoken French for French Learners or Researchers

Learn French with Free Corpora of Spoken French

I am always looking for corpora of spoken French for my research so I was quite surprised to come across several freely available resources on the internet in the past week. Most of these corpora contain audio and/or video with transcripts of authentic and spontaneous spoken French – perfect for self-study or use in a language lab.

  • SACODEYL (System-aided compilation: an open distribution of European youth language) is actually available in seven EU languages (English, French, German, Italian, Spanish, Romanian, and Lithuanian) and was designed specifically for teaching purposes. Click on Resources after choosing a corpus to access the learning packages.
  • TCOF (Traitement de Corpus Oraux en Français) includes recordings from the 1980’s and 1990’s, available under a Creative Commons license.
  • CFPP2000 (Corpus de français parlé parisien des années 2000) contains several interviews of Parisians from the early 2000’s. Audio files and transcripts are available for download.
  • CFPQ (Corpus de français parlé au Québec) is a multimodal corpus that also includes information on non-verbal aspects of communication (such as gestures, facial movements, etc.) It also dates from the 2000’s; however, only PDFs of the transcripts are available.

Other corpora of spoken French or simply videos with transcripts that I’ve mentioned in the past include:

And don’t forget my French Listening Resources, with plenty of transcripts and exercises.

If you know of other freely accessible corpora of French, please let me know.