Tag Archives: corpus linguistics

A Linguistic Analysis of Telenovela Spanish - What are the most frequent phrases in telenovelas?

A Very Informal Corpus Linguistic Analysis of Telenovela Spanish: Pasión y Poder

A Linguistic Analysis of Telenovela Spanish, or How this Nerdy Linguist Spent her Friday Night

Ever since I discovered that Univision started including transcripts of their telenovelas online, I had been wanting to experiment with the free corpus linguistics software AntConc to analyze the most common phrases used in telenovela Spanish. I chose Pasión y Poder because it had the most transcripts still available on the website, even though I rarely watched it. It was a fairly typical telenovela, unlike El Hotel de Los Secretos or Yago, with plenty of fighting and drama and a (mostly) happy ending. Unfortunately Telemundo does not provide transcripts of their telenovelas (which tend to be better) which is a shame since I’d love to analyze the language of La Esclava Blanca, a Colombian telenovela set in the mid 1800’s.

Here’s how I created the corpus and found the most frequent phrases, if you feel inclined to be as nerdy…

How to be a linguistics/telenovela nerd:

  1. Downloading the html files was easy and quick thanks to the DownThemAll add-on for Firefox and the fact that the URL of each episode only differs by the number so I was able to use batch descriptors. (I know webscraping is possible with Python, but my programming knowledge is still pretty basic and I knew that I could get the files with the add-on in about 20 seconds.)
  2. Then I needed to find a way to extract the text from all of the <p> tags – since the transcript was the only text enclosed in these tags in all of the html code – and create text files for each episode. I managed to find some Python/BeautifulSoup code online after an hour of searching that did what I needed, after a couple tweaks, a few tears, and many error messages.
  3. Finally, I loaded the 117 text files into AntConc and played around with the Clusters/N-Grams option and N-Gram Size to find the most frequent phrases between five and ten words.

Most Frequent Phrases in Pasión y Poder

So here are the most frequent phrases used in Pasión y Poder, starting with ten word phrases and ending with five word phrases. Keep in mind that some of the phrases are typically Mexican, and some are overly dramatic because, well, they’re from a telenovela!

  • A ver, a ver, a ver, a ver, a ver. (A ver is usually translated as let’s see, but I have no idea what a good translation for this many a vers together would be in natural English.)
  • No te metas en lo que no te importa. (Don’t stick your nose where it doesn’t belong./Mind your own business.)
  • No sabes el gusto que me da que… (You don’t know how happy it makes me that…)
    ¿No te das cuenta? ¿No te das cuenta? (Don’t you realize? Don’t you realize?)
  • Esto no se va a quedar así. (This isn’t over. [said as a threat of revenge])
    No me lo tomes a mal, pero… (Don’t take this the wrong way, but…)
  • … lo que te voy a decir. (… what I’m going to tell you.)
    Lo único que quiero es que… (The only thing I want is that…)
    No, eso no va a pasar. (No, that is not going to happen.)
    No tiene nada que ver con… (It has nothing to do with…)
    Lo que pasa es que no…  (What is happening is that … not)
    No te lo voy a perdonar. (I’m not going to forgive you for it.)
    No te voy a permitir que… (I won’t allow you to…)
    Eres el amor de mi vida. (You are the love of my life.)
    No tiene la culpa de nada. (S/he is not guilty of anything.)
    A pesar de todo, lo que… (In spite of everything, what…)
    Creo que lo mejor es que… (I think the best thing is that/to…)
    Lo que me preocupa es que… (What worries me is that…)
    Lo único que espero es que… (The only thing I hope is that…)
  • Todo va a estar bien. (Everything will be fine.)
    Me da mucho gusto que… (I’m very happy that…)
    No voy a dejar que… (I’m not going to let…)
    No, por supuesto que no. (No, of course not.)
    ¿Que fue lo que pasó? (What happened?)
    Sí, lo sé, lo sé. (Yes, I know, I know.)
    Ya me tengo que ir. (I have to go now.)
    No me importa lo que… (I don’t care what…)
    … lo que vas a hacer. (…what you’re going to do.)
    Te pido por favor que… (I am asking you please to…)
    Ya me di cuenta que… (I already realized that…)
    De una vez por todas. (Once and for all.)
    ¿No te das cuenta que…? (Don’t you realize that…?)
    Yo no tengo nada que… (I have nothing that…)
    Y lo peor es que… (And the worst is that…)

Telenovela Battle of Screams and Insults

I was also interested in finding out which words I heard yelled all the time were more frequent:

In the battle suéltame (let go of me) vs. lárgate (get out), the winner is: ¡lárgate! (59 vs. 61)

And in the battle infeliz (fool) vs. desgraciado (bastard), the winner is: ¡infeliz! (74 vs. 69)

However, the winner of them all was ¡No puede ser! (It can’t be!) with a frequency count of 151.

So what have we learned?

To sum up, Telenovela Spanish is hilarious and corpus linguistics is amazing.

If you’d like to learn more about Corpus Linguistics, there is a free MOOC at Futurelearn starting in September and the hands-on exercises in the new textbook Practical Corpus Linguistics will get you started with AntConc, plus there are tutorials on Youtube on how to use this software.

Free Corpora of Spoken French for French Language Learners or Researchers

Free Corpora of Spoken French for French Learners or Researchers

Learn French with Free Corpora of Spoken French

I am always looking for corpora of spoken French for my research so I was quite surprised to come across several freely available resources on the internet in the past week. Most of these corpora contain audio and/or video with transcripts of authentic and spontaneous spoken French – perfect for self-study or use in a language lab.

  • SACODEYL (System-aided compilation: an open distribution of European youth language) is actually available in seven EU languages (English, French, German, Italian, Spanish, Romanian, and Lithuanian) and was designed specifically for teaching purposes. Click on Resources after choosing a corpus to access the learning packages.
  • TCOF (Traitement de Corpus Oraux en Français) includes recordings from the 1980’s and 1990’s, available under a Creative Commons license.
  • CFPP2000 (Corpus de français parlé parisien des années 2000) contains several interviews of Parisians from the early 2000’s. Audio files and transcripts are available for download.
  • CFPQ (Corpus de français parlé au Québec) is a multimodal corpus that also includes information on non-verbal aspects of communication (such as gestures, facial movements, etc.) It also dates from the 2000’s; however, only PDFs of the transcripts are available.

Other corpora of spoken French or simply videos with transcripts that I’ve mentioned in the past include:

And don’t forget my French Listening Resources, with plenty of transcripts and exercises.

If you know of other freely accessible corpora of French, please let me know.