<?xml version="1.0" encoding="utf-8" standalone="yes" ?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>Text Analysis on rdata.lu Blog | Data science with R</title>
    <link>/categories/text-analysis/</link>
    <description>Recent content in Text Analysis on rdata.lu Blog | Data science with R</description>
    <generator>Hugo -- gohugo.io</generator>
    <language>en-us</language>
    <copyright>Copyright (c) rdata.lu. All rights reserved. &lt;br&gt; Content reblogged by &lt;a href=&#39;https://www.r-bloggers.com/&#39; target=&#39;_blank&#39;&gt;R-bloggers&lt;/a&gt; &amp; &lt;a href=&#39;http://www.rweekly.org/&#39; target=&#39;_blank&#39;&gt;RWeekly&lt;/a&gt;</copyright>
    <lastBuildDate>Fri, 26 Jan 2018 00:00:00 +0000</lastBuildDate>
    
        <atom:link href="/categories/text-analysis/index.xml" rel="self" type="application/rss+xml" />
    
    
    <item>
      <title>Analysis of the Renert - Part 3: Visualizations</title>
      <link>/post/2018-01-26-analysis-of-the-renert-part-3/</link>
      <pubDate>Fri, 26 Jan 2018 00:00:00 +0000</pubDate>
      
      <guid>/post/2018-01-26-analysis-of-the-renert-part-3/</guid>
      <description>&lt;!--```{r, echo=FALSE}
knitr::include_graphics(&#34;/images/renert.jpg&#34;)
```--&gt;
&lt;p style=&#34;text-align:center&#34;&gt;
&lt;img src=&#34;/images/renert.jpg&#34; style=&#34;width:60vh; &#34;&gt;
&lt;/p&gt;
&lt;p&gt;&lt;em&gt;This is part 3 of a 3 part blog post. This post uses the data that was scraped in part 1 and prepared in part 2.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;Now that we have the data in a nice format, let’s make a frequency plot! First let’s load the data and the packages:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(&amp;quot;tidyverse&amp;quot;)
library(&amp;quot;ggthemes&amp;quot;) # To use different themes and colors
renert_tokenized = readRDS(&amp;quot;renert_tokenized.rds&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Using the &lt;code&gt;ggplot2&lt;/code&gt; package, I can produce a plot of the most frequent words.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;renert_tokenized %&amp;gt;%
  count(word, sort = TRUE) %&amp;gt;%
  filter(n &amp;gt; 50) %&amp;gt;%
  mutate(word = reorder(word, n)) %&amp;gt;%
  ggplot(aes(word, n)) +
  geom_col() +
  xlab(NULL) +
  coord_flip() +
  theme_minimal() &lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;/post/2018-01-26-analysis-of-the-renert-part-3_files/figure-html/unnamed-chunk-3-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;So, the most frequent word is &lt;em&gt;kinnek&lt;/em&gt;, meaning &lt;em&gt;King&lt;/em&gt;! &lt;em&gt;kinnek&lt;/em&gt; is mentioned more times than &lt;em&gt;renert&lt;/em&gt;, the name of the hero. Next are &lt;em&gt;här&lt;/em&gt; and &lt;em&gt;wollef&lt;/em&gt; meaning &lt;em&gt;mister&lt;/em&gt; and &lt;em&gt;wolf&lt;/em&gt;. In fifth position we have &lt;em&gt;fuuss&lt;/em&gt;, for &lt;em&gt;fox&lt;/em&gt;. I’ll let you use Google Translate for the other words 😄.&lt;/p&gt;
&lt;p&gt;Now, I’m also doing sentiment analysis by using the AFINN list of words. This list of words have a score that gives its sentiment. You can download the original list from &lt;a href=&#34;http://www2.imm.dtu.dk/pubdb/views/publication_details.php?id=6010&#34;&gt;here&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Because such a list is not available in Luxembourguish, I have translated it using Google’s translate api. Here is the code to do that:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(&amp;quot;tidyverse&amp;quot;)
library(&amp;quot;translate&amp;quot;) # google translate api
library(&amp;quot;tidytext&amp;quot;) # to load the AFINN dictionary

api_key = &amp;quot;api_key_goes_here&amp;quot;

set.key(api_key)

afinn = get_sentiments(&amp;quot;afinn&amp;quot;)

# I wrap the `translate()` function around `purrr::possibly()` so that in case of an
# error, I get the translations that worked back.

possibly_translate = purrr::possibly(translate::translate, otherwise = &amp;quot;error&amp;quot;)

afinn_lux = afinn %&amp;gt;%
  mutate(lux = map(word, possibly_translate, source = &amp;quot;en&amp;quot;, target = &amp;quot;lb&amp;quot;)) %&amp;gt;%
  mutate(lux = unlist(lux))

write_csv(afinn_lux, &amp;quot;afinn_lux.csv&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;For the above code to work, you need to have a Google cloud account, which you can create for free.&lt;/p&gt;
&lt;p&gt;I did not check the quality of the translations, and I’m sure it’s far from perfect. It’s also available on the Github repository &lt;a href=&#34;https://github.com/b-rodrigues/stopwords_lu&#34;&gt;here&lt;/a&gt;. Again, contributions more than welcome!&lt;/p&gt;
&lt;p&gt;Now, I need to merge the dictionary with the data from each song. First, let’s load the dictionary:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;afinn_lux = read.csv(&amp;quot;afinn_lux.csv&amp;quot;)

# I only keep the `lux` column (and rename it to word) and the `score column`
afinn_lux = afinn_lux %&amp;gt;%
  select(word = lux, score)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;How does this dictionary look like? Let’s see:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;head(afinn_lux)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##             word score
## 1       opzeginn    -2
## 2     verloossen    -2
## 3       opzeginn    -2
## 4 entfouert ginn    -2
## 5       entlooss    -2
## 6      entfouert    -2&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Let’s load the tokenized songs, and merge them with the dictionary:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;renert_songs_tokenized = readRDS(&amp;quot;renert_songs_tokenized.rds&amp;quot;)
  
renert_songs_sentiment = map(renert_songs_tokenized, ~full_join(., afinn_lux))&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;I can now merge the data in a single data frame and do some further cleaning:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;renert_songs_sentiment = renert_songs_sentiment %&amp;gt;%
  bind_rows() %&amp;gt;%
  filter(!is.na(score)) %&amp;gt;%
  filter(!is.na(gesank))&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;What does the final data look like? Here it is:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;head(renert_songs_sentiment)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## # A tibble: 6 x 3
##   word  gesank  score
##   &amp;lt;chr&amp;gt; &amp;lt;chr&amp;gt;   &amp;lt;int&amp;gt;
## 1 rifft éischte    -2
## 2 léiw  éischte    -3
## 3 fest  éischte     2
## 4 fest  éischte     2
## 5 räich éischte     2
## 6 räich éischte     3&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;We see that there are words that are the same, but with different scores. That’s because the translation of the dictionary was most probably not very good. Oh well, let’s do a boxplot of the sentiment for each song:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;order =  c(&amp;quot;éischte&amp;quot;, &amp;quot;zwete&amp;quot;, &amp;quot;drëtte&amp;quot;, &amp;quot;véierte&amp;quot;, &amp;quot;fënnefte&amp;quot;, &amp;quot;sechste&amp;quot;, &amp;quot;siwente&amp;quot;,
                   &amp;quot;aachte&amp;quot;, &amp;quot;néngte&amp;quot;, &amp;quot;zéngte&amp;quot;, &amp;quot;elefte&amp;quot;, &amp;quot;zwielefte&amp;quot;, &amp;quot;dräizengte&amp;quot;, &amp;quot;véierzengte&amp;quot;)

renert_songs_sentiment %&amp;gt;%
  ggplot(aes(gesank, score)) + 
  scale_x_discrete(limits = order) + 
  geom_boxplot() + 
  theme_minimal() + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) &lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;/post/2018-01-26-analysis-of-the-renert-part-3_files/figure-html/unnamed-chunk-12-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;As we can see, there is no discernible pattern. This can mean two things; either the general sentiment inside each song is fairly neutral, or the the quality of the translation was too bad for the results to make any sense.&lt;/p&gt;
&lt;p&gt;That’s it for this series of posts! I hope you enjoyed reading it as much as I enjoyed writing it and analyzing the data!&lt;/p&gt;
&lt;p&gt;
Don’t hesitate to follow us on twitter &lt;a href=&#34;https://twitter.com/rdata_lu&#34; target=&#34;_blank&#34;&gt;&lt;span class=&#34;citation&#34;&gt;@rdata_lu&lt;/span&gt;&lt;/a&gt; &lt;!-- or &lt;a href=&#34;https://twitter.com/brodriguesco&#34;&gt;@brodriguesco&lt;/a&gt; --&gt; and to &lt;a href=&#34;https://www.youtube.com/channel/UCbazvBnJd7CJ4WnTL6BI6qw?sub_confirmation=1&#34; target=&#34;_blank&#34;&gt;subscribe&lt;/a&gt; to our youtube channel. &lt;br&gt; You can also contact us if you have any comments or suggestions. See you for the next post!
&lt;/p&gt;
</description>
    </item>
    
    <item>
      <title>Analysis of the Renert - Part 2: Data Processing</title>
      <link>/post/2018-01-24-analysis-of-the-renert-part-2/</link>
      <pubDate>Wed, 24 Jan 2018 00:00:00 +0000</pubDate>
      
      <guid>/post/2018-01-24-analysis-of-the-renert-part-2/</guid>
      <description>&lt;!--```{r, echo=FALSE}
knitr::include_graphics(&#34;/images/renert.jpg&#34;)
```--&gt;
&lt;p style=&#34;text-align:center&#34;&gt;
&lt;img src=&#34;/images/renert.jpg&#34; style=&#34;width:60vh; &#34;&gt;
&lt;/p&gt;
&lt;p&gt;&lt;em&gt;This is part 2 of a 3 part blog post. This post uses the data that we scraped in &lt;a href=&#34;http://www.blog.rdata.lu/post/2018-01-22-analysis-of-the-renert-part-1/&#34;&gt;part 1&lt;/a&gt; and prepares it for further analysis, which is quite technical. If you’re only interested in the results of the analysis, skip to &lt;a href=&#34;http://www.blog.rdata.lu/post/2018-01-26-analysis-of-the-renert-part-3/&#34;&gt;part 3&lt;/a&gt;!&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;First, let’s load the data that we prepared in &lt;a href=&#34;http://www.blog.rdata.lu/post/2018-01-22-analysis-of-the-renert-part-1/&#34;&gt;part 1&lt;/a&gt;. Let’s start with the full text:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(&amp;quot;tidyverse&amp;quot;)
library(&amp;quot;tidytext&amp;quot;)
renert = readRDS(&amp;quot;renert_full.rds&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;I want to study the frequencies of words, so for this, I will use a function from the &lt;code&gt;tidytext&lt;/code&gt; package called &lt;code&gt;unnest_tokens()&lt;/code&gt; which breaks the text down into tokens. Each token is a word, which will then make it possible to compute the frequencies of words.&lt;/p&gt;
&lt;p&gt;So, let’s unnest the tokens:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;renert = renert %&amp;gt;%
  unnest_tokens(word, text)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;We still need to do some cleaning before continuing. In Luxembourgish, &lt;em&gt;the&lt;/em&gt; is written &lt;em&gt;d’&lt;/em&gt; for feminine nouns. For example &lt;em&gt;d’Kaz&lt;/em&gt; for the &lt;em&gt;the cat&lt;/em&gt;. There’s also a bunch of &lt;em&gt;’t&lt;/em&gt;s in the text, which is &lt;em&gt;it&lt;/em&gt;. For example, the second line of the first song:&lt;/p&gt;
&lt;p&gt;&lt;em&gt;’T stung Alles an der Bléi,&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;&lt;em&gt;Everything (it) was on flower,&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;We can remove these with a couple lines of code:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;renert_tokenized = renert %&amp;gt;%
  mutate(word = str_replace_all(word, &amp;quot;d&amp;#39;&amp;quot;, &amp;quot;&amp;quot;)) %&amp;gt;%
  mutate(word = str_replace_all(word, &amp;quot;&amp;#39;t&amp;quot;, &amp;quot;&amp;quot;))&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;But that’s not all! We still need to remove so called stop words. Stop words are words that are very frequent, such as “and”, and these words usually do not add anything to the analysis. There are no set rules for defining a list of stop words, so I took inspiration for the stop words in English and German, and created my own, which you can get on &lt;a href=&#34;https://github.com/b-rodrigues/stopwords_lu&#34;&gt;Github&lt;/a&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;stopwords = read.csv(&amp;quot;stopwords_lu.csv&amp;quot;, header = TRUE)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;For my Luxembourgish-speaking compatriots, I’d be glad to get help to make this list better! This list is far from perfect, certainly contains typos, or even words that have no reason to be there! Please help 😅.&lt;/p&gt;
&lt;p&gt;Using this list of stop words, I can remove words that don’t add anything to the analysis. Creating a list of stop words for the Luxembourgish language is very challenging, because there might be stop words that come from German, such as “awer”, from the German “aber”, meaning &lt;em&gt;but&lt;/em&gt;, but you could also use “mä”, from the French &lt;em&gt;mais&lt;/em&gt;, meaning also but. Plus, as a kid, we never really learned how to write Luxembourgish. Actually, most Luxembourguians don’t know how to write Luxembourgish 100% correctly. This is because for a very long time, Luxembourgish was used for oral communication, and French for formal written correspondence. This is changing, and more and more people are learning how to write correctly. I definitely have a lot to learn! Thus, I have certainly missed a lot of stop words in the list, but I am hopeful that others will contribute to the list and make it better! In the meantime, that’s what I’m going to use.&lt;/p&gt;
&lt;p&gt;Let’s take a look at some lines of the stop words data frame:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;head(stopwords, 20)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##          word
## 1           a
## 2           à
## 3         äis
## 4          är
## 5         ärt
## 6        äert
## 7        ären
## 8         all
## 9       allem
## 10      alles
## 11   alleguer
## 12        als
## 13       also
## 14         am
## 15         an
## 16 anerefalls
## 17        ass
## 18        aus
## 19       awer
## 20        bei&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;We can remove the stop words from our tokens using an &lt;code&gt;anti_join()&lt;/code&gt;:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;renert_tokenized = renert_tokenized %&amp;gt;%
  anti_join(stopwords)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Joining, by = &amp;quot;word&amp;quot;&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Warning: Column `word` joining character vector and factor, coercing into
## character vector&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Let’s save this for later use:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;saveRDS(renert_tokenized, &amp;quot;renert_tokenized.rds&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;I now have to do the same for the data that is stored by song. Because this is a list where each element is a data frame, I have to use &lt;code&gt;purrr::map()&lt;/code&gt; to map each of the functions I used before to each data frame:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;renert_songs = readRDS(&amp;quot;renert_songs_df.rds&amp;quot;)
renert_songs = map(renert_songs, ~unnest_tokens(., word, text))
renert_songs = map(renert_songs, ~anti_join(., stopwords))
renert_songs = map(renert_songs, ~mutate(., word = str_replace_all(word, &amp;quot;d&amp;#39;&amp;quot;, &amp;quot;&amp;quot;)))
renert_songs = map(renert_songs, ~mutate(., word = str_replace_all(word, &amp;quot;&amp;#39;t&amp;quot;, &amp;quot;&amp;quot;)))&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Let’s take a look at the object we have:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;head(renert_songs[[1]])&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## # A tibble: 6 x 1
##   word     
##   &amp;lt;chr&amp;gt;    
## 1 éischte  
## 2 gesank   
## 3 edit     
## 4 päischten
## 5 stung    
## 6 bléi&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Looks pretty nice! But I can make it nicer by adding a column containing which song the data refers to. Indeed, the first line of each data frame contains the number of the song. I can extract this information and add it to each data set:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;renert_songs = map(renert_songs, ~mutate(., gesank = pull(.[1,1])))&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Let’s take a look again:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;head(renert_songs[[1]])&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## # A tibble: 6 x 2
##   word      gesank 
##   &amp;lt;chr&amp;gt;     &amp;lt;chr&amp;gt;  
## 1 éischte   éischte
## 2 gesank    éischte
## 3 edit      éischte
## 4 päischten éischte
## 5 stung     éischte
## 6 bléi      éischte&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now I can save this object for later use:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;saveRDS(renert_songs, &amp;quot;renert_songs_tokenized.rds&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In the final part of this series, I will use the tokenized data as well as the list of songs to create a couple of visualizations!&lt;/p&gt;
&lt;p&gt;
Don’t hesitate to follow us on twitter &lt;a href=&#34;https://twitter.com/rdata_lu&#34; target=&#34;_blank&#34;&gt;&lt;span class=&#34;citation&#34;&gt;@rdata_lu&lt;/span&gt;&lt;/a&gt; &lt;!-- or &lt;a href=&#34;https://twitter.com/brodriguesco&#34;&gt;@brodriguesco&lt;/a&gt; --&gt; and to &lt;a href=&#34;https://www.youtube.com/channel/UCbazvBnJd7CJ4WnTL6BI6qw?sub_confirmation=1&#34; target=&#34;_blank&#34;&gt;subscribe&lt;/a&gt; to our youtube channel. &lt;br&gt; You can also contact us if you have any comments or suggestions. See you for the next post!
&lt;/p&gt;
</description>
    </item>
    
    <item>
      <title>Analysis of the Renert - Part 1: Scraping</title>
      <link>/post/2018-01-22-analysis-of-the-renert-part-1/</link>
      <pubDate>Mon, 22 Jan 2018 00:00:00 +0000</pubDate>
      
      <guid>/post/2018-01-22-analysis-of-the-renert-part-1/</guid>
      <description>&lt;!--```{r, echo=FALSE}
knitr::include_graphics(&#34;/images/renert.jpg&#34;)
```--&gt;
&lt;p style=&#34;text-align:center&#34;&gt;
&lt;img src=&#34;/images/renert.jpg&#34; style=&#34;width:60vh; &#34;&gt;
&lt;/p&gt;
&lt;p&gt;&lt;em&gt;This is part 1 of a 3 part blog post. This post presents the Luxembourgish language as well as the literary work I am going to analyze using the R programming language. &lt;a href=&#34;http://www.blog.rdata.lu/post/2018-01-24-analysis-of-the-renert-part-2/&#34;&gt;Part 2&lt;/a&gt; deals with preparing the data for analysis, and finally &lt;a href=&#34;http://www.blog.rdata.lu/post/2018-01-26-analysis-of-the-renert-part-3/&#34;&gt;part 3&lt;/a&gt; is the analysis. Hope you enjoy!&lt;/em&gt;&lt;/p&gt;
&lt;div id=&#34;luxembourg-and-the-luxembourgish-language&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Luxembourg and the Luxembourgish language&lt;/h2&gt;
&lt;p&gt;Luxembourg is a small European country, squeezed between France, Belgium and Germany. Over the course of its history, it’s been invaded over and over by either France or Prussia (later Germany). It eventually became a state under the personal possession of William I of the Netherlands in 1815, with a… Prussian garrison to guard its capital, Luxembourg City, from further French invasions. After the Belgian revolution of 1839, the purely French-speaking part of the country was ceded to Belgium and the Luxembourgish-speaking part became what is known today as the Grand-Duchy of Luxembourg. What’s a Grand-Duchy you might wonder? &lt;br&gt;&lt;br&gt; Luxembourg is the only remaining Grand-Duchy in the world. A Grand-Duchy is like a Kingdom, but instead of a King, we have a Grand Duke. The current monarch is Henri, which means that Luxembourg is a constitutional monarchy with the head of state being the prime minister, Xavier Bettel. As you can imagine, Luxembourg’s history has had a very important impact on the languages we speak today in the country; there are three official languages, French, German, and Luxembourgish. Unlike other countries with several official languages, in Luxembourg, there is not a French, or German, or Luxembourgish speaking part. In Luxembourg, you use one of the three languages based on context.&lt;br&gt;&lt;br&gt; For example, the laws are all written in French, and French is mostly the language used for official or formal written correspondence.German has traditionally been the language of the press and the police. And finally Luxembourgish is the language Luxembourguians use to speak with one another. This means that on a given day, most people here might switch between these three languages; of course, add English to the pile, which is rapidly growing in the country due to all the English speaking expats that come here to work (&lt;em&gt;cough&lt;/em&gt;brexit&lt;em&gt;cough&lt;/em&gt;).&lt;br&gt;&lt;br&gt; There is also a sizable Portuguese community in Luxembourg, so you’ll hear a lot of Portuguese on the streets too, as well as Italian. Around 50% of the inhabitants of Luxembourg are foreign born, mostly from other EU countries. The Italians, Portuguese and a lot of others have emigrated to Luxembourg starting in the 60s to work in the metallurgic sector, and later, in the construction sector. The children of these emigrants usually speak five languages; their mother tongue, say, Portuguese, the three official languages of the country, and finally English. &lt;br&gt;&lt;/p&gt;
&lt;p&gt;You might wonder what Luxembourgish sounds like? Here is a video of our Prime Minister talking in Luxembourgish:&lt;/p&gt;
&lt;iframe width=&#34;100%&#34; height=&#34;100%&#34; src=&#34;https://www.youtube.com/embed/NnUf6nkZInM&#34; frameborder=&#34;0&#34; allowfullscreen style=&#34;max-width:100%; height:55vh;&#34;&gt;
&lt;/iframe&gt;
&lt;p&gt;Here is another video of him speaking French:&lt;/p&gt;
&lt;iframe width=&#34;100%&#34; height=&#34;100%&#34; src=&#34;https://www.youtube.com/embed/U4G8P_z84GU&#34; frameborder=&#34;0&#34; allowfullscreen style=&#34;max-width:100%; height:55vh;&#34;&gt;
&lt;/iframe&gt;
&lt;p&gt;Here he’s speaking German :&lt;/p&gt;
&lt;iframe width=&#34;100%&#34; height=&#34;100%&#34; src=&#34;https://www.youtube.com/embed/tblafXTQ2_w?start=120&#34; frameborder=&#34;0&#34; allowfullscreen style=&#34;max-width:100%; height:55vh;&#34;&gt;
&lt;/iframe&gt;
&lt;p&gt;And here English :&lt;/p&gt;
&lt;iframe width=&#34;100%&#34; height=&#34;100%&#34; src=&#34;https://www.youtube.com/embed/-ensRTwpjXk?start=185&#34; frameborder=&#34;0&#34; allowfullscreen style=&#34;max-width:100%; height:55vh;&#34;&gt;
&lt;/iframe&gt;
&lt;p&gt;On the English video, you might notice the typical accent Luxembourguians have when speaking English 😄&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;the-text-were-analysing&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;The text we’re analysing&lt;/h2&gt;
&lt;p&gt;The text I’ll be analyzing is called &lt;em&gt;Renert oder de Fuuss am Frack an a Maansgréisst&lt;/em&gt;, published in 1872 by Michel Rodange. My high school was named after Michel Rodange by the way! &lt;em&gt;Renert&lt;/em&gt; is a fable featuring a sly fox as the main character, called Renert. He gets in trouble because of his shenanigans and gets sentenced to death by the Lion King. However, through further lies and deceptions, he manages to escape. After some tribulations, he proves his worth to the King by winning a duel against the wolf and becomes an aristocrat. Because it was written in the 19th century, the way some words are written may be different that how we write them in modern Luxembourgish, which might create some problems when analyzing the text.&lt;/p&gt;
&lt;p&gt;Now starts the technical part. If you’re only interested in the results, you can skip to &lt;a href=&#34;http://www.blog.rdata.lu/post/2018-01-26-analysis-of-the-renert-part-3/&#34;&gt;part 3&lt;/a&gt;!&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;scraping-the-data&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Scraping the data&lt;/h2&gt;
&lt;p&gt;First of all, let’s load (or install if you don’t have them) the needed packages:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;install.packages(c(&amp;quot;tidyverse&amp;quot;,
                   &amp;quot;tidytext&amp;quot;,
                   &amp;quot;janitor&amp;quot;))&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(&amp;quot;tidyverse&amp;quot;)
library(&amp;quot;tidytext&amp;quot;)
library(&amp;quot;janitor&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The &lt;code&gt;tidyverse&lt;/code&gt; is a collection of packages that are very useful for a lot of different tasks. If you are not familiar with these packages, check out the tidyverse &lt;a href=&#34;https://www.tidyverse.org/&#34;&gt;website&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;code&gt;tidytext&lt;/code&gt; is a package that uses the same principles than the &lt;code&gt;tidyverse&lt;/code&gt;, but for text analysis. You can learn more about it &lt;a href=&#34;https://www.tidytextmining.com/&#34;&gt;here&lt;/a&gt; which is the book I took inspiration from for this series of blog posts.&lt;/p&gt;
&lt;p&gt;The full text of the Renert is available &lt;a href=&#34;https://wikisource.org/wiki/Renert&#34;&gt;here&lt;/a&gt;, so I’m going to use &lt;code&gt;rvest&lt;/code&gt;, to get the text into R:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;renert_link = &amp;quot;https://wikisource.org/wiki/Renert&amp;quot;

renert_raw = renert_link %&amp;gt;%
  xml2::read_html() %&amp;gt;%
  rvest::html_nodes(&amp;quot;.mw-parser-output&amp;quot;) %&amp;gt;%
  rvest::html_text() %&amp;gt;%
  str_split(&amp;quot;\n&amp;quot;, simplify = TRUE) %&amp;gt;%
  .[1, -c(1:24)]&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;I download the text using &lt;code&gt;read_html()&lt;/code&gt; from the &lt;code&gt;xml2&lt;/code&gt; package (which gets loaded by the &lt;code&gt;tidyverse&lt;/code&gt;) and then find the nodes that interest me, in this case &lt;code&gt;mw-parser-output&lt;/code&gt;. Then I extract the text from this node, and split it on the &lt;code&gt;\n&lt;/code&gt; character, to get a big vector where each element is a line of text. I also remove the 24 first lines, which are mostly blank. Let’s take a look at the first five lines:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;renert_raw[1:5]&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## [1] &amp;quot;Éischte Gesank.[edit]&amp;quot;       &amp;quot;&amp;quot;                           
## [3] &amp;quot;Et war esou ëm d&amp;#39;Päischten,&amp;quot; &amp;quot;&amp;#39;T stung Alles an der Bléi,&amp;quot;
## [5] &amp;quot;An d&amp;#39;Villercher di songen&amp;quot;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The Renert is divided into 14 songs, so I’d like to create a list with 14 elements, where each element is the text of a song. Every song is titled “First Song”, “Second Song” etc, so I first check on which lines I find the word &lt;em&gt;Gesank&lt;/em&gt;, which identifies the start of a &lt;em&gt;song&lt;/em&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;(indices = grepl(&amp;quot;Gesank&amp;quot;, renert_raw) %&amp;gt;% which(isTRUE(.)))&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##  [1]    1  605  885 1172 1555 1906 2441 2664 2995 3686 4214 4625 5116 5963&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;code&gt;indices&lt;/code&gt; contains the indices of where the songs start. So I need to create the indices of when the songs end. If you think about it, the first songs ends where the second song begins, minus 1. So I create a new vector of indices, by first removing the index for the first song, substracting 1, and then adding the index for the last line (using &lt;code&gt;length(renert_raw)&lt;/code&gt;).&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;(indices2 = c(indices[-1] - 1, length(renert_raw)))&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##  [1]  604  884 1171 1554 1905 2440 2663 2994 3685 4213 4624 5115 5962 6506&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;I can now create a list of sequences, called &lt;code&gt;song_lines&lt;/code&gt; which contains the indices for all the songs:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;song_lines = map2(indices, indices2,  ~seq(.x,.y))&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;And using this list of indices, I can now extract the songs into a list:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;renert_songs = map(song_lines, ~`[`(renert_raw, .))&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;I’ll save this object for later use, using &lt;code&gt;saveRDS()&lt;/code&gt;:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;saveRDS(renert_songs, &amp;quot;renert_songs.rds&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;I will also save a version of the above list, but where each element of the list is a data frame. This will make analysis much easier later.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;renert_songs_df = map(renert_songs, ~data_frame(text = .))
saveRDS(renert_songs_df, &amp;quot;renert_songs_df.rds&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;I also need to have the full text as a single character object, so I reduce my list into a single object and also save it:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;renert_full = reduce(renert_songs, c)

renert_full = data_frame(text = renert_full) %&amp;gt;%
  filter(!grepl(&amp;quot;Gesank&amp;quot;, text))

saveRDS(renert_full, &amp;quot;renert_full.rds&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This is the end of part 1. In &lt;a href=&#34;http://www.blog.rdata.lu/post/2018-01-24-analysis-of-the-renert-part-2/&#34;&gt;part 2&lt;/a&gt;, we are going to prepare the data for analysis, and in &lt;a href=&#34;http://www.blog.rdata.lu/post/2018-01-26-analysis-of-the-renert-part-3/&#34;&gt;part 3&lt;/a&gt; we are going to analyze it. &lt;br&gt;&lt;br&gt;&lt;/p&gt;
&lt;p&gt;
Don’t hesitate to follow us on twitter &lt;a href=&#34;https://twitter.com/rdata_lu&#34; target=&#34;_blank&#34;&gt;&lt;span class=&#34;citation&#34;&gt;@rdata_lu&lt;/span&gt;&lt;/a&gt; &lt;!-- or &lt;a href=&#34;https://twitter.com/brodriguesco&#34;&gt;@brodriguesco&lt;/a&gt; --&gt; and to &lt;a href=&#34;https://www.youtube.com/channel/UCbazvBnJd7CJ4WnTL6BI6qw?sub_confirmation=1&#34; target=&#34;_blank&#34;&gt;subscribe&lt;/a&gt; to our youtube channel. &lt;br&gt; You can also contact us if you have any comments or suggestions. See you for the next post!
&lt;/p&gt;
&lt;/div&gt;
</description>
    </item>
    
  </channel>
</rss>
