Introduction to Data Analysis

11.2. Network(d)s

This section brings a bit of text mining into network analysis, as we will turn word associations into network ties (networks of word associations) and visualize the result. The code draws on the ideas of Cornelius Puschmann, and the example data is Julian Assange's address at the United Nations in September 2012.

packages <- c("intergraph", "GGally", "ggplot2", "network", "RColorBrewer", 
    "sna", "tm")
packages <- lapply(packages, FUN = function(x) {
    if (!require(x, character.only = TRUE)) {
        install.packages(x)
        library(x, character.only = TRUE)
    }
})

Our first step is to read the text file, to extract all words from the speech, and to extract all associations like “welcome to” or “fine deeds”. We replace Cornelius Puschmann's original code by a functional approach to the job, using one the many R apply functions (sapply being from the plyr package) for the main word association routine.

build.corpus <- function(x, skip = 0) {
    # Read the text source.
    src = scan(x, what = "char", sep = "\n", encoding = "UTF-8", skip = skip)
    # Extract all words.
    txt = unlist(strsplit(gsub("[[:punct:]|[:digit:]]", " ", tolower(src)), 
        "[[:space:]]+"))
    # Remove single letters.
    txt = txt[nchar(txt) > 1]
    # Function to create word nodes.
    associate <- function(x) {
        y = c(txt[x], txt[x + 1])
        if (!TRUE %in% (y %in% c("", stopwords("en")))) 
            y
    }
    # Build word network.
    net = do.call(rbind, sapply(1:(length(txt) - 1), associate))
    # Return network object.
    return(network(net))
}
# Example data.
net <- build.corpus("data/assange.txt")

The word network is plotted as a very sparse network, trimmed to associations of non-trivial words that appear at least three times in the speech. Trivial words are removed by matching them to the list of English stopwords found in the tm package. We again use the ggnet function, but have a look at the igraph package for an alternative.

# Plot with ggnet.
ggnet(net, weight = "degree", subset = 3,
      alpha = 1, segment.color = "grey", label = TRUE, vjust = - 2,
      legend = "none")
52 nodes, weighted by freeman 

       id indegree outdegree freeman
39 states       11         6      17
45 united        0        14      14
47     us        1        12      13
15   fine        1         8       9
50  words        8         0       8
26  obama        6         1       7

plot of chunk plot-assange-auto

Since the corpus was created out of a simple function call, we can now find any corpus, prepare it and plot it in just a few lines. The next example is a plot of word associations in Cory Doctorow's speech to the Chaos Communication Congress in December 2011. Try running the same graph on any plain text speech file (here's one by Barack Obama).

# Target locations
link = "https://raw.github.com/jwise/28c3-doctorow/master/transcript.md"
file = "data/doctorow.txt"
# Download speech.
if(!file.exists(file)) download(link, file, mode = "wb")
# Build corpus.
net <- build.corpus(file, skip = 37)
# Plot with ggnet.
ggnet(net, weight = "indegree", subset = 5,
      alpha = 1, segment.color = "grey", label = TRUE, vjust = - 2,
      legend = "none")
29 nodes, weighted by indegree 

          id indegree outdegree freeman
11    laughs        7         1       8
3   computer        5         1       6
17   program        5         1       6
26      wars        5         0       5
4  computers        4         3       7
2        can        3         2       5

plot of chunk plot-doctorow-auto

Next week: Data-driven advances.