Introduction to Data Analysis

11.1. Influence

This session is inspired by the Tony Hirst's exploration of Twitter networks and content. His method is not based on R: he uses Open Refine to process the data and then uses Gephi to visualize it as network plots and run some basic network analysis. For consistency, we will run everything in R, knowing that there are alternative workflows like Hirst's.

# Load packages.
packages <- c("downloader", "intergraph", "GGally", "ggplot2", "network", "RColorBrewer", 
    "sna")
packages <- lapply(packages, FUN = function(x) {
    if (!require(x, character.only = TRUE)) {
        install.packages(x)
        library(x, character.only = TRUE)
    }
})

We will use our very own ggnet function to produce the plots with ggplot2: see this blog post (in French) and these slides (in English/French) for some construction notes, and see David Spark's networks with Bézier curves for an elegant variation of network plots drawn with ggplot2.

Regarding the data, it used to be pretty straightforward to mine Twitter data with R and the twitteR library, and there are nice examples of such exercises on Gaston Sanchez's “Mining Twitter” and on the Oxford Internet Institute's “Network Visualization”. Both of them are on GitHub) if you want to take a look at the code.

You can still replicate these examples, but only if you authenticate with Twitter first, which we will skip. Instead, we will rely on the data that were collected to illustrate the ggnet function. This network contains 339 Twitter accounts used by French MPs in May 2013 (see this blog post for data construction details).

# Locate and save the network data.
net = "data/network.tsv"
ids = "data/nodes.tsv"
zip = "data/twitter.an.zip"
if (!file.exists(zip)) {
    download("https://raw.github.com/briatte/ggnet/master/network.tsv", net)
    download("https://raw.github.com/briatte/ggnet/master/nodes.tsv", ids)
    zip(zip, file = c(net, ids))
    file.remove(net, ids)
}
# Get data on current French MPs.
ids = read.csv(unz(zip, ids), sep = "\t")
# Get data on their Twitter accounts.
net = read.csv(unz(zip, net), sep = "\t")
# Copy network data for later use.
ndf = net
# Convert it to a network object.
net = network(net)

Once the two datasets have been converted to a network object, plotting the network is very easy: we just pass the object to the ggnet function, along with some information on how to color and weight the points with parliamentary groups. The README file for the ggnet function has more examples.

mps = data.frame(Twitter = network.vertex.names(net))
# Set the French MP part colours.
mp.groups = merge(mps, ids, by = "Twitter")$Groupe
mp.colors = brewer.pal(9, "Set1")[c(3, 1, 9, 6, 8, 5, 2)]
# First ggnet example plot.
ggnet(net, 
      weight = "degree", 
      quantize = TRUE,
      node.group = mp.groups, 
      node.color = mp.colors,
      names = c("Group", "Links")) + 
  theme(text = element_text(size = 16))
339 nodes, weighted by freeman 

                 id indegree outdegree freeman
60  claudebartolone      200       238     438
223       marclefur       74       180     254
39    c_capdevielle       72       181     253
178        JJUrvoas      150        91     241
108    faureolivier      135        95     230
331       vpecresse      142        87     229

plot of chunk ggnet-plot-auto

The method used here to position the data points into a force-directed graph is the Fruchterman-Reingold algorithm. The algorithm contains a random component at its initial stage and therefore generates a different result on each run. Run the following function several times to view the same network under similar layouts with different random parameters.

Network centrality

The nodes of the network are MPs with Twitter accounts, and the network is formed by all “follower/following” directed links between them. Use these simple custom functions to explore the network by asking simple questions, like “who is following…” or “how many members of each group is following…”:

# Recall network data structure.
head(ndf)
# Load network functions.
code = "https://raw.github.com/briatte/ggnet/master/functions.R"
downloader::source_url(code, prompt = FALSE)
# A few simple examples.
x = who.follows(ndf, "nk_m")
y = who.is.followed.by(ndf, "JacquesBompard")
# A more subtle measure.
lapply(levels(ids$Groupe), top.group.outlinks, net = ndf)

In this network, the indegree is the number of followers, i.e. the sum of nodes that link to a node, and the outdegree is the number of outgoing connexions from this same node. The total degree of a node (the sum of its indegree and outdegree) is a possible measure of network centrality, as is betweenness:

# Calculate network betweenness.
top.mps = order(betweenness(net), decreasing = TRUE)
# Get the names of the vertices.
top.mps = cbind(top.mps, network.vertex.names(net)[top.mps])
# Show the top 5.
head(top.mps)
     top.mps                  
[1,] "60"    "claudebartolone"
[2,] "74"    "Dbussereau"     
[3,] "223"   "marclefur"      
[4,] "331"   "vpecresse"      
[5,] "178"   "JJUrvoas"       
[6,] "108"   "faureolivier"   

Centrality is useful to detect influent or important network members, as clearly illustrated by Claude Bartolone's central position among all French MPs on Twitter (Bartolone currently chairs the lower house of the French parliament). For a fun exploration of network centrality, see Kieran Healey's brilliant take on the issue, using R in the 18th century.

Next: Network(d)s.