Introduction to Data Analysis

# 11.1. Influence

This session is inspired by the Tony Hirst's exploration of Twitter networks and content. His method is not based on R: he uses Open Refine to process the data and then uses Gephi to visualize it as network plots and run some basic network analysis. For consistency, we will run everything in R, knowing that there are alternative workflows like Hirst's.

# Load packages.
"sna")
packages <- lapply(packages, FUN = function(x) {
if (!require(x, character.only = TRUE)) {
install.packages(x)
library(x, character.only = TRUE)
}
})


We will use our very own ggnet function to produce the plots with ggplot2: see this blog post (in French) and these slides (in English/French) for some construction notes, and see David Spark's networks with Bézier curves for an elegant variation of network plots drawn with ggplot2.

Regarding the data, it used to be pretty straightforward to mine Twitter data with R and the twitteR library, and there are nice examples of such exercises on Gaston Sanchez's “Mining Twitter” and on the Oxford Internet Institute's “Network Visualization”. Both of them are on GitHub) if you want to take a look at the code.

You can still replicate these examples, but only if you authenticate with Twitter first, which we will skip. Instead, we will rely on the data that were collected to illustrate the ggnet function. This network contains 339 Twitter accounts used by French MPs in May 2013 (see this blog post for data construction details).

# Locate and save the network data.
net = "data/network.tsv"
ids = "data/nodes.tsv"
if (!file.exists(zip)) {
zip(zip, file = c(net, ids))
file.remove(net, ids)
}
# Get data on current French MPs.
ids = read.csv(unz(zip, ids), sep = "\t")
# Get data on their Twitter accounts.
net = read.csv(unz(zip, net), sep = "\t")
# Copy network data for later use.
ndf = net
# Convert it to a network object.
net = network(net)


Once the two datasets have been converted to a network object, plotting the network is very easy: we just pass the object to the ggnet function, along with some information on how to color and weight the points with parliamentary groups. The README file for the ggnet function has more examples.

mps = data.frame(Twitter = network.vertex.names(net))
# Set the French MP part colours.
mp.groups = merge(mps, ids, by = "Twitter")$Groupe mp.colors = brewer.pal(9, "Set1")[c(3, 1, 9, 6, 8, 5, 2)] # First ggnet example plot. ggnet(net, weight = "degree", quantize = TRUE, node.group = mp.groups, node.color = mp.colors, names = c("Group", "Links")) + theme(text = element_text(size = 16))  339 nodes, weighted by freeman id indegree outdegree freeman 60 claudebartolone 200 238 438 223 marclefur 74 180 254 39 c_capdevielle 72 181 253 178 JJUrvoas 150 91 241 108 faureolivier 135 95 230 331 vpecresse 142 87 229  The method used here to position the data points into a force-directed graph is the Fruchterman-Reingold algorithm. The algorithm contains a random component at its initial stage and therefore generates a different result on each run. Run the following function several times to view the same network under similar layouts with different random parameters. ## Network centrality The nodes of the network are MPs with Twitter accounts, and the network is formed by all “follower/following” directed links between them. Use these simple custom functions to explore the network by asking simple questions, like “who is following…” or “how many members of each group is following…”: # Recall network data structure. head(ndf) # Load network functions. code = "https://raw.github.com/briatte/ggnet/master/functions.R" downloader::source_url(code, prompt = FALSE) # A few simple examples. x = who.follows(ndf, "nk_m") y = who.is.followed.by(ndf, "JacquesBompard") # A more subtle measure. lapply(levels(ids$Groupe), top.group.outlinks, net = ndf)


In this network, the indegree is the number of followers, i.e. the sum of nodes that link to a node, and the outdegree is the number of outgoing connexions from this same node. The total degree of a node (the sum of its indegree and outdegree) is a possible measure of network centrality, as is betweenness:

# Calculate network betweenness.
top.mps = order(betweenness(net), decreasing = TRUE)
# Get the names of the vertices.
top.mps = cbind(top.mps, network.vertex.names(net)[top.mps])
# Show the top 5.

     top.mps