Weighting co-authorship networks
This note explains how to implement two edge weighting schemes that are relevant to co-authorship networks, based on my work on legislative cosponsorship networks.
For the rest of this note, the networks under consideration are directed one-mode networks that connect the first author of a text to his or her co-authors. Since co-authorship can occur more than once, and since the number of co-authors can vary, weighting the ties should be considered.
Newman-Fowler weights
In his research on legislative cosponsorship in the U.S. Congress, Fowler uses the same weighting scheme as Newman used on co-authorship networks. The only difference is that he applies the weights to directed graphs, which means that the weight of the edge from first author $i$ to coauthor $j$ is not necessarily (and in effect, rarely) equal to the reverse edge weight.
Newman and Fowler take the number of coauthors $c$ on each text into account by taking the inverse of that quantity to represent the intensity of the tie. The overall intensity of the tie between authors $i$ and $j$ is the sum of these fractions,
$$ w_{ij} = \sum_{k} \frac{ a_{k} }{ c_{k} } $$
where $a_{k} = 1$ if $i$ and $j$ are coauthors of text $k$ and $0$ otherwise.
This weighting scheme produces strictly positive weights with no upper boundary. It is easy to compute by hand: if coauthor $j$ is the sole coauthor of first author $i$ on three texts, then the intensity of the tie between them will be $3$. If each of these three texts has two coauthors, then the intensity of the tie between $i$ and each coauthor drops to $3 \cdot \frac{1}{2} = 1.5$.
See Fowler's Political Analysis paper, pages 468-469, for further details and examples.
Gross-Kirkland-Shalizi weights
In their paper on cosponsorship in the U.S. Senate, Gross, Kirkland and Shalizi suggest normalizing Newman-Fowler weights by the maximum possible value that these weights might take if $j$ appears on every text authored by $i$, i.e. if $a_{k} = 1$ on all $k$ texts. The resulting weights,
$$ w_{ij} = \sum_{k} \frac{ a_{k} }{ c_{k} } \cdot \Big( \sum_{k} \frac{ 1 }{ c_{k} } \Big) ^ {-1} $$
are bounded between $0$ and $1$, and stand for the weighted propensity that $j$ is a coauthor of $i$.
The Gross, Kirkland and Shalizi paper, initially written by Gross alone, has not yet been published, and the online versions are not always dated. My code uses the gsw
acronym to designate these weights because the first version of the paper that I encountered was signed only by Gross and Shalizi.
Implementation in R
Let's find a way to implement the two edge weighting schemes outlined above, while also computing the “raw” edge weights equal to the total number of co-authorship ties between two authors.
The example data look like this:
A A 2
A B 2
A C 2
B B 1
B A 1
A A 1
A C 1
B B 3
B C 3
B D 3
B E 3
This is the edge list that you get when
- author
A
has written two texts, the first one co-authored byB
andC
(number of co-authors: 2), the second one co-authored byC
alone (number of co-authors: 1) - author
B
has also written two texts, the first one co-authored byA
alone (number of co-authors: 1), the second one co-authored byC
,D
andE
(number of co-authors: 3)
There are two first authors, A
and B
, who are also co-authors, and three more co-authors, C
, D
and E
.
Note that the edge list identifies the first authors through the self-loops, which also tell you where each text “starts” in the edge list: the first three rows correspond to the first text, the next two rows correspond to the second text, and so on. The order of the co-authors might or might not be relevant.
Here's how to get the example data in R, using the dplyr
package to generate it from binded data frames. The i
column contains the first author of each text, the j
column contains the co-authors, and the w
column holds the number of co-authors, which is just the number of rows in the data frame, minus 1 (i.e. minus the first author):
Let's extract the self-loops and create a table object, n_au
, which contains the number of texts by each first author. In this example, both A
and B
have authored two texts:
Going back to the main edge list, we drop the self-loops and count how many texts were co-authored by each author, storing the result in the n_co
table object. In this example, the most active co-author is C
, who co-authored three texts:
At that stage, remember that we also want to compute the “raw” edge weights, i.e. the number of times a tie exists between two authors. In order to get that quantity, we collapse the (directed) edge list to the character vector ij
of the form X->Y
, where X
is the first author and Y
the co-author:
The raw
object, which contains the tabulation of the ij
vector, correctly indicates that A
and C
have co-authored two texts together, while all other co-authorship ties are unique.
Let's now compute the Newman-Fowler weights. Since we have a column with the number of co-authors per text, these are pretty easy to get: all it takes is to apply an inverse sum function to each tie.
The operation above has collapsed the edge list into the following object, where the w
column now holds the Newman-Fowler weights of each unique tie in the network. As expected, the strongest edge weight is that of the tie between authors A
and C
.
Let's now re-expand the edges
object into a proper edge list by cutting the ij
column into its two parts, while adding the raw number of co-authorship ties as the raw
column, and renaming the w
column to nfw
, for “Newman-Fowler weights”:
Note that the code above requires that none of the authors featured in the network contain the character string ->
in their names.
We are left with the Gross-Kirkland-Shalizi weights to compute. The denominator of these weights is the maximum value that the Newman-Fowler weights might take, which can be computed from the self
object that we created by extracting the self-loops. Here's the complete trick:
What did we do here?
- The
aggregate
function computed the maximum possible value of the Newman-Fowler weight involving each first authori
, storing the result into thew
column. - The result was merged into the edge list, which now has Newman-Fowler weights in the
nfw
column and the denominator of the Gross-Kirkland-Shalizi weights in thew
column. - The
gsw
column created the ratio of the two columns, which will vary between $0$ and $1$. We can actually check that this is the case by adding one line of control flow:
Last, we finalize the edge list by dropping the denominator of the Gross-Kirkland-Shalizi weights:
Creating the weighted network is fairly straightforward from there on, and the n_au
and n_co
objects can further be used to create vertex attributes indicating how many texts were first-authored or co-authored by each author.
This Gist contains all code showed in this note. The dependency on the dplyr
package can be easily removed if necessary.
- First published on September 18th, 2015