## Weighting co-authorship networks

This note explains how to implement two edge weighting schemes that are relevant to co-authorship networks, based on my work on legislative cosponsorship networks.

For the rest of this note, the networks under consideration are directed one-mode networks that connect the first author of a text to his or her co-authors. Since co-authorship can occur more than once, and since the number of co-authors can vary, weighting the ties should be considered.

## Newman-Fowler weights

In his research on legislative cosponsorship in the U.S. Congress, Fowler uses the same weighting scheme as Newman used on co-authorship networks. The only difference is that he applies the weights to directed graphs, which means that the weight of the edge from first author $i$ to coauthor $j$ is not necessarily (and in effect, rarely) equal to the reverse edge weight.

Newman and Fowler take the number of coauthors $c$ on each text into account by taking the inverse of that quantity to represent the intensity of the tie. The overall intensity of the tie between authors $i$ and $j$ is the sum of these fractions,

$$ w_{ij} = \sum_{k} \frac{ a_{k} }{ c_{k} } $$

where $a_{k} = 1$ if $i$ and $j$ are coauthors of text $k$ and $0$ otherwise.

This weighting scheme produces strictly positive weights with no upper boundary. It is easy to compute by hand: if coauthor $j$ is the sole coauthor of first author $i$ on three texts, then the intensity of the tie between them will be $3$. If each of these three texts has two coauthors, then the intensity of the tie between $i$ and each coauthor drops to $3 \cdot \frac{1}{2} = 1.5$.

See Fowler's *Political Analysis* paper, pages 468-469, for further details and examples.

## Gross-Kirkland-Shalizi weights

In their paper on cosponsorship in the U.S. Senate, Gross, Kirkland and Shalizi suggest normalizing Newman-Fowler weights by the maximum possible value that these weights might take if $j$ appears on every text authored by $i$, i.e. if $a_{k} = 1$ on all $k$ texts. The resulting weights,

$$ w_{ij} = \sum_{k} \frac{ a_{k} }{ c_{k} } \cdot \Big( \sum_{k} \frac{ 1 }{ c_{k} } \Big) ^ {-1} $$

are bounded between $0$ and $1$, and stand for the weighted propensity that $j$ is a coauthor of $i$.

The Gross, Kirkland and Shalizi paper, initially written by Gross alone, has not yet been published, and the online versions are not always dated. My code uses the `gsw`

acronym to designate these weights because the first version of the paper that I encountered was signed only by Gross and Shalizi.

## Implementation in R

Let's find a way to implement the two edge weighting schemes outlined above, while also computing the “raw” edge weights equal to the total number of co-authorship ties between two authors.

The example data look like this:

```
A A 2
A B 2
A C 2
B B 1
B A 1
A A 1
A C 1
B B 3
B C 3
B D 3
B E 3
```

This is the edge list that you get when

- author
`A`

has written two texts, the first one co-authored by`B`

and`C`

(number of co-authors: 2), the second one co-authored by`C`

alone (number of co-authors: 1) - author
`B`

has also written two texts, the first one co-authored by`A`

alone (number of co-authors: 1), the second one co-authored by`C`

,`D`

and`E`

(number of co-authors: 3)

There are two first authors, `A`

and `B`

, who are also co-authors, and three more co-authors, `C`

, `D`

and `E`

.

Note that the edge list identifies the first authors through the self-loops, which also tell you where each text “starts” in the edge list: the first three rows correspond to the first text, the next two rows correspond to the second text, and so on. The order of the co-authors might or might not be relevant.

Here's how to get the example data in R, using the `dplyr`

package to generate it from binded data frames. The `i`

column contains the first author of each text, the `j`

column contains the co-authors, and the `w`

column holds the number of co-authors, which is just the number of rows in the data frame, minus 1 (i.e. minus the first author):

Let's extract the self-loops and create a table object, `n_au`

, which contains the number of texts by each first author. In this example, both `A`

and `B`

have authored two texts:

Going back to the main edge list, we drop the self-loops and count how many texts were co-authored by each author, storing the result in the `n_co`

table object. In this example, the most active co-author is `C`

, who co-authored three texts:

At that stage, remember that we also want to compute the “raw” edge weights, i.e. the number of times a tie exists between two authors. In order to get that quantity, we collapse the (directed) edge list to the character vector `ij`

of the form `X->Y`

, where `X`

is the first author and `Y`

the co-author:

The `raw`

object, which contains the tabulation of the `ij`

vector, correctly indicates that `A`

and `C`

have co-authored two texts together, while all other co-authorship ties are unique.

Let's now compute the Newman-Fowler weights. Since we have a column with the number of co-authors per text, these are pretty easy to get: all it takes is to apply an inverse sum function to each tie.

The operation above has collapsed the edge list into the following object, where the `w`

column now holds the Newman-Fowler weights of each unique tie in the network. As expected, the strongest edge weight is that of the tie between authors `A`

and `C`

.

Let's now re-expand the `edges`

object into a proper edge list by cutting the `ij`

column into its two parts, while adding the raw number of co-authorship ties as the `raw`

column, and renaming the `w`

column to `nfw`

, for “Newman-Fowler weights”:

Note that the code above requires that none of the authors featured in the network contain the character string `->`

in their names.

We are left with the Gross-Kirkland-Shalizi weights to compute. The denominator of these weights is the maximum value that the Newman-Fowler weights might take, which can be computed from the `self`

object that we created by extracting the self-loops. Here's the complete trick:

What did we do here?

- The
`aggregate`

function computed the maximum possible value of the Newman-Fowler weight involving each first author`i`

, storing the result into the`w`

column. - The result was merged into the edge list, which now has Newman-Fowler weights in the
`nfw`

column and the denominator of the Gross-Kirkland-Shalizi weights in the`w`

column. - The
`gsw`

column created the ratio of the two columns, which will vary between $0$ and $1$. We can actually check that this is the case by adding one line of control flow:

Last, we finalize the edge list by dropping the denominator of the Gross-Kirkland-Shalizi weights:

Creating the weighted network is fairly straightforward from there on, and the `n_au`

and `n_co`

objects can further be used to create vertex attributes indicating how many texts were first-authored or co-authored by each author.

This Gist contains all code showed in this note. The dependency on the `dplyr`

package can be easily removed if necessary.

- First published on September 18th, 2015