String manipulations on full names

This note shows how to use the stringr package to clean a list of full names that need to be turned into unique identifiers, i.e. something that can be assigned as row names to a data frame.

Example

Let's start by getting a list of real names by scraping the 183 full names of the people currently sitting in the lower chamber of the Austrian parliament, using the rvest package to find the link to the full list, and then to scrape the name values from the list:

The names extracted through this method need a lot of fixing before we can use them as readable unique identifiers:

"Doris Bures"                        "Karlheinz Kopf"                    
"Ing. Norbert Hofer"                 "\r\n\t\t\r\nAlm Nikolaus, Mag.\r\n"
"\r\n\t\t\r\nAmon Werner, MBA\r\n"   "\r\n\t\t\r\nAngerer Erwin\r\n"

There is, in fact, a method to get clean names, but it involves scraping one page per row in the data, which is not always desirable or feasable.

Method

Before we start, let's remark that text manipulation almost always calls for an idiosyncratic solution: depending on how messy the text is, the solution will rely on specific conditions being met (or, as importantly, being never met) in the data. Here, we assume that we are working with full names.

Working with names means that you are not working with full sentences, so you will want to eliminate some sentence-related characters, such as endmarks. However, with full names, you might need to preserve other punctuation marks, such as apostrophes, dashes or even periods, which are used in titles.

What you need in this scenario is a function that can deal with character duplication, exclusion and, because of punctuation rules and of the high likelihood that a long list of human inputs will get that last one wrong at least once, spacing. Last, we will add subsetting to that list.

We will clean the data by writing short string functions in the style of the stringr package, which provides a front-end to stringi that uses fast C code to process strings. All stringr functions start with str_ and use sensical verbs. The functions below also start with str_ but are not necessarily verb-based.

Loading stringr automatically loads magrittr, so we will be able to use %>% pipes even without loading that last package. We will also load the dplyr package when we get to postprocessing and de-duplication.

Whitespace

Whitespace at the beginning or at the end of a word is a common feature of badly formatted text data, as are problematic whitespace characters such as carriage returns. The other error that frequently pops up is the presence of multiple spaces instead of a single one, as in "this example".

Extra spaces are easy to fix, and fixing them also offers the opportunity to treat the special characters for line returns or tabs, such as \n, as whitespace. A simple call to gsub, which we will embed into a stringr-like function, is sufficient here:

After running the function on the data, the names are now stripped of any issue caused by whitespace, as shown in the results below (original data in left column, processed data in right column):

                              from                   to
                       Doris Bures          Doris Bures
                    Karlheinz Kopf       Karlheinz Kopf
                Ing. Norbert Hofer   Ing. Norbert Hofer
\r\n\t\t\r\nAlm Nikolaus, Mag.\r\n  Alm Nikolaus, Mag. 
  \r\n\t\t\r\nAmon Werner, MBA\r\n    Amon Werner, MBA 
     \r\n\t\t\r\nAngerer Erwin\r\n       Angerer Erwin

Note that some names still end with a space, an issue that we will fix right at tne end of the cleaning process.

Punctuation

Assuming that you are working with names and that you aim at matching some set of punctuation rules, you will want to treat some punctuation characters as whitespace and remove them, while preserving and adding space after some others.

Names are easy to process because only a few punctuations need to be preserved. Working with full sentences would require some coding effort to get endmarks correctly; instead, we are taking an important shortcut by removing those.

Since some punctuation marks are being preserved, such as dashes, we will also want to make sure that there is a single punctuation items between two words, in order to treat this-example and this--example as duplicated items.

The following function applies all these rules sequentially:

The top of the results show no big difference, except for one endmark that was removed because it was located at the end of the string:

               from                 to
        Doris Bures        Doris Bures
     Karlheinz Kopf     Karlheinz Kopf
 Ing. Norbert Hofer Ing. Norbert Hofer
Alm Nikolaus, Mag.   Alm Nikolaus, Mag
  Amon Werner, MBA    Amon Werner, MBA
     Angerer Erwin       Angerer Erwin

Subsetting

The data contain both prefixes, like "Ing.", and suffixes, like MBA. Let's write a short function to find either part of the names, in order to remove them. The function is not called str_subset because there is already such a function in the stringr package.

The str_filter function takes three arguments:

sep is the separator that starts or ends the part of the string that we want to remove
side is the side of the string on which the part of the string is expected to be found
greedy asks if all prefixes or suffixes, if there are more than one, should be removed

The function can match either prefixes or suffixes, in any quantity:

Note that the function needs the user to escape any special character in the sep argument: using . as a separator will create a destructive regular expression that will eliminate the entire string. Also note that function will strip any space located around the separator.

We run the function twice on our list of names: first with side set to "right" and sep set to ,, in order to remove "Dr." and similar prefixes, and then with side set to "left" and sep set to "\\." to remove "Ing." and similar suffixes. The results show neither or these:

              from             to
       Doris Bures    Doris Bures
    Karlheinz Kopf Karlheinz Kopf
Ing. Norbert Hofer  Norbert Hofer
 Alm Nikolaus, Mag   Alm Nikolaus
  Amon Werner, MBA    Amon Werner
     Angerer Erwin  Angerer Erwin

It is important to clean the "right-hand side" of the names before cleaning the "left-hand side" to avoid any pattern where they get confused together--which would lose the name in the middle.

It is also important for our solution that the prefixes do not contain any commas, otherwise the prefix and suffix patterns would get mixed up and the results would fail to identify the name in the middle. If the prefixes contained commas, we would need to be more cautious and use str_locate_all to subset the names more carefully.

Detaching

A related function can extract the prefix or suffix and the name to a list:

The function uses the str_filter function to find the part of the string that is not considered as a prefix or suffix, and then uses the (vectorized) str_replace function from the stringr package to remove that part of the string from the original text. When there is no prefix or suffix, the result is a missing value:

Applying the function to our data allows to extract the prefix or suffix of the names. In the data extract below, the prefix column is where we used str_detach on the "left" side with separator "\\.", and suffix is the column where we targeted the "right" side with separator ",":

              from prefix suffix
       Doris Bures   <NA>   <NA>
    Karlheinz Kopf   <NA>   <NA>
Ing. Norbert Hofer    Ing   <NA>
 Alm Nikolaus, Mag   <NA>    Mag
  Amon Werner, MBA   <NA>    MBA
     Angerer Erwin   <NA>   <NA>

Postprocessing

Let's finally wrap all processing functions in one, which returns a data frame of cleaned names with their prefixes and (cleaned) suffixes:

The combined results of all previous functions are shown below, with the prefix and suffix columns using str_filter, str_detach and some further text replacement to extract clean prefixes and suffixes:

                              from prefix           name suffix
                       Doris Bures   <NA>    Doris Bures   <NA>
                    Karlheinz Kopf   <NA> Karlheinz Kopf   <NA>
                Ing. Norbert Hofer    Ing  Norbert Hofer   <NA>
\r\n\t\t\r\nAlm Nikolaus, Mag.\r\n   <NA>   Alm Nikolaus    Mag
  \r\n\t\t\r\nAmon Werner, MBA\r\n   <NA>    Amon Werner    MBA
     \r\n\t\t\r\nAngerer Erwin\r\n   <NA>  Angerer Erwin   <NA>

After inspection of the data, the code gets only one of the 186 rows wrong, due to one person having his name written differently than all others (row #76). This problematic case will get be fixed in one line of code.

It also appears that there only one name with a prefix (row #3), and that name comes from the first three rows of the data, which designate people who re-appear in the later rows but with their names ordered differently (compare rows #1-3 to #4-6). The only step left is therefore to remove the first three rows of the results.

Both steps outlined above (dropping the extra rows and fixing the sole problematic case) can be performed together with the dplyr package:

Inverting

Let's now notice that the names are presented as family names, followed by a space, followed by first names, optionally followed by a space and one initial. Using str_count to count the number of spaces found in the names seems to confirm that this is how the data are structured.

If the pattern described above is fixed, a simple function can "invert" the names to have first names (and their optional initial) at the front of the family names:

Inspecting the results will reveal one problematic case where the family name is made of two words ("El Habbassi Asdin"), so let's fix that by "protecting" the "El" prefix before inverting. The code below does so and then shows all cases where the name inversion might have gone wrong:

At that stage, the results look fine even when the names are ambiguous:

                 from                    to
  Aslan Aygül Berivan   Aygül Berivan Aslan
 Bösch Reinhard Eugen  Reinhard Eugen Bösch
    El Habbassi Asdin     Asdin El Habbassi
   Eßl Franz Leonhard    Franz Leonhard Eßl
Feichtinger Klaus Uwe Klaus Uwe Feichtinger
Fekter Maria Theresia Maria Theresia Fekter
 Gamon Claudia Angela  Claudia Angela Gamon
  Karlsböck Andreas F   Andreas F Karlsböck
      Krainer Kai Jan       Kai Jan Krainer
       Riemer Josef A        Josef A Riemer

De-duplication

There are no nominal duplicates in the data, so there is no need to process the names further. However, if there were duplicates among processed names, the dplyr package would come in handy to do something like appending numbers to the duplicate names, so that they would read "Jon Example-1" and "Jon Example-2".

We therefore finalize the data by running the following code to drop the original names, invert the processed names as shown above, and then de-duplicate them if needed:

Inspecting the final data frame for any occurrence of "-1" confirms that the data did not contain duplicates, and we have reached our goal: all names inthe data have been cleaned up and made unique.

The code featured in this note is available from this Gist, which contains a backup of the example data. As previously remarked, the code is problem-dependent: it fits the example data that we used in this note. However, there is a fair chance that the code might be reusable without too many changes in different contexts.

First published on January 8th, 2016

Other notes