String manipulations on full names
This note shows how to use the stringr
package to clean a list of full names that need to be turned into unique identifiers, i.e. something that can be assigned as row names to a data frame.
Example
Let's start by getting a list of real names by scraping the 183 full names of the people currently sitting in the lower chamber of the Austrian parliament, using the rvest
package to find the link to the full list, and then to scrape the name values from the list:
The names extracted through this method need a lot of fixing before we can use them as readable unique identifiers:
"Doris Bures" "Karlheinz Kopf"
"Ing. Norbert Hofer" "\r\n\t\t\r\nAlm Nikolaus, Mag.\r\n"
"\r\n\t\t\r\nAmon Werner, MBA\r\n" "\r\n\t\t\r\nAngerer Erwin\r\n"
There is, in fact, a method to get clean names, but it involves scraping one page per row in the data, which is not always desirable or feasable.
Method
Before we start, let's remark that text manipulation almost always calls for an idiosyncratic solution: depending on how messy the text is, the solution will rely on specific conditions being met (or, as importantly, being never met) in the data. Here, we assume that we are working with full names.
Working with names means that you are not working with full sentences, so you will want to eliminate some sentence-related characters, such as endmarks. However, with full names, you might need to preserve other punctuation marks, such as apostrophes, dashes or even periods, which are used in titles.
What you need in this scenario is a function that can deal with character duplication, exclusion and, because of punctuation rules and of the high likelihood that a long list of human inputs will get that last one wrong at least once, spacing. Last, we will add subsetting to that list.
We will clean the data by writing short string functions in the style of the stringr
package, which provides a front-end to stringi
that uses fast C code to process strings. All stringr
functions start with str_
and use sensical verbs. The functions below also start with str_
but are not necessarily verb-based.
Loading stringr
automatically loads magrittr
, so we will be able to use %>%
pipes even without loading that last package. We will also load the dplyr
package when we get to postprocessing and de-duplication.
Whitespace
Whitespace at the beginning or at the end of a word is a common feature of badly formatted text data, as are problematic whitespace characters such as carriage returns. The other error that frequently pops up is the presence of multiple spaces instead of a single one, as in "this example"
.
Extra spaces are easy to fix, and fixing them also offers the opportunity to treat the special characters for line returns or tabs, such as \n
, as whitespace. A simple call to gsub
, which we will embed into a stringr
-like function, is sufficient here:
After running the function on the data, the names are now stripped of any issue caused by whitespace, as shown in the results below (original data in left column, processed data in right column):
from to
Doris Bures Doris Bures
Karlheinz Kopf Karlheinz Kopf
Ing. Norbert Hofer Ing. Norbert Hofer
\r\n\t\t\r\nAlm Nikolaus, Mag.\r\n Alm Nikolaus, Mag.
\r\n\t\t\r\nAmon Werner, MBA\r\n Amon Werner, MBA
\r\n\t\t\r\nAngerer Erwin\r\n Angerer Erwin
Note that some names still end with a space, an issue that we will fix right at tne end of the cleaning process.
Punctuation
Assuming that you are working with names and that you aim at matching some set of punctuation rules, you will want to treat some punctuation characters as whitespace and remove them, while preserving and adding space after some others.
Names are easy to process because only a few punctuations need to be preserved. Working with full sentences would require some coding effort to get endmarks correctly; instead, we are taking an important shortcut by removing those.
Since some punctuation marks are being preserved, such as dashes, we will also want to make sure that there is a single punctuation items between two words, in order to treat this-example
and this--example
as duplicated items.
The following function applies all these rules sequentially:
The top of the results show no big difference, except for one endmark that was removed because it was located at the end of the string:
from to
Doris Bures Doris Bures
Karlheinz Kopf Karlheinz Kopf
Ing. Norbert Hofer Ing. Norbert Hofer
Alm Nikolaus, Mag. Alm Nikolaus, Mag
Amon Werner, MBA Amon Werner, MBA
Angerer Erwin Angerer Erwin
Subsetting
The data contain both prefixes, like "Ing."
, and suffixes, like MBA
. Let's write a short function to find either part of the names, in order to remove them. The function is not called str_subset
because there is already such a function in the stringr
package.
The str_filter
function takes three arguments:
sep
is the separator that starts or ends the part of the string that we want to removeside
is the side of the string on which the part of the string is expected to be foundgreedy
asks if all prefixes or suffixes, if there are more than one, should be removed
The function can match either prefixes or suffixes, in any quantity:
Note that the function needs the user to escape any special character in the sep
argument: using .
as a separator will create a destructive regular expression that will eliminate the entire string. Also note that function will strip any space located around the separator.
We run the function twice on our list of names: first with side
set to "right"
and sep
set to ,
, in order to remove "Dr."
and similar prefixes, and then with side
set to "left"
and sep
set to "\\."
to remove "Ing."
and similar suffixes. The results show neither or these:
from to
Doris Bures Doris Bures
Karlheinz Kopf Karlheinz Kopf
Ing. Norbert Hofer Norbert Hofer
Alm Nikolaus, Mag Alm Nikolaus
Amon Werner, MBA Amon Werner
Angerer Erwin Angerer Erwin
It is important to clean the "right-hand side" of the names before cleaning the "left-hand side" to avoid any pattern where they get confused together--which would lose the name in the middle.
It is also important for our solution that the prefixes do not contain any commas, otherwise the prefix and suffix patterns would get mixed up and the results would fail to identify the name in the middle. If the prefixes contained commas, we would need to be more cautious and use str_locate_all
to subset the names more carefully.
Detaching
A related function can extract the prefix or suffix and the name to a list:
The function uses the str_filter
function to find the part of the string that is not considered as a prefix or suffix, and then uses the (vectorized) str_replace
function from the stringr
package to remove that part of the string from the original text. When there is no prefix or suffix, the result is a missing value:
Applying the function to our data allows to extract the prefix or suffix of the names. In the data extract below, the prefix
column is where we used str_detach
on the "left"
side with separator "\\."
, and suffix
is the column where we targeted the "right"
side with separator ","
:
from prefix suffix
Doris Bures <NA> <NA>
Karlheinz Kopf <NA> <NA>
Ing. Norbert Hofer Ing <NA>
Alm Nikolaus, Mag <NA> Mag
Amon Werner, MBA <NA> MBA
Angerer Erwin <NA> <NA>
Postprocessing
Let's finally wrap all processing functions in one, which returns a data frame of cleaned names with their prefixes and (cleaned) suffixes:
The combined results of all previous functions are shown below, with the prefix
and suffix
columns using str_filter
, str_detach
and some further text replacement to extract clean prefixes and suffixes:
from prefix name suffix
Doris Bures <NA> Doris Bures <NA>
Karlheinz Kopf <NA> Karlheinz Kopf <NA>
Ing. Norbert Hofer Ing Norbert Hofer <NA>
\r\n\t\t\r\nAlm Nikolaus, Mag.\r\n <NA> Alm Nikolaus Mag
\r\n\t\t\r\nAmon Werner, MBA\r\n <NA> Amon Werner MBA
\r\n\t\t\r\nAngerer Erwin\r\n <NA> Angerer Erwin <NA>
After inspection of the data, the code gets only one of the 186 rows wrong, due to one person having his name written differently than all others (row #76). This problematic case will get be fixed in one line of code.
It also appears that there only one name with a prefix (row #3), and that name comes from the first three rows of the data, which designate people who re-appear in the later rows but with their names ordered differently (compare rows #1-3 to #4-6). The only step left is therefore to remove the first three rows of the results.
Both steps outlined above (dropping the extra rows and fixing the sole problematic case) can be performed together with the dplyr
package:
Inverting
Let's now notice that the names are presented as family names, followed by a space, followed by first names, optionally followed by a space and one initial. Using str_count
to count the number of spaces found in the names seems to confirm that this is how the data are structured.
If the pattern described above is fixed, a simple function can "invert" the names to have first names (and their optional initial) at the front of the family names:
Inspecting the results will reveal one problematic case where the family name is made of two words ("El Habbassi Asdin"
), so let's fix that by "protecting" the "El" prefix before inverting. The code below does so and then shows all cases where the name inversion might have gone wrong:
At that stage, the results look fine even when the names are ambiguous:
from to
Aslan Aygül Berivan Aygül Berivan Aslan
Bösch Reinhard Eugen Reinhard Eugen Bösch
El Habbassi Asdin Asdin El Habbassi
Eßl Franz Leonhard Franz Leonhard Eßl
Feichtinger Klaus Uwe Klaus Uwe Feichtinger
Fekter Maria Theresia Maria Theresia Fekter
Gamon Claudia Angela Claudia Angela Gamon
Karlsböck Andreas F Andreas F Karlsböck
Krainer Kai Jan Kai Jan Krainer
Riemer Josef A Josef A Riemer
De-duplication
There are no nominal duplicates in the data, so there is no need to process the names further. However, if there were duplicates among processed names, the dplyr
package would come in handy to do something like appending numbers to the duplicate names, so that they would read "Jon Example-1"
and "Jon Example-2"
.
We therefore finalize the data by running the following code to drop the original names, invert the processed names as shown above, and then de-duplicate them if needed:
Inspecting the final data frame for any occurrence of "-1"
confirms that the data did not contain duplicates, and we have reached our goal: all names inthe data have been cleaned up and made unique.
The code featured in this note is available from this Gist, which contains a backup of the example data. As previously remarked, the code is problem-dependent: it fits the example data that we used in this note. However, there is a fair chance that the code might be reusable without too many changes in different contexts.
- First published on January 8th, 2016