Quick shell commands for R users

This note explains how to use an application launcher along with text expansion and shell commands to accomplish a few specific tasks that can be useful to R users.

Software requirements

This note assumes that you are equipped with an application launcher that supports text expansion and can process shell commands. On Mac OS X, I recommend Alfred, because the little money that you will need to pay for its Powerpack is very well compensated by the tons of useful features (including shell support) that it brings to the application.

This note also assumes that you are equipped with a text expander, which is a small utility that will transform keyboard inputs into longer ones, thereby saving you the hassle of memorising and typing shell commands. On Mac OS X, I recommend aText, because it is cheap, fast and stable.

If you are going to use the Alfred + aText combination that I recommend, remember to set the "Preferences → Appearance → Options → Focusing" setting in Alfred to "Compatibility mode" in order to enable aText in the Alfred prompt, and to write your aText shortcuts as plain text content.

Below are several example cases of using Alfred and aText to perform some specific tasks that can be useful to the R user:

Find open network ports with lsof
Launch a local Web server with xampp
Diagnose an online connection with curl
Download batches of files with wget
Find files anywhere with find
Scan large files with cat and sed
Extension: Calling shell commands from R, illustrated with pdftk
Extension: Quicker Git commands

For even more useful command-line one-liners, see this list by Arturo Herrero.

Find open network ports

There are a variety of situations in which listing open network ports is useful to the R user, such as using Shiny or any other software that sends its results from R to a specific local network port.

With Alfred installed, typing the following in its prompt will open a Terminal window (because of the trailing > character) and list every open file into a summary table:

> lsof

Since the definition of files in UNIX includes network connections, listing only open network sockets is as simple as passing a few options to lsof and then subsetting the result to connections that are currently being "listened to" (i.e. awaiting connections):

> lsof -i -P | grep -i "listen"

At that stage, the command has become too long to memorize if you are not a frequent shell user. This is where text expansion becomes handy, as the entire command above (including the trailing > character) can be assigned to an abbrevation like lsof. Typing those four letters in Alfred will now return results that might include something like

rsession  21309   fr    5u  IPv4 0xbb252b45e6a7134d      0t0  TCP localhost:42136 (LISTEN)
rsession  21309   fr   19u  IPv4 0xbb252b45d9f58b35      0t0  TCP localhost:18323 (LISTEN)

… which shows the specific ports that are being used by my R session.

If you are planning to use lsof for other tasks than the one described above, you will naturally want to use another abbreviation than lsof in your text expander.

Launch a local Web server

R can be used with many other technologies, and a lot of effort is currently being put in making R play nicely with languages that can be used in Web pages, such as JavaScript.

If your R code produces results that require PHP to be viewed, you will need to run a PHP server on your local computer. An easy way to do this is to install a "solution stack package" such as XAMPP, which will run an Apache HTTP server with Perl, PHP and a few other things on it.

On Mac OS X, one way to launch or stop XAMPP is to use the following commands:

sudo /Applications/XAMPP/xamppfiles/xampp start
sudo /Applications/XAMPP/xamppfiles/xampp stop

The first thing you will want to do is to create a symlink for the XAMPP executable into a directory that is part of your PATH, such as the /usr/local/bin directory, where Homebrew, for instance, also creates symlinks for the packages that it installs. Use the following command to create that symlink:

ln -s /Applications/XAMPP/xamppfiles/xampp /usr/local/bin

The two commands above are now executable through less text:

> sudo xampp start
> sudo xampp stop

An easy way to execute any of these commands is then to assign their common part (> sudo xampp) to the xampp abbreviation in your text expander, so that in your application launcher, typing xampp start or xampp stop will either launch or stop your local server.

Diagnose an online connection

A very useful tool for diagnosing all sorts of issues with Internet connections is curl, which is included with Mac OS X and many other operating systems. This is the same command that R will try to use to download files if it finds it installed.

If a particular Web page is not downloading or parsing properly when using R functions like download.file or rvest::read_html (which calls xml2::read_xml), it can be helpful to use curl with the following arguments:

 curl -ILv <URL>

The -I argument will only return the header of the target URL.
The -L argument allows curl to follow redirects.
The -v argument makes curl verbose.
And the <URL> argument is the target address to check.

To use this command through a shortcut, just assign

> curl -ILv

to an abbreviation like ccurl in your text expander. As in other examples, do not forget the > prefix if you are using this shortcut with Alfred, and include a trailing space at the end, so that only the target URL will be missing.

There are many other ways to use curl. For instance, I have the following curl command assigned to the sspeed keyword:

 curl -o /dev/null http://speedtest.sea01.softlayer.com/downloads/test10.zip

The command will attempt to download a 11 MB file from a speed-testing server. Since the download path is set to /dev/null, the file will not be saved anywhere on disk, although it will be downloaded in full.

The curl progress text produced by this command will indicate the average speed of the download, which is handy to troubleshoot a buggy connection.

Download batches of files

R is not necessarily the most appropriate hammer for every nail, and when it comes to batch downloads, it can be far more practical to use another tool, especially in situations where there is no particular need to make the download step replicable.

A very efficient tool in such situations is wget, which Mac OS X users can install with the following Homebrew command:

brew install wget

A generic wget command to download all the files at a given address might look like:

wget -c --proxy=off -Q0 --passive-ftp -r -l5 -A *.* --no-parent --progress=dot:binary <URL>

The -c argument will resume partial downloads, which is useful if the course stalls: just interrupt the process with Ctrl + C, and run it again.
The --proxy argument means that we are not using any particular proxy.
The -Q argument sets the file size quota limit to 0, which means that wget will try to download all files that it founds at the given address.
The --passive-ftp option means that we are letting the server set the FTP port, which is a way to avoid issues with server-side firewalls.
The -r argument sets the depth of recursivity: l5 means that wget will dig up to 5 levels to find files, which is its (convenient) default.
-A is the "acceptlist" regex: here, *.* means that all file types will be downloaded; change -A to -R to set a "rejectlist" regex instead.
--no-parent means that wget will not ascend to the parent directory of the given address, which is required to constrain the download.
The --progress=dot:binary argument allows to track the progress of the downloads in the Terminal.
And the <URL> argument is the target address to scrape from.

Another useful argument is --directory-prefix=<PATH>, which sets the destination directory of the download. If it is omitted, wget will download to the default working directory, which is fine by me.

One way to run this generic command via Alfred is to copy

> wget -c --proxy=off -Q0 --passive-ftp -r -l5 -A *.* --no-parent --progress=dot:binary

and to assign it to an abbreviation like wwget. From there, it only takes a few keystrokes followed by a paste of the target URL to retrieve all files from it.

Find files anywhere

The find utility is a powerful tool to search for files (and folders) everywhere on the hard drive, including in system folder hierarchies and in places that are invisible to the user or to search tools like Mac OS X Spotlight.

Using find is helpful to locate files located outside of the current working directory. Using it inside the working directory will return results similar to those of the list.files function with a specific pattern argument.

To search for all files called README.md on disk, use this command:

find / 2>/dev/null -name README.md

The first part of the command, /, means that find will search the entire hard drive, which how I usually use it. Change the argument to . to search only from the current working directory, or to ~ to only search the folders of the current user.

The second part of the command will remove the "permission denied" error messages that will inevitably pop up when using find without superuser privileges.

The third part command controls the file name. As hinted above, it will also return folders, including invisible ones, as in this example, which returns all .git folders on disk:

find / 2>/dev/null -name .git

The find utility supports regular expressions in many different ways, so that you can search for more complex patterns. The following command, for instance, will match all README files with a file extension:

find / 2>/dev/null -name README.*

Last, the -name argument can be used several times, with other arguments that stand for AND and OR logical operators. The example below will search for all README.md and README.txt files:

find / 2>/dev/null -name README.md -o -name README.txt

To use find through Alfred, simply assign

> find / 2>/dev/null -name

to an abbreviation like ffind in your text expander.

Scan large files

When your R workflow involves peaking into very large files, such as gigabyte-sized server logs, it becomes impratical to explore these files solely from within R. Several shell commands can help, however.

The basic command to read a file from the shell is cat, but the head and tail commands are better suited for very large files, as they will only print a given number of lines from the beginning or the end of the file:

head -500 <FILE>
tail -500 <FILE>

If you are looking for specific lines, use cat with a piped regular expression to filter out the line(s):

cat -b <FILE> | grep -i hello

The -b argument will print the line number(s) of the line(s) that you are looking for. From there on, you might want to use sed to read selected lines in your file, or an interval of them, as in this example:

sed -n 'FIRST,LASTp' <FILE>

The command above means "print (p) line numbers of <FILE> from FIRST to LAST".

All these commands are worth learning by heart if you find yourself regularly working with large text files. I personally use them frequently when working with data dumps from PubMed or from Wikipedia.

Extension: Calling shell commands from R

Let's not forget that R can run shell commands through the system function. For instance, the syntax of the PDFtk utility, which can read metadata from PDF files, goes as follows:

pdftk <FILE> data_dump_utf8

This command can be used from within R to access PDF metadata, as shown in this Gist, which renames PDF files based on their metadata title and author fields.

Extension: Quicker Git commands

The logic followed in this note can be applied to many other use cases, such as shortening Git commands.

Git and GitHub allow to track revisions on any content, especially code or any other form of plain text, but also many other forms of media like images or maps. There are excellent tutorials to learn them, such as this short guide and its summary of most useful Git commands, both by Karl Broman.

Even the excellent GitHub desktop client cannot cover the thousands of different functionalities that Git offers; as a consequence, it is necessary to use the command line to use some parts of Git, such as Git submodules or Git Large File Storage, which is supported by GitHub but only minimally configurable through its desktop client.

Any part of your Git workflow can be shortened for quicker use. For instance, if you are tracking a repository that comes with release tags, it might be helpful to review the history of the commits to the repository with these tags apparent. The Git command to do this (i.e. “visualize the version tree”) is

> git log --pretty=oneline --graph --decorate --all

… or any text shortcut that expands to it, such as gittree.

Update (April 13, 2016): added the curl, wget, find and sleepimage sections.

Update (April 16, 2016): added the cat / sed and pdftk sections.

Update (February 2, 2019): removed the sleepimage section, which is obsolete on recent versions of macOS.

First published on February 7th, 2016

Other notes

Software requirements

Table of contents