Quick shell commands for R users
This note explains how to use an application launcher along with text expansion and shell commands to accomplish a few specific tasks that can be useful to R users.
Software requirements
This note assumes that you are equipped with an application launcher that supports text expansion and can process shell commands. On Mac OS X, I recommend Alfred, because the little money that you will need to pay for its Powerpack is very well compensated by the tons of useful features (including shell support) that it brings to the application.
This note also assumes that you are equipped with a text expander, which is a small utility that will transform keyboard inputs into longer ones, thereby saving you the hassle of memorising and typing shell commands. On Mac OS X, I recommend aText, because it is cheap, fast and stable.
If you are going to use the Alfred + aText combination that I recommend, remember to set the "Preferences → Appearance → Options → Focusing" setting in Alfred to "Compatibility mode" in order to enable aText in the Alfred prompt, and to write your aText shortcuts as plain text content.
Table of contents
Below are several example cases of using Alfred and aText to perform some specific tasks that can be useful to the R user:
- Find open network ports with
lsof
- Launch a local Web server with
xampp
- Diagnose an online connection with
curl
- Download batches of files with
wget
- Find files anywhere with
find
- Scan large files with
cat
andsed
- Extension: Calling shell commands from R, illustrated with
pdftk
- Extension: Quicker Git commands
For even more useful command-line one-liners, see this list by Arturo Herrero.
Find open network ports
There are a variety of situations in which listing open network ports is useful to the R user, such as using Shiny or any other software that sends its results from R to a specific local network port.
With Alfred installed, typing the following in its prompt will open a Terminal window (because of the trailing >
character) and list every open file into a summary table:
> lsof
Since the definition of files in UNIX includes network connections, listing only open network sockets is as simple as passing a few options to lsof
and then subsetting the result to connections that are currently being "listened to" (i.e. awaiting connections):
> lsof -i -P | grep -i "listen"
At that stage, the command has become too long to memorize if you are not a frequent shell user. This is where text expansion becomes handy, as the entire command above (including the trailing >
character) can be assigned to an abbrevation like lsof
. Typing those four letters in Alfred will now return results that might include something like
rsession 21309 fr 5u IPv4 0xbb252b45e6a7134d 0t0 TCP localhost:42136 (LISTEN)
rsession 21309 fr 19u IPv4 0xbb252b45d9f58b35 0t0 TCP localhost:18323 (LISTEN)
… which shows the specific ports that are being used by my R session.
If you are planning to use lsof
for other tasks than the one described above, you will naturally want to use another abbreviation than lsof
in your text expander.
Launch a local Web server
R can be used with many other technologies, and a lot of effort is currently being put in making R play nicely with languages that can be used in Web pages, such as JavaScript.
If your R code produces results that require PHP to be viewed, you will need to run a PHP server on your local computer. An easy way to do this is to install a "solution stack package" such as XAMPP, which will run an Apache HTTP server with Perl, PHP and a few other things on it.
On Mac OS X, one way to launch or stop XAMPP is to use the following commands:
sudo /Applications/XAMPP/xamppfiles/xampp start
sudo /Applications/XAMPP/xamppfiles/xampp stop
The first thing you will want to do is to create a symlink for the XAMPP executable into a directory that is part of your PATH
, such as the /usr/local/bin
directory, where Homebrew, for instance, also creates symlinks for the packages that it installs. Use the following command to create that symlink:
ln -s /Applications/XAMPP/xamppfiles/xampp /usr/local/bin
The two commands above are now executable through less text:
> sudo xampp start
> sudo xampp stop
An easy way to execute any of these commands is then to assign their common part (> sudo xampp
) to the xampp
abbreviation in your text expander, so that in your application launcher, typing xampp start
or xampp stop
will either launch or stop your local server.
Diagnose an online connection
A very useful tool for diagnosing all sorts of issues with Internet connections is curl
, which is included with Mac OS X and many other operating systems. This is the same command that R will try to use to download files if it finds it installed.
If a particular Web page is not downloading or parsing properly when using R functions like download.file
or rvest::read_html
(which calls xml2::read_xml
), it can be helpful to use curl
with the following arguments:
curl -ILv <URL>
- The
-I
argument will only return the header of the target URL. - The
-L
argument allowscurl
to follow redirects. - The
-v
argument makescurl
verbose. - And the
<URL>
argument is the target address to check.
To use this command through a shortcut, just assign
> curl -ILv
to an abbreviation like ccurl
in your text expander. As in other examples, do not forget the >
prefix if you are using this shortcut with Alfred, and include a trailing space at the end, so that only the target URL will be missing.
There are many other ways to use curl
. For instance, I have the following curl
command assigned to the sspeed
keyword:
curl -o /dev/null http://speedtest.sea01.softlayer.com/downloads/test10.zip
The command will attempt to download a 11 MB file from a speed-testing server. Since the download path is set to /dev/null
, the file will not be saved anywhere on disk, although it will be downloaded in full.
The curl
progress text produced by this command will indicate the average speed of the download, which is handy to troubleshoot a buggy connection.
Download batches of files
R is not necessarily the most appropriate hammer for every nail, and when it comes to batch downloads, it can be far more practical to use another tool, especially in situations where there is no particular need to make the download step replicable.
A very efficient tool in such situations is wget
, which Mac OS X users can install with the following Homebrew command:
brew install wget
A generic wget
command to download all the files at a given address might look like:
wget -c --proxy=off -Q0 --passive-ftp -r -l5 -A *.* --no-parent --progress=dot:binary <URL>
- The
-c
argument will resume partial downloads, which is useful if the course stalls: just interrupt the process withCtrl
+C
, and run it again. - The
--proxy
argument means that we are not using any particular proxy. - The
-Q
argument sets the file size quota limit to0
, which means thatwget
will try to download all files that it founds at the given address. - The
--passive-ftp
option means that we are letting the server set the FTP port, which is a way to avoid issues with server-side firewalls. - The
-r
argument sets the depth of recursivity:l5
means thatwget
will dig up to 5 levels to find files, which is its (convenient) default. -A
is the "acceptlist" regex: here,*.*
means that all file types will be downloaded; change-A
to-R
to set a "rejectlist" regex instead.--no-parent
means thatwget
will not ascend to the parent directory of the given address, which is required to constrain the download.- The
--progress=dot:binary
argument allows to track the progress of the downloads in the Terminal. - And the
<URL>
argument is the target address to scrape from.
Another useful argument is --directory-prefix=<PATH>
, which sets the destination directory of the download. If it is omitted, wget
will download to the default working directory, which is fine by me.
One way to run this generic command via Alfred is to copy
> wget -c --proxy=off -Q0 --passive-ftp -r -l5 -A *.* --no-parent --progress=dot:binary
and to assign it to an abbreviation like wwget
. From there, it only takes a few keystrokes followed by a paste of the target URL to retrieve all files from it.
Find files anywhere
The find
utility is a powerful tool to search for files (and folders) everywhere on the hard drive, including in system folder hierarchies and in places that are invisible to the user or to search tools like Mac OS X Spotlight.
Using find
is helpful to locate files located outside of the current working directory. Using it inside the working directory will return results similar to those of the list.files
function with a specific pattern
argument.
To search for all files called README.md
on disk, use this command:
find / 2>/dev/null -name README.md
The first part of the command, /
, means that find
will search the entire hard drive, which how I usually use it. Change the argument to .
to search only from the current working directory, or to ~
to only search the folders of the current user.
The second part of the command will remove the "permission denied" error messages that will inevitably pop up when using find
without superuser privileges.
The third part command controls the file name. As hinted above, it will also return folders, including invisible ones, as in this example, which returns all .git
folders on disk:
find / 2>/dev/null -name .git
The find
utility supports regular expressions in many different ways, so that you can search for more complex patterns. The following command, for instance, will match all README
files with a file extension:
find / 2>/dev/null -name README.*
Last, the -name
argument can be used several times, with other arguments that stand for AND
and OR
logical operators. The example below will search for all README.md
and README.txt
files:
find / 2>/dev/null -name README.md -o -name README.txt
To use find
through Alfred, simply assign
> find / 2>/dev/null -name
to an abbreviation like ffind
in your text expander.
Scan large files
When your R workflow involves peaking into very large files, such as gigabyte-sized server logs, it becomes impratical to explore these files solely from within R. Several shell commands can help, however.
The basic command to read a file from the shell is cat
, but the head
and tail commands
are better suited for very large files, as they will only print a given number of lines from the beginning or the end of the file:
head -500 <FILE>
tail -500 <FILE>
If you are looking for specific lines, use cat
with a piped regular expression to filter out the line(s):
cat -b <FILE> | grep -i hello
The -b
argument will print the line number(s) of the line(s) that you are looking for. From there on, you might want to use sed
to read selected lines in your file, or an interval of them, as in this example:
sed -n 'FIRST,LASTp' <FILE>
The command above means "print (p
) line numbers of <FILE>
from FIRST
to LAST
".
All these commands are worth learning by heart if you find yourself regularly working with large text files. I personally use them frequently when working with data dumps from PubMed or from Wikipedia.
Extension: Calling shell commands from R
Let's not forget that R can run shell commands through the system
function. For instance, the syntax of the PDFtk utility, which can read metadata from PDF files, goes as follows:
pdftk <FILE> data_dump_utf8
This command can be used from within R to access PDF metadata, as shown in this Gist, which renames PDF files based on their metadata title and author fields.
Extension: Quicker Git commands
The logic followed in this note can be applied to many other use cases, such as shortening Git commands.
Git and GitHub allow to track revisions on any content, especially code or any other form of plain text, but also many other forms of media like images or maps. There are excellent tutorials to learn them, such as this short guide and its summary of most useful Git commands, both by Karl Broman.
Even the excellent GitHub desktop client cannot cover the thousands of different functionalities that Git offers; as a consequence, it is necessary to use the command line to use some parts of Git, such as Git submodules or Git Large File Storage, which is supported by GitHub but only minimally configurable through its desktop client.
Any part of your Git workflow can be shortened for quicker use. For instance, if you are tracking a repository that comes with release tags, it might be helpful to review the history of the commits to the repository with these tags apparent. The Git command to do this (i.e. “visualize the version tree”) is
> git log --pretty=oneline --graph --decorate --all
… or any text shortcut that expands to it, such as gittree
.
Update (April 13, 2016): added the curl
, wget
, find
and sleepimage sections.
Update (April 16, 2016): added the cat
/ sed
and pdftk
sections.
Update (February 2, 2019): removed the sleepimage section, which is obsolete on recent versions of macOS.
- First published on February 7th, 2016