This note explains how to use an application launcher along with text expansion and shell commands to accomplish a few specific tasks that can be useful to R users.
This note assumes that you are equipped with an application launcher that supports text expansion and can process shell commands. On Mac OS X, I recommend Alfred, because the little money that you will need to pay for its Powerpack is very well compensated by the tons of useful features (including shell support) that it brings to the application.
This note also assumes that you are equipped with a text expander, which is a small utility that will transform keyboard inputs into longer ones, thereby saving you the hassle of memorising and typing shell commands. On Mac OS X, I recommend aText, because it is cheap, fast and stable.
If you are going to use the Alfred + aText combination that I recommend, remember to set the "Preferences → Appearance → Options → Focusing" setting in Alfred to "Compatibility mode" in order to enable aText in the Alfred prompt, and to write your aText shortcuts as plain text content.
Table of contents
Below are several example cases of using Alfred and aText to perform some specific tasks that can be useful to the R user:
- Find open network ports with
- Launch a local Web server with
- Diagnose an online connection with
- Download batches of files with
- Find files anywhere with
- Scan large files with
- Extension: Calling shell commands from R, illustrated with
- Extension: Quicker Git commands
For even more useful command-line one-liners, see this list by Arturo Herrero.
There are a variety of situations in which listing open network ports is useful to the R user, such as using Shiny or any other software that sends its results from R to a specific local network port.
Since the definition of files in UNIX includes network connections, listing only open network sockets is as simple as passing a few options to
lsof and then subsetting the result to connections that are currently being "listened to" (i.e. awaiting connections):
> lsof -i -P | grep -i "listen"
At that stage, the command has become too long to memorize if you are not a frequent shell user. This is where text expansion becomes handy, as the entire command above (including the trailing
> character) can be assigned to an abbrevation like
lsof. Typing those four letters in Alfred will now return results that might include something like
rsession 21309 fr 5u IPv4 0xbb252b45e6a7134d 0t0 TCP localhost:42136 (LISTEN) rsession 21309 fr 19u IPv4 0xbb252b45d9f58b35 0t0 TCP localhost:18323 (LISTEN)
… which shows the specific ports that are being used by my R session.
If you are planning to use
lsof for other tasks than the one described above, you will naturally want to use another abbreviation than
lsof in your text expander.
If your R code produces results that require PHP to be viewed, you will need to run a PHP server on your local computer. An easy way to do this is to install a "solution stack package" such as XAMPP, which will run an Apache HTTP server with Perl, PHP and a few other things on it.
On Mac OS X, one way to launch or stop XAMPP is to use the following commands:
sudo /Applications/XAMPP/xamppfiles/xampp start sudo /Applications/XAMPP/xamppfiles/xampp stop
The first thing you will want to do is to create a symlink for the XAMPP executable into a directory that is part of your
PATH, such as the
/usr/local/bin directory, where Homebrew, for instance, also creates symlinks for the packages that it installs. Use the following command to create that symlink:
ln -s /Applications/XAMPP/xamppfiles/xampp /usr/local/bin
The two commands above are now executable through less text:
> sudo xampp start > sudo xampp stop
An easy way to execute any of these commands is then to assign their common part (
> sudo xampp) to the
xampp abbreviation in your text expander, so that in your application launcher, typing
xampp start or
xampp stop will either launch or stop your local server.
A very useful tool for diagnosing all sorts of issues with Internet connections is
curl, which is included with Mac OS X and many other operating systems. This is the same command that R will try to use to download files if it finds it installed.
If a particular Web page is not downloading or parsing properly when using R functions like
rvest::read_html (which calls
xml2::read_xml), it can be helpful to use
curl with the following arguments:
curl -ILv <URL>
-Iargument will only return the header of the target URL.
curlto follow redirects.
- And the
<URL>argument is the target address to check.
To use this command through a shortcut, just assign
> curl -ILv
to an abbreviation like
ccurl in your text expander. As in other examples, do not forget the
> prefix if you are using this shortcut with Alfred, and include a trailing space at the end, so that only the target URL will be missing.
There are many other ways to use
curl. For instance, I have the following
curl command assigned to the
curl -o /dev/null http://speedtest.sea01.softlayer.com/downloads/test10.zip
The command will attempt to download a 11 MB file from a speed-testing server. Since the download path is set to
/dev/null, the file will not be saved anywhere on disk, although it will be downloaded in full.
curl progress text produced by this command will indicate the average speed of the download, which is handy to troubleshoot a buggy connection.
R is not necessarily the most appropriate hammer for every nail, and when it comes to batch downloads, it can be far more practical to use another tool, especially in situations where there is no particular need to make the download step replicable.
brew install wget
wget command to download all the files at a given address might look like:
wget -c --proxy=off -Q0 --passive-ftp -r -l5 -A *.* --no-parent --progress=dot:binary <URL>
-cargument will resume partial downloads, which is useful if the course stalls: just interrupt the process with
C, and run it again.
--proxyargument means that we are not using any particular proxy.
-Qargument sets the file size quota limit to
0, which means that
wgetwill try to download all files that it founds at the given address.
--passive-ftpoption means that we are letting the server set the FTP port, which is a way to avoid issues with server-side firewalls.
-rargument sets the depth of recursivity:
wgetwill dig up to 5 levels to find files, which is its (convenient) default.
-Ais the "acceptlist" regex: here,
*.*means that all file types will be downloaded; change
-Rto set a "rejectlist" regex instead.
wgetwill not ascend to the parent directory of the given address, which is required to constrain the download.
--progress=dot:binaryargument allows to track the progress of the downloads in the Terminal.
- And the
<URL>argument is the target address to scrape from.
Another useful argument is
--directory-prefix=<PATH>, which sets the destination directory of the download. If it is omitted,
wget will download to the default working directory, which is fine by me.
One way to run this generic command via Alfred is to copy
> wget -c --proxy=off -Q0 --passive-ftp -r -l5 -A *.* --no-parent --progress=dot:binary
and to assign it to an abbreviation like
wwget. From there, it only takes a few keystrokes followed by a paste of the target URL to retrieve all files from it.
find utility is a powerful tool to search for files (and folders) everywhere on the hard drive, including in system folder hierarchies and in places that are invisible to the user or to search tools like Mac OS X Spotlight.
find is helpful to locate files located outside of the current working directory. Using it inside the working directory will return results similar to those of the
list.files function with a specific
To search for all files called
README.md on disk, use this command:
find / 2>/dev/null -name README.md
The first part of the command,
/, means that
find will search the entire hard drive, which how I usually use it. Change the argument to
. to search only from the current working directory, or to
~ to only search the folders of the current user.
The second part of the command will remove the "permission denied" error messages that will inevitably pop up when using
find without superuser privileges.
The third part command controls the file name. As hinted above, it will also return folders, including invisible ones, as in this example, which returns all
.git folders on disk:
find / 2>/dev/null -name .git
find utility supports regular expressions in many different ways, so that you can search for more complex patterns. The following command, for instance, will match all
README files with a file extension:
find / 2>/dev/null -name README.*
-name argument can be used several times, with other arguments that stand for
OR logical operators. The example below will search for all
find / 2>/dev/null -name README.md -o -name README.txt
find through Alfred, simply assign
> find / 2>/dev/null -name
to an abbreviation like
ffind in your text expander.
When your R workflow involves peaking into very large files, such as gigabyte-sized server logs, it becomes impratical to explore these files solely from within R. Several shell commands can help, however.
The basic command to read a file from the shell is
cat, but the
head and tail
commands are better suited for very large files, as they will only print a given number of lines from the beginning or the end of the file:
head -500 <FILE> tail -500 <FILE>
If you are looking for specific lines, use
cat with a piped regular expression to filter out the line(s):
cat -b <FILE> | grep -i hello
-b argument will print the line number(s) of the line(s) that you are looking for. From there on, you might want to use
sed to read selected lines in your file, or an interval of them, as in this example:
sed -n 'FIRST,LASTp' <FILE>
The command above means "print (
p) line numbers of
All these commands are worth learning by heart if you find yourself regularly working with large text files. I personally use them frequently when working with data dumps from PubMed or from Wikipedia.
pdftk <FILE> data_dump_utf8
This command can be used from within R to access PDF metadata, as shown in this Gist, which renames PDF files based on their metadata title and author fields.
The logic followed in this note can be applied to many other use cases, such as shortening Git commands.
Git and GitHub allow to track revisions on any content, especially code or any other form of plain text, but also many other forms of media like images or maps. There are excellent tutorials to learn them, such as this short guide and its summary of most useful Git commands, both by Karl Broman.
Even the excellent GitHub desktop client cannot cover the thousands of different functionalities that Git offers; as a consequence, it is necessary to use the command line to use some parts of Git, such as Git submodules or Git Large File Storage, which is supported by GitHub but only minimally configurable through its desktop client.
Any part of your Git workflow can be shortened for quicker use. For instance, if you are tracking a repository that comes with release tags, it might be helpful to review the history of the commits to the repository with these tags apparent. The Git command to do this (i.e. “visualize the version tree”) is
> git log --pretty=oneline --graph --decorate --all
… or any text shortcut that expands to it, such as
: added the
find and sleepimage sections.
: added the
: removed the sleepimage section, which is obsolete on recent versions of macOS.
- First published on February 7th, 2016