Technologies worth learning for data science

As a complement to my note on R as a data science language, this note lists ten other technologies that you might want to learn to use, or at least monitor, if you are interested in learning data science.

Communication

  • Git is a concurrent versioning system that is easy to use through platforms like GitHub or GitLab. It is the best tool that I know of to keep track of code and text over long periods of time (meaning: over 24 hours).

    Learning the basics of Git will also force you to learn the fundamentals of command-line programming, which is a crucial skill in its own right, and an indispensable skill to be able to make use of many useful tools.

  • Knowing HTML and CSS is simply invaluable. Learning to use both will also teach you basic HTTP and FTP communications, and learning HTML will serve as an introduction to XML.

    Learning to structure and style content using HTML and CSS will also make clear to you why you need to equip yourself with a code/plain text editor, why you need to learn how to use regular expressions as soon as possible.

  • LaTeX is an extremely powerful typesetting language. Unlike Markdown and its variants, such as R Markdown, it requires considerable effort to use; however, it is unbeatable to typeset high-quality reports, especially when mathematical notation is involved.

    Given the complexity of its inner workings and the nice replacements offered by tools such as Markdown and Pandoc, LaTeX is perhaps the least important item to learn on this list.

Data storage

  • Apache has released several tools for big data analysis, such as Hadoop, which can be interfaced with R via the RHadoop packages, Spark, which can be interfaced with R via sparklyr, and Arrow, which can be interfaced with R (or with Python) via feather.

    If you are interested in big data technology, Apache is probably the first company to watch, followed by Amazon and Microsoft, which both offer cloud computing solutions.

  • Google services like Google My Maps or Google Docs can be useful to collaborate over documents with a large team of users, and/or with users who can only make use of simple interfaces.

    Learning to use Google Docs, Google Drive and Google Maps efficiently is very straightforward and has saved me lots of time while working on data collection tasks with non-technical users.

  • SQL is the absolute standard of database management. It exists as many variants, including MySQL, PostgreSQL and SQLite, all of which can be interfaced with R via dplyr and other packages.

    Learning a bit of SQL is like learning a bit of JavaScript or PHP – it takes only a few hours and an Internet connexion, and it will always prove very useful at some point.

  • MonetDB is SQL for column-oriented users, which is likely to be your case if you come from the social sciences. It interfaces with R via the MonetDB.R and MonetDBLite packages.

Data visualization

  • Several JavaScript visualization libraries, such as Leaflet.js and Three.js, can be interfaced with R – in these cases, via the leaflet and threejs packages. For more examples of such interfaces, browse the reverse imports of the htmlwidgets package.

    Note that in some cases, as with d3.js or Sigma.js, you will not be able to use R to use these libraries to their full potential, and will therefore need to work with JavaScript directly.

  • Acquiring basic knowledge of image editing is a good way to learn the very basics of visualisation, such as color definitions, raster and vector image types, and how image resolution works.

    Adobe Illustrator and Adobe Photoshop are both standards in the graphics editing industry. Free and open-source alternatives are Inkscape and GIMP.

    Note that if you want to become well-versed in visualizing geographic data on maps, some basic knowledge of cartography will also become very handy. Unfortunately, I have no further guidance to provide on the topic.

  • Similarly, basic knowledge of typesetting is a good way to learn a bit about typography and desktop publishing, and to become more familiar with the measurement units used in data visualization.

    I learnt a bit of typesetting a long time ago, using QuarkXPress and, later on, Adobe InDesign. A free and open-source alternative is Scribus.


This note lists a few technologies and tools that I do not use myself, or that I learnt only to complete a specific task, and then later forgot about.

This note says nothing of RStudio and Shiny, but if you are serious about learning to use R, I assume that you will get introduced to either or both of these tools soon enough.

  • First published on January 5th, 2017