<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
<title>R / Notes</title>
<link href="https://f.briatte.org/r/" ></link>
<id>urn:uuid:67c7c123-4d56-8d2f-277d-a33fbe4257ef</id>
<author>R / Notes - f@briatte.org</author>
<updated>Wed, 15 Apr 2026 02:48:21 +0200</updated>
<entry>
<title>Current views on generative AI</title>
<link href="https://f.briatte.org/r/current-views-on-generative-ai" ></link>
<id>urn:uuid:ba27eb1e-1172-8e32-b09c-f49253ee37ee</id>
<updated>2026-03-14T00:00:00+01:00</updated>
<summary type="html" ><![CDATA[<p>This post contains my current views on <a href="https://en.wikipedia.org/wiki/Generative_artificial_intelligence">generative artificial intelligence</a>, and <a href="https://en.wikipedia.org/wiki/Large_language_model">Large Language Models</a> in particular. The context is mostly academia, which is about research and teaching.</p>

<h1>Personal context</h1>

<p>Generative AI is slowly creeping into my professional workflow, not because I am using it myself (I don't, although I guess that I will, at some point), but because everyone around me is.</p>

<p>My students use <a href="https://en.wikipedia.org/wiki/ChatGPT">ChatGPT</a> and other tools like, I believe, <a href="https://en.wikipedia.org/wiki/NotebookLM">NotebookLM</a> and <a href="https://en.wikipedia.org/wiki/Comet_(browser)">Perplexity Comet</a>. My RSS news feed (that's how old I am) recently had an <a href="https://www.newyorker.com/magazine/2026/02/16/what-is-claude-anthropic-doesnt-know-either">article on Claude</a>, and I use use Google applications, so I keep getting passively-aggressively asked to use <a href="https://en.wikipedia.org/wiki/Gemini_(language_model)">Gemini</a>, which I might do one day via <a href="https://scholar.googleblog.com/2025/11/scholar-labs-ai-powered-scholar-search.html">Scholar Labs</a>.</p>

<p>My workplace, which is a university, has taken a very basic stance on generative AI: unless stated otherwise, students are to follow <a href="https://info.lse.ac.uk/staff/divisions/Eden-Centre/Artificial-Intelligence-Education-and-Assessment/School-position-on-generative-AI">LSE Position 1</a> (no use of generative AI in graded work), which I suppose goes both ways (no use of generative AI in grading, either).</p>

<p>I do not know of any equivalent position on generative AI in research. It seems like everyone wants to discuss the topic and play around with whatever is available for free online, but no one wants to make hard decisions about it yet, possibly due to <a href="https://artificialintelligenceact.eu/">upcoming EU-level regulations</a>.</p>

<h1>Risks for teaching and learning</h1>

<p>From a teaching perspective, generative AI is only useful to me if it helps students going through the following process:</p>

<ol>
<li>Learn</li>
<li>Draft</li>
<li>Revise</li>
<li>Submit</li>
<li>Defend</li>
</ol>

<p>Part of what I teach is code, and code is the topic of this blog. As it happens, generative AI is already very good with code, and I am confident that it can be put to good use to go through Steps 1--3 of the process above.</p>

<p>There are, however, at least four reasons why I am currently taking ‘LSE Position 1’ on using generative AI in graded work that relies on code:</p>

<ol>
<li>Many students are using AI to <a href="https://www.chronicle.com/article/is-ai-enhancing-education-or-replacing-it">bypass the learning process</a>, rather than enhance it. This creates <a href="/r/ai-generated-code-security-risks">security risks</a>, and violates <strong>academic ethics</strong> in the same way that hiring an external party would. This comes on top of other breaches of students ethics, such as plagiarism.</li>
<li>The two issues mentioned in the previous point <strong>cannot be defended against</strong> at my level, at least not with my current resources. I can spot security risks, but I cannot reliably detect AI-generated code, which is neither watermarked or scannable through anti-plagiarism tools.</li>
<li>The software that I use in class is mostly open-source, and <strong>reproducibility</strong> is part of the core principles that I teach in class. As far as I understand, and unless proven otherwise, the kind of generative AI technology used by my students does not enforce these principles.</li>
<li>To make things worse, most generative AI also violates <strong>intellectual property</strong> (<a href="https://arxiv.org/abs/2601.02671">Ahmed <em>et al.</em> 2026</a>), rather than reconfigure it around the ‘<a href="https://en.wikipedia.org/wiki/Copyleft">copyleft</a>’ and ‘<a href="https://creativecommons.org/">creative commons</a>’ principles that many of us have spent years pushing within fields such as academic publishing.</li>
</ol>

<p>I have not been exposed to any argument that makes any attempt at solving the ethical, logistical, moral and eventually legal issues that I have outlined above. Until I do, I will treat generative AI as a form of <a href="https://en.wikipedia.org/wiki/Doping_in_sport">doping</a>, and will keep banning it.</p>

<p>The analogy above with doping is not an innocent one. There is, in my view, a very real rhetorical arc that goes from generative AI to the <a href="https://en.wikipedia.org/wiki/Enhanced_Games">Enhanced Games</a>. Higher education does not approve of students taking <a href="https://en.wikipedia.org/wiki/Adderall">Adderall</a>, and neither do I.</p>

<h1>Risks for scientific research</h1>

<p>From a research perspective, generative AI is only useful to me if it helps me going through the following process:</p>

<ol>
<li>Compile existing <strong>evidence</strong></li>
<li>Collect meaningful <strong>data</strong></li>
<li>Produce meaningful <strong>measures</strong></li>
<li>Formulate correct <strong>interpretations</strong></li>
<li>Enhance existing <strong>knowledge</strong></li>
</ol>

<p>There is no doubt that generative AI can help with every step above, especially perhaps at the level of data collection and, in the case of ‘big data’ or whatever people call it today, classification. I am also very interested in what it can contribute with regards to compiling scientific studies, in the same way that it is already helping with <a href="https://spectrum.ieee.org/ai-proof-verification">mathematical problems</a>.</p>

<p>The risks that I have heard about so far when it comes to generative AI and social science research (which is what I do) are the following:</p>

<ol>
<li>Generative AI can <strong>poison the evidence base</strong> (<a href="https://doi.org/10.1073/pnas.2314021121">Bail 2024</a>) through the mass production of low-quality academic output, or by compromising data such as online surveys (<a href="https://doi.org/10.1073/pnas.2518075122">Westwood 2025</a>, <a href="https://doi.org/10.1073/pnas.2537420123">Westwood and Frederick 2026</a>). This is already happening.</li>
<li>Generative AI does not yet produce <strong>reliable data annotations</strong> for the kind of data that I am interested in (<a href="https://www.eddieyang.net/research/llm_annotation.pdf">Yang <em>et al.</em> 2025</a>), and even if its coding reliability improves, it will require additional effort to mitigate related issues (<a href="https://doi.org/10.48550/arXiv.2509.08825">Baumann <em>et al.</em> 2025</a>).</li>
<li>Relatedly, generative AI cannot improve organically if it maintains its <strong>human bias</strong> towards evidence produced in the Global North (<a href="https://doi.org/10.31219/osf.io/w8q3y_v1">Ramirez-Ruiz and Senninger 2025</a>), mostly by ‘WEIRD’ individuals (<a href="https://doi.org/10.31234/osf.io/5b26t">Atari <em>et al.</em> 2023</a>). This will be hard and slow to solve.</li>
<li>Last but not least, generative AI will be used to <strong>erode scientific authority</strong> at the profit of <a href="https://en.wikipedia.org/wiki/Merchants_of_Doubt">those</a> who are interested in attacking the contribution that scientific (and higher education) institutions make to society. This is of course far from a trivial issue.</li>
</ol>

<p>The issues listed are all real, hard to solve, and are controversial insofar as some people have a vested interest in seeing them <em>not</em> addressed, at least not in the short term.</p>

<p>None of these issues will stop me from installing and trying out <a href="https://tidyverse.org/blog/2025/11/ellmer-0-4-0/"><code>ellmer</code></a> one day. However, I do expect this to happen within a scientific environment that will have acknowledged each issue in one way or another, and formulated guidelines to address them.</p>

<p>Are we there yet?</p>

<hr />

<p>This post was inspired by the <a href="https://www.bydamo.la/p/ai-manifesto">/ai ‘manifesto’</a>, which I discovered <a href="https://www.andrewheiss.com/ai/">thanks to Andrew Heiss</a>. I obtained some of the cited references through Jessica Hullman's <a href="https://statmodeling.stat.columbia.edu/2026/03/10/new-course-on-generative-ai-for-behavioral-science/">‘New course on generative AI for behavioral science’</a> blog post.</p>

<p><ins id="update-2026-04-04">Update (April 4, 2026)</ins>: added the reference to <a href="https://arxiv.org/abs/2601.02671">Ahmed <em>et al.</em> (2026)</a> after reading <a href="https://nataliebhogg.com/2026/03/09/find-the-stable-and-pull-out-the-bolt/">Natalie Hogg's blog post on LLMs</a>, which I found via <a href="https://ergosphere.blog/posts/the-machines-are-fine/">Minas Karamanis' blog post on the same topic</a> (which I found via <a href="https://pinboard.in/u:cshalizi/b:0aa0e4be9ed7">Cosma Shalizi</a> and <a href="https://pinboard.in/u:kjhealy/b:b0fefbddc1e7">Kieran Healy</a>). I also wrote a short introduction to this post <a href="https://politbistro.hypotheses.org/8651">on my other blog</a> (in French).</p>
]]></summary>
</entry>
<entry>
<title>Tutorials in Applied Statistics with R (and RStudio)</title>
<link href="https://f.briatte.org/r/applied-statistics-with-r-and-rstudio" ></link>
<id>urn:uuid:4d17241d-7e8a-6aca-6e85-827aa232ae32</id>
<updated>2026-02-27T00:00:00+01:00</updated>
<summary type="html" ><![CDATA[<p>This note documents my ongoing <a href="https://github.com/briatte/asr">Tutorials in Applied Statistics with R (and RStudio)</a>, which are aimed at first-year undergraduate social scientists.</p>

<p>Three years ago, I published a <a href="/r/data-science-with-r-and-rstudio">Data Science with R</a> course that has gone through a few iterations since then.</p>

<p>This year, I started teaching a short series of eight tutorials that cover more or less the same ground, although the audience is now first-year undergraduate (instead of postgraduate) students in political science.</p>

<p>The tutorials come with a lecture, taught by two colleagues (I teach two slightly different versions of the tutorials for each of them), which makes up time to focus on R + RStudio in class.</p>

<h2>Teaching material</h2>

<p>As usual, every bit of the course that I feel comfortable posting online has been published as a <a href="https://github.com/briatte/asr">GitHub repository</a>.</p>

<p>Some of the material repeats examples used in other courses (including a <a href="https://f.briatte.org/teaching/syllabus-qss.pdf">‘failed’ course</a> that I was forced to teach last year), but many if not most of the material is brand new.</p>

<p>I have also worked on three documents, which are linked to in the <code>README</code> file of the repository, and which are hosted on Google Docs:</p>

<ul>
<li><a href="https://docs.google.com/document/d/1QxZUho_ZmsdlIJZ5x87sijz3f6JGiq7t-S_S2bd6NBg/edit?usp=sharing">What to learn and to revise for the tutorials</a></li>
<li><a href="https://docs.google.com/document/d/1pTb-IY5qYQvIURK8PJItLBQjRzCnMlO7YHOrDOPJZq8/edit?usp=sharing">Survey research project instructions</a> (if relevant)</li>
<li><a href="https://docs.google.com/document/d/1_JZ48kJnAo4etDJbOeISbGRqMXDdsxpalppB7woEY7w/edit?usp=sharing">Troubleshooting</a></li>
</ul>

<p>The second document, in particular, is very much connected to how the course is assessed in one of the versions of the course. The other version uses class exams, which I cannot publish beyond the 'mock exam' that is already online.</p>

<p>I might post the tutorial slides one day, if I manage to purge them from links that lead to student information, and can send the exercise solutions to whoever wants them.</p>

<h2>Possible improvements</h2>

<p>This is very much version ‘0.x’ of the tutorials. I might try to <a href="https://github.com/briatte/asr/issues">improve a few things</a> in the future. If the course could change its name to something for specific to social scientists, I would also welcome that.</p>

<p>As I wrote somewhere in the <code>README</code> file, the ultimate goal of the tutorials is to cover more or less the same content as Matthew Blackwell does in his <a href="https://gov51.mattblackwell.org/">Data Analysis and Politics</a> course at Harvard University.</p>

<hr />

<p>This year will be my 13th year teaching R and RStudio. Time flies!</p>
]]></summary>
</entry>
<entry>
<title>sfReapportion</title>
<link href="https://f.briatte.org/r/sfReapportion" ></link>
<id>urn:uuid:eedaa585-0769-7f3f-6575-c57ff64bdcbb</id>
<updated>2025-12-26T00:00:00+01:00</updated>
<summary type="html" ><![CDATA[<p>This note documents the release of the <a href="https://cran.r-project.org/package=sfReapportion"><code>sfReapportion</code></a> package, which performs areal-weighted interpolation on spatial objects such as census tracts and voting districts.</p>

<p>A colleague of mine recently shared some code for a <a href="https://github.com/briatte/selection-bv">research project</a> on the upcoming municipal elections in France, but the code required the <a href="https://github.com/joelgombin/spReapportion"><code>spReapportion</code></a> package, which has been hard to install and use for a few years, due to some of its dependencies, <a href="https://cran.r-project.org/package=maptools"><code>maptools</code></a> and <a href="https://cran.r-project.org/package=rgeos"><code>rgeos</code></a>, having been <a href="https://r-spatial.org/r/2022/04/12/evolution.html">retired</a> in favour of the <a href="https://r-spatial.github.io/sf/index.html"><code>sf</code></a> package.</p>

<p>The <a href="https://github.com/joelgombin/spReapportion"><code>spReapportion</code></a> package, which performs <a href="https://cloud.r-project.org/web/packages/areal/vignettes/areal-weighted-interpolation.html">areal-weighted interpolation</a>, was coded by <a href="https://github.com/joelgombin">a friend of mine</a>. I decided to port his package in order for it to lose its retired dependencies, and to have it accept <a href="https://r-spatial.github.io/sf/index.html"><code>sf</code></a> objects as well as <a href="https://cran.r-project.org/package=sp"><code>sp</code></a> ones. The result is available <a href="https://cran.r-project.org/package=sfReapportion">on CRAN</a> and <a href="https://github.com/briatte/sfReapportion">on GitHub</a> as the <code>sfReapportion</code> package.</p>

<p>In parallel, I rewrote my other colleague's code in order to use that new package and to perform several other improvements. The first set of maps shown below come from early results obtained with that code, which is also <a href="https://github.com/briatte/selection-bv">on GitHub</a>.</p>

<h2>Rationale</h2>

<p>In France as in many if not most other countries, the census tracts, which are called <a href="https://www.insee.fr/fr/metadonnees/definition/c1523">IRIS</a>, are spatially incongruous with voting districts. If one wants to use data collected at the tract-level with voting data collected at the district-level, then one first has to interpolate/reapportion that data to the spatial boundaries of voting districts.</p>

<p>The two maps below show the polling stations (or <em>bureaux de vote</em>, in French) of the city of <a href="https://en.wikipedia.org/wiki/Lille">Lille</a>, the boundaries of which have been stable for several years. Each map shows the results of a distinct principal components analysis, followed by a <a href="http://factominer.free.fr/factomethods/hierarchical-clustering-on-principal-components.html">hierarchal clustering of its principal components</a>.</p>

<p><img src="images/sfReapportion-example.png" alt="" /></p>

<p>The map on the left is the interesting one. The data used for the underlying principal components analysis come from the French official statistics agency, <a href="https://www.insee.fr/">Insee</a>, which publishes that data <a href="https://www.insee.fr/fr/statistiques/6543200">at the tract-level</a>. The data were reapportioned with <a href="https://cran.r-project.org/package=sfReapportion"><code>sfReapportion</code></a> in order to coincide with the boundaries of the voting districts.</p>

<h2>Features</h2>

<p>The <a href="https://github.com/joelgombin/spReapportion"><code>spReapportion</code></a> package can reapportion three kinds of data:</p>

<ul>
<li><strong>counts</strong>, e.g. number of working-age adults in a given geographic area</li>
<li><strong>proportions</strong>, e.g. percentage of pensioners in a given geographic area</li>
<li><strong>weighted points</strong>, e.g. number of residents at a given set of coordinates</li>
</ul>

<p>The latter case is the most complex one to illustrate. The two maps below show the <a href="https://en.wikipedia.org/wiki/20th_arrondissement_of_Paris">20th arrondissement of Paris</a>. The map on the left shows the spatial incongruity between its polling stations and its census tracts, whereas the map on the right also shows where the voters of that arrondissement live, according to the <a href="https://www.data.gouv.fr/datasets/bureaux-de-vote-et-adresses-de-leurs-electeurs"><em>Répertoire électoral unique</em></a> (REU).</p>

<p><img src="images/sfReapportion-20e.png" alt="" /></p>

<p>When interpolating from one (spatial) geometry to another, we want to take that information into account, in order to reapportion the data to the areas where actual observations are to be found. The results are starkly different once that correction is taken into account:</p>

<p><img src="images/sfReapportion-20e-results.png" alt="" /></p>

<p>The example above is based on approximate data, as we are looking at voter <em>addresses</em>, rather than at the exact number of voters at a given address, but the corrective effect is still notable and possibly sufficient for our purposes.</p>

<h2>Limitations</h2>

<p>The <a href="https://cran.r-project.org/package=sfReapportion"><code>sfReapportion</code></a> package has only been lightly tested when it comes to its weighted modes. However, the main function, which uses unweighted population counts by default, has been thoroughly tested, and its results have been successfully reproduced with the <a href="https://cran.r-project.org/package=areal"><code>areal</code></a> package.</p>

<p>The <a href="https://cran.r-project.org/package=sfReapportion"><code>sfReapportion</code></a> package only performs <a href="https://r-spatial.org/book/05-Attributes.html#sec-extensiveintensive">extensive</a> areal-weighted interpolation: for intensive or multiple (mixed) interpolation, users should turn to the <a href="https://cran.r-project.org/package=areal"><code>areal</code></a> package. Additional methods are also available from the <a href="https://cran.r-project.org/package=populR"><code>populR</code></a> package.</p>

<hr />

<p>I do not plan to update the <a href="https://cran.r-project.org/package=sfReapportion"><code>sfReapportion</code></a> package much, as it was coded for reproducibility purposes, but users might open issues on its <a href="https://github.com/briatte/sfReapportion">GitHub repository</a> in order to ask questions or suggest improvements.</p>

<p><ins id="update-2026-03-28">Update (March 28, 2026)</ins>: version 0.2.0 of the package has been submitted to CRAN and should become available there soon. This post has been updated to document some of the new features. The code used to produce the last two plots is available from <a href="https://gist.github.com/briatte/7b04f7e3c78ca9b581be8a46c2a20399">this Gist</a>, which expands on the code provided in the README file of the package.</p>
]]></summary>
</entry>
<entry>
<title>AI-generated code comes with security risks</title>
<link href="https://f.briatte.org/r/ai-generated-code-security-risks" ></link>
<id>urn:uuid:933262a5-5f08-c7b3-ff4d-0bac64a8970a</id>
<updated>2025-04-14T00:00:00+02:00</updated>
<summary type="html" ><![CDATA[<p>More and more people are using <a href="https://en.wikipedia.org/wiki/Generative_artificial_intelligence">AI-generated</a> code in their work, without necessarily understanding the security risks that comes with that practice.</p>

<h1>How AI-generated code happens</h1>

<p>Generative AI services such as <a href="https://en.wikipedia.org/wiki/ChatGPT">ChatGPT</a> use <a href="https://en.wikipedia.org/wiki/Large_language_model">Large Language Models</a> to generate computer code. These models are ‘trained’ against a dataset of <strong>publicly available code</strong>.</p>

<p>Many users of generative AI do not seem fully aware of what ‘publicly available code’ might contain, and therefore do not really seem aware, in turn, of the security risks that come with executing AI-generated code.</p>

<p>Using generative AI services in a learning environment such as academia raises many concerns. Security is only one of them.</p>

<h1>Where the security risk lies</h1>

<p><strong>Programming languages like R are not <a href="https://en.wikipedia.org/wiki/Sandbox_(computer_security)">sandboxed</a>.</strong> This means that these languages can execute malicious instructions such as ‘erase every image on the hard drive of that laptop,’ or ‘replace all occurrences of “Jewish” in that text with “<a href="https://en.wikipedia.org/wiki/Kike">kike</a>”.’</p>

<p>Sandboxing the execution of R code is <a href="https://www.jstatsoft.org/article/view/v055i07">possible</a>, but this is not how R runs by default.</p>

<h1>The risk is real and already active</h1>

<p>Just like human languages have already been ‘poisoned’ in various ways, some of the computer code that makes up the public codebase on which Large Language Models are trained has already been ‘poisoned’ in various ways.</p>

<p>One of the ways that this has happened is through software packages. <strong>It is very easy to bundle harmful or malicious code into a software package</strong>, and then to give it a name that resembles the name of a legitimate software package.</p>

<p>Executing R code that contains such a package will pose a security threat to the user, equivalent to that of opening emails or attachments sent by unknown sources. The consequences can be relatively innocuous, or extremely serious.</p>

<p>Both AI-generated code and inattentive users can be misled into referring to these harmful or malicious software packages into their own code. The vulnerability will be triggered when the code is executed.</p>

<p>This scenario is not a view of the future. <strong><a href="https://www.theregister.com/2025/04/12/ai_code_suggestions_sabotage_supply_chain/">It is already happening.</a></strong></p>

<h1>Real-world example threat</h1>

<p>Even a user like myself, who has learnt how to code in R for research purposes, can very easily write up a malicious software package.</p>

<p>An example of such a software package might do the following:</p>

<ol>
<li>Scan all text files on disk for credit card information</li>
<li>Hide that information in a website address</li>
<li>Automatically open a Web browser and point it to that address</li>
<li>Collect the credit card information server-side</li>
<li>Delete as many files as possible on the hard drive</li>
</ol>

<p>The steps above can be executed without the user noticing at all, or might execute in part or in full before the user can stop them from happening.</p>

<p><strong>Privacy and security breaches of the sort are very easy to implement,</strong> and have been implemented in virtually all programming languages.</p>

<p>The risk is of course not limited to AI-generated code. Executing computer code from any untrusted source can lead to the same issues.</p>

<h1>How to minimise the risk</h1>

<p>R users should always check where their packages come from.</p>

<p>R users who use AI-generated code should be even more careful, and should also <strong>warn other users that their code was at least in part AI-generated.</strong></p>

<hr />

<p>It goes without saying that I have never, and will never, design the kind of <a href="https://en.wikipedia.org/wiki/Attack_vector">attack</a> described in this note.</p>
]]></summary>
</entry>
<entry>
<title>Making R work in government</title>
<link href="https://f.briatte.org/r/r-in-government" ></link>
<id>urn:uuid:f1e39ceb-bd7a-76df-659a-05bb085cb42d</id>
<updated>2024-09-29T00:00:00+02:00</updated>
<summary type="html" ><![CDATA[<p>This year's <a href="https://www.auckland.ac.nz/en/science/about-the-faculty/department-of-statistics/ihaka-lecture-series.html">Ihaka Lecture</a> is about <a href="https://www.youtube.com/watch?v=GnEqv1mcNsk">making R work in government</a>. It was delivered by Peter Ellis, the Director of the Statistics for Development Division at the <a href="https://www.spc.int/">Pacific Community</a> (SPC).</p>

<p>A lot of the talk is based on very direct comparisons between R and other software:</p>

<p><img src="images/r-in-government-1.png" alt="" /></p>

<p>The purpose of these comparisons is often to assert that R can do many forms of government analytics better than other software such as a spreadsheet editor (typically, Microsoft Excel). In this case, R replaces the older solution.</p>

<p>However, the talk also provides multiple examples of situations where R will have to be articulated with other tools, such as SQL or JavaScript:</p>

<p><img src="images/r-in-government-2.png" alt="" /></p>

<p>In these cases, R integrates with the solutions in place. This is a very likely scenario in most organizations, and one that should be given a lot of attention to when teaching R to anyone who is already immersed in an analytics workflow such as government analytics.</p>
]]></summary>
</entry>
<entry>
<title>igraph 2.0.0</title>
<link href="https://f.briatte.org/r/igraph-2-0-0" ></link>
<id>urn:uuid:51a839c4-a3c5-2fd3-f606-823b94c50243</id>
<updated>2024-06-13T00:00:00+02:00</updated>
<summary type="html" ><![CDATA[<p>The <a href="https://r.igraph.org/"><code>igraph</code></a> R package has reached version 2.0.0.</p>

<p>The <code>igraph</code> package is based on a C library, which is now <a href="https://igraph.org/2024/05/21/rigraph-2.0.0.html">fully available</a> under the newer versions of the package:</p>

<blockquote>
  <p>This major release brings development in line with the <a href="https://igraph.org/c/">igraph C library</a>. Version 1.6.0 of the R package used version 0.9.10 of the C core. The changes in the 0.10 series of the C core are now taken up in version 2.0 of the R package.</p>
</blockquote>

<p>There are <a href="https://r.igraph.org/news/index.html#breaking-changes-2-0-0">a few breaking changes</a>, but not that many.</p>

<p>A lot of work has been done upstream to have the authors of packages that use <code>igraph</code> to update their own code if needed. Thanks to the help of <a href="https://github.com/krlmlr">Kirill Müller</a>, I recently updated my own <a href="https://github.com/briatte/ggnetwork"><code>ggnetwork</code></a> package <a href="https://github.com/briatte/ggnetwork/commit/fc0c8edd2a80d4faa192d29d33dba30add5259be">to that effect</a>.</p>

<p>The <code>igraph</code> package has arguably the same stature in the <a href="/r/mapping-the-r-software-ecology">R software ecology</a> as the <a href="https://networkx.org/"><code>networkx</code></a> package has in the Python software ecology. It is great news for the R community that it has made such progress in the recent months/years.</p>
]]></summary>
</entry>
<entry>
<title>A security issue with R serialization</title>
<link href="https://f.briatte.org/r/security-issue-with-r-serialization" ></link>
<id>urn:uuid:1720255a-98d3-f17f-f28c-ead9da39fb81</id>
<updated>2024-05-24T00:00:00+02:00</updated>
<summary type="html" ><![CDATA[<p>A security issue has been found with how the R language serializes objects, and patched since.</p>

<p>The security issue is documented under <a href="https://www.cve.org/CVERecord?id=CVE-2024-27322">CVE-2024-27322</a>. It affects the serialization functions that were advertised in an <a href="use-the-rds-format">earlier note</a>.</p>

<p>The R Core Team <a href="https://blog.r-project.org/2024/05/10/statement-on-cve-2024-27322/">recently reported</a> that the issue has been fixed as of R 4.4.0, and that ‘any attack vector associated with it has been removed.’</p>

<p>This episode is a reminder that R is a programming language, and as such, that it raises the same security concerns as any other programming language.</p>

<p>Slightly over a decade ago, these concerns led Jeroen Ooms to develop <a href="https://doi.org/10.18637/jss.v055.i07">the <code>RAppArmor</code> package</a>, in order to enable users to restrict the execution environment of R through dynamic sandboxing.</p>

<p><ins id="update-2024-05-28">Update (May 28, 2024)</ins>: thanks to R Weekly for <a href="https://rweekly.org/2024-W22.html">mentioning this note</a>.</p>
]]></summary>
</entry>
<entry>
<title>Data Science with R (and RStudio)</title>
<link href="https://f.briatte.org/r/data-science-with-r-and-rstudio" ></link>
<id>urn:uuid:5467b250-d644-88a9-87be-14d124c7cd52</id>
<updated>2023-08-23T00:00:00+02:00</updated>
<summary type="html" ><![CDATA[<p>This blog has been silent for a while, and the Covid-19 pandemic has forced me to ditch my <a href="/r/a-personal-r-to-do-list-for-2021">R to-do list for 2021</a>. I did, however, manage to assemble a few R-related things in the past couple of years. This note documents the main one, a <a href="https://github.com/briatte/dsr">Data Science with R (and RStudio)</a> course aimed at social scientists.</p>

<h2>Historical side note</h2>

<p>Around two years ago, I was offered to teach R again <a href="https://www.sciencespo.fr/">at Sciences Po, in Paris</a>, in a spirit close to the <a href="https://f.briatte.org/teaching/quanti/">Stata-based course</a> that I have been teaching there for over ten years.</p>

<p>I first taught R to social scientists <a href="/teaching/ida/">in 2013</a>, but had not repeated the experience since then, except through various short and often focused workshops. I almost got to teach such a course in 2017, just as <a href="https://posit.co/blog/announcing-rstudio-v1-0/">RStudio Desktop was turning 1.0</a>, but that course failed to materialize.</p>

<p><em>Many</em> things have changed since 2013, and there is now much higher demand to teach R (and RStudio) to social science audiences. R and RStudio have improved a lot, and the <a href="https://www.tidyverse.org/">tidyverse</a>, which <a href="https://www.tidyverse.org/blog/2023/03/tidyverse-2-0-0/">recently turned 2.0</a> while still <a href="https://www.tidyverse.org/blog/2023/08/teach-tidyverse-23/">changing a lot</a>, has become a core component of most courses, including mine.</p>

<h2>Teaching material</h2>

<p>My own attempt to teach R, RStudio and the tidyverse in 2023 has been online for a few months, in the form of a <a href="https://github.com/briatte/dsr">GitHub repository</a> with a few <a href="https://github.com/briatte/dsr/wiki">wiki pages</a>, including a long list of <a href="https://github.com/briatte/dsr/wiki/readings">readings, videos and Web links</a>, and another list of <a href="https://github.com/briatte/dsr/wiki/elsewhere">other R courses</a>.</p>

<p>I have also uploaded a <a href="/teaching/syllabus-dsr.pdf">tentative syllabus</a> for the course:</p>

<p><a href="/teaching/syllabus-dsr.pdf"><img src="/r/images/data-science-r-syllabus.png" alt="" /></a></p>

<p>The course has only run once so far, and there are many <a href="https://github.com/briatte/dsr/issues">issues</a> with it that I will try to fix in the coming months. The repository also misses some essential course items (the slides, and the solutions to the exercises), which I am however happy to share privately by email.</p>

<p>A cool aspect of the course is that another instructor, <a href="https://antuki.github.io/">Kim Antunez</a>, will be teaching her own fork of it in the next few weeks. Kim has invested a lot into turning the course into a full-fledged <a href="https://quarto.org/docs/websites/">Quarto website</a>, which I will share in a follow-up post once she is done building it.</p>

<p>My own way of teaching the course is more old-school, as I rely on weekly emails and a shared Google Drive folder. I will, however, put some effort in improving the slides and giving the course a Web page, in order to make it more fully and easily accessible online.</p>

<h2>Going forward</h2>

<p>I feel that I already have enough material to assemble a more advanced R course for social scientists, but first need to streamline this introductory course a bit more, in order to make the reading list, especially, a bit more focused and manageable.</p>

<p>I also feel that there will soon be more changes to the tidyverse that I will have to take into account. I still, for instance, use the <a href="https://cran.r-project.org/package=magrittr"><code>%&gt;%</code> pipe</a> for chain operations, whereas the current trend is to use the <a href="https://www.tidyverse.org/blog/2023/04/base-vs-magrittr-pipe/">native <code>|&gt;</code> pipe, introduced in R version 4.1.0</a>, whenever possible.</p>

<hr />

<p>This note is tangentially related to my previous notes on <a href="/r/teaching-with-rstudio">teaching with RStudio</a>, on <a href="/r/r-as-a-data-science-language">R as a data science language</a> and on <a href="/r/technologies-for-data-science">other technologies for data science</a>.</p>
]]></summary>
</entry>
<entry>
<title>A personal R to-do list for 2021</title>
<link href="https://f.briatte.org/r/a-personal-r-to-do-list-for-2021" ></link>
<id>urn:uuid:9101f23e-8408-56bf-024b-e66037d12090</id>
<updated>2020-12-13T00:00:00+01:00</updated>
<summary type="html" ><![CDATA[<p>This note lists the main things that I will be doing with R next year.</p>

<p>I took some kind of a break from R over the past 18 months. I plan to change that this coming year, and have compiled the following list of things that I want to explore, or come back to.</p>

<h2>R Markdown</h2>

<p>While I have fully transitioned towards “tidy data” and <a href="https://www.tidyverse.org/">its wonderful packages</a>, I am still not the type of R user who works in <a href="https://rmarkdown.rstudio.com/">R Markdown</a> documents (notebooks) like <a href="https://www.youtube.com/user/safe4democracy">David Robinson</a> so brilliantly illustrates in his videos.</p>

<p>I hope to get there next year, because it makes sense from a reproducibility perspective.</p>

<h2>Panel models</h2>

<p>Most of my <a href="https://f.briatte.org/teaching/#quanti">quantitative methods courses</a> involve either basic frequentist regression models, or models used on panel data by political scientists, who have imported a lot of their modelling standards from econometrics and political economy. The models and replication material that come with published articles are mostly coded in <a href="https://www.stata.com/">Stata</a>.</p>

<p>One thing that I need to do this coming year is to check how easy (or difficult) it is, in 2021, to replicate those Stata-coded panel data models in R. This will involve looking mostly at <a href="https://www.jstor.org/stable/2082979">cross-section time-series</a> (CSTS) data, usually measured at the country level, and fitting some of the regression models available in Stata.</p>

<h2>Web scraping</h2>

<p>I have been wanting to upgrade my Web scraping skills for some time, and while my initial plan was to learn enough <a href="https://www.python.org/">Python</a> to switch to that language (and its excellent libraries) for Web scraping, I keep coming back to R due to lack of proper learning time and too little professional incentives to code in Python.</p>

<p>The specific thing that I will be coming back to is <a href="https://en.wikipedia.org/wiki/Headless_browser">headless browsing</a>.</p>

<h2>Network models</h2>

<p>Network models, and <a href="/r/exponential-random-graph-models-with-r">exponential random graph models</a> (ERGMs) in particular, have improved a lot in the past few years. I have followed the literature at a distance, and need to dive into it again, especially for the part that focuses either on generalizing ERGMs beyond binary responses, or on taking time (temporal dependence) into account.</p>

<h2>Bayesian models</h2>

<p>I received my copy of Gelman, Hill and Vehtari's <em><a href="https://avehtari.github.io/ROS-Examples/">Regression and Other Stories</a></em> this summer, and do not want that book to end up with Harrell's <em><a href="https://www.springer.com/gp/book/9783319194240">Regression Modeling Strategies</a></em> and McElreath's <em><a href="https://xcelab.net/rm/statistical-rethinking/">Statistical Rethinking</a></em> on my list of books that I want to read, but might never end up doing so. (The list is much longer than that, and also has everything by <a href="https://en.wikipedia.org/wiki/Trevor_Hastie">Hastie</a> and <a href="https://en.wikipedia.org/wiki/Robert_Tibshirani">Tibshirani</a> on it.)</p>

<p>Let's call this to-do item my annual attempt at <a href="https://www.fharrell.com/post/journey/">transitioning further towards Bayesianism</a>, which is made easy in R thanks to the <a href="https://mc-stan.org/rstanarm/"><code>rstanarm</code></a> package, to Bürkner's <a href="https://paul-buerkner.github.io/brms/"><code>brms</code></a> package, to Harrell's <a href="https://hbiostat.org/R/rmsb/"><code>rmsb</code></a> package, and McElreath's <a href="https://github.com/rmcelreath/rethinking"><code>rethinking</code></a> package.</p>

<h2>Machine learning</h2>

<p>The <a href="https://www.tidymodels.org/"><code>tidymodels</code></a> framework and <a href="https://www.youtube.com/channel/UCTTBgWyJl2HrrhQOOc710kA">Julia Silge's videos</a> offer a nice invitation to dive deeper into those things that many of us explored when "machine learning" was <em>the</em> keyword that any aspiring methodologist (or data scientist, or else) had to know something about.</p>

<p>Going back to learning about machine learning is something that I look forward to, and which I plan to do while looking again at Cosma Shalizi's course on <a href="http://www.stat.cmu.edu/~cshalizi/dm/19/">data mining</a> from 2019.</p>

<hr />

<p>The to-do list above will have to finds its place alongside coding in Stata for one of my oldest courses, plus reading about many other things that have little to do with R or statistics Let's see if that will happen.</p>

<p><ins id="update-2020-12-14">Update (December 14, 2020)</ins>: corrected the authors of the <code>brms</code> package, with thanks to Dieter Menne, who spotted and reported the error.</p>
]]></summary>
</entry>
<entry>
<title>Remember the change in the sample() function of R 3.6.0</title>
<link href="https://f.briatte.org/r/change-in-sample-function-r-3-6-0" ></link>
<id>urn:uuid:bd5bd043-f224-4b43-b9ae-70385567f9b3</id>
<updated>2019-08-07T00:00:00+02:00</updated>
<summary type="html" ><![CDATA[<p>This note documents how the <code>sample()</code> function has changed since R 3.6.0, and how to reproduce its previous behaviour.</p>

<p>A <a href="https://xianblog.wordpress.com/2019/05/21/biased-sample/">recent blog post by Christian Robert</a> reminded me that R had to fix its <a href="https://stat.ethz.ch/R-manual/R-devel/library/base/html/sample.html"><code>sample()</code></a> function in <a href="https://stat.ethz.ch/pipermail/r-announce/2019/000641.html">R 3.6.0</a> and above.</p>

<p>The issue that used to affect the pseudo-random number generator (<a href="https://en.wikipedia.org/wiki/Pseudorandom_number_generator">PRNG</a>) at the core of the function is documented in a note by Kellie Ottoboni and Philip B. Stark, “<a href="https://arxiv.org/abs/1809.06520">Random problems with R</a>,” which was extensively discussed <a href="https://stat.ethz.ch/pipermail/r-devel/2018-September/076817.html">on the R-devel mailing-list</a>.</p>

<p>The note explains that (part of) the <a href="https://github.com/wch/r-source/blob/efed16c945b6e31f8e345d2f18e39a014d2a57ae/src/main/RNG.c#L787-L791">PRNG used by R 3.5.1</a> does not correct for the <a href="https://www.exploringbinary.com/the-spacing-of-binary-floating-point-numbers/">uneven spacing</a> of binary <a href="https://en.wikipedia.org/wiki/Floating-point_arithmetic#IEEE_754:_floating_point_in_modern_computers">floating-point numbers</a>. The resulting <a href="https://en.wikipedia.org/wiki/Quantization_(signal_processing)">quantization</a> effect/error produces biased selection probabilities, to a sufficiently severe extent for the PRNG not to qualify as sufficiently pseudo-random.</p>

<p>The issue got <a href="https://bugs.r-project.org/bugzilla/show_bug.cgi?id=17494">patched</a> in R <a href="https://stat.ethz.ch/pipermail/r-announce/2019/000641.html">3.6.0</a>, which led to the introduction of a method allowing to reproduce the former behaviour of the PRNG. The method is well documented on R blogs like <a href="https://blog.revolutionanalytics.com/2019/05/whats-new-in-r-360.html">Revolution Analytics</a> or J. Kenneth Tay's <a href="https://statisticaloddsandends.wordpress.com/2019/06/19/rs-sample-function-works-differently-from-r-3-6-0-onwards/">Statistical Odds &amp; Ends</a>, and consists in adjusting the <a href="https://stat.ethz.ch/R-manual/R-devel/library/base/html/Random.html"><code>RNGKind</code></a> option before calling the <a href="https://stat.ethz.ch/R-manual/R-devel/library/base/html/sample.html"><code>sample()</code></a> function:</p>

<script src="https://gist.github.com/briatte/0b542cadfa76b5cbc2231f1e6106b42c.js"></script>

<p>This is of course only useful if one depends on a particular random number generator and seed number, as set through <a href="https://stat.ethz.ch/R-manual/R-devel/library/base/html/Random.html"><code>set.seed</code></a>, to reproduce the behaviour and results of R code written and executed before R 3.6.0.</p>

<p>At that stage, you might be tempted to add <a href="https://twitter.com/coolbutuseless/status/1150172942963073024">generating random numbers</a> to the short list of <a href="https://skeptics.stackexchange.com/questions/19836/has-phil-karlton-ever-said-there-are-only-two-hard-things-in-computer-science">hard things in computer science</a>—which would be, in my view, entirely correct.</p>

<hr />

<p>This note is obviously of the ‘note to self’ kind. The previous one in that category (successfully) aimed at reminding me to <a href="/r/use-the-rds-format">use the RDS format</a>.</p>

<p><ins id="update-2019-08-14">Update (August 14, 2019)</ins>: to change the local state of the PRNG without affecting its global state, see <a href="http://www.questionflow.org/2019/08/13/local-randomness-in-r/">this note</a> by Evgeni Chasnovski.</p>

<p><ins id="update-2019-08-14">Update (August 14, 2019)</ins>: thanks to R Weekly for <a href="https://rweekly.org/2019-32.html">mentioning this note</a>.</p>
]]></summary>
</entry>
</feed>