Efficient R Project Management for Small Teams

mbayly

Sun, 01/14/2024 - 04:51

Course Dates: [Customized for groups - contact to book]
Location: [Vancouver/Lower Mainland BC - contact to book]
Course Duration: 7.5 hours (6.5 hours of course material and 1 hour for lunch)
Course Fees: Cost per person is scaled to group size: $100 (for a group of 8), $150 (for a group of 6), $250 (for a group of 4) and $400 (for a group of 2).
Prerequisites: All participants should classify themselves (generally) as "R users".

Streamlining R Projects: Collaborative Strategies and Efficient Structures for Small Teams

R continues to grow in popularity as the data analysis workhorse in the field of environmental management and monitoring. Over the last 15 years, R developed from a small scripting language for academics into something widely used by data scientists in every discipline. The rise in popularity is primarily due to R’s exceptional data analysis and visualization capabilities, ability to process large datasets, and capacity to process large volumes of complex spatial data. The number of new R-packages released on CRAN each year continues to climb exponentially, and just about every graduate student entering the workforce is also well-versed in R.

However, despite the strengths and popularity of R, internal teams and working groups often struggle to manage larger projects efficiently, especially “brownfield” projects where a code base is transferred between individuals or instances where last-minute changes need to be implemented for a revision request. Challenges often stem from complex file structures, a lack of version control, documentation, or verbose code that is difficult to follow. In some instances, many have found that re-writing code (from scratch) can be faster than trying to reverse-engineer an unstructured codebase. Additionally, minor errors can easily slip into code like a needle in a haystack, resulting in significant issues that are almost impossible to catch, even with multiple detailed code reviews.

These complications are unfortunate; however, they are also largely avoidable. I am here to teach you about a system of (relatively easy) solutions and frameworks your team can follow to overcome these challenges. I have worked with large and complex R projects in small team settings for over ten years. I’ve seen everything, but I have also worked hard to develop solutions that merge concepts from professional software architecture with fast-paced data analysis exercises and workflows. These systems do not require your team to invest additional time and effort into each project but instead focus on leveraging existing tools to do things slightly differently.

I put this course together because I felt that existing resources were not well suited to small teams working collaboratively on R projects for general data analysis. Over time, it became clear that efficient structures are essential for collaborative workflows. Not only can these structures and systems be implemented without added costs, but they save teams significant time and hardship and result in massive long-term benefits. This course helps teams move away from the pattern of an over-reliance on one person to “run the code” or instances where it becomes too hard to implement new features given existing complexities in the code base (i.e., the complexity asymptote). Throughout this course, we will work towards developing project structures that can be quickly picked up years into the future or passed off entirely. The key underlying theme is a focus on making the code simple with an understanding of several other vital systems and principles work to achieve this:

The R Package File Structure: All external R packages follow a standardized file structure. The R package file structure offers incredible opportunities if integrated into existing projects. We will introduce the R package file structure and show how we can quickly make unpublished R packages to store reusable functions and files within our organizations. We will then review alternative file structures for various analysis projects and show how we can combine these two frameworks to transform a hairy and complex project into something that reads like a child’s book.
Version Control with Git: Git is one of the most pivotal version control systems. However, despite its numerous benefits, attempts to onboard and train new team members to use Git can often be unnecessarily challenging. Most online training resources are geared toward large teams of full-time developers (e.g., Google employees) that vastly overcomplicate workflows for small teams with a focus on data analysis. There are hundreds of git commands, but we will focus on just four. We will also focus on the .gitignore file. The .gitignore file is rarely discussed in introductory courses on Git version control but is the most critical component of a data analysis project to manage inputs and outputs.
Documentation and workflows: There are constant calls to “write good documentation”, but what exactly does that mean for small teams with too much work and tight deadlines? The qualifier of “good documentation” is too ambiguous and unstructured. We will go over standardized R-package documentation with roxygen2. We will also describe where documentation is most useful and where it may be a poor use of time. The focus here will be on efficiency and reusability.
Unit Tests and Code Review: Large analysis pipelines often traverse numerous files, functions, and datasets. Like the previous bullet point, the concept of “a code review” to catch errors, issues, and/or bugs is not only ambiguous but, in many cases, impossible to do without investing numerous hours to reverse-engineer each line. Even then, the probability that a colleague will catch an error in the code (if one exists) is extremely low. Fortunately, better systems are out there to run a code review faster and more efficiently with a predefined structure. In this course, we will introduce the concept of unit tests with testthat and show how we can implement them on functions embedded within the R package file structure. We will focus on where and when to apply unit tests and how to set up a system where we can run all tests automatically each time we change something in the code or rerun our analysis.

The course is meant to strengthen the structural foundations of a collaborative analysis team. Each component is geared towards engineers, ecologists, grad students, and other professionals who use R frequently but don’t have time to sift through the literature on software architecture to build and design customized systems and workflows that work for their team. The long-term goal of this course is to help your organization build projects with clean, small, and discrete functions/packages that can be reused for years into the future. The concepts and principles will have long-term benefits and support an evolving team to strengthen their foundational projects, programs, and services.

Search

Efficient R Project Management for Small Teams

Share This Page