Brief Thoughts on R as a Beginner - Research Data Services

In this series, we’ll cover different aspects of the R programming language. This second post is my experience as a beginner programmer without a sciences or statistical background. This article is what I think might be helpful for other beginners with a similar background. I welcome any feedback from those more experienced!

My biggest challenge in learning the R language was that although I have prior basic programming experience, R is truly a language for statisticians and mathematicians. At face value, it looks like it should act like other programming languages you’re familiar with, but it’s important to remember that the purpose of R is data oriented. If you have a statistical or sciences background and are comfortable with SAS, Matlab, or other statistical languages and environments, R will likely be a little easier to wrap your head around. However, if you are coming from another background, there may be a learning curve.

Below I will run through a brief overview of some thoughts and technical issues that challenged me this semester. I am still a beginner in programming, so I’m not yet equipped to discuss in depth the conceptual differences when working with a programming language versus a statistical language, but hopefully the observations will still be helpful.

Basic Thoughts:

You will see this everywhere when you start looking at R resources – R has a goofy and inconsistent syntax. You will often see <- instead of = for an assignment operator. This is a historical holdover and though at first I really didn’t like it, I now find myself using it exclusively. The R blog, Revolutions, gives a great quick history on the reasons that R uses this quirky assignment operator. For more information on inconsistencies and general quirks, this blog post is great.
R feels too flexible and dense to me. It allows missing values, empty structures, and if I really want to push through an error it will let me. The flexibility is somewhat freeing to engage with the language in a new way, but also means that I had to account for problems that I was not accustomed to. By dense, I mean that R never quite acts how you think it will and the documentation and naming often isn’t much help.
Recycling as a concept was something I struggled with until mid-way through the semester. Recycling means that R will allow operations between data structures of different lengths, but it will recycle or start over with the values in the shorter structure when it reaches the end. However, in practice I always found myself using trial and error to get the values to come out the way I wanted them. It was extremely difficult for me to predict and correctly input start and stop vectors to get the values to output how I expected.

Data Structures:

Some of the data structures in R were similar to concepts I had encountered elsewhere and so were fairly straightforward like Matrix, Array, and Data Frame(like a dataset). However, some structures were entirely new for me.

Vectors –

This was a new data structure for me and formed the basis for a lot of the classwork for our semester. Vectors are the most basic one dimensional data structure in R. Vectors can handle numerous data types, – you can have numeric vectors, character vectors, and logical (boolean) vectors. You can assign names to the vector values and access them via those names as well. While vectors are actually quite simple, I found that it took some time to get used to working with them. For examples of a vector see below.

> x <- c(1,2,3) *c is combine in this case

> x

[1] 1 2 3

> y <- seq(from=1, to=10, by =2)

> y

[1] 1 3 5 7 9

> y <- y +1

> y

[1] 2 4 6 8 10

Lists –

Lists can hold multiple data types and my professor told us the easiest way to think of a list is as a train car. The train is made up of multiple cars and each car has values inside it that can be accessed similarly to how you would access a matrix. While this also appears fairly straightforward, we had an assignment this semester where lists got messy and it was fairly difficult to pick apart the structures.

> x <- list(c(1,2), c(“Hello”))

> x

[[1]]

[1] 1 2

[[2]]

[1] “Hello”

To access the contents of x –

> x[[1]]

[1] 1 2

> x[[1]][[2]]

[1] 2

Working with Data:

In R, you have more flexibility to perform operations on entire datasets rather than looping through your data to perform a function as you would in other non-statistical languages. For example, to find values that are less than one in a dataset in R, you can use the function which() to get a vector of indices at which a logical statement you pass (x < 1) is true. You can then subset the data and retrieve the values. In other languages you would write a loop to complete this task, i.e.

for i in data (if i < 1) {

print i

}

However in R this can be done with one function. If our vector is set as below:

> x<-(c(0.5,6,9,2.3,0.9,23,.07))

> x

[1] 0.50 6.00 9.00 2.30 0.90 23.00 0.07

You can access those values in this way:

> which(x<1)

[1] 1 5 7

> x[which(x<1)]

[1] 0.50 0.90 0.07

I also encountered this in the apply functions (tapply, lapply, etc.) which are more widely used in R than loops because they are faster and more compact. The apply functions allow you to perform a function on each item in a subset of data that you have defined. The different apply functions also allow for different input and output structures depending on which one you’re working with. I’m extremely comfortable with loops from my previous programming experience, so I actually found the apply functions hard to learn. I suggest checking out R-bloggers as they have a lot of great information for R newbies; their posts on using tapply and the other apply functions will provide a better introduction than I can.

Although R can be quite difficult, it is also enjoyable. I suggest diving in, playing around, and taking advantage of the resources available to help you learn. Elizabeth Wickes (who has fantastic Python resources) recently pointed me to “R for Everyone” and my professor recommends “R in a Nutshell” and then “Advanced R” after you’ve had a little practice.

Research Data Services (RDS) is an interdisciplinary organization committed to advancing research data management practice on the UW-Madison campus. We focus on providing researchers with the tools and resources that support their efforts to store, analyze and share data.