Page 1 of 2
R is a language targeted at statistics, but it has an interesting way of working with data. In this introduction to R we take a programmer's point of view and celebrate that fact that R is based on Lisp. First we need to get started with the most basic of R data types - the vector.
A Programmers Guide To R
- Getting Started And The Vector
- Lists and more advanced data
- Working With Data
R is a language targeting statistics, it is open source, hence it is free, and it seems to be taking over the world... well the statistical world at least.
With the rise of "big data" as a serious application, stats and statistical languages are also becoming more important. If you have used SPSS, SAS, Mathematica or Maple to do stats, then it is worth looking at R as an alternative - if only because it is free. If you are considering becoming involved in "big data", starting out with R is a good idea.
First - how to characterise R?
The most important thing to realise is that R isn't a particularly ground-breaking language, but it does have some facilities that you might consider strange if you know a classical object-oriented language like C# or Java. It is targeted at a fairly narrow set of tasks and as such its real power comes from having a range of functions that do statistical analysis and many users can get all the results they need without ever worrying about any programming aspects of R.
So to the non-programmer R looks very simple, but as a programmer you are going to want to do more with it so you really do need to get under the skin of R
The best way to describe the language is as being related to Lisp, but with some additions to make it easier to use and to make sophisticated data easier to work with. Its basic data type is the List and this is used to implement all of the more sophisticated data structures needed for statistics and to make them easier to use. It is a typed language, you could even say that it has too many definitions of "type", but it attempts to be weakly typed in the way that it treats its base data and its data structures.
Overall the language can be written in a procedural style with a very strong leaning towards the functional style. It is often said that R is object-oriented, but this is more wishful thinking on behalf of its supporters than any real facilities provided. In reality it calls everything an object and provides a manual type system that is used to work out which form of a function should be used to process the data. This could be called "polymorphism", but you could just as well call it "dynamic generics" or something similar. Essentially it provides function overloading based on the first parameter. What is surprising is that this simple scheme, coupled with the user friendly form of the List, turns out to be quite powerful and easy to use.
That is, R is not object-oriented in the usual meaning of the term and trying to understand it in this way is doomed to failure.
It is often said that R is a language that supports a procedural approach with the option of using an object-oriented approach. This seems to overstate the case. R is a language that supports a functional procedural approach with some typing to support simple function overloading.
If you try to treat it as an object-oriented language you are going to spend a long time looking for the objects as you know them.
For example, the entities referred to in R as objects don't have properties and methods that are specified by compound names. If L is a List (covered in the next article in this series) you write length(L) and not L.length() to retrieve its length.
As already said, R is mostly functional.
This said, everything in R is referred to as an object and this is appropriate because everything in R is a data structure that includes what you might call "metadata" that helps you determine how it should be treated. That is, in R there are no atomic or primitive data types, everything is a structure.
So to sum up.
The best way to describe the language is as being related to Lisp, but with some additions to make it easier to use and to make sophisticated data easier to work with.
Getting and installing R is simple.
Just go to the website, download the appropriate binary and run the installation. If you are working under Windows then it will automatically install and use either a 32-bit or 64-bit version.
The installation includes the RGui, which provides an easy-to-use R Console that you can use to type instructions into directly.
Once you have it installed you can run the R Console and type in commands.
Notice that R uses a persistent environment. That is, any data structures or objects you create remain accessible and you have an option to save and reload the user environment complete with data objects each time you quit and start a session. If you are not used to it, this can seem strange at first.
All of the examples given in the rest of this article can be typed into the command console and tried out at once.
While R isn't really a functional language, the fact that it supports some sophisticated data types and provides functions which perform complicated operations on them does give it the flavour of a functional language. To see this in action we need to first look at the fundamental data structure - the vector.
As already mentioned, R doesn't have primitive data types in the way that other languages do. In R even the simplest numeric value is an example of a vector.
An R vector is what would be called a one-dimensional array in other languages.
A vector is an indexed set of data all of the same type.
R allows you to use six different types of vector:
Notice that even a single number like 4.3 is an example of a vector of length one. This might seems like a crazy idea and potentially inefficient, but it fits in well with the sort of calculations you want to do in R.
You can create a vector using c, the concatenation function, which will take a set of arguments and return a vector. For example:
is the vector 1 2 3 4. If one of the values is a different numeric type then all of the other values are coerced to be the highest type used.
For example, c(1.1,2,3,4) is the vector:
1.1 2.0 3.0 4.0
If one of the arguments is a string, then all of the values are coerced to strings.
Assignment to a symbol is generally done using the <- operator as in:
v <- c(1,2,3,4)
However it is just a short hand for a call to the assign function
To access a single element of a vector you can use the [ [ ] ] notation. For example:
This raises the question of why the notation is [ [ ] ] and not [ ] as it is in most other languages?
Indeed many R programmers think that you access an array using [ ] as in v say and in this case there is little practical difference between [ [ ] ] and [ ].
The difference is that the [ [ ] ] notation extracts the element from the vector whereas the [ ] is a more general indexing notation which can be used to extract a sub-vector. We will return to this topic later but a simple example will make things clear:
This is an indexing by a vector and the result is a vector containing the elements v[  ] and v[  ], i.e. 2 4 in this case.
The rule is that if you index using an integer vector the result is a vector consisting of the elements indexed by the integer vector. This allows you to pick out arbitrary elements from one vector to create a new vector.
You can now see that:
is a vector indexed by the vector 2 - recall that primitive data is in the form of a vector with a single element. This means that v returns the same as v[  ], i.e. a vector with a single element.
There some important differences, however, and in particular you cannot use a general index in v[ [ ] ], only a single integer.
- [ [ ] ] returns an element of the vector
- [ ] returns a sub-vector of the vector.