R for Everyone: Advanced Analytics and Graphics (Addison-Wesley Data and Analytics) by Jared P. Lander, PDF, EPUB, 0321888030

R for Everyone: Advanced Analytics and Graphics (Addison-Wesley Data and Analytics) by Jared P. Lander

  • Print Length: 464 Pages
  • Publisher: Addison-Wesley Professional
  • Publication Date: December 29, 2013
  • Language: English
  • ISBN-10: 0321888030
  • ISBN-13: 978-0321888037
  • File Format: PDF, EPUB

 

”Preview”

To my mother and father

Contents

Foreword

Preface

Acknowledgments

About the Author

1 Getting R

1.1 Downloading R

1.2 R Version

1.3 32-bit versus 64-bit

1.4 Installing

1.5 Revolution R Community Edition

1.6 Conclusion

2 The R Environment

2.1 Command Line Interface

2.2 RStudio

2.3 Revolution Analytics RPE

2.4 Conclusion

3 R Packages

3.1 Installing Packages

3.2 Loading Packages

3.3 Building a Package

3.4 Conclusion

4 Basics of R

4.1 Basic Math

4.2 Variables

4.3 Data Types

4.4 Vectors

4.5 Calling Functions

4.6 Function Documentation

4.7 Missing Data

4.8 Conclusion

5 Advanced Data Structures

5.1 data.frames

5.2 Lists

5.3 Matrices

5.4 Arrays

5.5 Conclusion

6 Reading Data into R

6.1 Reading CSVs

6.2 Excel Data

6.3 Reading from Databases

6.4 Data from Other Statistical Tools

6.5 R Binary Files

6.6 Data Included with R

6.7 Extract Data from Web Sites

6.8 Conclusion

7 Statistical Graphics

7.1 Base Graphics

7.2 ggplot2

7.3 Conclusion

8 Writing R Functions

8.1 Hello, World!

8.2 Function Arguments

8.3 Return Values

8.4 do.call

8.5 Conclusion

9 Control Statements

9.1 if and else

9.2 switch

9.3 ifelse

9.4 Compound Tests

9.5 Conclusion

10 Loops, the Un-R Way to Iterate

10.1 for Loops

10.2 while Loops

10.3 Controlling Loops

10.4 Conclusion

11 Group Manipulation

11.1 Apply Family

11.2 aggregate

11.3 plyr

11.4 data.table

11.5 Conclusion

12 Data Reshaping

12.1 cbind and rbind

12.2 Joins

12.3 reshape2

12.4 Conclusion

13 Manipulating Strings

13.1 paste

13.2 sprintf

13.3 Extracting Text

13.4 Regular Expressions

13.5 Conclusion

14 Probability Distributions

14.1 Normal Distribution

14.2 Binomial Distribution

14.3 Poisson Distribution

14.4 Other Distributions

14.5 Conclusion

15 Basic Statistics

15.1 Summary Statistics

15.2 Correlation and Covariance

15.3 T-Tests

15.4 ANOVA

15.5 Conclusion

16 Linear Models

16.1 Simple Linear Regression

16.2 Multiple Regression

16.3 Conclusion

17 Generalized Linear Models

17.1 Logistic Regression

17.2 Poisson Regression

17.3 Other Generalized Linear Models

17.4 Survival Analysis

17.5 Conclusion

18 Model Diagnostics

18.1 Residuals

18.2 Comparing Models

18.3 Cross-Validation

18.4 Bootstrap

18.5 Stepwise Variable Selection

18.6 Conclusion

19 Regularization and Shrinkage

19.1 Elastic Net

19.2 Bayesian Shrinkage

19.3 Conclusion

20 Nonlinear Models

20.1 Nonlinear Least Squares

20.2 Splines

20.3 Generalized Additive Models

20.4 Decision Trees

20.5 Random Forests

20.6 Conclusion

21 Time Series and Autocorrelation

21.1 Autoregressive Moving Average

21.2 VAR

21.3 GARCH

21.4 Conclusion

22 Clustering

22.1 K-means

22.2 PAM

22.3 Hierarchical Clustering

22.4 Conclusion

23 Reproducibility, Reports and Slide Shows with knitr

23.1 Installing a LATEX Program

23.2 LATEX Primer

23.3 Using knitr with LATEX

23.4 Markdown Tips

23.5 Using knitr and Markdown

23.6 pandoc

23.7 Conclusion

24 Building R Packages

24.1 Folder Structure

24.2 Package Files

24.3 Package Documentation

24.4 Checking, Building and Installing

24.5 Submitting to CRAN

24.6 C++ Code

24.7 Conclusion

A Real-Life Resources

A.1 Meetups

A.2 Stack overflow

A.3 Twitter

A.4 Conferences

A.5 Web Sites

A.6 Documents

A.7 Books

A.8 Conclusion

B Glossary

List of Figures

List of Tables

General Index

Index of Functions

Index of Packages

Index of People

Data Index

Foreword

Rhas had tremendous growth in popularity over the last three years. Based on that, you’d think that it was a new, up-and-coming language. But surprisingly, R has been around since 1993. Why the sudden uptick in popularity? The somewhat obvious answer seems to be the emergence of data science as a career and a field of study. But the underpinnings of data science have been around for many decades. Statistics, linear algebra, operations research, artificial intelligence, and machine learning all contribute parts to the tools that a modern data scientist uses. R, more than most languages, has been built to make most of these tools only a single function call away.

That’s why I’m very excited to have this book as one of the first in the Addison-Wesley Data and Analytics Series. R is indispensable for many data science tasks. Many algorithms useful for prediction and analysis can be accessed through only a few lines of code, which makes it a great fit for solving modern data challenges. Data science as a field isn’t just about math and statistics, and it isn’t just about programming and infrastructure. This book provides a well-balanced introduction to the power and expressiveness of R and is aimed at a general audience.

I can’t think of a better author to provide an introduction to R than Jared Lander. Jared and I first met through the New York City machine learning community in late 2009. Back then, the New York City data community was small enough to fit in a single conference room, and many of the other data meetups had yet to be formed. Over the last four years, Jared has been at the forefront of the emerging data science profession.

Through running the Open Statistical Programming Meetup, speaking at events, and teaching a course at Columbia on R, Jared has helped grow the community by educating programmers, data scientists, journalists, and statisticians alike. But Jared’s expertise isn’t limited to teaching. As an everyday practitioner, he puts these tools to use while consulting for clients big and small.

This book provides an introduction both to programming in R and to the various statistical methods and tools an everyday R programmer uses. Examples use publicly available datasets that Jared has helpfully cleaned and made accessible through his Web site. By using real data and setting up interesting problems, this book stays engaging to the end.

—Paul Dix, Series Editor

Preface

With the increasing prevalence of data in our daily lives, new and better tools are needed to analyze the deluge. Traditionally there have been two ends of the spectrum: lightweight, individual analysis using tools like Excel or SPSS and heavy duty, high-performance analysis built with C++ and the like. With the increasing strength of personal computers grew a middle ground that was both interactive and robust. Analysis done by an individual on his or her own computer in an exploratory fashion could quickly be transformed into something destined for a server, underpinning advanced business processes. This area is the domain of R, Python, and other scripted languages.

R, invented by Robert Gentleman and Ross Ihaka of the University of Auckland in 1993, grew out of S, which was invented by John Chambers at Bell Labs. It is a high-level language that was originally intended to be run interactively where the user runs a command, gets a result, and then runs another command. It has since evolved into a language that can also be embedded in systems and tackle complex problems.

In addition to transforming and analyzing data, R can produce amazing graphics and reports with ease. It is now being used as a full stack for data analysis, extracting and transforming data, fitting models, drawing inferences and making predictions, plotting and reporting results.

R’s popularity has skyrocketed since the late 2000s, as it has stepped out of academia and into banking, marketing, pharmaceuticals, politics, genomics and many other fields. Its new users are often shifting from low-level, compiled languages like C++, other statistical packages such as SAS or SPSS, and from the 800-pound gorilla, Excel. This time period also saw a rapid surge in the number of add-on packages—libraries of prewritten code that extend R’s functionality.

While R can sometimes be intimidating to beginners, especially for those without programming experience, I find that programming analysis, instead of pointing and clicking, soon becomes much easier, more convenient and more reliable. It is my goal to make that learning process easier and quicker.

This book lays out information in a way I wish I were taught when learning R in graduate school. Coming full circle, the content of this book was developed in conjuction with the data science course I teach at Columbia University. It is not meant to cover every minute detail of R, but rather the 20% of functionality needed to accomplish 80% of the work. The content is organized into self-contained chapters as follows.

Chapter 1, Getting R: Where to download R and how to install it. This deals with the varying operating systems and 32-bit versus 64-bit versions. It also gives advice on where to install R.

Chapter 2, The R Environment: An overview of using R, particularly from within RStudio. RStudio projects and Git integration are covered as is customizing and navigating RStudio.

Chapter 3, Packages: How to locate, install and load R packages.

Chapter 4, Basics of R: Using R for math. Variable types such as numeric, character and Date are detailed as are vectors. There is a brief introduction to calling functions and finding documentation on functions.

Chapter 5, Advanced Data Structures: The most powerful and commonly used data structure, data.frames, along with matrices and lists, are introduced.

Chapter 6, Reading Data into R: Before data can be analyzed it must be read into R. There are numerous ways to ingest data, including reading from CSVs and databases.

Chapter 7, Statistical Graphics: Graphics are a crucial part of preliminary data analysis and communicating results. R can make beautiful plots using its powerful plotting utilities. Base graphics and ggplot2 are introduced and detailed here.

Chapter 8, Writing R Functions: Repeatable analysis is often made easier with user-defined functions. The structure, arguments and return rules are discussed.

Chapter 9, Control Statements: Controlling the flow of programs using if, ifelse and complex checks.

Chapter 10, Loops, the Un-R Way to Iterate: Iterating using for and while loops. While these are generally discouraged they are important to know.

Chapter 11, Group Manipulation: A better alternative to loops, vectorization does not quite iterate through data so much as operate on all elements at once. This is more efficient and is primarily performed with the apply functions and plyr package.

Chapter 12, Data Reshaping: Combining multiple datasets, whether by stacking or joining, is commonly necessary as is changing the shape of data. The plyr and reshape2 packages offer good functions for accomplishing this in addition to base tools such as rbind, cbind and merge.

Chapter 13, Manipulating Strings: Most people do not associate character data with statistics but it is an important form of data. R provides numerous facilities for working with strings, including combining them and extracting information from within. Regular expressions are also detailed.

Chapter 14, Probability Distributions: A thorough look at the normal, binomial and Poisson distributions. The formulas and functions for many distributions are noted.

Chapter 15, Basic Statistics: These are the first statistics most people are taught, such as mean, standard deviation and t-tests.

Chapter 16, Linear Models: The most powerful and common tool in statistics, linear models are extensively detailed.

Chapter 17, Generalized Linear Models: Linear models are extended to include logistic and Poisson regression. Survival analysis is also covered.

Chapter 18, Model Diagnostics: Determining the quality of models and variable selection using residuals, AIC, cross-validation, the bootstrap and stepwise variable selection.

Chapter 19, Regularization and Shrinkage: Preventing overfitting using the Elastic Net and Bayesian methods.

Chapter 20, Nonlinear Models: When linear models are inappropriate, nonlinear models are a good solution. Nonlinear least squares, splines, generalized additive models, decision trees and random forests are discussed.

Chapter 21, Time Series and Autocorrelation: Methods for the analysis of univariate and multivariate time series data.

Chapter 22, Clustering: Clustering, the grouping of data, is accomplished by various methods such as K-means and hierarchical clustering.

Chapter 23, Reproducibility, Reports and Slide Shows with knitr: Generating reports, slide shows and Web pages from within R is made easy with knitr, LATEX and Markdown.

Chapter 24, Building R Packages: R packages are great for portable, reusable code. Building these packages has been made incredibly easy with the advent of devtools and Rcpp.

Appendix A, Real-Life Resources: A listing of our favorite resources for learning more about R and interacting with the community.

Appendix B, Glossary: A glossary of terms used throughout this book. A good deal of the text in this book is either R code or the results of running code. Code and results are most often in a separate block of text and set in a distinctive font, as shown in the following example. The different parts of code also have different colors. Lines of code start with >, and if code is continued from one line to another the continued line begins with +.

> # this is a comment

>

> # now basic math

> 10 * 10

[1] 100

>

> # calling a function

> sqrt(4)

[1] 2

Certain Kindle devices do not display color so the digital edition of this book will be viewed in greyscale on those devices.

There are occasions where code is shown inline and looks like sqrt(4).

In the few places where math is necessary, the equations are indented from the margin and are numbered.

Within equations, normal variables appear as italic text (x), vectors are bold lowercase letters (x) and matrices are bold uppercase letters (X). Greek letters, such as α and β, follow the same convention.

Function names will be written as join and package names as plyr. Objects generated in code that are referenced in text are written as object1.

Learning R is a gratifying experience that makes life so much easier for so many tasks. I hope you enjoy learning with me.

Acknowledgments

To start, I must thank my mother, Gail Lander, for encouraging me to become a math major. Without that I would never have followed the path that led me to statistics and data science. In a similar vein, I have to thank my father, Howard Lander, for paying all those tuition bills. He has been a valuable source of advice and guidance throughout my life and someone I have aspired to emulate in many ways. While they both insist they do not understand what I do, they love that I do it and have helped me all along the way. Staying with family, I should thank my sister and brother-in-law, Aimee and Eric Schechterman, for letting me teach math to Noah, their five-year-old son.

There are many teachers who have helped shape me over the years. The first is Rochelle Lecke, who tutored me in middle school math even when my teacher told me I did not have worthwhile math skills.

Then there is Beth Edmondson, my precalc teacher at Princeton Day School. After I wasted the first half of high school as a mediocre student, she told me I had “some nerve signing up for next year’s AP Calc” given my grades. She agreed to let me take AP Calc if I went from a C to an A+ in her class, never thinking I stood a chance. Three months later, she was in shock as I not only earned the A+, but turned around my entire academic career. She changed my life and without her, I do not know where I would be today. I am forever grateful that she was my teacher.

For the first two years at Muhlenberg College, I was determined to be a business and communications major, but took math classes because they came naturally to me. My professors, Dr. Penny Dunham, Dr. Bill Dunham, and Dr. Linda McGuire, all convinced me to become a math major, a decision that has greatly shaped my life. Dr. Greg Cicconetti gave me my first glimpse of rigorous statistics, my first research opportunity and planted the idea in my head that I should go to grad school for statistics.

While earning my M.A. at Columbia University, I was surrounded by brilliant minds in statistics and programming. Dr. David Madigan opened my eyes to modern machine learning, and Dr. Bodhi Sen got me thinking about statistical programming. I had the privilege to do research with Dr. Andrew Gelman, whose insights have been immeasurably important to me. Dr. Richard Garfield showed me how to use statistics to help people in disaster and war zones when he sent me on my first assignment to Myanmar. His advice and friendship over the years have been dear to me. Dr. Jingchen Liu allowed and encouraged me to write my thesis on New York City pizza, which has brought me an inordinate amount of attention.1

1. http://slice.seriouseats.com/archives/2010/03/the-moneyball-of-pizza-statistician-uses-statistics-to-find-nyc-best-pizza.html

While at Columbia, I also met my good friend—and one time TA—Dr. Ivor Cribben who filled in so many gaps in my knowledge. Through him, I met Dr. Rachel Schutt, a source of great advice, and who I am now honored to teach alongside at Columbia.

Grad school might never have happened without the encouragement and support of Shanna Lee. She helped maintain my sanity while I was incredibly overcommited to two jobs, classes and Columbia’s hockey team. I am not sure I would have made it through without her.

Steve Czetty gave me my first job in analytics at Sky IT Group and taught me about databases, while letting me experiment with off-the-wall programming. This sparked my interest in statistics and data. Joe DeSiena, Philip du Plessis, and Ed Bobrin at the Bardess Group are some of the finest people I have ever had the pleasure to work with, and I am proud to be working with them to this day. Mike Minelli, Rich Kittler, Mark Barry, David Smith, Joseph Rickert, Dr. Norman Nie, James Peruvankal, Neera Talbert and Dave Rich at Revolution Analytics let me do one of the best jobs I could possibly imagine: explaining to people in business why they should be using R. Kirk Mettler, Richard Schultz, Dr. Bryan Lewis and Jim Winfield at Big Computing encouraged me to have fun, tackling interesting problems in R. Vincent Saulys, John Weir, and Dr. Saar Golde at Goldman Sachs made my time there both enjoyable and educational.

Throughout the course of writing this book, many people helped me with the process. First and foremost is Yin Cheung, who saw all the stress I constantly felt and supported me through many ruined nights and days.

My editor, Debra Williams, knew just how to encourage me and her guiding hand has been invaluable. Paul Dix, the series editor and a good friend, was the person who suggested I write this book, so none of this would have happened without him. Thanks to Caroline Senay and Andrea Fox for being great copy editors. Without them, this book would not be nearly as well put together. Robert Mauriello’s technical review was incredibly useful in honing the book’s presentation.

The folks at RStudio, particularly JJ Allaire and Josh Paulson, make an amazing product, which made the writing process far easier than it would have been otherwise. Yihui Xie, the author of the knitr package, provided numerous feature changes that I needed to write this book. His software, and his speed at implementing my requests, is greatly appreciated.

Numerous people have provided valuable feedback as I produced this book, including Chris Bethel, Dr. Dirk Eddelbuettel, Dr. Ramnath Vaidyanathan, Dr. Eran Bellin, Avi Fisher, Brian Ezra, Paul Puglia, Nicholas Galasinao, Aaron Schumaker, Adam Hogan, Jeffrey Arnold, and John Houston.

Last fall was my first time teaching, and I am thankful to the students from the Fall 2012 Introduction to Data Science class at Columbia University for being the guinea pigs for the material that ultimately ended up in this book.

Thank you to everyone who helped along the way.

About the Author

Jared P. Lander is the founder and CEO of Lander Analytics, a statistical consulting firm based in New York City, the organizer of the New York Open Statistical Programming Meetup, and an adjunct professor of statistics at Columbia University. He is also a tour guide for Scott’s Pizza Tours and an advisor to Brewla Bars, a gourmet ice pop start-up. With an M.A. from Columbia University in statistics and a B.A. from Muhlenberg College in mathematics, he has experience in both academic research and industry. His work for both large and small organizations spans politics, tech start-ups, fund-raising, music, finance, healthcare and humanitarian relief efforts.

He specializes in data management, multilevel models, machine learning, generalized linear models, visualization, data management and statistical computing.

Chapter 1. Getting R

R is a wonderful tool for statistical analysis, visualization and reporting. Its usefulness is best seen in the wide variety of fields where it is used. We alone have used R for projects with banks, political campaigns, tech startups, food startups, international development and aid organizations, hospitals and real estate developers. Other areas where we have seen it used are online advertising, insurance, ecology, genetics and pharmaceuticals. R is used by statisticians with advanced machine learning training and by programmers familiar with other languages, and also by people who are not necessarily trained in advanced data analysis but are tired of using Excel.

Before it can be used it needs to be downloaded and installed, a process that is no more complicated than installing any other program.

1.1. Downloading R

The first step in using R is getting it on the computer. Unlike with languages such as C++, R must be installed in order to run.1 The program is easily obtainable from the Comprehensive R Archive Network (CRAN), the maintainer of R, at http://cran.r-project.org/. At the top of the page are links to download R for Windows, Mac OS X and Linux.

1. Technically C++ cannot be set up on its own without a compiler, so something would still need to be installed anyway.

There are prebuilt installations available for Windows and Mac OS X while those for Linux usually compile from source. Installing R on any of these platforms is just like installing any other program.

Windows users should click the link Download R for Windows, then base and then Download R 3.x.x for Windows; the x’s indicate the version of R. This changes periodically as improvements are made.

Similarly, Mac users should click Download R for (Mac) OS X and then R-3.x.x.pkg; again, the x’s indicate the current version of R. This will also install both 32- and 64-bit versions.

Linux users should download R using their standard distribution mechanism whether that is apt-get (Ubuntu and Debian), zypper (SUSE) or another source. This will also build and install R.

1.2. R Version

As of this writing, R is at version 3.0.2, which is a big jump from the previous version, 2.15.3. CRAN follows a one-year release cycle where each major version change increases the middle of the three numbers in the version. For instance, version 3.0.0 was released in 2013. In 2014 the version will be incremented to 3.1.0 with 3.2.0 coming in 2015. The last number in the version is for minor updates to the current major version.

Most R functionality is usually backward compatible with previous versions.

1.3. 32-bit versus 64-bit

The choice between using 32-bit and using 64-bit comes down to whether the computer supports 64-bit—most new machines do—and the size of the data to be worked with. The 64-bit versions can address arbitrarily large amounts of memory (or RAM) so it might as well be used.

This is especially important starting with version 3.0.0, as that adds support for 64-bit integers, meaning far greater amounts of data can be stored in R objects.

In the past, certain packages required the 32-bit version of R but that is exceedingly rare these days. The only reason for installing the 32-bit version now is to support some legacy analysis or for use on a machine with a 32-bit processor such as Intel’s low-power Atom chip.

1.4. Installing

Installing R on Windows and Mac is just like installing any other program.

1.4.1. Installing on Windows

Find the appropriate installer where it was downloaded. For Windows users it will look like Figure 1.1.

Figure 1.1 Location of R installer.

R should be installed using administrator privileges. This means right-clicking the installer and then selecting Run as Administrator. This brings up a prompt where the administrator password should be entered.

The first dialog, shown in Figure 1.2, offers a choice of language, defaulted at English. Choose the appropriate language and click OK.

Figure 1.2 Language selection.

Next, the caution shown in Figure 1.3 recommends that all other programs be closed. This advice is rarely followed or necessary anymore, so clicking Next is appropriate.

Figure 1.3 With modern versions of Windows, this suggestion can be safely ignored.

The software license is then displayed, as in Figure 1.4. R cannot be used without agreeing to this (important) license, so the only recourse is to click Next.

Figure 1.4 The license agreement must be acknowledged to use R.

The installer then asks for a destination location. Even though the official advice from CRAN is that R should be installed in a directory with no spaces in the name, half the time the default installation directory is Program Files\R, which causes trouble if we try to build packages that require compiled code such as C++ for FORTRAN. Figure 1.5 shows this dialog.

Figure 1.5 It is important to choose a destination folder with no spaces in the name.

If that is the case, click the Browse button to bring up folder options like the ones shown in Figure 1.6.

Figure 1.6 This dialog is used to choose the destination folder.

It is best to choose a destination folder that is on the C: drive (or another hard disk drive) or inside My Documents, which despite that user-friendly name is actually located at C:\Users\UserName\Documents, which contains no spaces. Figure 1.7 shows a proper destination for the installation.

Figure 1.7 This is a proper destination, with no spaces in the name.

Next, Figure 1.8, shows a list of components to install. Unless there is a specific need for 32-bit files, that option can be unchecked. Everything else should be selected.

Figure 1.8 It is best to select everything except 32-bit components.

The startup options should be left at the default, No, as in Figure 1.9, because there are not a lot of options and we recommend using RStudio as the front end anyway.

Figure 1.9 Accept the default startup options, as we recommend using RStudio as the front end and these will not be important.

Next, choose where to put the start menu shortcuts. We recommend simply using R and putting every version in there as shown in Figure 1.10.

Figure 1.10 Choose the Start Menu folder where the shortcuts will be installed.

We have many versions of R, all inside the same Start Menu folder, which allows code to be tested in different versions. This is illustrated in Figure 1.11.

Figure 1.11 We have multiple versions of R installed to allow development and testing with different versions.

The last option is choosing whether to complete some additional tasks such as creating a desktop icon (not too useful if using RStudio). We highly recommend saving the version number in the registry and associating R with RData files. These options are shown in Figure 1.12.

Figure 1.12 We recommend saving the version number in the registry and associating R with RData files.

Clicking Next begins installation and displays a progress bar, as shown in Figure 1.13.

Figure 1.13 A progress bar is displayed during installation.

The last step, shown in Figure 1.14, is to click Finish and the installation is complete.

Figure 1.14 Confirmation that installation is complete.

1.4.2. Installing on Mac OS X

Find the appropriate installer, which ends in .pkg, and launch it by double-clicking. This brings up the introduction, shown in Figure 1.15. Click Continue to begin the installation process.

Figure 1.15 Introductory screen for installation on a Mac.

This brings up some information about the version of R being installed. There is nothing to do except click Continue, as shown in Figure 1.16.

Figure 1.16 Version selection.

Then the license information is displayed, as in Figure 1.17. Click Continue to proceed, the only viable option in order to use R.

Figure 1.17 The license agreement, which must be acknowledged to use R.

Click Agree to confirm that the license is agreed to, which is mandatory to use R as is evidenced in Figure 1.18.

Figure 1.18 The license agreement must also be agreed to.

To install R for all users, click Install; otherwise, click Change Install Location to pick a different location. This is shown in Figure 1.19.

Figure 1.19 By default R is installed for all users, although there is the option to choose a specific location.

If prompted, enter the necessary password as shown in Figure 1.20.

Figure 1.20 The administrator password might be required for installation.

This starts the installation process, which displays a progress bar as shown in Figure 1.21.

Figure 1.21 A progress bar is displayed during installation.

When done, the installer signals success as Figure 1.22 shows. Click Close to finish the installation.

Figure 1.22 This signals a successful installation.

1.4.3. Installing on Linux

Retrieving R from its standard distribution mechanism will download, build and install R in one step.

1.5. Revolution R Community Edition

Revolution Analytics offers a community version of its build of R featuring an Integrated Development Environment based on Visual Studio and built with the Intel Matrix Kernel Library (MKL), allowing for much faster matrix computations. It is available for free at http://www.revolutionanalytics.com/products/revolution-r.php. They also offer a paid version that provides specialized algorithms to work on very large data. More information is available at http://www.revolutionanalytics.com/products/revolution-enterprise.php.

1.6. Conclusion

At this point R is fully usable and comes with a crude GUI. However, it is best to install RStudio and use its interface, which is detailed in Section 2.2. The process involves downloading and launching an installer, just as with any other program.

Chapter 2. The R Environment

Now that R is downloaded and installed, it is time to get familiar with how to use R. The basic R interface on Windows is fairly Spartan as seen in Figure 2.1. The Mac interface (Figure 2.2) has some extra features and Linux has far fewer, being just a terminal.

Figure 2.1 The standard R interface in Windows.

Figure 2.2 The standard R interface on Mac OS X.

Unlike other languages, R is very interactive. That is, results can be seen one command at a time. Languages such as C++ require that an entire section of code be written, compiled and run in order to see results. The state of objects and results can be seen at any point in R. This interactivity is one of the most amazing aspects of working with R.

There have been numerous Integrated Development Environments (IDEs) built for R. For the purposes of this book we will assume that RStudio is being used, which is discussed in Section 2.2.

2.1. Command Line Interface

The command line interface is what makes R so powerful, and also frustrating to learn. There have been attempts to build point-and-click interfaces for R, such as Rcmdr, but none have truly taken off. This is a testament to how typing in commands is much better than using a mouse. That might be hard to believe, especially for those coming from Excel, but over time it becomes easier and less error prone.

For instance, fitting a regression in Excel takes at least seven mouse clicks, often more: Data >> Data Analysis >> Regression >> OK >> Input Y Range >> Input X Range >> OK. Then it may need to be done all over again to make one little tweak or because there are new data. Even harder is walking a colleague through those steps via email. In contrast, the same command is just one line in R, which can easily be repeated and copied and pasted. This may be hard to believe initially, but after some time the command line makes life much easier.

To run a command in R, type it into the console next to the > symbol and press the Enter key. Entries can be as simple as the number 2 or complex functions, such as those seen in Chapter 8.

To repeat a line of code, simply press the Up Arrow key and hit Enter again. All previous commands are saved and can be accessed by repeatedly using the Up and Down Arrow keys to cycle through them.

Interrupting a command is done with Esc in Windows and Mac and Ctrl-C in Linux.

Often when working on a large analysis it is good to have a file of the code used. Until recently, the most common way to handle this was to use a text editor1 such as TextPad or UltraEdit to write code and then copy and paste it into the R console. While this worked, it was sloppy and led to a lot of switching between programs.

1. This means a programming text editor as opposed to a word processor such as Microsoft Word. A text editor preserves the structure of the text whereas word processors may add formatting that makes it unsuitable for insertion into the console.

2.2. RStudio

While there are a number of IDEs available, the best right now is RStudio, created by a team led by JJ Allaire whose previous products include ColdFusion and Windows Live Writer. It is available for Windows, Mac and Linux and looks identical in all of them. Even more impressive is the RStudio server, which runs an R instance on a Linux server and allows the user to run commands through the standard RStudio interface in a Web browser. It works with any version of R (greater than 2.11.1) including Revolution R from Revolution Analytics. RStudio has so many options that it can be a bit overwhelming. We will cover some of the most useful or frequently used features.

RStudio is highly customizable but the basic interface looks roughly like Figure 2.3. In this case the lower left pane is the R console, which can be used just like the standard R console. The upper left pane takes the place of a text editor but is far more powerful. The upper right pane holds information about the workspace, command history, files in the current folder and Git version control. The lower right pane displays plots, package information and help files.

Figure 2.3 The general layout of RStudio.

There are a number of ways to send and execute commands from the editor to the console. To send one line place the cursor at the desired line and press Ctrl+Enter (Command+Enter on Mac). To insert a selection, simply highlight the selection and press Ctrl+Enter. To run an entire file of code, press Ctrl+Shift+S.

When typing code, such as an object name or function name, hitting Tab will autocomplete the code. If more than one object or function matches the letters typed so far, a dialog will pop up giving the matching options as shown in Figure 2.4.

Figure 2.4 Object Name Autocomplete in RStudio.

Typing Ctrl+1 moves the cursor to the text editor area and Ctrl+2 moves it to the console. To move to the previous tab in the text editor, press Ctrl+Alt+Left in Windows, Ctrl+PageUp in Linux and Ctrl+Option+Left on Mac. To move to the next tab in the text editor, press Ctrl+Alt+Right in Windows, Ctrl+PageDown in Linux and Ctrl+Option+Right on Mac. For a complete list of shortcuts click Help >> Keyboard Shortcuts.

2.2.1. RStudio Projects

A primary feature of RStudio is projects. A project is a collection of files—and possibly data, results and graphs—that are all related to each other.2 Each package even has its own working directory. This is a great way to keep organized.

2. This is different from an R session, which is all the objects and work done in R and kept in memory for the current usage period, which usually resets upon restarting R.

The simplest way to start a new project is to click File >> New Project as in Figure 2.5.

Figure 2.5 Clicking File >> New Project begins the project creation process.

Three options are available, shown in Figure 2.6: starting a new project in a new directory, associating a project with an existing directory or checking out a project from a version control repository such as Git or SVN. In all three cases a .Rproj file is put into the resulting directory and keeps track of the project.

Figure 2.6 Three options are available to start a new project: a new directory, associating a project with an existing directory or checking out a project from a version control repository.

Choosing to create a new directory brings up a dialog, shown in Figure 2.7, that requests a project name and where to create a new directory.

Figure 2.7 Dialog to choose the location of a new project directory.

Choosing an existing directory asks for the name of the directory, seen in Figure 2.8.

Figure 2.8 Dialog to choose an existing directory in which to start a project.

Choosing to use version control (we prefer Git) firsts asks whether to use Git or SVN as in Figure 2.9.

Figure 2.9 Here is the option to choose which type of repository to start a new project from.

Selecting Git asks for a repository URL, such as git@github.com:jaredlander/coefplot.git, which will then fill in the project directory name, as shown in Figure 2.10. As with creating a new directory, this will ask where to put this new directory.

Figure 2.10 Enter the URL for a Git repository, as well as the folder where this should be cloned to.

2.2.2. RStudio Tools

RStudio is highly customizable with a lot of options. Most are contained in the Options dialog accessed by clicking Tools >> Options, as seen in Figure 2.11.

Figure 2.11 Clicking Tools >> Options brings up RStudio options.

First are the General options, shown in Figure 2.12. There is a control for selecting which version of R to use. This is a powerful tool when a computer has a number of versions of R. However, RStudio must be restarted after changing the R version. In the future, RStudio is slated to offer the ability to set different versions of R for each project. It is also a good idea to not restore or save .RData files on startup and exiting.3

3. RData files are a convenient way of saving and sharing R objects and are discussed in Section 6.5.

Figure 2.12 General options in RStudio.

The Code Editing options, shown in Figure 2.13, control the way code is entered and displayed in the text editor. It is generally considered good practice to replace tabs with spaces, either two or four. Some hard-core programmers will appreciate vim mode. As of now there is no Emacs mode.

Figure 2.13 Options for customizing the code editing pane.

Appearance options, shown in Figure 2.14, change the way code looks, aesthetically. The font, size and color of the background and text can all be customized here.

Figure 2.14 Options for code appearance.

The Pane Layout options, shown in Figure 2.15, simply rearrange the panes that make up RStudio.

Figure 2.15 These options control the placement of the various panes in RStudio.

The Packages options, shown in Figure 2.16, set options regarding packages, although the most important is the CRAN mirror. While this is changeable from the console, this is the default setting. It is best to pick the mirror that is geographically the closest.

Figure 2.16 Options related to packages. The most important is the CRAN mirror selection.

Sweave, Figure 2.17, may be a bit misnamed, as this is where to choose between using Sweave or knitr. Both are used for the generation of PDF documents with knitr also enabling the creation of HTML documents. knitr, detailed in Chapter 23, is by far the better option, although it must be installed first, which is explained in Section 3.1. This is also where the PDF viewer is selected.

Figure 2.17 This is where to choose whether to use Sweave or knitr and select the PDF viewer.

RStudio contains a spelling checker for writing LATEX and Markdown documents (using knitr, preferably), which is controlled from the Spelling options, Figure 2.18. Not much needs to be set here.

Figure 2.18 These are the options for the spelling check dictionary, which allows language selection and the custom dictionaries.

The last option, Git/SVN, Figure 2.19, indicates where the executables for Git and SVN exist. This needs to be set only once but is necessary for version control.

Figure 2.19 This is where to set the location of Git and SVN executables so they can be used by RStudio.

2.2.3. Git Integration

Using version control is a great idea for many reasons. First and foremost it provides snapshots of code at different points in time and can easily revert to those snapshots. Ancillary benefits include having a backup of the code and the ability to easily transfer the code between computers with little effort.

While SVN used to be the gold standard in version control it has since been superseded by Git, so that will be our focus. After associating a project with a Git repository4 RStudio has a pane for Git like the one shown in Figure 2.20.

4. A Git account should be set up with either GitHub (https://github.com/) or Bitbucket (https://bitbucket.org/) beforehand.

Figure 2.20 The Git pane shows the Git status of files under version control. A blue square with a white M indicates a file has been changed and needs to be committed. A yellow square with a white question mark indicates a new file that is not being tracked by Git.

The main functionality is committing changes, pushing them to the server and pulling changes made by other users. Clicking the Commit button brings up a dialog, Figure 2.21, which displays files that have been modified, or new files. Clicking on one of these files displays the changes; deletions are colored pink and additions are colored green. There is also a space to write a message describing the commit.

Figure 2.21 This displays files and the changes made to the files, with green being additions and pink being deletions. The upper right contains a space for writing commit messages.

Clicking Commit will stage the changes and clicking Push will send them to the server.

2.3. Revolution Analytics RPE

Revolution Analytics provides an IDE based on Visual Studio called the R Productivity Environment (RPE). The greatest benefit of the RPE is the visual debugger. If this feature is not needed,5 we recommend using Revolution with RStudio as the front-end, which can be set in the General options detailed in Section 2.2.2.

5. The latest version of RStudio now also offers a visual debugger.

2.4. Conclusion

R’s usability has greatly improved over the past few years, mainly thanks to Revolution Analytics’ RPE and RStudio. Using an IDE can greatly improve proficiency, and change working with R from merely tolerable to actually enjoyable.6 RStudio’s code completion, text editor, Git integration and projects are indispensable for a good programming work flow.

6. One of our students relayed that he preferred Matlab to R until he used RStudio.

Chapter 3. R Packages

Perhaps the biggest reason for R’s phenomenally ascendant popularity is its collection of user-contributed packages. As of mid-September 2013, there were 4,845 packages available on CRAN1, written by an estimated 2,000 different people. Odds are good that if a statistical technique exists, it has been written in R and contributed to CRAN. Not only are there an incredibly large number of packages, many are written by the authorities in the field such as Andrew Gelman, Trevor Hastie, Dirk Eddelbuettel and Hadley Wickham.

1. http://cran.r-project.org/web/packages/

A package is essentially a library of prewritten code designed to accomplish some task or a collection of tasks. The survival package is used for survival analysis, ggplot2 is used for plotting and sp is for dealing with spatial data.

It is important to remember that not all packages are of the same quality. Some are built to be very robust and are well-maintained, while others are built with good intentions but can fail with unforeseen errors and others still are just plain poor. Even with the best packages, it is important to remember that most were written by statisticians for statisticians, so they may differ from what a computer engineer would expect.

This book will not attempt to provide an exhaustive list of good packages to use because that is constantly changing. However, there are some packages that are so pervasive that they will be used in this book as if they were part of base R. Some of these are ggplot2, reshape2 and plyr by Hadley Wickham; glmnet by Trevor Hastie, Robert Tibshirani and Jerome Friedman; Rcpp by Dirk Eddelbuettel; and knitr by Yihui Xie. We have written a package on CRAN, coefplot, with more to follow.

3.1. Installing Packages

As with many tasks in R, there are multiple ways to install packages. The simplest is to install them using the GUI provided by RStudio and shown in Figure 3.1. Access the Packages pane shown in this figure either by clicking its tab or by pressing Ctrl+7 on the keyboard.

Figure 3.1 RStudio’s Packages pane.

In the upper-left corner, click the Install Packages button to bring up the dialog in Figure 3.2.

Figure 3.2 RStudio’s package installation dialog.

From here simply type the name of a package (RStudio has a nice autocomplete feature for this) and click Install. Multiple packages can be specified, separated by commas. This downloads and installs the desired package, which is then available for use. Selecting the Install dependencies checkbox will automatically download and install all packages that the desired package requires to work. For example, our coefplot package depends on ggplot2, plyr, useful, stringr and reshape2, and each of those may have further dependencies.

An alternative is to type a very simple command into the console:

Click here to view code image

> install.packages(“coefplot”)

This will accomplish the same thing as working in the GUI.

There has been a movement recently to install packages directly from GitHub or BitBucket repositories, especially to get the development versions of packages. This can be accomplished using devtools.

Click here to view code image

> require(devtools)

> install_github(repo = “coefplot”, username = “jaredlander”)

If the package being installed from a repository contains source code for a compiled language—generally C++ or FORTRAN—then the proper compilers must be installed. More information is in Section 24.6.

Sometimes there is a need to install a package from a local file, either a zip of a prebuilt package or a tar.gz of package code. This can be done using the installation dialog mentioned before but switching the Install from: option to Package Archive File as shown in Figure 3.3. Then browse to the file and install. Note that this will not install dependencies, and if they are not present the installation will fail. Be sure to install dependencies first.

Figure 3.3 RStudio’s package installation dialog to install from an archive file.

Similarly to before, this can be accomplished using install.packages.

Click here to view code image

> install.packages(“coefplot_1.1.7.zip”)

3.1.1. Uninstalling Packages

In the rare instance when a package needs to be uninstalled, it is easiest to click the white X inside a grey circle on the right of the package description in RStudio’s Packages pane shown in Figure 3.1. Alternatively, this can be done with remove.packages where the first argument is a character vector naming the packages to be removed.

3.2. Loading Packages

Now that packages are installed they are almost ready to use and just need to be loaded first. There are two commands that can be used, either library or require. They both accomplish the same thing—loading the package—but require will return TRUE if it succeeds and FALSE with a warning if it cannot find the package. This returned value is useful when loading a package from within a function, a practice considered acceptable to some, improper to others. In general usage there is not much of a difference, so it comes down to personal preference. The argument to either function is the name of the desired package, with or without quotes. So loading the coefplot package would look like:

Click here to view code image

> require(coefplot)

Loading required package: coefplot

Loading required package: ggplot2

It prints out the dependent packages that get loaded as well. This can be suppressed by setting the argument quietly to TRUE.

Click here to view code image

> require(coefplot, quietly = TRUE)

A package only needs to be loaded when starting a new R session. Once loaded, it remains available until either R is restarted or the package is unloaded, as described in Section 3.2.1.

An alternative to loading a package through code is to select the checkbox next to the package name in RStudio’s Packages pane, seen on the left of Figure 3.1. This will load the package by running the code just shown.

3.2.1. Unloading Packages

Sometimes a package needs to be unloaded. This is simple enough either by clearing the checkbox in RStudio’s Packages pane or by using the detach function. The function takes the package name preceded by package: all in quotes.

Click here to view code image

> detach(“package:coefplot”)

It is not uncommon for functions in different packages to have the same name. For example, coefplot is in both arm (by Andrew Gelman) and coefplot.2 If both packages are loaded, the function in the package loaded last will be invoked when calling that function. A way around this is to precede the function with the name of the package, separated by two colons (::).

2. This particular instance is because we built coefplot as an improvement on the one available in arm. There are other instances where the names have nothing in common.

Click here to view code image

> arm::coefplot(object)

> coefplot::coefplot(object)

Not only does this call the appropriate function, it also allows the function to be called without even loading the package beforehand.

3.3. Building a Package

Building a package is one of the more rewarding parts of working with R, especially sharing that package with the community through CRAN. Chapter 24 discusses this process in detail.

3.4. Conclusion

Packages make up the backbone of the R community and experience. They are often considered what makes working with R so desirable. This is how the community makes its work, and so many of the statistical techniques, available to the world. With such a large number of packages, finding the right one can be overwhelming. CRAN Task Views (http://cran.r-project.org/web/views/) offers a curated listing of packages for different needs. However, the best way to find a new package might just be to ask the community. Appendix A gives some resources for doing just that.

Chapter 4. Basics of R

R is a powerful tool for all manner of calculations, data manipulation and scientific computations. Before getting to the complex operations possible in R we must start with the basics. Like most languages R has its share of mathematical capability, variables, functions and data types.

4.1. Basic Math

Being a statistical programming language, R can certainly be used to do basic math and that is where we will start.

We begin with the “Hello, World!” of basic math: 1 + 1. In the console there is a right angle bracket (>) where code should be entered. Simply test R by running

> 1 + 1

[1] 2

If this returns 2, then everything is great; if not, then something is very, very wrong. Assuming it worked, let’s look at some slightly more complicated expressions:

> 1 + 2 + 3

[1] 6

> 3 * 7 * 2

[1] 42

> 4/2

[1] 2

> 4/3

[1] 1.333

These follow the basic order of operations: Parenthesis, Exponents, Multiplication, Division, Addition and Subtraction (PEMDAS). This means operations inside parentheses take priority over other operations. Next on the priority list is exponentiation. After that multiplication and division are performed, followed by addition and subtraction.

This is why the first two lines in the following code have the same result while the third is different.

> 4 * 6 + 5

[1] 29

> (4 * 6) + 5

[1] 29

> 4 * (6 + 5)

[1] 44

So far we have put white space in between each operator such as * and /. This is not necessary but is encouraged as good coding practice.

4.2. Variables

Variables are an integral part of any programming language and R offers a great deal of flexibility. Unlike statically typed languages such as C++, R does not require variable types to be declared. A variable can take on any available data type as described in Section 4.3. It can also hold any R object such as a function, the result of an analysis or a plot. A single variable can at one point hold a number, then later hold a character and then later a number again.

4.2.1. Variable Assignment

There are a number of ways to assign a value to a variable, and again, this does not depend on the type of value being assigned.

The valid assignment operators are <- and = with the first being preferred.

For example, let’s save 2 to the variable x and 5 to the variable y.

> x <- 2

> x

[1] 2

> y = 5

> y

[1] 5

The arrow operator can also point in the other direction.

> 3 <- z

> z

[1] 3

The assignment operation can be used successively to assign a value to multiple variables simultaneously.

> a <- b <- 7

> a

[1] 7

> b

[1] 7

A more laborious, though sometimes necessary, way to assign variables is to use the assign function.

> assign(“j”, 4)

> j

[1] 4

Variable names can contain any combination of alphanumeric characters along with periods (.) and underscores (_). However, they cannot start with a number or an underscore.

The most common form of assignment in the R community is the left arrow (<-), which may seem awkward to use at first but eventually becomes second nature. It even seems to make sense, as the variable is sort of pointing to its value. There is also a particularly nice benefit for people coming from languages like SQL, where a single equal sign (=) tests for equality.

It is generally considered best practice to use actual names, usually nouns, for variables instead of single letters. This provides more information to the person reading the code. This is seen throughout this book.

4.2.2. Removing Variables

For various reasons a variable may need to be removed. This is easily done using remove or its shortcut rm.

Click here to view code image

> j

[1] 4

> rm(j)

> # now it is gone

> j

Error: object ‘j’ not found

This frees up memory so that R can store more objects, although it does not necessarily free up memory for the operating system. To guarantee that, use gc, which performs garbage collection, releasing unused memory to the operating system. R automatically does garbage collection periodically, so this function is not essential.

Variable names are case sensitive, which can trip up people coming from a language like SQL or Visual Basic.

Click here to view code image

> theVariable <- 17

> theVariable

[1] 17

> THEVARIABLE

Error: object ‘THEVARIABLE’ not found

4.3. Data Types

There are numerous data types in R that store various kinds of data. The four main types of data most likely to be used are numeric, character (string), Date/POSIXct (time-based) and logical (TRUE/FALSE).

The type of data contained in a variable is checked with the class function.

> class(x)

[1] “numeric”

4.3.1. Numeric Data

As expected, R excels at running numbers, so numeric data is the most common type in R. The most commonly used numeric data is numeric. This is similar to a float or double in other languages. It handles integers and decimals, both positive and negative, and, of course, zero. A numeric value stored in a variable is automatically assumed to be numeric. Testing whether a variable is numeric is done with the function is.numeric.

> is.numeric(x)

[1] TRUE

Another important, if less frequently used, type is integer. As the name implies this is for whole numbers only, no decimals. To set an integer to a variable it is necessary to append the value with an L. As with checking for a numeric, the is.integer function is used.

> i <- 5L

> i

[1] 5

> is.integer(i)

[1] TRUE

Do note that, even though i is an integer, it will also pass a numeric check.

> is.numeric(i)

[1] TRUE

R nicely promotes integers to numeric when needed. This is obvious when multiplying an integer by a numeric, but importantly it works when dividing an integer by another integer, resulting in a decimal number.

> class(4L)

[1] “integer”

> class(2.8)

[1] “numeric”

> 4L * 2.8

[1] 11.2

> class(4L * 2.8)

[1] “numeric”

>

> class(5L)

[1] “integer”

> class(2L)

[1] “integer”

> 5L/2L

[1] 2.5

> class(5L/2L)

[1] “numeric”

4.3.2. Character Data

Even though it is not explicitly mathematical, the character (string) data type is very common in statistical analysis and must be handled with care. R has two primary ways of handling character data: character and factor. While they may seem similar on the surface, they are treated quite differently.

> x <- “data”

> x

[1] “data”

> y <- factor(“data”)

> y

[1] data

Levels: data

Notice that x contains the word “data” encapsulated in quotes, while y has the word “data” without quotes and a second line of information about the levels of y. That is explained further in Section 4.4.2 about vectors.

Characters are case sensitive, so “Data” is different from “data” or “DATA.”

To find the length of a character (or numeric) use the nchar function.

> nchar(x)

[1] 4

> nchar(“hello”)

[1] 5

> nchar(3)

[1] 1

> nchar(452)

[1] 3

This will not work for factor data.

Click here to view code image

> nchar(y)

Error: ‘nchar()’ requires a character vector

4.3.3. Dates

Dealing with dates and times can be difficult in any language, and to further complicate matters R has numerous different types of dates. The most useful are Date and POSIXct. Date stores just a date while POSIXct stores a date and time. Both objects are actually represented as the number of days (Date) or seconds (POSIXct) since January 1, 1970.

Click here to view code image

> date1 <- as.Date(“2012-06-28”)

> date1

[1] “2012-06-28”

> class(date1)

[1] “Date”

> as.numeric(date1)

[1] 15519

>

> date2 <- as.POSIXct(“2012-06-28 17:42”)

> date2

[1] “2012-06-28 17:42:00 EDT”

> class(date2)

[1] “POSIXct” “POSIXt”

> as.numeric(date2)

[1] 1340919720

Easier manipulation of date and time objects can be accomplished using the lubridate and chron packages.

Using functions such as as.numeric or as.Date does not merely change the formatting of an object but actually changes the underlying type.

Click here to view code image

> class(date1)

[1] “Date”

> class(as.numeric(date1))

[1] “numeric”

4.3.4. Logical

logicals are a way of representing data that can be either TRUE or FALSE. Numerically, TRUE is the same as 1 and FALSE is the same as 0. So TRUE * 5 equals 5 while FALSE * 5 equals 0.

> TRUE * 5

[1] 5

> FALSE * 5

[1] 0

Similar to other types, logicals have their own test, using the is.logical function.

> k <- TRUE

> class(k)

[1] “logical”

> is.logical(k)

[1] TRUE

R provides T and F as shortcuts for TRUE and FALSE, respectively, but it is best practice not to use them, as they are simply variables storing the values TRUE and FALSE and can be overwritten, which can cause a great deal of frustration as seen in the following example.

> TRUE

[1] TRUE

> T

[1] TRUE

> class(T)

[1] “logical”

> T <- 7

> T

[1] 7

> class(T)

[1] “numeric”

logicals can result from comparing two numbers, or characters.

Click here to view code image

> # does 2 equal 3?

> 2 == 3

[1] FALSE

> # does 2 not equal three?

> 2 != 3

[1] TRUE

> # is two less than three?

> 2 < 3

[1] TRUE

> # is two less than or equal to three?

> 2 <= 3

[1] TRUE

> # is two greater than three?

> 2 > 3

[1] FALSE

> # is two greater than or equal to three?

> 2 >= 3

[1] FALSE

> # is ‘data’ equal to ‘stats’?

> “data” == “stats”

[1] FALSE

> # is ‘data’ less than ‘stats’?

> “data” < “stats”

[1] TRUE

4.4. Vectors

A vector is a collection of elements, all of the same type. For instance, c(1, 3, 2, 1, 5) is a vector consisting of the numbers 1, 3, 2, 1, 5, in that order. Similarly, c(“R”, “Excel”, “SAS”, “Excel”) is a vector of the character elements “R,” “Excel,” “SAS” and “Excel.” A vector cannot be of mixed type.

vectors play a crucial, and helpful, role in R. More than being simple containers, vectors in R are special in that R is a vectorized language. That means operations are applied to each element of the vector automatically, without the need to loop through the vector. This is a powerful concept that may seem foreign to people coming from other languages, but it is one of the greatest things about R.

vectors do not have a dimension, meaning there is no such thing as a column vector or row vector. These vectors are not like the mathematical vector where there is a difference between row and column orientation.1

1. Column or row vectors can be represented as one-dimensional matrices, which are discussed in Section 5.3.

The most common way to create a vector is with c. The “c” stands for combine because multiple elements are being combined into a vector.

Click here to view code image

> x <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)

> x

[1] 1 2 3 4 5 6 7 8 9 10

4.4.1. Vector Operations

Now that we have a vector of the first ten numbers, we might want to multiply each element by 3. In R this is a simple operation using just the multiplication operator (*).

Click here to view code image

> x * 3

[1] 3 6 9 12 15 18 21 24 27 30

No loops are necessary. Addition, subtraction and division are just as easy. This also works for any number of operations.

Click here to view code image

> x + 2

[1] 3 4 5 6 7 8 9 10 11 12

> x – 3

[1] -2 -1 0 1 2 3 4 5 6 7

> x/4

[1] 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 2.25 2.50

> x^2

[1] 1 4 9 16 25 36 49 64 81 100

> sqrt(x)

[1] 1.000 1.414 1.732 2.000 2.236 2.449 2.646 2.828 3.000 3.162

Earlier we created a vector of the first ten numbers using the c function, which creates a vector. A shortcut is the : operator, which generates a sequence of consecutive numbers, in either direction.

Click here to view code image

> 1:10

[1] 1 2 3 4 5 6 7 8 9 10

> 10:1

[1] 10 9 8 7 6 5 4 3 2 1

> -2:3

[1] -2 -1 0 1 2 3

> 5:-7

[1] 5 4 3 2 1 0 -1 -2 -3 -4 -5 -6 -7

Vector operations can be extended even further. Let’s say we have two vectors of equal length. Each of the corresponding elements can be operated on together.

Click here to view code image

> # create two vectors of equal length

> x <- 1:10

> y <- -5:4

> # add them

> x + y

[1] -4 -2 0 2 4 6 8 10 12 14

> # subtract them

> x – y

[1] 6 6 6 6 6 6 6 6 6 6

> # multiply them

> x * y

[1] -5 -8 -9 -8 -5 0 7 16 27 40

> # divide them–notice division by 0 results in Inf

> x/y

[1] -0.2 -0.5 -1.0 -2.0 -5.0 Inf 7.0 4.0 3.0 2.5

> # raise one to the power of the other

> x^y

[1] 1.000e+00 6.250e-02 3.704e-02 6.250e-02 2.000e-01 1.000e+00

[7] 7.000e+00 6.400e+01 7.290e+02 1.000e+04

> # check the length of each

> length(x)

[1] 10

> length(y)

[1] 10

> # the length of them added together should be the same

> length(x + y)

[1] 10

In the preceding code block, notice the hash # symbol. This is used for comments. Anything following the hash, on the same line, will be commented out and not run.

Things get a little more complicated when operating on two vectors of unequal length. The shorter vector gets recycled, that is, its elements are repeated, in order, until they have been matched up with every element of the longer vector. If the longer one is not a multiple of the shorter one, a warning is given.

Click here to view code image

> x + c(1, 2)

[1] 2 4 4 6 6 8 8 10 10 12

> x + c(1, 2, 3)

Warning: longer object length is not a multiple of shorter object

length

[1] 2 4 6 5 7 9 8 10 12 11

Comparisons also work on vectors. Here the result is a vector of the same length containing TRUE or FALSE for each element.

Click here to view code image

> x <= 5

[1] TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE

> x > y

[1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

> x < y

[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

To test whether all the resulting elements are TRUE, use the all function. Similarly, the any function checks whether any element is TRUE.

> x <- 10:1

> y <- -4:5

> any(x < y)

[1] TRUE

> all(x < y)

[1] FALSE

The nchar function also acts on each element of a vector.

Click here to view code image

> q <- c(“Hockey”, “Football”, “Baseball”, “Curling”, “Rugby”,

+ “Lacrosse”, “Basketball”, “Tennis”, “Cricket”, “Soccer”)

> nchar(q)

[1] 6 8 8 7 5 8 10 6 7 6

> nchar(y)

[1] 2 2 2 2 1 1 1 1 1 1

Accessing individual elements of a vector is done using square brackets ([ ]). The first element of x is retrieved by typing x[1], the first two elements by x[1:2] and nonconsecutive elements by x[c(1, 4)].

> x[1]

[1] 10

> x[1:2]

[1] 10 9

> x[c(1, 4)]

[1] 10 7

This works for all types of vectors whether they are numeric, logical, character and so forth.

It is possible to give names to a vector either during creation or after the fact.

Click here to view code image

> # provide a name for each element of an array using a name-value pair

> c(One = “a”, Two = “y”, Last = “r”)

One Two Last

“a” “y” “r”

>

> # create a vector

> w <- 1:3

> # name the elements

> names(w) <- c(“a”, “b”, “c”)

> w

a b c

1 2 3

4.4.2. Factor Vectors

factors are an important concept in R, especially when building models. Let’s create a simple vector of text data that has a few repeats. We will start with the vector q we created earlier and add some elements to it.

Click here to view code image

> q2 <- c(q, “Hockey”, “Lacrosse”, “Hockey”, “Water Polo”,

+ “Hockey”, “Lacrosse”)

Converting this to a factor is easy with as.factor.

Click here to view code image

> q2Factor <- as.factor(q2)

> q2Factor

[1] Hockey Football Baseball Curling Rugby Lacrosse

[7] Basketball Tennis Cricket Soccer Hockey Lacrosse

[13] Hockey Water Polo Hockey Lacrosse

11 Levels: Baseball Basketball Cricket Curling Football … Water Polo

Notice that after printing out every element of q2Factor, R also prints the levels of q2Factor. The levels of a factor are the unique values of that factor variable. Technically, R is giving each unique value of a factor a unique integer tying it back to the character representation. This can be seen with as.numeric.

Click here to view code image

> as.numeric(q2Factor)

[1] 6 5 1 4 8 7 2 10 3 9 6 7 6 11 6 7

In ordinary factors the order of the levels does not matter and one level is no different from another. Sometimes, however, it is important to understand the order of a factor, such as when coding education levels. Setting the ordered argument to TRUE creates an ordered factor with the order given in the levels argument.

Click here to view code image

> factor(x=c(“High School”, “College”, “Masters”, “Doctorate”),

+ levels=c(“High School”, “College”, “Masters”, “Doctorate”),

+ ordered=TRUE)

[1] High School College Masters Doctorate

Levels: High School < College < Masters < Doctorate

factors can drastically reduce the size of the variable because they are storing only the unique values, but they can cause headaches if not used properly. This will be discussed further throughout the book.

4.5. Calling Functions

Earlier we briefly used a few basic functions like nchar, length and as.Date to illustrate some concepts. Functions are very important and helpful in any language because they make code easily repeatable. Almost every step taken in R involves using functions, so it is best to learn the proper way to call them. R function calling is filled with a good deal of nuance, so we are going to focus on the gist of what is needed to know. Of course, throughout the book there will be many examples of calling functions.

Let’s start with the simple mean function, which computes the average of a set of numbers. In its simplest form it takes a vector as an argument.

> mean(x)

[1] 5.5

More complicated functions have multiple arguments that can be either specified by the order they are entered or by using their name with an equal sign. We will see further use of this throughout the book.

R provides an easy way for users to build their own functions, which we will cover in more detail in Chapter 8.

4.6. Function Documentation

Any function provided in R has accompanying documentation, of varying quality of course. The easiest way to access that documentation is to place a question mark in front of the function name, like this: ?mean.

To get help on binary operators like +, * or == surround them with back ticks (`).

> ?`+`

> ?`*`

> ?`==`

There are occasions when we have only a sense of the function we want to use. In that case we can look up the function by using part of the name with apropos.

Click here to view code image

> apropos(“mea”)

[1] “.cache/mean-simple_ce29515dafe58a90a771568646d73aae”

[2] “.colMeans”

[3] “.rowMeans”

[4] “colMeans”

[5] “influence.measures”

[6] “kmeans”

[7] “mean”

[8] “mean.Date”

[9] “mean.default”

[10] “mean.difftime”

[11] “mean.POSIXct”

[12] “mean.POSIXlt”

[13] “mean_cl_boot”

[14] “mean_cl_normal”

[15] “mean_sdl”

[16] “mean_se”

[17] “rowMeans”

[18] “weighted.mean”

4.7. Missing Data

Missing data plays a critical role in both statistics and computing, and R has two types of missing data, NA and NULL. While they are similar, they behave differently and that difference needs attention.

 

 

Statistical Computation for Programmers, Scientists, Quants, Excel Users, and Other Professionals

Using the open source R language, you can build powerful statistical models to answer many of your most challenging questions. R has traditionally been difficult for non-statisticians to learn, and most R books assume far too much knowledge to be of help. R for Everyone is the solution.

Drawing on his unsurpassed experience teaching new users, professional data scientist Jared P. Lander has written the perfect tutorial for anyone new to statistical programming and modeling. Organized to make learning easy and intuitive, this guide focuses on the 20 percent of R functionality you’ll need to accomplish 80 percent of modern data tasks.

Lander’s self-contained chapters start with the absolute basics, offering extensive hands-on practice and sample code. You’ll download and install R; navigate and use the R environment; master basic program control, data import, and manipulation; and walk through several essential tests. Then, building on this foundation, you’ll construct several complete models, both linear and nonlinear, and use some data mining techniques.

By the time you’re done, you won’t just know how to write R programs, you’ll be ready to tackle the statistical problems you care about most.

 

COVERAGE INCLUDES

• Exploring R, RStudio, and R packages

• Using R for math: variable types, vectors, calling functions, and more

• Exploiting data structures, including data.frames, matrices, and lists

• Creating attractive, intuitive statistical graphics

• Writing user-defined functions

• Controlling program flow with if, ifelse, and complex checks

• Improving program efficiency with group manipulations

• Combining and reshaping multiple datasets

• Manipulating strings using R’s facilities and regular expressions

• Creating normal, binomial, and Poisson probability distributions

• Programming basic statistics: mean, standard deviation, and t-tests

• Building linear, generalized linear, and nonlinear models

• Assessing the quality of models and variable selection

• Preventing overfitting, using the Elastic Net and Bayesian methods

• Analyzing univariate and multivariate time series data

• Grouping data via K-means and hierarchical clustering

• Preparing reports, slideshows, and web pages with knitr

• Building reusable R packages with devtools and Rcpp

• Getting involved with the R global community


Related posts

Hello Scratch!: Learn to Program by Making Arcade Games by Melissa Ford, PDF 161729425X
Aaron Marks’ Complete Guide to Game Audio: For Composers, Sound Designers, Musicians, and Game Developers by Aaron Marks, PDF 1138795380
Programming Rust: Fast, Safe Systems Development by Jim Blandy, PDF
Quantitative Methods in Archaeology Using R (Cambridge Manuals in Archaeology) by Dr David L. Carlson, PDF 1107655579
Programming with Python by T R Padmanabhan, PDF 9811032769
Discrete Mathematical Structures (6th Edition) by Sharon C. Ross, PDF 0132297515

Leave a Reply

Your email address will not be published. Required fields are marked *