If you are using R to do data analysis inside a company, most of the data you need probably already lives in a database (it’s just a matter of figuring out which one!). However, you will learn how to load data in to a local database in order to demonstrate dplyr’s database tools. At the end, I’ll also give you a few pointers if you do. Dplyr provides a grammar for manipulating tables in R. This cheat sheet will guide you through the grammar, reminding you how to select, filter, arrange, mutate, summarise, group, and join data frames and tibbles. Updated January 2017.
Ballet codpiece. New cheat-sheet for the dplyrXdf package Hadley Wickham's dplyr package is an amazing tool for restructuring, filtering, and aggregating data sets using its elegant grammar of data manipulation. By default, it works on in-memory data frames, which means you're limited to the amount of data you can fit into R's memory. If you are new to dplyr, the best place to start is the data import chapter in R for data science. Installation # The easiest way to get dplyr is to install the whole tidyverse: install.packages('tidyverse') # Alternatively, install just dplyr: install.packages('dplyr') # Or the development version from GitHub: # install.packages('devtools. Work with strings with stringr:: CHEAT SHEET Detect Matches strdetect(string, pattern) Detect the presence of a pattern match in a string. Strdetect(fruit, 'a') strwhich(string, pattern) Find the indexes of strings that contain a pattern match. Strwhich(fruit, 'a') strcount(string, pattern) Count the number of matches in a string.
Meeting time: Tuesdays from 1:30 to 3 pm (beginning June 11) | Instructor: Kelsey Moty
This workshop will help you to learn the fundamentals of R needed to manipulate, visualize, and describe your data. This workshop has a particular emphasis on producing clean and reproducible code in line with coding and open science best practices. Due to time limitations, we will not be able to go over how to do statistical modeling in R; however, I will provide a series of resources at the end of this list that you can look over on your own.
Some of these resources may at times be redundant with one another. Feel free to skip over material that you feel comfortable with. Most importantly, make sure to work through the exercises! The best way to learn how to code is by actually coding :)
Each week, we will meet for an hour and half to go over the topic for that week. This meeting is meant to be collaborative. You will work together with other people in our lab to get started with each week's R skill. However, you will also need to complete some of the lesson on your own time, as an hour and half is likely not enough time to practice that week's topic. Mac lip liners. If questions come up outside of our Tuesday meeting, feel free to post them on the R Workshop Slack channel!
You will get to apply the skills learned in this workshop to a dataset from a research project you are currently working on in the lab. At the end of the workshop, you will share with other members of the lab the dataset you cleaned up, a plot you created from that dataset, and some kind of analysis you did on that dataset (whether descriptive or inferential).
Before we begin, this workshop pulls from resources written by a lot of amazing people and they deserve credit for it!
A number of the book chapters and other resources we are reading were written by Hadley Wickham, Danielle Navarro, Jenny Bryan, Jim Hester, Kieran Healy, and Andy Fields. Several of the tutorials we are working through are from a course that was taught by Dale Barr and Lisa DeBruine.
Getting your data ready for statistical analysis
- Downloading R: Download the appropriate version for your operating system (Mac or Windows)
Downloading RStudio: RStudio to makes it much easier to code in R
- Reading + exercises:Learning basics about R (Part 1)
Reading + exercises:Learning basics about R (Part 2)
Reading:More about packages
Reading:More about variables
Reading:More about vectors
Resource:Cheat sheet on how to use RStudio
Resource:Cheat sheet on basic R functions
- Reading:What makes a good plot?
Notes + exercises:Making plots
Reading + exercises:Getting a better understanding of the code used to make plots (Chapter 3, especially 3.3 - 3.10)
Resource:Examples of plots with corresponding R code
Resource:Resource for helping you select the best way to visualize your data
Resource:R Graphics Cookbook: A practical guide to help you build graphs in R
Resource:Cheat sheet on data visualization
- Reading:Tidy Data
Reading:Using pipes to tidy data
Notes + exercises:Learning tidyr
Reading (optional):Manipulating your data using tidyr (this reading provides similar information as the other readings, but may be useful to you if the other materials weren't clear)
Resource:Cheat sheet on importing and tidying dataResource:Cheat sheet on processing dates using lubridate
- Reading:Describing data
Reading + exercises:Data transformation
Notes + exercises:Learning the main 6 dplyr verbs
Resource:Cheat sheet on data transformation with dplyr
- Reading + exercises:Relational data
Notes + exercises:Joining data using the dplyr's join verbs
Resource:Cheat sheet on data transformation with dplyr
- Reading + exercises:Using loops in R
Reading + exercises:Using branches in R
Reading + exercises:Creating your own functions in R
Notes + exercises:Iterating and more practice creating your own functions in R
Reading (optional):More about loops and iterating in R
Reading (optional):More about writing your own functions in R
- Reading:Using R Markdown
Notes + exercises:Creating reproducible code in R
Reading:How to properly set paths
Slides:How to name files
Reading:How to debug your R code
Resource:Cheat sheet on R Markdown
- Reading:Why GitHub?
Reading + exercises:Installing Git (Read Chapters 4 - 7; 8 is optional)
Reading + exercises:Connecting GitHub and RStudio (Chapters 9 & 12; 14 is helpful if you are having problems connecting!)
Reading + exercises:Using GitHub to store R code (Read through Chapter 15; 16 and 17 are for your reference for future projects)
Reading + exercises:Basics of Git (Chapter 20; Chapter 21 - 23 for more advanced stuff)
data.table and dplyr cheat-sheet
This is a cheat-sheet on data manipulation using data.table and dplyr package (sqldf will be included soon…) . The package dplyr is an excellent and intuitive tool for data manipulation in R. Due to its intuitive data process steps and a somewhat similar concepts with SQL, dplyr gets increasingly popular. Another reason is that it can be integrated in SparkR seamlessly. Mastering dplyr will be a must if you want to get started with SparkR.
I found this cheat-sheet very useful in using dplyr. My post is inspired by it. I hereby write this cheat sheet for data manipulation with data.table / data.frame and dplyr computation side by side. It is especially useful for those who wants to convert data manipulation style from data.table to dplyr. There are 6 data investigation and manipulation included:
- Summary of data
- subset rows
- subset columns
- summarize data
- group data
- create new data
Select rows that meet logical criteria:
dplyr
data.frame / data.table
Remove duplicate rows:
dplyr
Rstudio Cheat Sheet Dplyr
data.table
Randomly select fraction of rows
dplyr
Dplyr Cheat Sheet In R
Randomly select n rows
dplyr
data.table / data.frame
Select rows by position
dplyr
data.table / data.frame
Select and order top n entries (by group if group data)
dplyr
data.table
dplyr
data.frame
> iris[c(‘Sepal.Width’,’Petal.Length’,’Species’)]
data.table
Select columns whose name contains a character string
Select columns whose name ends with a character string
Select every column
dplyr
data.frame
Select columns whose name matches a regular expression
Select columns names x1,x2,x3,x4,x5
select(iris, num_range(‘x’, 1:5))
Select columns whose names are in a group of names
Select column whose name starts with a character string
Select all columns between Sepal.Length and Petal.Width (inclusive)
Select all columns except Species.
dplyr
data.frame
The package dplyr allows you to easily compute first, last, nth, n, n_distinct, min, max, mean, median, var, st of a vector as a summary of the table.
Summarize data into single row of values
dplyr
Apply summary function to each column
Note: mean cannot be applied on Factor type.
Count number of rows with each unique value of variable (with or without weights)
dplyr
data.table:
aggregate {stats}
Group data into rows with the same value of Species
dplyr
data.table: this is usually performed with some aggregation computation
Remove grouping information from data frame
dplyr
Compute separate summary row for each group
2am saint o%60clock full album. dplyr
data.frame
data.table
Mutate used window function, function that take a vector of values and return another vector of values, such as:
compute and append one or more new columns
data.frame / data.table
dplyr
Apply window function to each column
dplyr
base
data.table
Compute one or more new columns. Drop original columns
Compute new variable by group.
dplyr
iris %>% group_by(Species) %>% mutate(ave = mean(Sepal.Length))
data.table
iris[, ave:=mean(Sepal.Length), by = Species]
data.frame
You can verify the result df1, df2 using: