Introduction to SQL (for an R user)

Although SQL is commonly used in industry, it’s not something that’s often used or taught in academia. I learned it on my own a few years ago, but since I don’t use it regularly, it’s hard to retain. To resolve this, I’ve created the following guide for basic SQL commands along with their equivalents in R/dplyr. Hopefully, this will allow me to pick it back up again more quickly in the future.

This guide is primarily based on sqlzoo.

Selecting columns and rows

SELECT

The SELECT command selects or creates columns from the table. This is like dplyr::select or dplyr::mutate.

SELECT city, country
FROM table

Here are some other ways to use SELECT:

  • Select all columns: SELECT *
  • Create a new column from existing columns: SELECT gdp/population
  • Rename a column with alias: SELECT gdp/100 AS gdp2 – AS is optional
  • Round a column: SELECT ROUND(gdp/100, 2) – round to 2 decimal places
  • Distinct entries: SELECT DISTINCT country
  • Case when: SELECT CASE WHEN country = ‘France’ THEN ‘Europe’ WHEN country = ‘India’ THEN ‘Asia’ ELSE ‘Other’ END continent
  • Return first value that is not NULL: SELECT COALESCE(city, ‘None’)
  • Concatenate a string: SELECT concat(city, “City”)
  • Select top 2 rows: SELECT TOP 2 * FROM table – the command is LIMIT, used after FROM, for MySQL

Note that since you can use SELECT to essentially create new columns, it can be useful to use a SELECT within SELECT.

WHERE

The WHERE command filters rows from the table. This is like dplyr::filter.

SELECT city
FROM table
WHERE country = 'France'

Here are some other ways to use WHERE:

  • Not equal: WHERE country <> ‘France’
  • Element in a list:
    • WHERE country IN (‘France’, ‘Germany’)
    • WHERE country NOT IN (‘France’, ‘Germany’)
  • Pattern match strings:
    • WHERE country LIKE ‘F%’ – country starts with F
    • WHERE country LIKE ‘_F%’ – second letter of country is F
  • Numeric ranges: WHERE population BETWEEN 100 and 200
  • NULL values: WHERE population IS NULL
  • Boolean operators:
    • WHERE population > 100 OR country = ‘France’ – either condition must be true
    • WHERE population > 100 AND country = ‘France’ – both conditions must be true
    • WHERE population > 100 XOR country = ‘France’ – one of the conditions must be true but not both
  • SELECT within SELECT:
    • WHERE population > (SELECT population FROM table where country = ‘France’) – greater than population of France
    • WHERE population > ALL(SELECT population FROM table WHERE country IN (‘France’, ‘Germany’)) – greater than population of France and Germany

ORDER BY

The ORDER BY command orders the rows in the table. By default, it arranges rows from “smallest” to “largest.” This is like dplyr::arrange.

SELECT city
FROM table
ORDER BY country

Here are some other ways to use ORDER BY:

  • Order from largest to smallest: ORDER BY country DESC
  • Order with multiple columns: ORDER BY country, population
  • Order with a boolean value: ORDER BY country IN (‘France’, ‘Germany’) – puts France and Germany last
  • Order randomly: ORDER BY RAND()

Aggregating rows

SUM, COUNT, MAX, AVG

These commands calculate aggregate values, optionally by groups. For example, below I count the number of cities and the sum of their populations for each country. This is like dplyr::summarize and dplyr::group_by.

SELECT COUNT(city), SUM(population)
FROM table
GROUP BY country
  • Sum of column: SELECT SUM(population)
  • Max of column: SELECT MAX(population)
  • Mean of column: SELECT AVG(population)
  • Count number of entries/rows: SELECT COUNT(city)

Note that to filter on groups, we use the HAVING command instead of the WHERE command.

  • Filter on groups: HAVING SUM(population) > 100

Joining tables

JOIN

The JOIN command joins tables with a matching column(s). This is like the join commands from dplyr: dplyr::inner_join, dplyr::left_join, dplyr::right_join, and dplyr::full_join.

SELECT brand, price
FROM sales JOIN purchases ON (sales.id = purchases.id)

Here are some other ways to use JOIN:

  • Match on multiple columns: FROM sales JOIN purchases ON (sales.id = purchases.id AND sales.date = purchases.date)
  • Only return rows in both tables (default when you use JOIN): FROM sales INNER JOIN purchases ON (sales.id = purchases.id)
  • Only return rows in left table: FROM sales LEFT JOIN purchases ON (sales.id = purchases.id)
  • Only return rows in right table: FROM sales RIGHT JOIN purchases ON (sales.id = purchases.id)
  • Return rows in either table: FROM sales FULL OUTER JOIN purchases ON (sales.id = purchases.id)

CROSS JOIN

The CROSS JOIN command creates every combination of columns from two tables. I don’t believe I’ve ever needed to do a cross join in R, but as described here, you can modify the dplyr::full_join command to do a cross join.

SELECT sales.brand, purchases.brand
FROM sales CROSS JOIN purchases

UNION

The UNION command stacks rows (distinct values only). The UNION ALL command does the same thing, allowing for duplicate values. This is like rbind or dplyr::bind_rows.

SELECT id FROM A
UNION
SELECT id FROM B

Window functions

RANK

The RANK command creates the ranked order. You can create a ranking in R by using dplyr::mutate and rank.

SELECT RANK() OVER (ORDER BY population DESC)
FROM table
  • To rank within partitions: SELECT RANK() OVER (PARTITION BY city ORDER BY population)

LAG and LEAD

The LAG command shows data from the preceding row. Similarly, the LEAD command shows data from the following row. This is like dplyr::lag and dplyr::lead.

SELECT cases, LAG(cases, 1) OVER (ORDER BY day) AS pcases
FROM covid
  • To lag within partitions: SELECT LAG(cases, 1) OVER (PARTITION BY city ORDER BY day)

Insert, update, and delete

The INSERT command inserts rows into a table. The UPDATE command modifies rows. The DELETE command deletes rows.

INSERT INTO table VALUES ('Beijing', 'China')
UPDATE table SET city = 'Shanghai', country = 'China' WHERE city = 'Beijing'
DELETE FROM table WHERE country = 'China'
  • Insert with select: INSERT INTO table (SELECT city, country FROM table WHERE population > 100)