Introduction to SQL (for an R user)
Although SQL is commonly used in industry, it’s not something that’s often used or taught in academia. I learned it on my own a few years ago, but since I don’t use it regularly, it’s hard to retain. To resolve this, I’ve created the following guide for basic SQL commands along with their equivalents in R/dplyr. Hopefully, this will allow me to pick it back up again more quickly in the future.
This guide is primarily based on sqlzoo.
Selecting columns and rows
SELECT
The SELECT
command selects or creates columns from the table. This is like dplyr::select
or dplyr::mutate
.
SELECT city, country
FROM table
Here are some other ways to use SELECT
:
- Select all columns: SELECT *
- Create a new column from existing columns: SELECT gdp/population
- Rename a column with alias: SELECT gdp/100 AS gdp2 – AS is optional
- Round a column: SELECT ROUND(gdp/100, 2) – round to 2 decimal places
- Distinct entries: SELECT DISTINCT country
- Case when: SELECT CASE WHEN country = ‘France’ THEN ‘Europe’ WHEN country = ‘India’ THEN ‘Asia’ ELSE ‘Other’ END continent
- Return first value that is not NULL: SELECT COALESCE(city, ‘None’)
- Concatenate a string: SELECT concat(city, “City”)
- Select top 2 rows: SELECT TOP 2 * FROM table – the command is LIMIT, used after FROM, for MySQL
Note that since you can use SELECT
to essentially create new columns, it can be useful to use a SELECT
within SELECT
.
WHERE
The WHERE
command filters rows from the table. This is like dplyr::filter
.
SELECT city
FROM table
WHERE country = 'France'
Here are some other ways to use WHERE
:
- Not equal: WHERE country <> ‘France’
- Element in a list:
- WHERE country IN (‘France’, ‘Germany’)
- WHERE country NOT IN (‘France’, ‘Germany’)
- Pattern match strings:
- WHERE country LIKE ‘F%’ – country starts with F
- WHERE country LIKE ‘_F%’ – second letter of country is F
- Numeric ranges: WHERE population BETWEEN 100 and 200
- NULL values: WHERE population IS NULL
- Boolean operators:
- WHERE population > 100 OR country = ‘France’ – either condition must be true
- WHERE population > 100 AND country = ‘France’ – both conditions must be true
- WHERE population > 100 XOR country = ‘France’ – one of the conditions must be true but not both
- SELECT within SELECT:
- WHERE population > (SELECT population FROM table where country = ‘France’) – greater than population of France
- WHERE population > ALL(SELECT population FROM table WHERE country IN (‘France’, ‘Germany’)) – greater than population of France and Germany
ORDER BY
The ORDER BY
command orders the rows in the table. By default, it arranges rows from “smallest” to “largest.” This is like dplyr::arrange
.
SELECT city
FROM table
ORDER BY country
Here are some other ways to use ORDER BY
:
- Order from largest to smallest: ORDER BY country DESC
- Order with multiple columns: ORDER BY country, population
- Order with a boolean value: ORDER BY country IN (‘France’, ‘Germany’) – puts France and Germany last
- Order randomly: ORDER BY RAND()
Aggregating rows
SUM, COUNT, MAX, AVG
These commands calculate aggregate values, optionally by groups. For example, below I count the number of cities and the sum of their populations for each country. This is like dplyr::summarize
and dplyr::group_by
.
SELECT COUNT(city), SUM(population)
FROM table
GROUP BY country
- Sum of column: SELECT SUM(population)
- Max of column: SELECT MAX(population)
- Mean of column: SELECT AVG(population)
- Count number of entries/rows: SELECT COUNT(city)
Note that to filter on groups, we use the HAVING
command instead of the WHERE
command.
- Filter on groups: HAVING SUM(population) > 100
Joining tables
JOIN
The JOIN
command joins tables with a matching column(s). This is like the join commands from dplyr: dplyr::inner_join
, dplyr::left_join
, dplyr::right_join
, and dplyr::full_join
.
SELECT brand, price
FROM sales JOIN purchases ON (sales.id = purchases.id)
Here are some other ways to use JOIN
:
- Match on multiple columns: FROM sales JOIN purchases ON (sales.id = purchases.id AND sales.date = purchases.date)
- Only return rows in both tables (default when you use JOIN): FROM sales INNER JOIN purchases ON (sales.id = purchases.id)
- Only return rows in left table: FROM sales LEFT JOIN purchases ON (sales.id = purchases.id)
- Only return rows in right table: FROM sales RIGHT JOIN purchases ON (sales.id = purchases.id)
- Return rows in either table: FROM sales FULL OUTER JOIN purchases ON (sales.id = purchases.id)
CROSS JOIN
The CROSS JOIN
command creates every combination of columns from two tables. I don’t believe I’ve ever needed to do a cross join in R, but as described here, you can modify the dplyr::full_join
command to do a cross join.
SELECT sales.brand, purchases.brand
FROM sales CROSS JOIN purchases
UNION
The UNION
command stacks rows (distinct values only). The UNION ALL
command does the same thing, allowing for duplicate values. This is like rbind
or dplyr::bind_rows
.
SELECT id FROM A
UNION
SELECT id FROM B
Window functions
RANK
The RANK
command creates the ranked order. You can create a ranking in R by using dplyr::mutate
and rank
.
SELECT RANK() OVER (ORDER BY population DESC)
FROM table
- To rank within partitions: SELECT RANK() OVER (PARTITION BY city ORDER BY population)
LAG and LEAD
The LAG
command shows data from the preceding row. Similarly, the LEAD
command shows data from the following row. This is like dplyr::lag
and dplyr::lead
.
SELECT cases, LAG(cases, 1) OVER (ORDER BY day) AS pcases
FROM covid
- To lag within partitions: SELECT LAG(cases, 1) OVER (PARTITION BY city ORDER BY day)
Insert, update, and delete
The INSERT
command inserts rows into a table. The UPDATE
command modifies rows. The DELETE
command deletes rows.
INSERT INTO table VALUES ('Beijing', 'China')
UPDATE table SET city = 'Shanghai', country = 'China' WHERE city = 'Beijing'
DELETE FROM table WHERE country = 'China'
- Insert with select: INSERT INTO table (SELECT city, country FROM table WHERE population > 100)