Understanding SQL GROUP BY and DISTINCT with Examples
In SQL, the GROUP BY
and DISTINCT
clauses play a crucial role in data manipulation, particularly in organizing and summarizing data. Here, we explore the usage, syntax, and working of GROUP BY
and DISTINCT
with example data, along with insights into how GROUP BY
functions behind the scenes.
Sample Data: Employees Table
We'll use the following sample dataset to demonstrate both GROUP BY
and DISTINCT
:
employee_id | name | department | job_title | salary |
---|---|---|---|---|
1 | Alice | Sales | Sales Manager | 70000 |
2 | Bob | Sales | Sales Associate | 50000 |
3 | Charlie | HR | HR Manager | 60000 |
4 | David | HR | Recruiter | 45000 |
5 | Eve | IT | Software Engineer | 80000 |
6 | Frank | IT | IT Support | 55000 |
7 | Grace | IT | Software Engineer | 80000 |
GROUP BY Clause
The GROUP BY
clause organizes data into groups based on specified columns, often used with aggregate functions like SUM
, COUNT
, or MAX
to summarize information within each group.
Basic Syntax
SELECT column1, column2, AGGREGATE_FUNCTION(column3)
FROM table_name
GROUP BY column1, column2;
Example 1: Counting Employees per Department
To find the number of employees in each department:
SELECT department, COUNT(*) AS employee_count
FROM employees
GROUP BY department;
Result:
department | employee_count |
---|---|
Sales | 2 |
HR | 2 |
IT | 3 |
This query groups employees by department, then counts the employees within each group.
Example 2: Grouping by Multiple Columns
If we want to group by both department and job title to get a more detailed count:
SELECT department, job_title, COUNT(*) AS employee_count
FROM employees
GROUP BY department, job_title;
Result:
department | job_title | employee_count |
---|---|---|
Sales | Sales Manager | 1 |
Sales | Sales Associate | 1 |
HR | HR Manager | 1 |
HR | Recruiter | 1 |
IT | Software Engineer | 2 |
IT | IT Support | 1 |
Here, we get counts of employees for each unique combination of department and job title.
DISTINCT Clause
The DISTINCT
clause removes duplicate rows in the result set, returning only unique values for the specified columns.
Basic Syntax
SELECT DISTINCT column1, column2
FROM table_name;
Example: Listing Unique Job Titles
To list unique job titles from the employees
table:
SELECT DISTINCT job_title
FROM employees;
Result:
job_title |
---|
Sales Manager |
Sales Associate |
HR Manager |
Recruiter |
Software Engineer |
IT Support |
This returns each unique job title once, eliminating duplicates.
How GROUP BY Works Behind the Scenes
When GROUP BY
is executed, SQL follows several steps to efficiently organize and summarize the data:
-
Scan and Sort: SQL scans through the data in the table. If an index is present on the grouped column(s), the database engine can locate relevant data more quickly. SQL then sorts rows by the specified columns (e.g.,
department
,job_title
). -
Divide into Groups: After sorting, SQL divides the data into groups based on distinct values in the specified columns.
-
Apply Aggregations: Within each group, SQL applies aggregate functions such as
COUNT
,SUM
, orMAX
. These functions operate on all rows in each group to return summarized results. -
Return Results: Once all groups have been processed and aggregate calculations completed, the results are compiled and returned as the final output.
Example of GROUP BY Internals
For the query:
SELECT department, COUNT(*) AS employee_count
FROM employees
GROUP BY department;
SQL sorts the employees
data by department
, creating "buckets" for each unique department. It then counts the entries within each bucket to get the employee_count
for each department.
Key Differences Between GROUP BY and DISTINCT
Aspect | GROUP BY | DISTINCT |
---|---|---|
Purpose | Groups rows based on column values | Eliminates duplicate rows |
Usage with Aggregates | Commonly used with aggregate functions | Used without aggregate functions |
Multiple Columns | Groups data based on multiple columns | Ensures unique combinations of multiple columns |
Summary
GROUP BY
: Ideal when you want to organize data into groups and use aggregate functions.DISTINCT
: Useful for retrieving a unique set of rows without aggregating.
Understanding these clauses enables you to better manage and summarize data, offering powerful ways to analyze datasets efficiently.