One post tagged with "data manipulation"

Understanding SQL GROUP BY and DISTINCT with Examples

October 26, 2024 · 4 min read

SDE @ Intralinks

In SQL, the GROUP BY and DISTINCT clauses play a crucial role in data manipulation, particularly in organizing and summarizing data. Here, we explore the usage, syntax, and working of GROUP BY and DISTINCT with example data, along with insights into how GROUP BY functions behind the scenes.

Sample Data: Employees Table

We'll use the following sample dataset to demonstrate both GROUP BY and DISTINCT:

employee_id	name	department	job_title	salary
1	Alice	Sales	Sales Manager	70000
2	Bob	Sales	Sales Associate	50000
3	Charlie	HR	HR Manager	60000
4	David	HR	Recruiter	45000
5	Eve	IT	Software Engineer	80000
6	Frank	IT	IT Support	55000
7	Grace	IT	Software Engineer	80000

GROUP BY Clause

The GROUP BY clause organizes data into groups based on specified columns, often used with aggregate functions like SUM, COUNT, or MAX to summarize information within each group.

Basic Syntax

SELECT column1, column2, AGGREGATE_FUNCTION(column3)
FROM table_name
GROUP BY column1, column2;

Example 1: Counting Employees per Department

To find the number of employees in each department:

SELECT department, COUNT(*) AS employee_count
FROM employees
GROUP BY department;

Result:

department	employee_count
Sales	2
HR	2
IT	3

This query groups employees by department, then counts the employees within each group.

Example 2: Grouping by Multiple Columns

If we want to group by both department and job title to get a more detailed count:

SELECT department, job_title, COUNT(*) AS employee_count
FROM employees
GROUP BY department, job_title;

Result:

department	job_title	employee_count
Sales	Sales Manager	1
Sales	Sales Associate	1
HR	HR Manager	1
HR	Recruiter	1
IT	Software Engineer	2
IT	IT Support	1

Here, we get counts of employees for each unique combination of department and job title.

DISTINCT Clause

The DISTINCT clause removes duplicate rows in the result set, returning only unique values for the specified columns.

Basic Syntax

SELECT DISTINCT column1, column2
FROM table_name;

Example: Listing Unique Job Titles

To list unique job titles from the employees table:

SELECT DISTINCT job_title
FROM employees;

Result:

job_title
Sales Manager
Sales Associate
HR Manager
Recruiter
Software Engineer
IT Support

This returns each unique job title once, eliminating duplicates.

How GROUP BY Works Behind the Scenes

When GROUP BY is executed, SQL follows several steps to efficiently organize and summarize the data:

Scan and Sort: SQL scans through the data in the table. If an index is present on the grouped column(s), the database engine can locate relevant data more quickly. SQL then sorts rows by the specified columns (e.g., department, job_title).
Divide into Groups: After sorting, SQL divides the data into groups based on distinct values in the specified columns.
Apply Aggregations: Within each group, SQL applies aggregate functions such as COUNT, SUM, or MAX. These functions operate on all rows in each group to return summarized results.
Return Results: Once all groups have been processed and aggregate calculations completed, the results are compiled and returned as the final output.

Example of GROUP BY Internals

For the query:

SELECT department, COUNT(*) AS employee_count
FROM employees
GROUP BY department;

SQL sorts the employees data by department, creating "buckets" for each unique department. It then counts the entries within each bucket to get the employee_count for each department.

Key Differences Between GROUP BY and DISTINCT

Aspect	GROUP BY	DISTINCT
Purpose	Groups rows based on column values	Eliminates duplicate rows
Usage with Aggregates	Commonly used with aggregate functions	Used without aggregate functions
Multiple Columns	Groups data based on multiple columns	Ensures unique combinations of multiple columns

Summary

GROUP BY: Ideal when you want to organize data into groups and use aggregate functions.
DISTINCT: Useful for retrieving a unique set of rows without aggregating.

Understanding these clauses enables you to better manage and summarize data, offering powerful ways to analyze datasets efficiently.

Sample Data: Employees Table​

GROUP BY Clause​

Basic Syntax​

Example 1: Counting Employees per Department​

Example 2: Grouping by Multiple Columns​

DISTINCT Clause​

Basic Syntax​

Example: Listing Unique Job Titles​

How GROUP BY Works Behind the Scenes​

Example of GROUP BY Internals​

Key Differences Between GROUP BY and DISTINCT​

Summary​

Sample Data: Employees Table

GROUP BY Clause

Basic Syntax

Example 1: Counting Employees per Department

Example 2: Grouping by Multiple Columns

DISTINCT Clause

Basic Syntax

Example: Listing Unique Job Titles

How GROUP BY Works Behind the Scenes

Example of GROUP BY Internals

Key Differences Between GROUP BY and DISTINCT

Summary