Skip to main content

One post tagged with "data manipulation"

View All Tags

Understanding SQL GROUP BY and DISTINCT with Examples

· 4 min read
PSVNL SAI KUMAR
SDE @ Intralinks

In SQL, the GROUP BY and DISTINCT clauses play a crucial role in data manipulation, particularly in organizing and summarizing data. Here, we explore the usage, syntax, and working of GROUP BY and DISTINCT with example data, along with insights into how GROUP BY functions behind the scenes.

Sample Data: Employees Table

We'll use the following sample dataset to demonstrate both GROUP BY and DISTINCT:

employee_idnamedepartmentjob_titlesalary
1AliceSalesSales Manager70000
2BobSalesSales Associate50000
3CharlieHRHR Manager60000
4DavidHRRecruiter45000
5EveITSoftware Engineer80000
6FrankITIT Support55000
7GraceITSoftware Engineer80000

GROUP BY Clause

The GROUP BY clause organizes data into groups based on specified columns, often used with aggregate functions like SUM, COUNT, or MAX to summarize information within each group.

Basic Syntax

SELECT column1, column2, AGGREGATE_FUNCTION(column3)
FROM table_name
GROUP BY column1, column2;

Example 1: Counting Employees per Department

To find the number of employees in each department:

SELECT department, COUNT(*) AS employee_count
FROM employees
GROUP BY department;

Result:

departmentemployee_count
Sales2
HR2
IT3

This query groups employees by department, then counts the employees within each group.

Example 2: Grouping by Multiple Columns

If we want to group by both department and job title to get a more detailed count:

SELECT department, job_title, COUNT(*) AS employee_count
FROM employees
GROUP BY department, job_title;

Result:

departmentjob_titleemployee_count
SalesSales Manager1
SalesSales Associate1
HRHR Manager1
HRRecruiter1
ITSoftware Engineer2
ITIT Support1

Here, we get counts of employees for each unique combination of department and job title.

DISTINCT Clause

The DISTINCT clause removes duplicate rows in the result set, returning only unique values for the specified columns.

Basic Syntax

SELECT DISTINCT column1, column2
FROM table_name;

Example: Listing Unique Job Titles

To list unique job titles from the employees table:

SELECT DISTINCT job_title
FROM employees;

Result:

job_title
Sales Manager
Sales Associate
HR Manager
Recruiter
Software Engineer
IT Support

This returns each unique job title once, eliminating duplicates.

How GROUP BY Works Behind the Scenes

When GROUP BY is executed, SQL follows several steps to efficiently organize and summarize the data:

  1. Scan and Sort: SQL scans through the data in the table. If an index is present on the grouped column(s), the database engine can locate relevant data more quickly. SQL then sorts rows by the specified columns (e.g., department, job_title).

  2. Divide into Groups: After sorting, SQL divides the data into groups based on distinct values in the specified columns.

  3. Apply Aggregations: Within each group, SQL applies aggregate functions such as COUNT, SUM, or MAX. These functions operate on all rows in each group to return summarized results.

  4. Return Results: Once all groups have been processed and aggregate calculations completed, the results are compiled and returned as the final output.

Example of GROUP BY Internals

For the query:

SELECT department, COUNT(*) AS employee_count
FROM employees
GROUP BY department;

SQL sorts the employees data by department, creating "buckets" for each unique department. It then counts the entries within each bucket to get the employee_count for each department.

Key Differences Between GROUP BY and DISTINCT

AspectGROUP BYDISTINCT
PurposeGroups rows based on column valuesEliminates duplicate rows
Usage with AggregatesCommonly used with aggregate functionsUsed without aggregate functions
Multiple ColumnsGroups data based on multiple columnsEnsures unique combinations of multiple columns

Summary

  • GROUP BY: Ideal when you want to organize data into groups and use aggregate functions.
  • DISTINCT: Useful for retrieving a unique set of rows without aggregating.

Understanding these clauses enables you to better manage and summarize data, offering powerful ways to analyze datasets efficiently.