**Introduction**

Data aggregation is the process of gathering and expressing data in a summary to get more information about particular groups based on specific conditions. HiveQL offers several built-in aggregate functions, such as max, min, avg,..etc. It also supports advanced aggregation using keywords such as Variance and Standard Deviation and different types of window functions. In this article, we will demonstrate HiveQL aggregation functions with examples.

**Prerequisites**

Let’s start by creating our Managed table from ‘orders.csv’ file

CREATE TABLEOrders(ProductID int , Quantity string , UnitPrice decimal, Discount int, OrderID int, CustomerID string , OrdersID int, ShipVia int , Freight decimal, ShipName string, ShipAddress string, ShipCity string, ShipRegion string, ShipPostalCode int, ShipCountry string)ROW FORMAT DELIMITEDFIELDS TERMINATED BY','STORED AS TEXTFILELOCATION 'hdfs://<name node host name>:<name node port>/<path of the data files>';

Refer to our previous article about Hive tables to know the difference between Managed and External tables

**Aggregation functions**

**SUM**

Returns the sum of the column in the group or sum of the distinct values of the column in the group.

SELECTsum(Quantity)FROMOrders;

SELECTSum(Quantity)FROMOrdersWHEREShipCountry = ‘France’;

**Count**

In Count function, we have three variations we can use as following:

**COUNT**(*) – Returns the total number of retrieved rows, including rows containing NULL values;**COUNT**(<expression>) – Returns the number of rows for which the supplied expression is non-NULL;**COUNT**(**DISTINCT **expr[, expr]) – Returns the number of rows for which the supplied expression(s) are unique

SELECTCOUNT(*)FROMOrders;

SELECT COUNT(distinctShipCountry)FROMOrders;

SELECT COUNT(*)FROMOrdersWHEREUnitPrice >10;

**Average**

Returns the average of the elements in the group or the average of the distinct values of the column in the group.

**AVG**(<expression or column name>)**AVG**(**DISTINCT **<expression or column name>)

SELECT AVG(UnitPrice)FROMOrdersWHEREShipCountry = 'Germany';

**Minimum**

Returns the minimum of the column in the group.

**MIN**(<expression or column>

SELECT MIN(UnitPrice)FROMOrders;

**Maximum**

Returns the maximum of the column in the group.

**MAX**(<expression or column>

SELECT MAX(UnitPrice)FROMOrders;

**Variance**

Returns the variance of a numeric column in the group.

**VARIANCE**(<expression or column>

SELECT VARIANCE(Quantity)FROMOrders;

**Standard Deviation**

Returns the Standard Deviation of a numeric column in the group.

**STDDEV_POP**(<expression or column>

SELECT STDDEV_POP(UnitPrice)FROMOrders;

**Covariance**

Returns the population covariance of a pair of numeric columns in the group.

**COVAR_POP**(<expression or column> , <expression or column>)

SELECT COVAR_POP(UnitPrice, OrderID)FROMOrders;

**Correlation**

Returns the Pearson coefficient of correlation of a pair of a numeric columns in the group.

**CORR**(<expression or column> , <expression or column>)

SELECT CORR(UnitPrice, OrderID)FROMOrders;

**Collections**

Returns a set of objects with duplicate elements eliminated, repeated values will be rejected in the output you will get a distinct list of your input column or expression in the output.

**COLLECT_SET**(<expression or column>)

SELECT COLLECT_SET(UnitPrice)FROMOrders;

**Histogram**

Computes a histogram of a numeric column in the group using b non-uniformly spaced bins. The output is an array of size b of double-valued (x,y) coordinates that represent the bin centers and heights.

SELECT HISTOGRAM_NUMERIC(UnitPrice,5)FROMOrders;