Calculate TF-IDF using DuckDB
This SQL query calculates the term frequency-inverse document frequency (TF-IDF) scores of words in a set of 4 documents represented by the sentences CTE.
Loading editor...
WITH
sentences(sentence, doc) AS (
VALUES
('Air quality in the sunny island improved gradually throughout Wednesday.', 1),
('Air quality in Singapore on Wednesday continued to get worse as haze hit the island.', 2),
('The air quality in Singapore is monitored through a network of air monitoring stations located in different parts of the island', 3),
('The air quality in Singapore got worse on Wednesday.', 4)
), base_tokenized AS (
SELECT
UNNEST(STRING_SPLIT(REGEXP_REPLACE(LOWER(sentence),'[^-0-9A-Za-z ]','', 'g'),' ')) AS token,
doc
FROM sentences
), base_stopworded AS (
SELECT *
FROM base_tokenized
WHERE token NOT IN ('the', 'on', 'was', 'in', 'got', 'of', 'to', 'as', 'is', 'a')
), tf_num AS (
SELECT
token,
doc,
COUNT(*) AS doc_tok_cnt
FROM base_stopworded
GROUP BY 1,2
), tf_den AS (
SELECT
doc,
SUM(doc_tok_cnt) AS doc_tot_sum
FROM tf_num
GROUP BY 1
), idf_den AS (
SELECT COUNT(DISTINCT doc) AS ndocs
FROM base_stopworded
), idf_num AS (
SELECT
token,
COUNT(DISTINCT doc) AS docs
FROM
base_stopworded
GROUP BY 1
), calc_tf AS (
SELECT
token,
doc,
tf_num.doc_tok_cnt,
tf_den.doc_tot_sum,
tf_num.doc_tok_cnt/tf_den.doc_tot_sum AS tf
FROM tf_num
JOIN tf_den USING (doc)
), calc_idf AS (
SELECT
token,
idf_den.ndocs,
idf_num.docs,
LOG(idf_den.ndocs/idf_num.docs) AS idf
FROM idf_den, idf_num
), calc_tfidf AS (
SELECT
*,
calc_tf.tf * calc_idf.idf AS tfidf
FROM calc_tf
JOIN calc_idf USING (token)
)
SELECT *
FROM calc_tfidf
ORDER BY
doc ASC,
tfidf DESCIt performs the following operations:
- Tokenize the lowercase sentences into individual words by splitting on spaces, after removing all characters that are not letters or numbers.
- Remove stop words (“the”, “on”, “was”, “in”, “got”, “of”, “to”, “as”, “is”, “a”).
- Calculate term frequency (TF) as the number of times a word appears in a document divided by the total number of words in that document.
- Calculate inverse document frequency (IDF) as the logarithm of the total number of documents divided by the number of documents containing the word.
- Multiply TF and IDF to get the TF-IDF score for each word in each document.
- Return a table of the words, their TF-IDF scores, and the document they appear in, sorted by document and then by TF-IDF score in descending order.
CTEs Breakdown

sentences: Creates a table with 4 sentences representing the 4 documents. Each row contains a sentence and the document number it belongs to.
base_tokenized: Tokenizes each sentence into individual words. It first removes all characters from the sentence that are not letters or numbers, and then splits the sentence into words by spaces. The result is a table with columns for the word (token) and the document number it belongs to.
base_stopworded: Removes stop words from the base_tokenized table. It only includes words that are not in the list of stop words.
tf_num: Calculates the number of times each word appears in each document. It groups the words and documents from base_stopworded and counts the number of occurrences of each word in each document.
tf_den: Calculates the total number of words in each document. It sums the word counts from tf_num for each document.
idf_den: Calculates the total number of documents.
idf_num: Calculates the number of documents containing each word. It groups the words from base_stopworded and counts the number of unique documents for each word.
calc_tf: Calculates the term frequency (TF) for each word in each document. It joins the tf_num and tf_den CTEs to get the necessary information and calculates TF as the number of times a word appears in a document divided by the total number of words in that document.
calc_idf: Calculates the inverse document frequency (IDF) for each word. It joins the idf_den and idf_num CTEs to get the necessary information and calculates IDF as the logarithm of the total number of documents divided by the number of documents containing the word.
calc_tfidf: Calculates the TF-IDF score for each word in each document. It joins the calc_tf and calc_idf CTEs to get both TF and IDF, and multiplies them to get the TF-IDF score.
TF-IDF example with a larger dataset
Now, lets try with 50,000 HackerNews titles:
Loading editor...
WITH
sentences AS (
SELECT
title AS sentence,
id AS doc
FROM 'https://storage.sekuel.com/hn_titles.parquet'
-- LIMIT 1000
), base_tokenized AS (
SELECT
UNNEST(STRING_SPLIT(REGEXP_REPLACE(LOWER(sentence),'[^-0-9A-Za-z ]','', 'g'),' ')) AS token,
doc
FROM sentences
), base_stopworded AS (
SELECT *
FROM base_tokenized
WHERE token NOT IN ('the', 'on', 'was', 'in', 'got', 'of', 'to', 'as', 'is', 'a')
), tf_num AS (
SELECT
token,
doc,
COUNT(*) AS doc_tok_cnt
FROM base_stopworded
GROUP BY 1,2
), tf_den AS (
SELECT
doc,
SUM(doc_tok_cnt) AS doc_tot_sum
FROM tf_num
GROUP BY 1
), idf_den AS (
SELECT COUNT(DISTINCT doc) AS ndocs
FROM base_stopworded
), idf_num AS (
SELECT
token,
COUNT(DISTINCT doc) AS docs
FROM
base_stopworded
GROUP BY 1
), calc_tf AS (
SELECT
token,
doc,
tf_num.doc_tok_cnt,
tf_den.doc_tot_sum,
tf_num.doc_tok_cnt/tf_den.doc_tot_sum AS tf
FROM tf_num
JOIN tf_den USING (doc)
), calc_idf AS (
SELECT
token,
idf_den.ndocs,
idf_num.docs,
LOG(idf_den.ndocs/idf_num.docs) AS idf
FROM idf_den, idf_num
), calc_tfidf AS (
SELECT
*,
calc_tf.tf * calc_idf.idf AS tfidf
FROM calc_tf
JOIN calc_idf USING (token)
)
SELECT *
FROM calc_tfidf
ORDER BY
doc ASC,
tfidf DESC