The Encoding Extension enables text encoding detection and UTF-8 conversion from within DuckDB
Installing and Loading
INSTALL encoding FROM community;
LOAD encoding;
Example
-- Forces text with potentially mixed or unknown encoding to UTF-8
D SELECT force_utf8_encoding('\x5B\x7B\x22Langue\x22\x3A\x5B\x22Fran\xE7ais\x22,\x22English\x22\x5D\x7D\x5D') as utf8_text;
┌─────────────────────────────────────┐
│             utf8_text               │
│               varchar               │
├─────────────────────────────────────┤
│ [{"Langue":["Français","English"]}] │
└─────────────────────────────────────┘
-- Use with json() function to parse converted text
D SELECT json(force_utf8_encoding('\x5B\x7B\x22Langue\x22\x3A\x5B\x22Fran\xE7ais\x22,\x22English\x22\x5D\x7D\x5D')) as parsed_json;
┌─────────────────────────────────────┐
│             parsed_json             │
│                json                 │
├─────────────────────────────────────┤
│ [{"Langue":["Français","English"]}] │
└─────────────────────────────────────┘
-- Detect the encoding of input text
D SELECT detect_encoding('\x5B\x7B\x22Langue\x22\x3A\x5B\x22Fran\xE7ais\x22,\x22English\x22\x5D\x7D\x5D') as detected_encoding;
┌───────────────────┐
│ detected_encoding │
│      varchar      │
├───────────────────┤
│ windows-1252      │
└───────────────────┘
-- Convert simple hex-encoded text
D SELECT force_utf8_encoding('\x48\x65\x6C\x6C\x6F') as simple_conversion;
┌───────────────────┐
│ simple_conversion │
│      varchar      │
├───────────────────┤
│ Hello             │
└───────────────────┘
-- Handle regular UTF-8 text (pass-through)
D SELECT force_utf8_encoding('Hello World') as passthrough;
┌─────────────┐
│ passthrough │
│   varchar   │
├─────────────┤
│ Hello World │
└─────────────┘
About encoding
This community extension implements text encoding detection and UTF-8 conversion functions for DuckDB. It solves the common problem of working with text data from various sources that may not be UTF-8 encoded.
Problem Solved
DuckDB's built-in decode() function fails when trying to convert non-UTF-8 encoded data:
SELECT json(decode('\x5B\x7B\x22Langue\x22\x3A\x5B\x22Fran\xE7ais\x22,\x22English\x22\x5D\x7D\x5D'));
-- Error: Failure in decode: could not convert blob to UTF8 string,
-- the blob contained invalid UTF8 characters
This extension automatically detects the encoding and converts it to UTF-8:
SELECT json(force_utf8_encoding('\x5B\x7B\x22Langue\x22\x3A\x5B\x22Fran\xE7ais\x22,\x22English\x22\x5D\x7D\x5D')) as json;
-- Returns: [{"Langue":["Français","English"]}]
Functions
        
        force_utf8_encoding(input_text)
        
      
    
Forces text with potentially mixed or unknown encoding to UTF-8. Handles hex-encoded byte sequences and attempts automatic encoding detection.
Parameters:
input_text(VARCHAR): Text that may contain encoded bytes (like '\x5B\x7B…')
Returns: VARCHAR - The text converted to UTF-8, or NULL on error
Examples:
-- Convert hex-encoded JSON with French characters
SELECT force_utf8_encoding('\x5B\x7B\x22Langue\x22\x3A\x5B\x22Fran\xE7ais\x22,\x22English\x22\x5D\x7D\x5D') as utf8_text;
-- Returns: [{"Langue":["Français","English"]}]
-- Convert simple hex-encoded text
SELECT force_utf8_encoding('\x48\x65\x6C\x6C\x6F') as text;
-- Returns: Hello
-- Handle regular UTF-8 text (pass-through)
SELECT force_utf8_encoding('Hello World') as text;
-- Returns: Hello World
-- Mixed hex and regular characters
SELECT force_utf8_encoding('Hello \x57\x6F\x72\x6C\x64') as text;
-- Returns: Hello World
        
        detect_encoding(input_text)
        
      
    
Detects the likely encoding of input text and returns the encoding name.
Parameters:
input_text(VARCHAR): Text that may contain encoded bytes
Returns: VARCHAR - The detected encoding name, or NULL on error
Examples:
-- Detect encoding of regular UTF-8 text
SELECT detect_encoding('Hello World') as encoding;
-- Returns: UTF-8
-- Detect encoding of hex-encoded text
SELECT detect_encoding('\x48\x65\x6C\x6C\x6F') as encoding;
-- Returns: UTF-8
-- Detect encoding of French text with Windows-1252 encoding
SELECT detect_encoding('\x5B\x7B\x22Langue\x22\x3A\x5B\x22Fran\xE7ais\x22,\x22English\x22\x5D\x7D\x5D') as encoding;
-- Returns: windows-1252
Supported Encodings
The extension automatically detects and converts from these encodings:
- UTF-8: Already valid UTF-8 (pass-through)
 - Windows-1252: Very common for Western text
 - Windows-1251: Cyrillic text
 - Windows-1250: Central European text
 - Shift_JIS: Japanese text
 - GB18030: Chinese text
 - EUC-KR: Korean text
 - ISO-8859-15: Latin-9 (with Euro sign)
 - ISO-8859-2: Latin-2
 
Usage Examples
Basic Encoding Conversion
-- Convert encoded data to UTF-8 for JSON parsing
SELECT json(force_utf8_encoding('\x5B\x7B\x22test\x22\x3A\x22value\x22\x7D\x5D')) as data;
-- Convert and process in WHERE clause
SELECT * FROM documents
WHERE json_extract(force_utf8_encoding(raw_data), '$.language') = 'français';
-- Batch convert multiple encoded strings
SELECT
    id,
    force_utf8_encoding(encoded_text) as clean_text
FROM raw_data_table
WHERE force_utf8_encoding(encoded_text) IS NOT NULL;
Encoding Detection and Analysis
-- Analyze encoding distribution in your data
SELECT
    detect_encoding(raw_text) as encoding,
    COUNT(*) as count
FROM documents
GROUP BY detect_encoding(raw_text)
ORDER BY count DESC;
-- Find records that need encoding conversion
SELECT *
FROM documents
WHERE detect_encoding(raw_text) != 'UTF-8'
AND detect_encoding(raw_text) IS NOT NULL;
This extension is experimental and potentially unstable. See README for full examples.
Added Functions
| function_name | function_type | description | comment | examples | 
|---|---|---|---|---|
| force_utf8_encoding | scalar | Forces text with potentially mixed or unknown encoding to UTF-8. Handles hex-encoded byte sequences and attempts automatic encoding detection. | NULL | [SELECT force_utf8_encoding('\x5B\x7B\x22Langue\x22\x3A\x5B\x22Fran\xE7ais\x22,\x22English\x22\x5D\x7D\x5D');] | 
| detect_encoding | scalar | Detects the likely encoding of input text and returns the encoding name. | NULL | [SELECT detect_encoding('\x5B\x7B\x22Langue\x22\x3A\x5B\x22Fran\xE7ais\x22,\x22English\x22\x5D\x7D\x5D');] |