1/jecolon/ziglyph v0.32
Unicode text processing for the Zig programming language.
ziglyph
Unicode text processing for the Zig Programming Language.
In-Depth Articles on Unicode Processing with Zig and Ziglyph
The Unicode Processing with Zig series of articles over on ZigNEWS covers important aspects of Unicode in general and in particular how to use this library to process Unicode text.
Looking for an UTF-8 String Type?
Zigstr
is a UTF-8 string type that incorporates many of Ziglyph's Unicode processing tools. You can
learn more in the Zigstr repo.
Status
This is pre-1.0 software. Although breaking changes are less frequent with each minor version release, they still will occur until we reach 1.0.
Integrating Ziglyph in your Project
Using Zigmod
$ zigmod aq add 1/jecolon/zigstr
$ zigmod fetch
Now in your build.zig
you add this import:
const deps = @import("deps.zig");
In the exe
section for the executable where you wish to have Zigstr available, add:
deps.addAllTo(exe);
Manually via Git
In a libs
subdirectory under the root of your project, clone this repository via
$ git clone https://github.com/jecolon/ziglyph.git
Now in your build.zig, you can add:
exe.addPackagePath("ziglyph", "libs/ziglyph/src/ziglyph.zig");
to the exe
section for the executable where you wish to have Ziglyph available. Now in the code, you
can import components like this:
const ziglyph = @import("ziglyph");
const letter = @import("ziglyph").letter; // or const letter = ziglyph.letter;
const number = @import("ziglyph").number; // or const number = ziglyph.number;
Using the ziglyph
Namespace
The ziglyph
namespace provides convenient acces to the most frequently-used functions related to Unicode
code points and strings.
const ziglyph = @import("ziglyph");
test "ziglyph namespace" {
const z = 'z';
try expect(ziglyph.isLetter(z));
try expect(ziglyph.isAlphaNum(z));
try expect(ziglyph.isPrint(z));
try expect(!ziglyph.isUpper(z));
const uz = ziglyph.toUpper(z);
try expect(ziglyph.isUpper(uz));
try expectEqual(uz, 'Z');
// String toLower, toTitle, and toUpper.
var allocator = std.testing.allocator;
var got = try ziglyph.toLowerStr(allocator, "AbC123");
errdefer allocator.free(got);
try expect(std.mem.eql(u8, "abc123", got));
allocator.free(got);
got = try ziglyph.toUpperStr(allocator, "aBc123");
errdefer allocator.free(got);
try expect(std.mem.eql(u8, "ABC123", got));
allocator.free(got);
got = try ziglyph.toTitleStr(allocator, "thE aBc123 moVie. yes!");
defer allocator.free(got);
try expect(std.mem.eql(u8, "The Abc123 Movie. Yes!", got));
}
Category Namespaces
Namespaces for frequently-used Unicode General Categories are available. See ziglyph.zig for a full list of all components.
const letter = @import("ziglyph").letter;
const punct = @import("ziglyph").punct;
test "Category namespaces" {
const z = 'z';
try expect(letter.isletter(z));
try expect(!letter.isUpper(z));
try expect(!punct.ispunct(z));
try expect(punct.ispunct('!'));
const uz = letter.toUpper(z);
try expect(letter.isUpper(uz));
try expectEqual(uz, 'Z');
}
Normalization
In addition to the basic functions to detect and convert code point case, the Normalizer
struct
provides code point and string normalization methods. All normalization forms are supported (NFC,
NFKC, NFD, NFKD.).
const Normalizer = @import("ziglyph").Normalizer;
test "normalizeTo" {
var allocator = std.testing.allocator;
var normalizer = try Normalizer.init(allocator);
defer normalizer.deinit();
// Canonical Composition (NFC)
const input_nfc = "Complex char: \u{03D2}\u{0301}";
const want_nfc = "Complex char: \u{03D3}";
var got_nfc = try normalizer.nfc(allocator, input_nfc);
defer got_nfc.deinit();
try testing.expectEqualSlices(u8, want_nfc, got_nfc.slice);
// Compatibility Composition (NFKC)
const input_nfkc = "Complex char: \u{03A5}\u{0301}";
const want_nfkc = "Complex char: \u{038E}";
var got_nfkc = try normalizer.nfkc(allocator, input_nfkc);
defer got_nfkc.deinit();
try testing.expectEqualSlices(u8, want_nfkc, got_nfkc.slice);
// Canonical Decomposition (NFD)
const input_nfd = "Complex char: \u{03D3}";
const want_nfd = "Complex char: \u{03D2}\u{0301}";
var got_nfd = try normalizer.nfd(allocator, input_nfd);
defer got_nfd.deinit();
try testing.expectEqualSlices(u8, want_nfd, got_nfd.slice);
// Compatibility Decomposition (NFKD)
const input_nfkd = "Complex char: \u{03D3}";
const want_nfkd = "Complex char: \u{03A5}\u{0301}";
var got_nfkd = try normalizer.nfkd(allocator, input_nfkd);
defer got_nfkd.deinit();
try testing.expectEqualSlices(u8, want_nfkd, got_nfkd.slice);
// String comparisons.
try testing.expect(try normalizer.eql(allocator, "foé", "foe\u{0301}"));
try testing.expect(try normalizer.eql(allocator, "foϓ", "fo\u{03D2}\u{0301}"));
try testing.expect(try normalizer.eqlCaseless(allocator, "Foϓ", "fo\u{03D2}\u{0301}"));
try testing.expect(try normalizer.eqlCaseless(allocator, "FOÉ", "foe\u{0301}")); // foÉ == foé
// Note: eqlIdentifiers is not a method, it's just a function in the Normalizer namespace.
try testing.expect(try Normalizer.eqlIdentifiers(allocator, "Foé", "foé")); // Unicode Identifiers caseless match.
}
Collation (String Ordering)
One of the most common operations required by string processing is sorting and ordering comparisons.
The Unicode Collation Algorithm was developed to attend this area of string processing. The Collator
struct implements the algorithm, allowing for proper sorting and order comparison of Unicode strings.
Aside from the usual init
function, there's initWithReader
which you can use to initialize the
struct with an alternate weights table file (allkeys.bin
), be it a file, a network stream, or anything
else that exposes a std.io.Reader
. This allows for tailoring of the sorting algorithm.
const Collator = @import("ziglyph").Collator;
test "Collation" {
var allocator = std.testing.allocator;
var collator = try Collator.init(allocator);
defer collator.deinit();
// Collation weight levels overview:
// * .primary: different letters.
// * .secondary: could be same letters but with marks (like accents) differ.
// * .tertiary: same letters and marks but case is different.
// So cab < dab at .primary, and cab < cáb at .secondary, and cáb < Cáb at .tertiary level.
testing.expect(collator.tertiaryAsc("abc", "def"));
testing.expect(collator.tertiaryDesc("def", "abc"));
// At only primary level, José and jose are equal because base letters are the same, only marks
// and case differ, which are .secondary and .tertiary respectively.
testing.expect(try collator.orderFn("José", "jose", .primary, .eq));
// Full Unicode sort.
var strings: [3][]const u8 = .{ "xyz", "def", "abc" };
collator.sortAsc(&strings);
testing.expectEqual(strings[0], "abc");
testing.expectEqual(strings[1], "def");
testing.expectEqual(strings[2], "xyz");
// ASCII only binary sort. If you know the strings are ASCII only, this is much faster.
strings = .{ "xyz", "def", "abc" };
collator.sortAsciiAsc(&strings);
testing.expectEqual(strings[0], "abc");
testing.expectEqual(strings[1], "def");
testing.expectEqual(strings[2], "xyz");
}
Tailoring With allkeys.txt
To tailor the sorting algorithm, you can create a modified allkeys.txt
and generate a new compressed binary allkeys.bin
file from it. Follow these steps:
# Change to the Ziglyph source directory.
cd <path to ziglyph>/src/
# Build the UDDC tool for your platform.
zig build-exe -O ReleaseSafe uddc.zig
# Create a new directory to store the UDDC tool and modified data files.
mkdir <path to new data dir>
# Move the tool and copy the data file to the new directory.
mv uddc <path to new data dir>/
cp data/uca/allkeys.txt <path to new data dir>/
# Change into the new data dir.
cd <path to new data dir>/
# Modifiy the allkeys.txt file with your favorite editor.
vim allkeys.txt
# Generate the new compressed binary allkeys.bin
./uddc allkeys.txt
After running these commands, you can then use this new allkeys.bin file with the initWithReader
method:
const Collator = @import("ziglyph").Collator;
var file = try std.fs.cwd().openFile("<path to new data dir>/allkeys.bin", .{});
defer file.close();
var reader = std.io.bufferedReader(file.reader()).reader();
var collator = try Collator.initWithReader(allocator, reader);
defer collator.deinit();
// ...use the collator as usual.
Text Segmentation (Grapheme Clusters, Words, Sentences)
Ziglyph has iterators to traverse text as Grapheme Clusters (what most people recognize as characters),
Words, and Sentences. All of these text segmentation functions adhere to the Unicode Text Segmentation rules,
which may surprise you in terms of what's included and excluded at each break point. Test before assuming any
results! There are also non-allocating compile-time versions for use with string literals or embedded files.
Note that for compile-time versions, you may need to increase the compile-time branch evaluation quota via
@setEvalBranchQuota
.
const Grapheme = @import("ziglyph").Grapheme;
const GraphemeIterator = Grapheme.GraphemeIterator;
const SentenceIterator = Sentence.SentenceIterator;
const ComptimeSentenceIterator = Sentence.ComptimeSentenceIterator;
const Word = @import("ziglyph").Word;
const WordIterator = Word.WordIterator;
test "GraphemeIterator" {
const input = "H\u{0065}\u{0301}llo";
var iter = try GraphemeIterator.init(input);
const want = &[_][]const u8{ "H", "\u{0065}\u{0301}", "l", "l", "o" };
var i: usize = 0;
while (iter.next()) |grapheme| : (i += 1) {
try testing.expect(grapheme.eql(want[i]));
}
// Need your grapheme clusters at compile time?
comptime {
var ct_iter = try GraphemeIterator.init(input);
var j = 0;
while (ct_iter.next()) |grapheme| : (j += 1) {
try testing.expect(grapheme.eql(want[j]));
}
}
}
test "SentenceIterator" {
var allocator = std.testing.allocator;
const input =
\\("Go.") ("He said.")
;
var iter = try SentenceIterator.init(allocator, input);
defer iter.deinit();
// Note the space after the closing right parenthesis is included as part
// of the first sentence.
const s1 =
\\("Go.")
;
const s2 =
\\("He said.")
;
const want = &[_][]const u8{ s1, s2 };
var i: usize = 0;
while (iter.next()) |sentence| : (i += 1) {
try testing.expectEqualStrings(sentence.bytes, want[i]);
}
// Need your sentences at compile time?
@setEvalBranchQuota(2_000);
comptime var ct_iter = ComptimeSentenceIterator(input){};
const n = comptime ct_iter.count();
var sentences: [n]Sentence = undefined;
comptime {
var ct_i: usize = 0;
while (ct_iter.next()) |sentence| : (ct_i += 1) {
sentences[ct_i] = sentence;
}
}
for (sentences) |sentence, j| {
try testing.expect(sentence.eql(want[j]));
}
}
test "WordIterator" {
const input = "The (quick) fox. Fast! ";
var iter = try WordIterator.init(input);
const want = &[_][]const u8{ "The", " ", "(", "quick", ")", " ", "fox", ".", " ", "Fast", "!", " " };
var i: usize = 0;
while (iter.next()) |word| : (i += 1) {
try testing.expectEqualStrings(word.bytes, want[i]);
}
// Need your words at compile time?
@setEvalBranchQuota(2_000);
comptime {
var ct_iter = try WordIterator.init(input);
var j = 0;
while (ct_iter.next()) |word| : (j += 1) {
try testing.expect(word.eql(want[j]));
}
}
}
Code Point and String Display Width
When working with environments in which text is rendered in a fixed-width font, such as terminal
emulators, it's necessary to know how many cells (or columns) a particular code point or string will
occupy. The display_width
namespace provides functions to do just that.
const dw = @import("ziglyph").display_width;
test "Code point / string widths" {
// The width methods take a second parameter of value .half or .full to determine the width of
// ambiguous code points as per the Unicode standard. .half is the most common case.
// Note that codePointWidth returns an i3 because code points like backspace have width -1.
try expectEqual(dw.codePointWidth('é', .half), 1);
try expectEqual(dw.codePointWidth('😊', .half), 2);
try expectEqual(dw.codePointWidth('统', .half), 2);
var allocator = std.testing.allocator;
// strWidth returns usize because it can never be negative, regardless of the code points it contains.
try expectEqual(try dw.strWidth("Hello\r\n", .half), 5);
try expectEqual(try dw.strWidth("\u{1F476}\u{1F3FF}\u{0308}\u{200D}\u{1F476}\u{1F3FF}", .half), 2);
try expectEqual(try dw.strWidth("Héllo 🇵🇷", .half), 8);
try expectEqual(try dw.strWidth("\u{26A1}\u{FE0E}", .half), 1); // Text sequence
try expectEqual(try dw.strWidth("\u{26A1}\u{FE0F}", .half), 2); // Presentation sequence
// padLeft, center, padRight
const right_aligned = try dw.padLeft(allocator, "w😊w", 10, "-");
defer allocator.free(right_aligned);
try expectEqualSlices(u8, "------w😊w", right_aligned);
const centered = try dw.center(allocator, "w😊w", 10, "-");
defer allocator.free(centered);
try expectEqualSlices(u8, "---w😊w---", centered);
const left_aligned = try dw.padRight(allocator, "w😊w", 10, "-");
defer allocator.free(left_aligned);
try expectEqualSlices(u8, "w😊w------", left_aligned);
}
Word Wrap
If you need to wrap a string to a specific number of columns according to Unicode Word boundaries and display width,
you can use the display_width
struct's wrap
function for this. You can also specify a threshold value indicating how close
a word boundary can be to the column limit and trigger a line break.
const dw = @import("ziglyph").display_width;
test "display_width wrap" {
var allocator = testing.allocator;
var input = "The quick brown fox\r\njumped over the lazy dog!";
var got = try dw.wrap(allocator, input, 10, 3);
defer allocator.free(got);
var want = "The quick\n brown \nfox jumped\n over the\n lazy dog\n!";
try testing.expectEqualStrings(want, got);
}
Package Contents
- .gitattributes
- LICENSE
- build.zig
- src/ascii.zig
- src/data/ucd/GraphemeBreakTest.txt
- src/data/ucd/UnicodeData.txt
- src/data/ucd/NormalizationTest.txt
- src/data/ucd/Decompositions.txt.gz
- src/data/ucd/WordBreakTest.txt
- src/data/ucd/SentenceBreakTest.txt
- src/data/ucd/Composites.txt.gz
- src/data/uca/CollationTest_NON_IGNORABLE_SHORT.txt
- src/data/uca/allkeys.txt
- src/data/uca/allkeys.bin
- src/data/license/UnicodeLicenseAgreement.html
- src/data/license/standard_styles.css
- src/normalizer/Normalizer.zig
- src/segmenter/CodePoint.zig
- src/segmenter/Sentence.zig
- src/segmenter/Word.zig
- src/segmenter/Grapheme.zig
- src/ziglyph.zig
- src/display_width.zig
- src/tests.zig
- src/uddc.zig
- src/tests/readme_tests.zig
- src/autogen/word_break_property.zig
- src/autogen/derived_general_category.zig
- src/autogen/title_map.zig
- src/autogen/derived_numeric_type.zig
- src/autogen/emoji_data.zig
- src/autogen/blocks.zig
- src/autogen/derived_east_asian_width.zig
- src/autogen/prop_list.zig
- src/autogen/derived_combining_class.zig
- src/autogen/case_folding.zig
- src/autogen/derived_core_properties.zig
- src/autogen/derived_normalization_props.zig
- src/autogen/grapheme_break_property.zig
- src/autogen/hangul_syllable_type.zig
- src/autogen/sentence_break_property.zig
- src/autogen/upper_map.zig
- src/autogen/lower_map.zig
- src/category/letter.zig
- src/category/mark.zig
- src/category/punct.zig
- src/category/symbol.zig
- src/category/number.zig
- src/collator/Collator.zig
- src/collator/CollatorTrie.zig
- src/collator/AllKeysFile.zig
- README.md
- zig.mod
- .gitignore
History
Published On | Tree @ Commit | Size | |
---|---|---|---|
v0.37 | Sun, 05 Mar 2023 23:05:02 UTC | Tree | 7.233 MB |
v0.36 | Sun, 26 Feb 2023 13:38:36 UTC | Tree | 7.233 MB |
v0.35 | Thu, 09 Feb 2023 23:06:35 UTC | Tree | 7.233 MB |
v0.34 | Wed, 14 Dec 2022 01:49:34 UTC | Tree | 7.232 MB |
v0.33 | Fri, 02 Dec 2022 14:57:26 UTC | Tree | 7.232 MB |
v0.32 | Fri, 25 Nov 2022 01:25:14 UTC | Tree | 10.144 MB |
v0.31 | Fri, 25 Nov 2022 01:18:53 UTC | Tree | 10.144 MB |
v0.30 | Sat, 19 Nov 2022 14:29:40 UTC | Tree | 10.202 MB |
v0.29 | Sat, 19 Nov 2022 02:09:25 UTC | Tree | 10.205 MB |
v0.28 | Sat, 19 Nov 2022 01:37:54 UTC | Tree | 10.204 MB |
v0.27 | Thu, 03 Nov 2022 11:57:54 UTC | Tree | 10.195 MB |
v0.26 | Tue, 06 Sep 2022 15:57:44 UTC | Tree | 10.434 MB |
v0.25 | Sun, 26 Dec 2021 13:25:45 UTC | Tree | 10.434 MB |
v0.24 | Sun, 26 Dec 2021 13:07:14 UTC | Tree | 10.434 MB |
v0.23 | Sun, 26 Dec 2021 12:50:27 UTC | Tree | 10.434 MB |
v0.22 | Fri, 24 Dec 2021 13:45:05 UTC | Tree | 10.457 MB |
v0.21 | Thu, 23 Dec 2021 16:03:13 UTC | Tree | 10.457 MB |
v0.20 | Thu, 23 Dec 2021 15:58:47 UTC | Tree | 10.457 MB |
v0.19 | Sat, 25 Sep 2021 00:29:44 UTC | Tree | 10.457 MB |
v0.18 | Wed, 22 Sep 2021 00:13:07 UTC | Tree | 10.457 MB |
v0.17 | Mon, 20 Sep 2021 10:50:08 UTC | Tree | 10.459 MB |
v0.16 | Fri, 17 Sep 2021 00:15:40 UTC | Tree | 10.457 MB |
v0.15 | Thu, 16 Sep 2021 01:30:12 UTC | Tree | 10.455 MB |
v0.14 | Tue, 14 Sep 2021 22:07:09 UTC | Tree | 10.251 MB |
v0.13 | Tue, 14 Sep 2021 01:29:25 UTC | Tree | 10.251 MB |
v0.12 | Fri, 27 Aug 2021 14:12:57 UTC | Tree | 10.251 MB |
v0.11 | Fri, 27 Aug 2021 14:10:18 UTC | Tree | 10.251 MB |
v0.10 | Fri, 27 Aug 2021 10:46:32 UTC | Tree | 10.263 MB |
v0.9 | Fri, 27 Aug 2021 02:20:32 UTC | Tree | 10.263 MB |
v0.8 | Fri, 27 Aug 2021 01:42:44 UTC | Tree | 10.263 MB |
v0.7 | Thu, 26 Aug 2021 16:01:47 UTC | Tree | 10.260 MB |
v0.6 | Thu, 26 Aug 2021 15:58:25 UTC | Tree | 10.260 MB |
v0.5 | Thu, 26 Aug 2021 15:57:17 UTC | Tree | 10.260 MB |
v0.4 | Thu, 26 Aug 2021 15:51:06 UTC | Tree | 10.260 MB |
v0.3 | Mon, 23 Aug 2021 10:55:09 UTC | Tree | 10.228 MB |
v0.2 | Mon, 23 Aug 2021 00:41:38 UTC | Tree | 10.228 MB |
v0.1 | Sun, 22 Aug 2021 23:46:25 UTC | Tree | 10.228 MB |